More

evmar · 2026-05-27T20:28:21 1779913701

This is my second emulator, and in my first I picked a name more like that and regretted it. A thing I now appreciate about emulators is that it's common to increase scope -- like this one already supports non-wasm output, and I am tinkering with adding support for DOS executables as well, which means the name 'win2wasm' would already become obsolete!

evmar · 2026-05-27T20:06:57 1779912417

Wow, thanks for the link, that is perfect timing! I submitted my post and my own feedback on the discussion: https://github.com/WebAssembly/shared-everything-threads/dis...

evmar · 2026-05-27T19:05:02 1779908702

[post author] I looked into this API but wasn't sure how to make good use of it. I may have misunderstood, maybe you could help!

The doc you linked has two forms of use, sync and async.

For sync: it seems the idea is for the worker to render into an OffscreenCanvas, then postMessage an ImageBitmap created with transferToImageBitmap from worker to main thread for drawing. It seems like it would need to allocate a new bitmap for each frame. Currently Theseus puts the pixel data in shared memory and the main thread copies it out (required to create an ImageData), which at least in principle could reuse the copy buffer (though it currently doesn't), which seems better? https://github.com/evmar/theseus/blob/a5a849dbcf8046a2d1837a...

For async: in this the idea is have the worker render into an OffscreenCanvas linked to the on-screen one. But it seems to get an OffscreenCanvas in a worker, the main thread canvas must .transferControlToOffscreen() it to the worker. Under the current synchronization model[1] the only time the worker can receive a message is during startup, because the rest of the time it's deep in its own wasm call stacks. This means that if the worker needs to resize its canvas and then paint to it, it's stuck.

[1] I wrote "current" because after writing this post I learned about JSPI which might help with this.

modeless · 2026-05-27T19:17:18 1779909438

I think after you call transferControlToOffscreen() and transfer the OffscreenCanvas to the worker, then the worker can control the size of the canvas with the width and height properties of the OffscreenCanvas. I'm not sure how this interacts with layout though.

IIRC the transferToImageBitmap path is efficient and doesn't necessarily copy anything. The APIs are designed to allow the ImageBitmap you get to be a reference to the GPU texture that is the backbuffer of the canvas, not a copy. When you transfer it to the main thread you are supposed to draw it using an ImageBitmapRenderingContext, which doesn't need to do an extra blit. It's just directly composited with the rest of the page, all staying on the GPU. In theory if it's a full screen canvas with no DOM on top the browser could even skip compositing entirely though I don't know if that's implemented anywhere.

Of course there are probably a lot of ways to fall off the fast path. And I guess you are doing software rendering for GDI so you're not starting out with pixels on the GPU in the first place. I'm not sure what the best path is for you but I think you can probably benefit from OffscreenCanvas in some way.

evmar · 2026-05-27T20:26:46 1779913606

Thanks a lot for this, I will put it on my list to investigate.

I've gone in circles a few times with how to think about image buffer management because I also support DirectDraw, which is designed to be backed by accelerated graphics, with operations like scaling bitblit. (Currently the Theseus implementation uses a shared "Surface" type as the backing store for both GDI Windows and DirectX Surfaces.)

It's a bit complicated by a few things. (1) DirectX surfaces can be "locked" to access as pixel buffers, so any accelerated surface indirection I guess would need to be able to copy pixels back down into emulator memory. Which I guess I could just implement. (2) There's a bunch of different modes for operations like bitblit like setting a color key for transparency that I can't implement with the canvas API, so I think I'd need to use GL shaders if I want acceleration, not just canvas.

evmar · 2026-05-27T18:46:53 1779907613

Gosh, I think that means even when your code is wholly running on workers (where you would be able to use the atomic wait mentioned in the comment), it still will busy loop, doesn't it? At least it's within the allocator and not in the general implementation of Mutex... I think?

kettlecorn · 2026-05-27T20:46:37 1779914797

Yes that first sentence is correct and that's an unfortunate side effect. The Rust ecosystem eventually needs to evolve its multithreaded Wasm approach. Atomics are still only supported on Nightly Rust but it's been that way for 7+ years now.

And yes you're right again it's only in the allocator.

evmar · 2026-05-17T01:08:15 1778980095

One nice tool for analyzing maps as a tree is as a dominator trees. I wrote a bit about it here: https://neugierig.org/software/blog/2023/07/dominator.html

euroderf · 2026-05-17T18:34:27 1779042867

Another interesting type of tree is the multitree. Try this very readable paper:

https://interactivity.ucsd.edu/articles/In_Process/MultiTree...

chrisco255 · 2026-05-17T08:06:21 1779005181

Thank you for sharing, first I've heard of this tree type.

evmar · 2026-05-13T14:23:29 1778682209

The translator I made is only hobbyist quality, but I just have a big table that says “if you indirect jmp to address X then the associated block is at location Y”.

This is slower than a direct jmp (which doesn’t use the table) but also indirect jumps were slower in the original program to begin with and typically don’t occur in performance-critical loops.

jcranmer · 2026-05-13T14:26:55 1778682415

> also indirect jumps were slower in the original program to begin with and typically don’t occur in performance-critical loops.

The main use-case in performance-critical loops is generally something like a core interpreter loop, where you're dispatching on an opcode.

evmar · 2026-04-22T20:14:07 1776888847

Do you have any notes or other artifacts from your recompiler? I’d love to learn more.

evmar · 2026-04-22T20:13:36 1776888816

Yes, I agree that there is little harm in gathering too much code. I have tried out just scanning data memory for values that refer to addresses within the region marked as code and disassembling from those points, as well as scanning the instructions I traverse for any immediate values in the same range.

evmar · 2026-04-21T23:06:26 1776812786

Looking through Wikipedia at least, it's not exactly clear to me. They have separate pages for 'binary recompiler' and 'binary translation' that link to each other, and the latter is more about going between architectures (which is the main objective here).

evmar · 2026-04-21T22:58:31 1776812311

[post author] I went down some similar paths in retrowin32, though 32-bit x86 is likely easier.

I was also surprised by how much goop there is between startup and main. In retrowin32 I just implemented it all, though I wonder how much I could get away with not running it in the Theseus replace-some-parts model.

I mostly relied on my own x86 emulator, but I also implemented the thunking between 64-bit and 32-bit mode just to see how it was. It definitely was some asm but once I wrapped my head around it it wasn't so bad, check out the 'trans64' and 'trans32' snippets in https://github.com/evmar/retrowin32/blob/ffd8665795ae6c6bdd7... for I believe all of it. One reframing that helped me (after a few false starts) was to put as much code as possible in my high-level language and just use asm to bridge to it.

jcranmer · 2026-04-22T03:23:31 1776828211

Yeah, 32-bit x86 is somewhat easier because everything's in the same flat address space, and you at least have a system-wide code32 gdt entry that means you can ignore futzing around with the ldt. 16-bit means you get to deal with segmented memory, and the cherry on top is that gdb just stops being useful since it doesn't know anything about segmented memory (I don't think Linux even makes it possible to query the LDT of another process, even with ptrace, to be fair).

As for trying to ignore before main... well, the main benefit for me was being able to avoid emulating DOS interrupts entirely, between skipping the calls to set up various global variables, stubbing out some of the libc implementations, and manually marking in the emulator that code page X was 32-bit (something else that sends tools in a tizzy, a function switching from 16-bit to 32-bit mid-assembly code).

16-bit is weird and kinda fun to work with at times... but there's also a reason that progress on this is incredibly slow for me.

evmar · 2026-04-22T15:49:18 1776872958

Slow progress is fine, it took me like two years to get where I got! (Not that I was working on it full time or anything, but also there were just many false starts and I had no idea what I was doing...)