I'm really not convinced a dynamic recompiler is necessary, nor the answer. Nor would I say that implementing opcode emulation in assembler would necessarily give the biggest gain.
I think one problem is that the code tries to be too much by emulating all varieties of Beeb all in the same functions. Take the example of reading a byte from the emulated Beeb memory - this is one of the most basic operations which happens at least every opcode fetch, and potentially up to 4 times more (for an indirect opcode). Emulating a BBC B, this should be no more complex than:
Code: Select all
inline byte ReadMemory(word addr)
if (addr < 0x8000)
return ReadMemoryTranslate(addr) ; deal with paged ROM + hardware (non-inline)
In BeebEm, this is a complicated function which checks every configuration of machine type it emulates, and does the address translation where necessary. For the simple case (direct mapping to emulated RAM), this is a very straightforward function which should be inlined. In the case where it is addressing hardware or a paged ROM, which needs some more complicated translation, this can call a non-inlined function, which will also benefit by (hopefully) still being in the instruction cache.
But the only way to do this is to have separate code for each model emulated, and the only sane way to do this is with C++ templates, at the expense of a bigger executable. Likewise for the CPU emulation. And there is the main problem - BeebEm isn't really architected in a speed optimal way. Unfortunately the problem with using templates to generate optimal code is that the executable potentially becomes bigger than is acceptable.
Take another example, that of ADC. As I demonstrated above, this can be emulated in a very compact way in ARM assembler (by using the fact that ARM and 6502 set their flags in more or less the same way). But BeebEm could still perform a bit better than it does by separating the emulation of ADC in decimal mode (complicated and rare) and ADC in normal mode (simple and common). By inlining the function and jumping to a non-inlined version in the case that decimal mode is on would avoid polluting the instruction cache with rarely executed code.
Surprisingly, GCC seems to do OK with the fact that all the 6502 registers are emulated in global variables. I expected this might cost extra time (by needing to establish the address of the global with a number of expensive operations), rather than having a pointer to a struct on the stack already, but this actually seems to be pretty optimised already, so kudos to GCC on that.
I'll have a browse through the code later and see if I spot anything else. But I haven't seen anything so
awful there... Did profiling really reveal the 6502 emulation to be the bottleneck in BeebEm?