I thought I would start a new thread for continued ARM 65C02 emulator optimization, as otherwise it will get buried in the various Pi threads.
There was a discussion a long time ago on the 6502.org forums, where BigEd pointed out that Acorn's 65tube 6502 emulation was written pretty efficiently. So I thought I would have a go at disassembling this back into source form, and seeing what it looked like.
What's neat about Acorn's 65tube emulation is:
- all of the 6502 registers are kept in ARM registers
- the 6502 stack pointer and program counters are also ARM registers that point directly to the right place in memory
- the 6502 N, Z, C and V flags are implemented directly using the ARM flags, rather than needing to be emulated
Just before Christmas last year I ported this from 26-bit ARM to 32-bit ARM, so it could run on the Pi (attached to a Matchbox Co Pro acting as an SPI bridge).
The first published result was here: This was obtained using the following test configuration, which we stuck with for the remainder of the work.
- Raspberry Pi One with no overclocking (i.e. ARM clock is 700MHz, Core clock is 250MHz)
- BBC Master 128
- BBC Basic IV
- the original version CLOCKSP, calibrated for a 2MHz Model B running Basic II
These results are a bit skewed (about 25% overstated) because we should be using the version of CLOCKSP that's calibrated for Basic IV on the Master. So bear that in mind, and at the end I'll shift to the Model B, so we can get a more realistic figure for effective 6502 speed.
Dominic (dp11) joined in, and over the next couple of week suggested a series of incremental improvements
At this point it felt like we were running out of steam. Most of the instructions were close to optimal, and the biggest overhead was the cost of checking for interrupts between each 6502 instruction.
Over Christmas, Dominic came up with a new pattern for handing interrupts that was much more efficient. It involved the ARM interrupt handler updating the register that points to the base of the opcode table. This meant the test was now effectively for free. And what a difference it made:
Over the next week, further significant improvements were made by optimizing for ARM pipeline stalls:
That takes us up to New Years Eve last year, and was the last change made to the 6502 code in the PiTubeClient project.
Over the next few months, the focus with PiTubeClient was adding more co processors (ARM2, 80x86, 32016)
Separately, BigEd and myself started working on a fork of PiTubeClient, called PiTubeDirect, where the tube emulation was done in software on the Pi. This work was announced yesterday in it's own thread: Pi-based Co-Pro on the cheap - 100MHz 6502 for £10?. The challenge there was reliability, not raw performance. A side effect of which was need to shrink the ARM 6502 code so that it fitted in the L1 cache.
I've just back-ported that "code shrinkage" to PiTubeClient, in a way that (I hope) preserves most of the pipeline optimization done above.
- Before, the emulator used 65KB of code and 64KB of data.
- After, the emulator used 9KB of code and 66KB of data
On the Pi One and Zero, the L1 instruction cache is 16KB, so the code should now "fit".
Here's the change in git:
https://github.com/hoglet67/PiTubeClien ... e7498b7199
The effect it has on performance is pretty nice!
- I re-measured the old code at 176.49MHz
- the "shrunk" code now runs at 195.99MHz
Remember, this is still using our slightly skewed measurement technique (i.e. the wrong version of CLOCKSP)
Here's a set of measurements broader range of hardware:
On a Master, running Basic IV, using the "wrong" version of CLOCKSP for this platform
- Pi One (700MHz ARM/250MHz Core) (emulator code from 31/12/2015) 176.49MHz
- Pi One (700MHz ARM/250MHz Core) ("shrunk" emulator code from today) 195.99MHz
- Pi Zero (1000MHz ARM/250MHz Core) (emulator code from 31/12/2015) 232.14MHz
- Pi Zero (1000MHz ARM/250MHz Core) ("shrunk" emulator code from today) 279.00MHz
These measurements are directly comparable with all the numbers reported above.
Finally, on a Model B, running Basic II, using the "right" version of CLOCKSP for this platform:
- Pi One (700MHz ARM/250MHz Core) (emulator code from 31/12/2015) 146.66MHz
- Pi One (700MHz ARM/250MHz Core) ("shrunk" emulator code from today) 157.66MHz
- Pi Zero (1000MHz ARM/250MHz Core) (emulator code from 31/12/2015) 199.39MHz
- Pi Zero (1000MHz ARM/250MHz Core) ("shrunk" emulator code from today) 225.62MHz
These measurements are more externally reportable and are representative of actual 6502 clock speed.
Interestingly, increasing the core clock speed with the new code makes no difference to performance, which is a strong indicator that all of the code is in L1 cache.
Finally, for a bit of fun, I decided to see how far I could get by overclocking the Pi Zero:
- Pi Zero (1200MHz ARM/250MHz Core) ("shrunk" emulator code from today) 269.59MHz
Not bad for a piece of hardware that costs £4.50!
Dominic, I'm happy to continue tweaking this, just to see how far we can get. Take a look at the change. It basically boils down an extra level of indirection, via a opcode jump table (pointed to by ip). The change to the code was in the various FETCH_NEXT macros from:
Code: Select all
add lr, ip, r2, lsl #I_ALIGN
Code: Select all
ldr lr, [ip, r2, lsl #2]