Continued ARM 65C02 emulation improvements

ask about 32-bit emulators like ArcEm, Red Squirrel, Arculator or RPCEmu here
User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Continued ARM 65C02 emulation improvements

Postby hoglet » Sat Jun 18, 2016 10:33 am

Hi Guys,

I thought I would start a new thread for continued ARM 65C02 emulator optimization, as otherwise it will get buried in the various Pi threads.

History

There was a discussion a long time ago on the 6502.org forums, where BigEd pointed out that Acorn's 65tube 6502 emulation was written pretty efficiently. So I thought I would have a go at disassembling this back into source form, and seeing what it looked like.

What's neat about Acorn's 65tube emulation is:
- all of the 6502 registers are kept in ARM registers
- the 6502 stack pointer and program counters are also ARM registers that point directly to the right place in memory
- the 6502 N, Z, C and V flags are implemented directly using the ARM flags, rather than needing to be emulated

Just before Christmas last year I ported this from 26-bit ARM to 32-bit ARM, so it could run on the Pi (attached to a Matchbox Co Pro acting as an SPI bridge).

The first published result was here:
IMG_0189.JPG

This was obtained using the following test configuration, which we stuck with for the remainder of the work.
- Raspberry Pi One with no overclocking (i.e. ARM clock is 700MHz, Core clock is 250MHz)
- BBC Master 128
- BBC Basic IV
- the original version CLOCKSP, calibrated for a 2MHz Model B running Basic II

These results are a bit skewed (about 25% overstated) because we should be using the version of CLOCKSP that's calibrated for Basic IV on the Master. So bear that in mind, and at the end I'll shift to the Model B, so we can get a more realistic figure for effective 6502 speed.

Dominic (dp11) joined in, and over the next couple of week suggested a series of incremental improvements
- 88.13MHz
- 89.31MHz
- 91.42MHz
- 92.38MHz
- 94.66MHz
- 94.96MHz
- 95.73MHz

At this point it felt like we were running out of steam. Most of the instructions were close to optimal, and the biggest overhead was the cost of checking for interrupts between each 6502 instruction.

Over Christmas, Dominic came up with a new pattern for handing interrupts that was much more efficient. It involved the ARM interrupt handler updating the register that points to the base of the opcode table. This meant the test was now effectively for free. And what a difference it made:
- 134.37MHz

Over the next week, further significant improvements were made by optimizing for ARM pipeline stalls:
- 150.15MHz
- 155.69MHz
- 163.09MHz
- 167.72MHz
- 168.62MHz
- 168.84MHz
- 171.71MHz
- 176.36MHz

That takes us up to New Years Eve last year, and was the last change made to the 6502 code in the PiTubeClient project.

Over the next few months, the focus with PiTubeClient was adding more co processors (ARM2, 80x86, 32016)

Separately, BigEd and myself started working on a fork of PiTubeClient, called PiTubeDirect, where the tube emulation was done in software on the Pi. This work was announced yesterday in it's own thread: Pi-based Co-Pro on the cheap - 100MHz 6502 for £10?. The challenge there was reliability, not raw performance. A side effect of which was need to shrink the ARM 6502 code so that it fitted in the L1 cache.

New work

I've just back-ported that "code shrinkage" to PiTubeClient, in a way that (I hope) preserves most of the pipeline optimization done above.
- Before, the emulator used 65KB of code and 64KB of data.
- After, the emulator used 9KB of code and 66KB of data

On the Pi One and Zero, the L1 instruction cache is 16KB, so the code should now "fit".

Here's the change in git:
https://github.com/hoglet67/PiTubeClien ... e7498b7199

The effect it has on performance is pretty nice!
- I re-measured the old code at 176.49MHz
- the "shrunk" code now runs at 195.99MHz :shock:

IMG_0471.JPG

Remember, this is still using our slightly skewed measurement technique (i.e. the wrong version of CLOCKSP)

Here's a set of measurements broader range of hardware:

On a Master, running Basic IV, using the "wrong" version of CLOCKSP for this platform
- Pi One (700MHz ARM/250MHz Core) (emulator code from 31/12/2015) 176.49MHz
- Pi One (700MHz ARM/250MHz Core) ("shrunk" emulator code from today) 195.99MHz
- Pi Zero (1000MHz ARM/250MHz Core) (emulator code from 31/12/2015) 232.14MHz
- Pi Zero (1000MHz ARM/250MHz Core) ("shrunk" emulator code from today) 279.00MHz

IMG_0472.JPG

These measurements are directly comparable with all the numbers reported above.

Finally, on a Model B, running Basic II, using the "right" version of CLOCKSP for this platform:
- Pi One (700MHz ARM/250MHz Core) (emulator code from 31/12/2015) 146.66MHz
- Pi One (700MHz ARM/250MHz Core) ("shrunk" emulator code from today) 157.66MHz
- Pi Zero (1000MHz ARM/250MHz Core) (emulator code from 31/12/2015) 199.39MHz
- Pi Zero (1000MHz ARM/250MHz Core) ("shrunk" emulator code from today) 225.62MHz

IMG_0474.JPG

These measurements are more externally reportable and are representative of actual 6502 clock speed.

Interestingly, increasing the core clock speed with the new code makes no difference to performance, which is a strong indicator that all of the code is in L1 cache.

Finally, for a bit of fun, I decided to see how far I could get by overclocking the Pi Zero:
- Pi Zero (1200MHz ARM/250MHz Core) ("shrunk" emulator code from today) 269.59MHz

IMG_0473.JPG

Not bad for a piece of hardware that costs £4.50!

Dominic, I'm happy to continue tweaking this, just to see how far we can get. Take a look at the change. It basically boils down an extra level of indirection, via a opcode jump table (pointed to by ip). The change to the code was in the various FETCH_NEXT macros from:

Code: Select all

        add     lr, ip, r2, lsl #I_ALIGN

to

Code: Select all

        ldr     lr, [ip, r2, lsl #2]

Eveything else stayed the same.

Dave
Last edited by hoglet on Sat Jun 18, 2016 5:08 pm, edited 1 time in total.

User avatar
BigEd
Posts: 1500
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: Continued ARM 65C02 emulation improvements

Postby BigEd » Sat Jun 18, 2016 11:02 am

Brilliant - I won't argue with 225MHz!

User avatar
flaxcottage
Posts: 2799
Joined: Thu Dec 13, 2012 8:46 pm
Location: Derbyshire

Re: Continued ARM 65C02 emulation improvements

Postby flaxcottage » Sat Jun 18, 2016 12:38 pm

This is looking really cool. 8)

Does this mean that there now is a BBC emulator that runs on a Pi that is equivalent to !65Host?
- John

Currently running Level 4 Econet with BBC B, BBC B+ 128K, Master 128K, 4Mb A3000, 4Mb A3020, 4Mb A4000, 4Mb A5000 dual FDD; UK101; HP41CX setup; Psion 3a, 3mx and 5mx; Z88; TI-58c, TI-59 and printer, HP-16C programmer's calculator

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sat Jun 18, 2016 12:41 pm

Excellent work. I have one or two little ideas I need to think about.

It might be interesting to lock the jump table , zero page and the stack in the data cache.

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Sat Jun 18, 2016 12:49 pm

As an experiment, I took the "shrunk" code (PiTubeClient/1af82e9f) and backed out all the FETCH_NEXT pipeline improvements.

On the Pi Zero at 1000MHz on a Model B with Basic II, the performance dropped from 225.62MHz down to 195.92MHz.

So those changes are still working very effectively with the latest "shrunk" code, and make a ~30MHz difference

It's worth pushing these back to PiTubeDirect, they were removed for simplicity when adding multi core support. I've just done that, and it's made a good improvement to speed:
- before 150.85MHz
- after 180.58MHz

This also lets us directly see the cost of emulating the tube is the difference between 225.62MHz (PiTubeClient) and 180.58MHz (PiTubeDirect) (which is a reduction of 20%).

Dave

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Sat Jun 18, 2016 12:52 pm

flaxcottage wrote:Does this mean that there now is a BBC emulator that runs on a Pi that is equivalent to !65Host?

Sorry, but no. This is just emulating a 65C02 Co Processor (like the original Acorn 65TUBE from which it is derived).

For the difference between 65TUBE and 65HOST, see here:
http://www.chiark.greenend.org.uk/~theo ... late65.txt

Dave

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sat Jun 18, 2016 2:27 pm

space saving optimisations

Code: Select all

opcode_FB:
        sub     sl, sl, #1
        FETCH_NEXT_MERGED


FETCH_NEXT_MERGRED isn't needed here as the code can fall into the next FETCH_NEXT_MERGED

Space saving with no real impact
move the follow code to the beginning of the NOP section and change to add sl,sl,#2 and let is fall into the next nop section
// NOP_3
opcode_5C:
opcode_DC:
opcode_FC:
add sl, sl, #1
FETCH_NEXT_MERGED


I'm just playing here so removed some of the macros , but could you test this to see if there is a difference ?

// Opcode B1 - LDA ($00),Y
opcode_B1:

ldrh r0, [r0, fp]
FETCH_NEXT_STAGE_0
add r0, r0, r8, lsr #24
bic r0, r0, #0x10000
ldrb r1, [r0, fp] // Can you also try fp,r0?
FETCH_NEXT_STAGE_1
lsl r6, r1, #24
teq r6, #0
FETCH_NEXT_STAGE_2 mov pc, lr

edit: actually i can't see the above actually making any difference

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sat Jun 18, 2016 7:27 pm

Code: Select all

.macro FETCH_NEXT_MERGED
        FETCH_NEXT_STAGE_0
        ldrb    r0, [sl], #1
        ldr     pc, [ip, r2, lsl #2]   
.endm

saves a word in the macro, should make no speed difference.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sat Jun 18, 2016 8:38 pm

Below may also just save a word with no performance issue

Code: Select all

.macro FETCH_NEXT_STAGE_12_merged
        ldrb    r0, [sl], #1
        ldr     pc, [ip, r2, lsl #2]
.endm

// Opcode C9 - CMP #$00
opcode_C9:
        FETCH_NEXT_STAGE_0
        cmp     r6, r0, lsl #24
        FETCH_NEXT_STAGE_12_merged

// Opcode E0 - CPX #$00
opcode_E0:
        FETCH_NEXT_STAGE_0
        cmp     r7, r0, lsl #24
        FETCH_NEXT_STAGE_12_merged


Below may also just save a word with no performance issue

Code: Select all

// Opcode 84 - STY $00
opcode_84:
        FETCH_NEXT_STAGE_0
        lsr     r1, r8, #24
        strb    r1, [fp, r0]
        FETCH_NEXT_STAGE_12_merged

// Opcode 85 - STA $00
opcode_85:
        FETCH_NEXT_STAGE_0
        lsr     r1, r6, #24
        strb    r1, [fp, r0]
        FETCH_NEXT_STAGE_12_merged

// Opcode 91 - STA ($00),Y
opcode_91:
        EA_INDIRECT_Y
        FETCH_NEXT_STAGE_0
        lsr     r1, r6, #24
        STORE_INDIRECT
        FETCH_NEXT_STAGE_12_mergred       

// Opcode A0 - LDY #$00
opcode_A0:
        FETCH_NEXT_STAGE_0
        lsl     r8, r0, #24
        teq     r8, #0
        FETCH_NEXT_STAGE_12_merged       


If the above has no performance difference it can be applied globally if there is 1 or 2 instructions between fetch_next_stage_0 and FETCH_NEXT_STAGE_12

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sat Jun 18, 2016 9:13 pm

better use of pipeline

Code: Select all

// Opcode 91 - STA ($00),Y
opcode_91:
       
      ldrh    r0, [r0, fp]
      ldrb    r2, [sl], #1
      lsr     r1, r6, #24      
                     ;stall
      add     r0, r0, r8, lsr #24
      bic     r0, r0, #0x10000
           ldr     lr, [ip, r2, lsl #2]
           strb    r1, [fp, r0]

      ldrb    r0, [sl], #1
                     ;stall
      mov     pc, lr   
      
// Opcode BD - LDA $0000,X
opcode_BD:
        ldrh    r0, [sl, #-1]
        add     sl, sl, #1
   FETCH_NEXT_STAGE_0
        add     r0, r0, r7, lsr #24
        bic     r0, r0, #0x10000
        LOAD_ABSOLUTE
   ldr     lr, [ip, r2, lsl #2]
   ldrb    r0, [sl], #1
        lsl     r6, r1, #24
        teq     r6, #0
        FETCH_NEXT_STAGE_2   
       
// Opcode E6 - INC $00
opcode_E6:
        ldrb    r1, [fp, r0]
        FETCH_NEXT_STAGE_0
        add     r1, r1, #1
        strb    r1, [fp, r0]
        FETCH_NEXT_STAGE_1
        lsl     r1, r1, #24
        teq     r1, #0
        FETCH_NEXT_STAGE_2
       
       

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sat Jun 18, 2016 9:44 pm

I don't know enough about the workings of the branch predictor. The Jump table may be better off being a list of Branches

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sat Jun 18, 2016 10:03 pm

The single byte NOPs should really be I think:

Code: Select all

        FETCH_NEXT_STAGE_1_I
         FETCH_NEXT_STAGE_2_I
       

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Sun Jun 19, 2016 6:08 am

dp11 wrote:I don't know enough about the workings of the branch predictor. The Jump table may be better off being a list of Branches

I tried this a while back and saw far more I cache misses. But a lot has changed since then.

I'm out today, but I'll take a look at these ideas over the next few days.

Thanks,

Dave

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Sun Jun 19, 2016 7:32 pm

I've had some more thoughts . I know we have tried this before , but I think it is very much worth another try. I would like to try pre caching the next byte as well. I can't remember how we did it last time. We could try and recover R5. in ADC and SBC I think we might need an extra instruction. For ASL I think we can actually just use MOVS R1,R1,LSL #24. For STZ we can just use the Table pointer as lower byte is always zero.

If I have understood the pipeline, I think This will give about 10%-20% performance increase.

Some examples of the most important instructions . Branches now also take a short cut if they aren't taken. For my timings I have assume all branches are taken which is worst case. I also even though I think the branch predictor will now predict some branches I haven't assumed it will

Code: Select all

 Current BEQ
      sxtabeq sl, sl, r0
      ;stall
      ldrb    r2, [sl], #1
      ldrb    r0, [sl], #1 ;hidden in stall
      ;stall
      ;stall
        ldr     lr, [ip, r2, lsl #2]
      ;stall
      ;stall
      mov     pc, lr ; 4 or 7 cycles depending on prediction
                  ; allmost certainly not predicted due to lack of cycles

      ; total of 14/17 cycles
      
// new BEQ
      sxtabeq sl, sl, r0
      ldrne   lr, [ip, r5, lsl #2];hidden in stall
      ldrbeq  r2, [sl,#-1]
      ldrb    r0, [sl], #1 ;hidden in stall
      ldrb    r5, [sl], #1 ;hidden in stall
      ;stall
        ldreq     lr, [ip, r2, lsl #2]
      ;stall
      ;stall
      mov     pc, lr ; 4 or 7 cycles depending on prediction
                  ; almost certainly not predicted due to lack of cycles

      ; total of 14/17 cycles if taken. 10 if not taken

// Current STA $00
       ldrb    r2, [sl], #1
        lsr     r1, r6, #24
        strb    r1, [fp, r0]
      ;stall
        ldr     lr, [ip, r2, lsl #2]
      ldrb    r0, [sl], #1;hidden in stall
      ;stall
      mov     pc, lr ; 4 or 7 cycles depending on prediction
                  ; almost certainly not predicted due to lack of cycles

      ; total of 11/14 cycles
      
// new STA $00
       ldr     lr, [ip, r5, lsl #2]
        lsr     r1, r6, #24
        strb    r1, [fp, r0]
     
      ldrb    r0, [sl], #1
      ldrb    r5, [sl], #1
      mov     pc, lr ; 4 or 7 cycles depending on prediction
                  ; almost certainly predicted

      ; total of 9/12 cycles
      
// Current Opcode B1 - LDA ($00),Y
opcode_B1:     

       ldrh    r1, [r0, fp]
      ldrb    r2, [sl], #1
      ;stall
      ;stall
        add     r1, r1, r8, lsr #24
        bic     r1, r1, #0x10000
        ;stall
        ldrb    r1, [fp, r1]
      ldrb    r0, [sl], #1
      ldr     lr, [ip, r2, lsl #2]
       ;stall
        lsl     r6, r1, #24
        teq     r6, #0
        mov     pc, lr      
      
      //total 17/20 cycles almost certainly not predicted
      
// New Opcode B1 - LDA ($00),Y
opcode_B1:     

       ldrh    r1, [r0, fp]
      ldr     lr, [ip, r5, lsl #2]
      ;stall
      ;stall
        add     r1, r1, r8, lsr #24
        bic     r1, r1, #0x10000
        ;stall
        ldrb    r1, [fp, r1]
      ldrb    r0, [sl], #1
      ldrb   r5, [sl], #1
       ;stall
        lsl     r6, r1, #24
        teq     r6, #0
        mov     pc, lr      
      
      //total 17/20 cycles almost certainly predicted
      
// Current CMP #$00
       ldrb    r2, [sl], #1
        cmp     r6, r0, lsl #24
      ;stall
      ;stall
        ldr     lr, [ip, r2, lsl #2]
      ldrb    r0, [sl], #1 ;hidden stall
      ;stall
      mov     pc, lr ; 4 or 7 cycles depending on prediction
                  ; almost certainly not predicted due to lack of cycles

      ; total of 11/14 cycles      

      
// new CMP #$00
       ldr     lr, [ip, r5, lsl #2]
        cmp     r6, r0, lsl #24
      ldrb    r0, [sl], #1
      ldrb   r5, [sl], #1
   
      mov     pc, lr ; 4 or 7 cycles depending on prediction
                  ; almost certainly predicted

      ; total of 8/11 cycles      

// current INY
      ldr     lr, [ip, r0, lsl #2]
        add     r8, r8, #0x1000000
        teq     r8, #0
        ldrb    r0, [sl], #1
        mov     pc, lr
      
      ; cycles 8/11
      
// new INY
// current INY
      ldr     lr, [ip, r0, lsl #2]
        add     r8, r8, #0x1000000
        teq     r8, #0
        mov     r0, r5
      ldrb   r5, [sl], #1
        mov     pc, lr
      
      ; cycles 9/12
      
// current Opcode A5 - LDA $00
opcode_A5:
        ldrb    r2, [sl], #1
        ldrb    r6, [fp, r0]
        ldrb    r0, [sl], #1
        ; stall
      ldr     lr, [ip, r2, lsl #2]
        lsl     r6, r6, #24
        teq     r6, #0
        mov     pc, lr   
      
      ; cycle 11/14
      
// new Opcode A5 - LDA $00
opcode_A5:
      ldrb    r6, [fp, r0]
        ldr     lr, [ip, r5, lsl #2]
        ldrb    r0, [sl], #1
        ldrb   r5, [sl], #1
        lsl     r6, r6, #24
        teq     r6, #0
        mov     pc, lr   
      
      ; cycle 10/13      

// current (but extra optimisation applied)Opcode 20 - JSR $0000
opcode_20:
        ldrh    r1, [sl, #-1]
        strh    sl, [r9,#-1]
      sub      r9, r9,#-2
        add     sl, fp, r1
        ldrb    r2, [fp,r1]
        orr     r9, #0x0100
      add    sl,sl #1
      ;stall
        ldr     lr, [ip, r2, lsl #2]
        ldrb    r0, [sl], #1
      ;stall
        mov     pc, lr   

// new JSR
        ldrh    r1, [sl, #-2]
        strh    sl, [r9,#-1]
      sub      r9, r9,#-2
        add     sl, fp, r1
        ldrb    r2, [fp,r1]
        orr     r9, #0x0100
      add    sl,sl #1
      ;stall
        ldr     lr, [ip, r2, lsl #2]
        ldrb    r0, [sl], #1
      ;stall
        mov     pc, lr   

Edit while I remember slight improvement to beq
Last edited by dp11 on Tue Jun 21, 2016 4:40 am, edited 1 time in total.

sirbod
Posts: 742
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Continued ARM 65C02 emulation improvements

Postby sirbod » Mon Jun 20, 2016 7:56 am

Isn't the use of LDRH in all the code suggestions above, going to fall foul of alignment restrictions? I'm not sure from the sourcecode if the source address is word aligned or not.

As the code and tables are small, I'd lock them both into their respective caches. ARM's cache replacement policy hasn't kept up with the rest of the core, so there's no prediction when it comes to dropping cache lines at random.

On a more general note, I'd replace all references to r7/r8 that are 6502 registers with something more meaningful (eg r6502_X), it would make the code a lot easier to follow.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Mon Jun 20, 2016 8:11 am

previous tests suggest LDRH give a performance increase. if we go down the new route of precaching the two bytes we can just do something like orr r1,r0,R5,LSL #8 ; which might mean we can also save another cycle in JSR

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Mon Jun 20, 2016 8:19 am

Dominic,

I'm just wondering if you have the necessary hardware to put together a test environment for this yourself?

There's a few possible configurations that would let you run stuff:
- A BBC Master (or Model B) + Matchbox Co Pro + Raspberry Pi
- An Altera DE1 FPGA board + Raspberry Pi
- A Papilio DUO FPGA board + Classic Computing shield + Raspberry Pi

Dave

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Mon Jun 20, 2016 8:24 am

sirbod wrote:Isn't the use of LDRH in all the code suggestions above, going to fall foul of alignment restrictions? I'm not sure from the sourcecode if the source address is word aligned or not.

We are enabling unaligned access in the ARM control register.
sirbod wrote:As the code and tables are small, I'd lock them both into their respective caches. ARM's cache replacement policy hasn't kept up with the rest of the core, so there's no prediction when it comes to dropping cache lines at random.

Can you say a bit more about this?

The cache is 4-way set associative. I've been assuming that a cache line would only be evicted if all ways for a particular index were in use.

As the working set of the code is now smaller than the cache size, won't it just always remain cached?
sirbod wrote:On a more general note, I'd replace all references to r7/r8 that are 6502 registers with something more meaningful (eg r6502_X), it would make the code a lot easier to follow.

I very much agree with this.

Dave

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Mon Jun 20, 2016 8:25 am

I don't I'm afraid, but am considering it.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Mon Jun 20, 2016 8:28 am

I'm wondering if there is a way of setting up the equivalent of VNC

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Mon Jun 20, 2016 8:36 am

dp11 wrote:I'm wondering if there is a way of setting up the equivalent of VNC

I can't see how that would be possible.

There's too many manual steps (e.g. copy firmware, hit Pi reset button, Ctrl-Break on Beeb) that would be hard to do remotely.

Dave

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Mon Jun 20, 2016 1:14 pm

I'm wondering ( this is just a random thought) could the serial code you have be extended to entered when a pin is toggled and and an option to download code ? This would save having to write SDCards all this time. I could then gather the bits for Pi direct. I could put the Master in the Loft with a USB serial adaptor, mmc set to boot a program with *fx2,1 etc. I can have a second pi which has two serial ports one for the master and one for Pi direct. This second pi can wiggle the Tube RST and the go into serial bootloader mode on pi direct.

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Mon Jun 20, 2016 1:25 pm

dp11 wrote:I'm wondering ( this is just a random thought) could the serial code you have be extended to entered when a pin is toggled and and an option to download code ? This would save having to write SDCards all this time. I could then gather the bits for Pi direct. I could put the Master in the Loft with a USB serial adaptor, mmc set to boot a program with *fx2,1 etc. I can have a second pi which has two serial ports one for the master and one for Pi direct. This second pi can wiggle the Tube RST and the go into serial bootloader mode on pi direct.

There already is a serial boot loader for the Pi that I used for a while:
https://github.com/mrvn/raspbootin
The host side program manages the download of your kernel.img, then becomes a dumb terminal.

I have some doubts as to whether PiTubeDirect is the best choice for optimizing the emulator code, because of the 20% overhead from the ISR/Tube Emulation. I would be concerned about the variability this might introduce into the results.

Dave

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Mon Jun 20, 2016 1:47 pm

I understand your concerns

sirbod
Posts: 742
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Continued ARM 65C02 emulation improvements

Postby sirbod » Mon Jun 20, 2016 2:45 pm

hoglet wrote:
sirbod wrote:As the code and tables are small, I'd lock them both into their respective caches. ARM's cache replacement policy hasn't kept up with the rest of the core, so there's no prediction when it comes to dropping cache lines at random.

Can you say a bit more about this?

You'll want to read the relevant ARM ARM entry on data/instruction cache lockdown. ARM have also provided an example function: lock_d_cache which will load and lock data into the data cache.

The cache replacement policy on all ARM cores is to pick a random cache line, so your 6502 core code will constantly be evicted and re-cached.

Another thing you can do is use PLD at the fetch stage and pre-cache 6502 instructions before you hit them. You need to give it enough time to perform the pre-cache though, which will depend on how many cycles you perform before you're reading the data. As you're emulating instructions, 32 bytes may be enough, but try 64/96/128 to see if there's any difference.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Mon Jun 20, 2016 4:06 pm

Dave if we could get the CPU mailboxes working on the PI3 would that remove your concerns over performance variations ?

User avatar
hoglet
Posts: 6629
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol

Re: Continued ARM 65C02 emulation improvements

Postby hoglet » Mon Jun 20, 2016 4:54 pm

sirbod wrote:You'll want to read the relevant ARM ARM entry on data/instruction cache lockdown. ARM have also provided an example function: lock_d_cache which will load and lock data into the data cache.

Thanks for this Jon.

We are already doing exactly this to lock down the FIQ Handler (code and data):
https://github.com/hoglet67/PiTubeDirec ... ube.S#L245
sirbod wrote:The cache replacement policy on all ARM cores is to pick a random cache line, so your 6502 core code will constantly be evicted and re-cached.

It was this statement I was particularly interested in.

My understanding was that a cache line will only be evicted if, on cache miss, all lines at that index are already in use (valid).

This seems to be confirmed on page 7-4 of
http://infocenter.arm.com/help/topic/co ... p7_trm.pdf

Cache line allocation uses the cache replacement algorithm when all cache lines are valid.
If one or more lines is invalid, then the invalid cache line with the lowest way number is
allocated to in preference to replacing a valid cache line. This mechanism does not
allocate to locked cache ways unless all cache ways are locked. See Cache miss handling
when all ways are locked down on page 7-6.

I interpreted this to mean that as long as the code for our emulator is contiguous, and is small enough to fit in the unlocked cache ways, then it will all end up in the L1 I Cache and stay there.

That said, the ARM Performance counters do indicate we are experiencing I Cache misses, so I could be wrong.

Unfortunately, there seem to be several cases where I Cache misses are falsely recorded:
http://infocenter.arm.com/help/index.js ... index.html

3. Event 0x0: Instruction cache miss

The instruction cache must be enabled for this event to be counted. Each instruction cache miss triggers an instruction fetch, which is a burst of four double words (8 words) to fill a cache line.

If the I-Cache miss count appears to be excessive, the most likely reason is that branch prediction is not enabled, as it is at reset. This results in the branch shadow at the end of each code loop being fetched every time around the loop. If the branch shadow extends into a new cache line, which has not yet been executed, then the cache line will continually be discarded after it is fetched, meaning that a new cache line fill will take place next time around the loop. This behaviour is eliminated with branch prediction enabled, because the processor will not fetch the branch shadow, if it predicts correctly.

If branch prediction is already enabled and the I-Cache miss count is still high, this may be the result of alignment of subroutine return instructions. With branch prediction on, subroutine returns are predicted using the return stack, but in this case the prediction is recognized in the cycle following the request to fetch the branch shadow. This is in time to cancel the requested bus access for a potential cache line fill (assuming the alignment of the code would trigger a line fill on the branch shadow), but the I-Cache miss counter has already been updated by this time - so in this case, the 'miss' count is incremented but no external bus access takes place.

Some instruction sequences will generate multiple I-cache misses in the same cache line, but result in a single cache line fill. Thus the I-Cache miss count in the performance monitor is larger than the number of cache line fills seen on the bus.

Based on this, I'm not sure I can rely on the I-Cache miss counts being accurate.

Dave

User avatar
BigEd
Posts: 1500
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: Continued ARM 65C02 emulation improvements

Postby BigEd » Tue Jan 31, 2017 3:18 pm

Just to note, the next speed bump moved us up to 274MHz, on Nov 21st 2016:
In parallel, Dominic has been continuing work on the ARM Assembler based 6502 Co Pro (Co Pros 0 and 1), and has improved the performance from 225MHz to 274MHz, which is completely amazing given just under a year ago we were at 84MHz, and at that point 100MHz seemed optimisitic!

(I'm not sure what the implementation changes were but it looks like a series of micro-optimisations and refactorings.)

Some time later (Jan 19th 2017) this work was released, in the Anaconda release. There were other improvements too, like the addition of banked memory.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Tue Jan 31, 2017 3:28 pm

The big jump 225MHz to 270MHzish was freeing up another ARM register and using it to precaching a future byte. This extra load was hidden in a pipeline stall and so was free.

When we start an instruction we now have :

For 2 byte instructions we have the second byte and the byte for the next instruction.
For 1 byte instructions we now need to shuffle the precached bytes. ( only actually costs 1 ARM cycle still quicker than a load)
For 3 byte instructions we have both data bytes which just need oring together

For branches we need to potentially need to reload this cache but this doesn't cost us anything

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Continued ARM 65C02 emulation improvements

Postby dp11 » Tue Jan 31, 2017 3:36 pm

I should also say these cached bytes are sometimes done early in the emulated instruction this give the ARMs cache time to get the data out of RAM before it is needed I haven't actually seen this give a performance advantage but it can't do any harm.

Sometimes due to register usage this precaching happens out of order so byte +2 is loaded before byte+1.


Return to “emulators”

Who is online

Users browsing this forum: No registered users and 2 guests