ARM3 cache emulation

ask about 32-bit emulators like ArcEm, Red Squirrel, Arculator or RPCEmu here
User avatar
Rich Talbot-Watkins
Posts: 1081
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

ARM3 cache emulation

Postby Rich Talbot-Watkins » Fri Dec 04, 2015 9:00 am

(Split from Filling routines (WIP) Follow up to Zarch thread)

David1664 wrote:
TomWalker wrote:
David1664 wrote:From some previous experience (making a BASIC demo with Arculator in ARM2 mode), I'm guessing that your code will run 2 to 3 times slower on a real 8MHz ARM2.

Probably worth pointing out that the main reason for this is that Arculator 0.99 doesn't emulate ROM timings, so BASIC is running as fast as if you'd *rmfaster'ed it. Assembler code is still a bit off, but nothing like 2-3 times.


Yes, I think I got my wires a bit crossed here. My BASIC demos ('Event Horizon' and 'BodgerBAS') actually targetted the 25MHz ARM3 - not the ARM2, and they were developed with Arculator in 25MHz ARM3 mode. You explained at the time:

"Arculator doesn't attempt to accurately emulate the ARM3 cache, on the basis that I haven't yet lost the will to live. Hence it's probably a bit quick here - but adjusting the speed for this would make other things too slow..."


David.
--


Yeah the ARM3 cache is a horrible thing to emulate, but certainly doable at a decent speed I think. It's 64 way set associative, but you wouldn't want to check through 64 entries to see if a line is in the cache on each memory access. I think I concluded that you could probably do this with a reverse lookup, given that there are only 22 significant bits in an address which maps to a cache line (so essentially you would store which cache line corresponds to a given address, as well as the other way round). Never managed to find out what algorithm it uses for the random replacement - I assume a very simple linear congruential generator.

sirbod
Posts: 609
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Filling routines (WIP) Follow up to Zarch thread

Postby sirbod » Fri Dec 04, 2015 3:01 pm

I believe the ARM3 uses a FIFO replacement cache policy, there's no intelligence to it.

With the size of cache on an x64 chip, the whole emulator and ARM3 cache should fit easily into the CPU L2, so speed shouldn't be an issue - certainly not if your going for 1:1 - you'd be throttling it back for the majority of the time.

Should be fairly easy to emulate, just check the cache array before pulling in from memory and overwriting the next cache line in a circular buffer, probably a dozen lines of code at most.

User avatar
Rich Talbot-Watkins
Posts: 1081
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: ARM3 cache emulation

Postby Rich Talbot-Watkins » Fri Dec 04, 2015 5:19 pm

Nah, it's not that simple - there are 4 sets of 64 slots. Bits 4 and 5 of the address determine which of the 4 cache sets it'll use. If the line isn't found in the cache set, it will replace one of the 64 slots "randomly" (from which I assume it's a simple PRNG generator of the form next=old*a+b).

In hardware it's easy to do a match over 64 slots in parallel, but that kind of iteration in an emulator on each memory access seems a bit steep, hence my thought that it should probably get a reverse lookup.

The ARM3 cache is pretty unconventional compared to modern CPU caches (I never heard of a 64-way set associative cache until this), but the policy obviously worked for them well. From this it's fairly clear to see why the ARM3 performs much better with tight little loops as there's less chance of a cache line clash as it caches instructions.

SarahWalker
Posts: 1036
Joined: Fri Jan 14, 2005 3:56 pm
Contact:

Re: ARM3 cache emulation

Postby SarahWalker » Fri Dec 04, 2015 8:30 pm

Yes, this should probably be feasible - that should result in a bitmap of 512kb I think. There will probably be some pain involved with synchronising CPU and memory clocks though I imagine.

I'll stick with trying to improve ARM2 for now - it's looking better than it was, but still a few issues...
Attachments
arculator_memc1_timings.png

sirbod
Posts: 609
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: ARM3 cache emulation

Postby sirbod » Sat Dec 05, 2015 3:04 am

ARM2 timings look a lot more accurate Tom, quick work.

If the cache replacement is a pseudo random as you suggest, we'd never get it accurate without insider NDA details. I wonder if there's a way of figuring it out programmatically, by presetting the cache and timing an LDM across each full cache line. The code would have to be accurately sized, pre-cached and IRQ/FIQ off etc. T1 should be good enough to tell if each cache line read hit the cache or not. Sounds like programming hell though!

I'd like to see a StrongARM or later emulator with an accurate cache, as the Havard architecture is royal pain to programme on. Having to power up physical machines to test is like going back the the dark ages!!

User avatar
Rich Talbot-Watkins
Posts: 1081
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: ARM3 cache emulation

Postby Rich Talbot-Watkins » Sat Dec 05, 2015 10:42 am

Articles from Steve Furber himself suggest that the random numbers may be generated from a simple table, or alternatively a very high frequency counter. But yeah I think determining it through tests on real hardware would be very tricky but maybe there's a way.

On emulating the ARM3 cache, I actually think reverse lookup isn't the way to go as the required table would be 2Mb and probably screw the cache. Probably a better approach would just be to store the emulated cache sorted by tag and use a binary search for lookups. That'd certainly be kinder on less powerful platforms.

sirbod
Posts: 609
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: ARM3 cache emulation

Postby sirbod » Sun Dec 06, 2015 3:49 am

p333 of VLSI RISC Architecture and Organization by Steve Furber details the ARM3 cache design, no detail on the random replacement policy though.

There's also some detail in Micropipelined Cache Design Strategies for an Asynchronous Microprocessor from p33, detailing the ARM3 cache design and the 64-way set associativity.

In theory, if the policy is pseudo random, any pseudo random policy should produce a compatible result. It's obviously not going to be an exact mirror of the hardware, but for average speed would probably turn out to be close enough.

sirbod
Posts: 609
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: ARM3 cache emulation

Postby sirbod » Mon Dec 07, 2015 7:06 pm

Based on the documentation in my previous post, I've coded the attached to detect the cache line eviction order. If it's working correctly (I'm not convinced!) the ARM3/610/710's do appear to use a completely random order, which doesn't repeat and can result in the same cache line being replaced consecutively.

It works by pre-staging all four CAM's, reading a non-cached address that would hit CAM0 and then checks the addresses that should be cached in CAM0 to see which address takes one T1 tick longer. To avoid the code itself being evicted and voiding the result, all code and results are kept within CAM1 addresses. CAM2/3 aren't touched during the test.

Code: Select all

CACHE_LINE%=16    :REM Size of an individual cache line
CACHE_LINES%=256  :REM Amount of cache lines
CAM_SIZE%=64      :REM Entries in each CAM
CACHE_SIZE%=CACHE_LINE% * CACHE_LINES%  :REM Size of the ARM3 cache

MEMC_SOFTCOPY% = &114 : REM Address in Page Zero of the MEMC CR copy
T1_LOW%        = &50  : REM Offset of T1 low in IOC
T1_HIGH%       = &54  : REM Offset of T1 high in IOC
T1_GO%         = &58  : REM Offset of T1 go in IOC
T1_LATCH%      = &5C  : REM Offset of T1 latch in IOC

code%=&20000

FOR A%=0 TO 2 STEP 2
P%=code%
[OPT A%
.start
 STMFD   R13!, {R0-R12, R14}
 MOV     R10, R14                               ;R10=origial CPU mode
 SWI     "XOS_EnterOS"

 ORR     R0, R14, #%11 << 26                    ;disable IRQ / FIQ
 TEQP    R0, #0

 MOV     R8, #MEMC_SOFTCOPY%
 LDR     R9, [R8, #0]                           ;R9=original MEMC CR
 BIC     R0, R9, #%11 << 10                     ;disable DMA
 STR     R0, [R0, #0]

 ADR     R8, start                              ;pre-stage cache

 MOV     R3, #256                               ;several times
 .pre_cache_L2
   MOV     R2, #0
   .pre_cache_L1
     LDR     R0, [R8, R2, LSL #4]
     ADD     R2, R2, #1
     TEQ     R2, #CACHE_LINES%
   BNE     pre_cache_L1

   SUBS    R3, R3, #1
 BNE     pre_cache_L2

 ;CAM0/1/2/3 should now be pre-staged

 MOV     R12, #&3200000                         ;setup T1 timer
 ADR     R4, cache_misses

 MOV     R0, #0
 STRB    R0, [R12, #T1_HIGH%]

 MOV     R5, #0
 B       check_full_cache

 FNalignSET
 .check_full_cache
   SUB     R6, R8, #CACHE_SIZE%
   LDRB    R0, [R6, R5, LSL #6]                 ;force a cache miss
   MOV     R2, #0
   B       check_cache_lines

   FNalignSET
   .check_cache_lines
     MOV     R0, #&FF
     STRB    R0, [R12, #T1_LOW%]
     STRB    R0, [R12, #T1_GO%]                 ;start T1
     B       f1

     FNalignSET
     .f1
     LDR     R1, [R8, R2, LSL #6]               ;load cached line
     MOV     R0, R0, LSL R1                     ;pause 250nS
     MOV     R0, R0, LSL R1                     ;pause 250nS
     B       f2

     FNalignSET
     .f2
     MOV     R0, R0                             ;pause 125nS
     B       f3

     FNalignSET
     .f3
     STRB    R0, [R12, #T1_LATCH%]              ;latch T1
     LDRB    R1, [R12, #T1_LOW%]                ;read the low byte
     MOV     R0, #252
     B       f4

     FNalignSET
     .f4
     TEQ     R1, R0                             ;did the cache line miss?
     STREQB  R2, [R4], #CAM_SIZE%               ;YES
     BEQ     exit_loop
     B       f5

     FNalignSET
     .f5
     ADD     R2, R2, #1
     TEQ     R2, #CAM_SIZE%
   BNE     check_cache_lines
   B       done

   .exit_loop
   ADD     R5, R5, #1
   TEQ     R5, #CAM_SIZE%
 BNE     check_full_cache

 .done
 STR     R9, [R9, #0]                           ;reset MEMC CR
 TEQP    R10, #0                                ;reset CPU mode
 MOV     R0, R0

LDMFD   R13!, {R0-R12, PC}

FNalignSET
.cache_misses
]:NEXT

FOR A%=0 TO 4096 STEP 4:cache_misses!A%=0:NEXT
CALL start
FOR A%=0 TO CAM_SIZE%-1:PRINT ;" ";cache_misses?(A%*CAM_SIZE%);" ";:NEXT
WAIT:WAIT
END



DEF FNalignSET
  WHILE (P% AND %111111)<>%10000
    [OPT A%:MOV R0,R0:]
  ENDWHILE
=A%


And here's a sample of the output, showing the address offset that was evicted within the CAM range in the order they were evicted (note we can't actually determine which CAM entry was evicted, only the offset within the CAM range):

51 41 22 26 28 30 9 5 20 22 19 6 2 9 24 28 47 21 26 23 38 17 7 9 16 18 18 16 8 26 20 33 17 23 36 38 11 39 25 40 31 42 8 0 20 1 11 1 5 6 22 6 22 41 15 33 25 7 20 8 9 11 27 16

What is immediately obvious is that very few values are within the top of the CAM range, which is a hint it's not working as expect. I suspect this is down to the issue of actually pre-staging the cache in the first place - as the replacement method is pseudo random we can say with certainty that its impossible to pre-stage the cache on an ARM3....making the above code unreliable.

What we can draw from this however, is that any random replacement method should be sufficient to emulate an ARM3/610/710 cache, provided it's using bits 4/5 of the address for the CAM selection. You could probably even get away with taking bits 6..11 and add an incrementing value to it to get the next CAM entry to evict.

EDIT: Noticed a flaw in the code above. I'm writing the results across addresses that fall into CAM0 which will skew the cache result.

EDIT2: The above code now modified to avoid skewing the results by writing to CAM0 addresses

SarahWalker
Posts: 1036
Joined: Fri Jan 14, 2005 3:56 pm
Contact:

Re: ARM3 cache emulation

Postby SarahWalker » Wed Dec 23, 2015 6:14 pm

Been playing with this a bit, got a prototype working. Results aren't exactly 100% accurate so far, but a lot closer than 0.99. Some benchmark results below, I also note that Doom runs at pretty much the same performance as my R260 now, instead of twice as fast as it used to. I'm currently using a reverse mapping (with a 512kb bitmap). Video DMA isn't handled properly, and synching between ARM and MEMC clocks is incorrect (forced 2:1 relationship at the moment, forcing a 12.5 MHz memory clock, and there's no cost to the synchronisation).

I've moved the Mercurial repo to https://bitbucket.org/pcem_emulator/arculator, as the old one wasn't restored when retrosoftware.co.uk was. The ARM3 changes aren't there yet, but ARM2 is much better.
Attachments
arm3_sick.png
arm3_armsi.png

steve3000
Posts: 1689
Joined: Sun Nov 25, 2012 12:43 am

Re: ARM3 cache emulation

Postby steve3000 » Thu Dec 24, 2015 10:53 am

TomWalker wrote:Been playing with this a bit, got a prototype working. Results aren't exactly 100% accurate so far, but a lot closer than 0.99. Some benchmark results below, I also note that Doom runs at pretty much the same performance as my R260 now, instead of twice as fast as it used to. I'm currently using a reverse mapping (with a 512kb bitmap). Video DMA isn't handled properly, and synching between ARM and MEMC clocks is incorrect (forced 2:1 relationship at the moment, forcing a 12.5 MHz memory clock, and there's no cost to the synchronisation).

Hi Tom, really great to read about your work on this - it will be a huge step forward to have some form of cache implementation in an emulator :D can't wait to try this!

Would you be interested in taking a look at my pre-release 'RasterMan' demo to see if Arculator could be tweaked to allow this to run correctly? If so I'd be more than happy to share this, along with some pointers as to what may need attention to get it running?

SarahWalker
Posts: 1036
Joined: Fri Jan 14, 2005 3:56 pm
Contact:

Re: ARM3 cache emulation

Postby SarahWalker » Thu Dec 24, 2015 2:31 pm

Yes, that would probably be helpful!

steve3000
Posts: 1689
Joined: Sun Nov 25, 2012 12:43 am

Re: ARM3 cache emulation

Postby steve3000 » Sat Dec 26, 2015 11:06 am

TomWalker wrote:Yes, that would probably be helpful!

You have a PM.

sirbod
Posts: 609
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: ARM3 cache emulation

Postby sirbod » Fri Jan 01, 2016 12:27 pm

I don't suppose you could post a binary to test?

I may be able to help you out with the VIDC positioning issue in Grieveous Bolid ARM, as I have it displaying correctly on the Pi. From memory it relies on the monitor type being a specific value and relies on the chosen MODE also having cetain values, as it doesn't set all VIDC parameters. If any of these are wrong, the borders are positioned incorrectly. If you watch the VIDC parameters on both the title screen and in-level, you'll see what I mean.

IIRC, it selects a VGA MODE for the title screen and then changes the parameters to attain the overscan - which isn't VGA compatible! It's not the only game that does this, I've seen a few others that make odd MODE choices.


Return to “emulators”

Who is online

Users browsing this forum: No registered users and 1 guest