Hacker needed ... for Zarch ;-)

chat about arc/risc pc gaming & RISC OS software here (NOT the core OS!)

Related forum: adventures


User avatar
trixster
Posts: 527
Joined: Wed May 06, 2015 11:45 am
Location: York

Re: Hacker needed ... for Zarch ;-)

Postby trixster » Mon Oct 16, 2017 8:33 am

It must be time for an update!?? [-o< :D
A3020 | A3000 | BBC B + 128K RAM/ROM + 20K Shadow + Pi0 + VideoNuLA
BBC Master Turbo + DC | Atom | A1200 060 | A500 | Jaguar | A420/1
A4000/040 060 | Atari Falcon 060 | Saturn | PS1 | SNES | CPC6128 | C64 | 3DO | MD

sirbod
Posts: 736
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Postby sirbod » Mon Oct 16, 2017 8:47 am

trixster wrote:It must be time for an update!?? [-o< :D

You've not watched the "Let's play with the Zarch source code" series on the JASPP YouTube channel then?

I'll have the Pan-O-Vision version running both on the project and a Pi at the London Show. If I get time, I'll try to merge it with the 50Hz version that the YouTube series is about - but I'm not making any promises as I'm still setting up all the machines for the show and will probably run out of time.

User avatar
trixster
Posts: 527
Joined: Wed May 06, 2015 11:45 am
Location: York

Re: Hacker needed ... for Zarch ;-)

Postby trixster » Mon Oct 16, 2017 9:36 am

I have not! But I will now!
A3020 | A3000 | BBC B + 128K RAM/ROM + 20K Shadow + Pi0 + VideoNuLA
BBC Master Turbo + DC | Atom | A1200 060 | A500 | Jaguar | A420/1
A4000/040 060 | Atari Falcon 060 | Saturn | PS1 | SNES | CPC6128 | C64 | 3DO | MD

sirbod
Posts: 736
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Postby sirbod » Mon Oct 16, 2017 10:52 am

trixster wrote:I have not! But I will now!

With the London Show looming and me desperately trying to get a dozen machines ready in time, I've not had a chance to continue the series. Hopefully I'll resume it in November, where I plan to finish the 50Hz work and look at recoding for dynamic resolution and add 16m colour support ...or more accurately use the full 4096 colour palette that Zarch uses internally and solve the "fade to black" issue for the depth plotting.

Once all that's done I'll look at merging Pan-O-Vision into the original code - which on first inspection isn't going to be straight forward. I was hoping the original source would easily allow the width/depth of the landscape to be increased, sadly that doesn't appear to be the case.

User avatar
sbadger
Posts: 232
Joined: Mon Mar 25, 2013 1:12 pm
Location: Farnham, Surrey

Re: Hacker needed ... for Zarch ;-)

Postby sbadger » Tue Oct 17, 2017 8:49 am

Hi Jon,

I'd missed your YT vids too. Fantastic, fps looks so smooth!.
Does the source code allow an increase in resolution on the pi?
A3020| A3000x3| BBCBx3 | Electrn | Masterx3 |RiscPC| RPix3
A600 | C64 bbin x2|C64C | Toastrack |QL | XB360&1X |GB |GBC |GBA |GBASP | DS | 3DS XL x2| MD | MS
Atari 7600 | PS1-2-3-4| PSP |Vita |SNES |GC |N64 |Wii & U |Switch |JammaCab |Sony PVMx2

sirbod
Posts: 736
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Postby sirbod » Tue Oct 17, 2017 10:52 am

sbadger wrote:Does the source code allow an increase in resolution on the pi?

No, although there's variables for the width/height of the screen, the code contains hardcoded *320's using shifting.

crj
Posts: 315
Joined: Thu May 02, 2013 4:58 pm

Re: Hacker needed ... for Zarch ;-)

Postby crj » Tue Oct 17, 2017 1:11 pm

I've only just spotted this thread, so I'm not sure which people here have what levels of background knowledge about writing fast ARMcode for this sort of thing.

Important things to consider:
  • The combination of ARM2 and MEMC, when both code and data are in DRAM, does three relevant kinds of cycle:
    • N-cycle: 2 ticks
    • S-cycle: 1 tick, unless it crosses a 16-byte boundary in which case 2 ticks
    • I-cycle: always 1 tick
  • LDR(B) takes n+i+s
  • STR(B) takes 2n
  • STM of x registers takes 2n+(x-1)s
  • ALU operations usually take s
  • B/BL/PC-dest ALU take n+2s

(Here, a "tick" is a cycle of the un-stretched CPU clock, so 125ns if running at 8MHz.)

If you're trying to fill memory with bytes, there are various strategies you can adopt:










Codebytesticksbytes/tick
STRB r0,[r1],#1140.25
STRB r0,[r1],#1 : STRB r0,[r1],#1280.25
STRB r0,[r1],#1 : STRB r0,[r1],#1 : STRB r0,[r1],#13120.25
LDR r2,[r1,#-1]! : AND r2,r2,#&FF000000 : ORR r2,r2,r0,LSR#8 : STR r2,[r1],#43100.3
STR r0,[r1],#4441
STMIA r4!,{r0-r3} (unaligned)1682
STMIA r4!,{r0-r3} (aligned)1672.29
STMIA r8!,{r0-r7} (aligned)32122.67
STMIA r12!,{r0-r11} (aligned)48172.82
CMP rN,rEND : BLO loop05


That last consideration means that:

Code: Select all

.loop
  STMIA r12!,{r0-r11}
  STMIA r12!,{r0-r11}
  STMIA r12!,{r0-r11}
  STMIA r12!,{r0-r11}
  STMIA r12!,{r0-r11}
  STMIA r12!,{r0-r11}
  CMP r12,r13
  BLO loop

...takes 107 cycles for 288 bytes = 2.69 bytes/tick, whereas:

Code: Select all

.loop
  STMIA r12!,{r0-r11}
  CMP r12,r13
  BLO loop

...takes 22 cycles for 48 bytes = 2.18 bytes/tick.

In summary:
  • Load/mask/store is 20% quicker than LDRBs for storing 3 bytes in a word
  • STR store is 4x speed of STRB (duh!)
  • STMs can be 2x to 2.8x speed of STR
  • Aligned STMs are 15% quicker than unaligned
  • Loop unrolling saves about 20%
  • In code, positioning your loads/stores at addresses 12 mod 16 can save 5-25%

Another consideration: with an ARM3 you need to care about cache occupancy, meaning it's important to keep code size down. With an ARM2 executing the same instruction again is no quicker than executing a different instruction, so the only limit to unrolling is how much RAM you're prepared to sacrifice for the code.

Bringing this all together, I'd be tempted to try something like:

Code: Select all

.entry
  \ r0 = start address
  \ r1 = end address
  \ r4 = byte to store, quadruplicated into every byte of the word

  \ Find the common high-order bits of start and end addresses
  EOR r2,r0,r1
  \ Find the start address modulo 32
  AND r3,r0,#&1F
 
  CMP r2,#&20
  BHS large
 
  \ If we get here, start and end are in the same aligned 32-byte block
  \ Compose r3 into r2
  ORR r2,r2,r3,LSL#5
 
  \ Now dispatch to one of 32*32=1024 different code fragments
  \ Each is up to 16 instructions = 64 bytes long
  \
  \ 16 instructions, because the worst case is:
  \   Copy r4 into 5 more registers
  \   Store 3 bytes in a word (4 instructions)
  \   STM (1 instruction)
  \   Store another 3 bytes in the final word (4 instructions)
  \   MOV PC,lr (1 instruction)
  \ Total: 13 instructions
  \
  \ So the entire table is 64Kbytes, but we've only taken 10 ticks so far,
  \ and the code from this point on is *certain* to be optimal!
  ADD PC,PC,r2,LSL#6

  \ Insert 64K of mechanically-generated code here. (-8
  \ NB: Align .entry so that this table is 16-byte aligned

  \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

  \ This code will be executed last, but has to come before .large so that it's within ADR range.
  \ We will land in this code fragment the appropriate number of instructions back from the end.
  \ To be able to store max_bytes bytes, this table must contain max_bytes/32 entries
 
  STMIA r0!,{r4-r11}
  STMIA r0!,{r4-r11}
  STMIA r0!,{r4-r11}
  \ ...
  STMIA r0!,{r4-r11}
.no_blocks
  \ Align preceding STMs so that this MOV is at an address 12 mod 16
  MOV PC,lr

  \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
 
.large
  \ copy r4 so we have it in eight consecutive registers
  MOV r5,r4 : MOV r6,r4 : MOV r7,r4 : MOV r8,r4 : MOV r9,r4 : MOV r10,r4 : MOV r11,r4

  \ Down this path, we know there are some higher-order bits in r2 to clear
  AND r2,r2,#&1F
  \ Compose r3 into r2, as before
  ORR r2,r2,r3,LSL#5

  ADR r3,no_blocks

  \ Now we dispatch into a different 64Kbyte cluster of code fragments
  \ In this case the job is to:
  \   Write the initial 0-31 bytes, incrementing r0, worst case:
  \     Store 3 bytes in a word (4 instructions)
  \     STM (1 instruction)
  \   Write the final 0-31bytes, decrementing r1, worst case:
  \     Store 3 bytes in a word (4 instructions)
  \     STM (1 instruction)
  \   (Maybe) dispatch into the block storer to fill the gap (3 instructions)
  \ Total: 11 instructions
  \
  \ The final 3 instructions in every case are:
  \   SUBS r2,r1,r0  \ NB: By now, both r1 and r0 are now 32-byte-aligned
  \   MOVEQ PC,lr
  \   SUB PC,r3,r2,LSR#3  \ Offset backwards 4*number of 32-byte blocks
  \
  \ Note that the MOVEQ PC,lr saves 5 ticks for short lines, but costs 1 tick for
  \ longer ones. You'd need profiling to work out whether or not to include it.
 
  ADD PC,PC,r2,LSL#6

  \ Insert second 64K of mechanically-generated code here. (-8
  \ NB: Align .large so that this table is 16-byte aligned

Look ma, no loops!

sirbod
Posts: 736
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Postby sirbod » Tue Oct 17, 2017 3:11 pm

crj wrote:I'm not sure which people here have what levels of background knowledge about writing fast ARMcode for this sort of thing.

If you take a look at this post, I took the guesswork out of it by writing a BASIC program that works out the optimal set of instructions to line fill up to 64 pixels at every offset within a Quad word. There's also various graphs later on that show the performance when using varying numbers of colour registers.

crj
Posts: 315
Joined: Thu May 02, 2013 4:58 pm

Re: Hacker needed ... for Zarch ;-)

Postby crj » Tue Oct 17, 2017 5:06 pm

Hmm. It's tricky to make sense of the output from your BASIC program, but it doesn't look correct. In particular, if I understand it correctly, when filling three pixels there's only one alignment for which it will use load/mask/store yet there are two alignments for which it's optimal.

Come to that, for the aligned case, no misalignment fixup is needed. STRB and load/store/mask seem to be your only two options?

Also, it's definitely not the case that the optimal algorithm can always be expressed by three steps. For many lengths and alignments it will be best to deal with an initial part-word, then initial part-line, then whole lines, then final part-line, then final part-word.

Knowing the correct sequence of operations for a particular length and alignment is only half the battle, though. As you can see, I mainly took that as a given and focused on efficient decode, dispatch and code alignment issues.

sirbod
Posts: 736
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Postby sirbod » Wed Oct 18, 2017 6:56 am

crj wrote:Hmm. It's tricky to make sense of the output from your BASIC program, but it doesn't look correct. In particular, if I understand it correctly, when filling three pixels there's only one alignment for which it will use load/mask/store yet there are two alignments for which it's optimal.

It tries STRB and LDR/mask/STR at both the start and end of the fill, if you run it you'll see the full timing output of every test.

I posted it here so people such as yourself with a good understanding of the CPU could modify it and/or make corrections, so feel free to make the changes you propose and repost.


Return to “software”

Who is online

Users browsing this forum: No registered users and 2 guests