Filling routines (WIP) Follow up to Zarch thread

chat about arc/risc pc gaming & RISC OS software here (NOT the core OS!)

Related forum: adventures


Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Tue Dec 01, 2015 7:57 pm

Hi *. members.

This is a new thread, since it appears the segment filling routines I've been working on are not exactly tailored for Zarch and its small triangles.
(See here : viewtopic.php?f=29&t=10313&hilit=Zarch Jon explains very well why filling while building the triangles is more appropriate).
My routines work with a table of couples of words (memory start address, length in pixels) It's something I believe is needed when working with RasterMan if RM hardwarely wobbles the screen on each scanline : having a table will permit to compensate the hardware shifting by altering the memory start address.

Here is everything I did so far in a PackDir archive. (type &68e)
I made a video to show you how to use it, with Arculator.(with a module playing, it's less boring).
https://youtu.be/dWwSKm9iVpY

Please feel free to comment, many of you here are excellent programmers, so don't hesitate to criticize what I did.
I'm on my own so I can only be happy to learn.
So far I think it's smart, but of course this opinion is mine so is biased. :wink:
I know so far :
- routines could be faster if they were to fill a shape with a uniform colour, but I wanted them to cope with pattern colour, so mine can work with a regular pattern ie col1, col2, col1, col2 and of course col1,col1,col1,col1.
- writing everything by hand isn't the smartest solution
- the filling routines consume a lot of memory (but to me it's not an issue. My view is that an Archie is a 2 Mbyte machine, not 1).
See the source : there's plenty of unused memory space, these 'holes' can be used to store data or even short pieces of code, so it's not exactly wasted.
Everything I did was with SPEED as the top priority.
- the routines don't make use of the clever sequencing of ARM instructions allowing to save more cycles (See this thread viewtopic.php?f=16&t=9055&hilit=MEMC+behaviour&start=30) simply because this info about the Archie architecture is brand new to me. So far I wrote the routines to plot segments from 1 to 200 pixels, I intend to expand that to 384 pixels (intended usage is for overscan mode 13) and I'll do it using this new rule, then I'll correct the actual part 1 to 200.
- some routines can be enhanced, it's why there are some comments in the 2 part sources to express my doubts they are as fast as they should be, in their current shape. Further optimisations are still possible, for sure.

Comments welcome and highly appreciated.

Enjoy ! (I hope)

Xavier.
Attachments
Uni8bit.txt
(850.75 KiB) Downloaded 35 times

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Thu Dec 03, 2015 12:08 am

New video to time the filling routines.
https://www.youtube.com/watch?v=h1Wz1N4OXms

I honestly didn't know what to time : a very small shape, a big one ?
I chose the one I used to test the routines, being lazy tonight.
The result (with Arculator, ARM2 8 Mhz) gives a bandwith for this shape of 13 333 333 pixels per second or 12,715 Mbytes per second.

I thought it would be better.
I'll check on a real Archie tomorrow.

David1664
Posts: 52
Joined: Thu Feb 25, 2010 2:24 am

Re: Filling routines (WIP) Follow up to Zarch thread

Postby David1664 » Thu Dec 03, 2015 3:25 am

From some previous experience (making a BASIC demo with Arculator in ARM2 mode), I'm guessing that your code will run 2 to 3 times slower on a real 8MHz ARM2.


David.

sirbod
Posts: 751
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Filling routines (WIP) Follow up to Zarch thread

Postby sirbod » Thu Dec 03, 2015 5:54 am

Arculator is mostly accurate timing wise on LDM/STM, but there are some huge differences elsewhere. ALU with register barrelshift are double speed, LDR is too fast and MUL/MLA are way out. I did a side-by-side timing comparison here with an A310.

I honestly didn't know what to time : a very small shape, a big one ?

If you're aiming for Zarch, there's a graph in the thread showing the max fill lengths and triangle sizes. 56 pixels is the largest fill I think.

If you're testing general fills, I use a pseudo random number generator and plot randomly based on that, so it's a consistent but general comparison.

You'll probably want to test each fill length individually and build a graph as I did, then you'll see where the slow ones are and spot any oddities, looks pretty too :)

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Thu Dec 03, 2015 7:23 am

It takes an A3000 85 seconds to run the code, instead of 75 seconds for Arculator.
https://www.youtube.com/watch?v=SCovPU8DJbY

Bandwith is now 11,219 Mbytes per second.

Brought back to a frame, it's 229,8 kbytes per frame with a refresh rate of 50 per second ; when a mode 13 bank is 80 kbytes it means basically you can fill 2,8 times the screen per VBL.
Disappointing, I expected something faster, way over 3 times. (*)
I'll have to work on this better sequencing of ARM instructions, as it's not implemented yet.

And no Jon I don't think these routines are the best for Zarch, as you very well explained.
It's why I left the thread and created this one.
Any other 3D games or demos with larger objects certainly, or 2D games with a cartoon-like style ... but not Zarch with its small triangles.
What's 'fp' in your pictures ? I don't know anything about floating points on the ARM ... It's not really a register when an Archie isn't fiited with an FPA, isn't it ? (I mean : there's nothing I should try to search in this direction when my target machine is a standard A3000 ?).

Readers and watchers please note : my routines are compatable with RasterMan if there are hardware ruptures of the video DMA on each scanline, and of course this isn't a small advantage . There's a small but real tribute to pay for that, but all in all it's worth working this way if you want to have a hardware wobbling background screen, or implement hardware differential scrollings etc ... This technique will save cycles over the normal one.

(*) Disappointing because I expected to be very much faster than my sprites plotting routines.
I realise the real cost in terms of CPU cycles comes from the insertion into the background at the beginning and the end of a segment or a line of sprite, and this is a constant in both the sprite plotting routines and the segment filling routines.

sirbod
Posts: 751
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Filling routines (WIP) Follow up to Zarch thread

Postby sirbod » Thu Dec 03, 2015 10:12 am

What's 'fp' in your pictures?

Which pictures are you referring too? The Arculator/A310 comparison is CPU only.
Disappointing, I expected something faster, way over 3 times.

I wouldn't get too disheartened, I'm sure we can get them a bit faster with alignment and careful STM placement. Don't expect too much gain though, 10% tops I'd say.

Might be worth comparing speeds with what my programmatical routine outputs in the way of code with your hand crafted code. I'll test when I get a chance and post a speed graph.

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Thu Dec 03, 2015 12:02 pm

sirbod wrote:
What's 'fp' in your pictures?

Which pictures are you referring too? The Arculator/A310 comparison is CPU only.


fp in the registers list for LDMIA & STMIA
http://www.iconbar.com/forums/attachmen ... 0-ARM2.png
Is it R12 ? Why naming it 'fp' ?

User avatar
Rich Talbot-Watkins
Posts: 1143
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Rich Talbot-Watkins » Thu Dec 03, 2015 12:25 pm

fp = frame pointer, although in the normal ABI, it's assigned to R11. With this naming convention, R12 is normally 'ip' (intra-procedure call scratch register). All this nomenclature is of more relevance to compilers which use the registers in a prescribed way. When writing native ARM code I've always just gone with R0-R15 :)

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Thu Dec 03, 2015 12:47 pm

Rich Talbot-Watkins wrote:fp = frame pointer, although in the normal ABI, it's assigned to R11. With this naming convention, R12 is normally 'ip' (intra-procedure call scratch register). All this nomenclature is of more relevance to compilers which use the registers in a prescribed way. When writing native ARM code I've always just gone with R0-R15 :)


OK thanks.
I'm soon going to forget it, my poor brain is only able to obey the KISS principle.

sirbod
Posts: 751
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Filling routines (WIP) Follow up to Zarch thread

Postby sirbod » Thu Dec 03, 2015 12:48 pm

Zarchos wrote:fp in the registers list for LDMIA & STMIA
http://www.iconbar.com/forums/attachmen ... 0-ARM2.png
Is it R12 ? Why naming it 'fp' ?

Sorry, I see what you mean now. Ah, well, back in the days of Arthur, that's how the registers were referenced in the Acorn documentation.

SarahWalker
Posts: 1053
Joined: Fri Jan 14, 2005 3:56 pm
Contact:

Re: Filling routines (WIP) Follow up to Zarch thread

Postby SarahWalker » Thu Dec 03, 2015 6:28 pm

David1664 wrote:From some previous experience (making a BASIC demo with Arculator in ARM2 mode), I'm guessing that your code will run 2 to 3 times slower on a real 8MHz ARM2.

Probably worth pointing out that the main reason for this is that Arculator 0.99 doesn't emulate ROM timings, so BASIC is running as fast as if you'd *rmfaster'ed it. Assembler code is still a bit off, but nothing like 2-3 times.

Arculator v1.0 should be a lot better, if I ever get round to finishing it.

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Thu Dec 03, 2015 7:31 pm

TomWalker wrote:
David1664 wrote:From some previous experience (making a BASIC demo with Arculator in ARM2 mode), I'm guessing that your code will run 2 to 3 times slower on a real 8MHz ARM2.

Probably worth pointing out that the main reason for this is that Arculator 0.99 doesn't emulate ROM timings, so BASIC is running as fast as if you'd *rmfaster'ed it. Assembler code is still a bit off, but nothing like 2-3 times.

Arculator v1.0 should be a lot better, if I ever get round to finishing it.


Good news then there's hope you keep on enhancing it ?
I thought you had definitively decided to stop further developments.
I know it's not an easy job but is there any chance to have debugging/tracing functions in the future ?
Also I know someone who'd be happy to see MEMC registers changes not performed only during video flyback ... (if I'm not wrong it's what Arculator does).

David1664
Posts: 52
Joined: Thu Feb 25, 2010 2:24 am

Re: Filling routines (WIP) Follow up to Zarch thread

Postby David1664 » Thu Dec 03, 2015 8:16 pm

TomWalker wrote:
David1664 wrote:From some previous experience (making a BASIC demo with Arculator in ARM2 mode), I'm guessing that your code will run 2 to 3 times slower on a real 8MHz ARM2.

Probably worth pointing out that the main reason for this is that Arculator 0.99 doesn't emulate ROM timings, so BASIC is running as fast as if you'd *rmfaster'ed it. Assembler code is still a bit off, but nothing like 2-3 times.


Yes, I think I got my wires a bit crossed here. My BASIC demos ('Event Horizon' and 'BodgerBAS') actually targetted the 25MHz ARM3 - not the ARM2, and they were developed with Arculator in 25MHz ARM3 mode. You explained at the time:

"Arculator doesn't attempt to accurately emulate the ARM3 cache, on the basis that I haven't yet lost the will to live. Hence it's probably a bit quick here - but adjusting the speed for this would make other things too slow..."


David.
--

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Fri Dec 04, 2015 3:18 pm

Edit: ARM3 cache emulation posts moved to new topic in Emulators.

Rich Talbot-Watkins wrote:Yeah the ARM3 cache is a horrible thing to emulate, but certainly doable at a decent speed I think. It's 64 way set associative, but you wouldn't want to check through 64 entries to see if a line is in the cache on each memory access. I think I concluded that you could probably do this with a reverse lookup, given that there are only 22 significant bits in an address which maps to a cache line (so essentially you would store which cache line corresponds to a given address, as well as the other way round). Never managed to find out what algorithm it uses for the random replacement - I assume a very simple linear congruential generator.


Gentlemen, that's very interesting.
But as I was told by others (and they were right) in some other threads, please start another thread, it's interesting enough, because we are now quite OT ... :D

Any of you downloaded my routines and have some comments ?
They're not going to change the world, I know, but I'd be grateful to read your comments.

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Sun Dec 20, 2015 8:28 pm

I've started modifying my segment plotting routines to make use of the clever sequencing of ARM instructions.
So far I modified the routines up to plotting a segment of 50 pixels (well that was modifying 50 x 16 = 800 routines #-o when it's been so sunny these days :? ).

And yes it' now faster.
Have a watch here and read the description text :

https://www.youtube.com/watch?v=cLBSM1thRj8
https://www.youtube.com/watch?v=qqLZoT8pcNo
https://www.youtube.com/watch?v=htmKdVvlkXQ

And yes don't hesitate to offer me a proper camera for Christmas to upload videos with greater quality :wink:

sirbod
Posts: 751
Joined: Mon Apr 09, 2012 8:44 am
Location: Essex
Contact:

Re: Filling routines (WIP) Follow up to Zarch thread

Postby sirbod » Wed Dec 23, 2015 4:25 am

Your eyesight must be better than mine! I can't make out what it says at the end of the videos :roll:

Have you thought about investing in a VGA>HMDI converter and an HDMI recorder? I use this recorder as you can work it without a PC and upload direct to YouTube.

I need to do some recordings on my A440/1 so have been looking at purchasing a VGA>HDMI converter that supports 720x400, I'll let you know if it works.

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Thu Dec 24, 2015 6:21 pm

Yes I agree.
The quality is so poor I removed these videos.

I'm not sure your devices can handle low res TV modes, but it's interesting, of course.

Just posted this, much better, I used my cellphone this time.
V2 routines complete for segments with max length 75 pixels.
https://www.youtube.com/watch?v=SP5RiUjgJ_4

25/12 Woke up at 03:40 so coded a bit :
https://www.youtube.com/watch?v=Pjev7O7fKjA here there's a real speed gain with better sequencing of ARM instructions.

Same day in the evening, I've just completed amending the routines to plot up to 100 pixels and here is a video to see if there's a speed gain, for example when the destination offset is 10 from a quadword aligned address. Starring Nute, one of my cats :D
https://www.youtube.com/watch?v=vHO6E6V-7mE

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Mon Dec 28, 2015 8:48 am

Update.
1st 1600 routines to cope with plotting from 1 to 100 pixels have been amended to use the optimised sequencing of ARM instructions.
Oh yes, I got up in the middle of the night to do that, or early in the morning like 5 o'clock, it's funny it reminds me the era I was coding this awful preview of a demo with the generated code.

So, please load the files enclosed and replace the previous ones by these new files.
You know the filetypes, they are in the name as it all comes from Arculator.
EDIT : well no *. doesn't like the name,extension format, so :
- Demo and Uni8bitp1 are BASIC files
- c1 and fcode filetypes are not important, set 'Text'

Comments welcome.
I'm not sure I spotted every place some cycles could be saved, so in case you see anything, please help.

I also changed the explanations to make the way it all works a bit more understandable :shock:

Final note : the speed gain is more important than the videos show because I was able to save more cycles after these videos, looking back again at the code.
Attachments
Uni8bitp1.txt
(1.17 MiB) Downloaded 28 times
fcode.txt
(402 KiB) Downloaded 29 times
Demo.txt
(6.69 KiB) Downloaded 24 times
c1.txt
(202 KiB) Downloaded 25 times

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Tue Jan 12, 2016 1:05 am

Update of the routines, which in fact were very buggy, but with the very 2 close colours used, I couldn't see it.

Routines are now (I hope) both quad aligned address DMA optimised for all routines + clever ARM instructions sequencing optimised (only until 140 pixels plotted). EDIT : It's now done for all routines up to 200 pixels.

Rename the files and their filetype as follows :
SetDir is Obey
Uni8bitpart1 and 2, Demo, Append are BASIC files
fcode is some code you can settype it text it'll work

To use all this run this sequence :
SetDir
Uni8bitp1
Uni8bitp2
Append
Demo

If you don't have 4 Mbytes Uni8bitp1 and 2 won't assemble, but you can still run 'Demo' because the assembled code 'fcode' is here anyway.
PS : You can use Arculator, after all it's what I used for the development of these routines.

EDIT 01/02/16 : New Uni8bitp2. Corrections added, there are now more routines using the clever sequencing of ARM instructions, some errors corrected too with some routines not fully using the 'MEMC DMA friendly quadword addresses', and also added '::' in front of each ARM instructions on the 2nd word of a quadword aligned address, to ease reading and finding (perhaps) more efficient routines.
No 'UP' to signal this, Uni8bitp2 isn't fully 'optimised' yet, only 2 thirds done tonight. New fcode not uploaded, I'll do that when everything is fully re read and corrected (this includes Uni8bitp1, not re re re read yet ;-) )

EDIT 22/02/16 : Final version of Uni8bitp1. Some corrections to get better optimised code with some routines not fully using the 'MEMC DMA friendly quadword addresses', and also added '::' in front of each ARM instructions on the 2nd word of a quadword aligned address, to ease reading and finding (perhaps) more efficient routines.
Most interesting changes are better sequencing of ARM instructions to get a 5:1 ratio when crunching the code with a very simple and extremely fast cruncher. The decruncher will also be very simple and extremely fast and shouldn't need more than 200 ARM instructions and a vocabulary of more than 50 ARM words.

EDIT 06/03/16 : Final version of Uni8bitp2. See comments above : same applies here except the code was much lousier, shame on me !
No new fcode sent, you can generate it on your machine, I'd like to reach plotting 384 pixels before a new global update of the files.
Attachments
Uni8bitp2.txt
(1.35 MiB) Downloaded 21 times
Uni8bitp1.txt
(1.21 MiB) Downloaded 18 times
SetDir.txt
(14 Bytes) Downloaded 21 times
royoraw.txt
(80 KiB) Downloaded 24 times
fcode.txt
(402 KiB) Downloaded 23 times
Demo.txt
(7 KiB) Downloaded 26 times
Append.txt
(215 Bytes) Downloaded 23 times
Last edited by Zarchos on Sun Mar 06, 2016 9:23 am, edited 8 times in total.

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Tue Jan 26, 2016 9:06 am

UP. New update. I've repalced the previous files by the latest ones.

Comments welcome, don't hesitate please, I'd be happy to read sbdy took at least 5 minutes to take glance at what took me several hundreds hours to think of + code + amend + debug + optimise + debug again + optimise a second time (implementing the clever sequencing of ARM instructions) + debug :D
And I just loved it :wink:

User avatar
davidb
Posts: 1918
Joined: Sun Nov 11, 2007 10:11 pm
Contact:

Re: Filling routines (WIP) Follow up to Zarch thread

Postby davidb » Tue Jan 26, 2016 1:14 pm

It works for me in Arculator! I had to reenable mouse support in the version I'm using, change the memory in the Next wimpslot and switch to mode 15. I'm sure someone can put it all in an application directory with a !Run file to do all that. Maybe even me. ;)

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Sat Jan 30, 2016 9:38 pm

davidb wrote:It works for me in Arculator! I had to reenable mouse support in the version I'm using, change the memory in the Next wimpslot and switch to mode 15. I'm sure someone can put it all in an application directory with a !Run file to do all that. Maybe even me. ;)


Yes, sure.
I'm re re re re re reading my source and there are some other cases in Uni8bitp2 where 1 cycle can be saved using a different code.
I'll post the amended source with the new source to plot segments from 201 to 300 pixels in a couple of days (I hope).

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Mon Feb 22, 2016 12:58 pm

Up to signal new update (Uni8bitp1), see edit 22/02/16 in the message where I replaced the previous version of the file by the latest one, and I hope definitive version.

Zarchos
Posts: 2355
Joined: Sun May 19, 2013 8:19 am
Location: FRANCE

Re: Filling routines (WIP) Follow up to Zarch thread

Postby Zarchos » Wed Mar 30, 2016 6:46 am

Further progress as can be seen here for people with a very good eyesight : https://www.youtube.com/watch?v=MDg2iOuTkSw
It's only technical and visually not impressive at all.
But believe me once used in a game or a demo it should be very, very impressive, since it's very fast and should help keep a high framerate.
:shock:


Return to “software”

Who is online

Users browsing this forum: No registered users and 3 guests