12second Mandelbrot rendering on the BBC Master!
12second Mandelbrot rendering on the BBC Master!
Yes, really! Well, no, not really, but, yes really!
Okay, I was distracted, and I decided to have a play with stupidly limited fixedpoint arithmetic, and it led to this.
The JSBeeb link is here: https://bbc.godbolt.org/?model=Master&d ... d&autoboot
And the writeup is here (including a link to the code): http://cowlark.com/20180521bbcmicromandelbrot/
Okay, I was distracted, and I decided to have a play with stupidly limited fixedpoint arithmetic, and it led to this.
The JSBeeb link is here: https://bbc.godbolt.org/?model=Master&d ... d&autoboot
And the writeup is here (including a link to the code): http://cowlark.com/20180521bbcmicromandelbrot/
Re: 12second Mandelbrot rendering on the BBC Master!
Excellent  thanks for the jsbeeb link! (And thanks for the detailed writeup too. It's interesting to see different tradeoffs in arithmetic.)
Re: 12second Mandelbrot rendering on the BBC Master!
Beautiful!
Made me smile
Made me smile
Re: 12second Mandelbrot rendering on the BBC Master!
I got it down to 4.5 seconds! With some slight problems...
That's using reenigne's tableofsquares algorithm, but 2.6 fixed point just doesn't have enough precision to make it work and it's overflowing nearly everywhere. It's still getting the main set moreorless right, though.
That's using reenigne's tableofsquares algorithm, but 2.6 fixed point just doesn't have enough precision to make it work and it's overflowing nearly everywhere. It's still getting the main set moreorless right, though.
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
Sigh. I was feeling all smug there for a moment...
So reenigne came up with an incredibly cunning and impossibly fast Mandelbrot kernel, using 14point fixed point arithmetic and a 16kB lookup table. After patching it into my renderer, this happened...
Jsbeeb link: https://bbc.godbolt.org/?model=Master&d ... d&autoboot
So, yeah. Now it really is a real Mandelbrot renderer. (That's 32 iterations, too.)
So reenigne came up with an incredibly cunning and impossibly fast Mandelbrot kernel, using 14point fixed point arithmetic and a 16kB lookup table. After patching it into my renderer, this happened...
Jsbeeb link: https://bbc.godbolt.org/?model=Master&d ... d&autoboot
So, yeah. Now it really is a real Mandelbrot renderer. (That's 32 iterations, too.)
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
Very good indeed  what's the secret sauce?
Re: 12second Mandelbrot rendering on the BBC Master!
I'm going to do a proper writeup, because it's really neat, but essentially: there are 16384 possible 14 bit numbers; if we enforce that the LSB is always 0, then there are 8192 possible 14 bit numbers, each of which are 2 apart; the top two bits are unused, so let's mask in $8000; now every possible number is a valid pointer to a unique entry in a 16kB lookup table at 0x8000.
So, every number is a pointer to its own square.
This then allows a relatively standard tableofsquares approach to multiplication. Arithmetic operations are adds, subtractions and squares, all of which preserve the 0 in the LSB; operations which result in a number which might need to be squared require the top two bits to be fixed up; but that's basically all there is to it.
(Although reenigne's original kernel is also supposed to live in zero page and use selfmodifying code. The version in my renderer doesn't, because I haven't figured out how to make the assembler do it. Sigh. I miss the BBC Basic assembler.)
So, every number is a pointer to its own square.
This then allows a relatively standard tableofsquares approach to multiplication. Arithmetic operations are adds, subtractions and squares, all of which preserve the 0 in the LSB; operations which result in a number which might need to be squared require the top two bits to be fixed up; but that's basically all there is to it.
(Although reenigne's original kernel is also supposed to live in zero page and use selfmodifying code. The version in my renderer doesn't, because I haven't figured out how to make the assembler do it. Sigh. I miss the BBC Basic assembler.)
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
Very nice! (I think I'd call it 13 bits, but still.)
 Rich TalbotWatkins
 Posts: 1339
 Joined: Thu Jan 13, 2005 5:20 pm
 Location: Palma, Mallorca
 Contact:
Re: 12second Mandelbrot rendering on the BBC Master!
That's a cool technique!
So, we have:
Zre' = Zre2  Zim2 + Cre
Zim' = 2 Zre Zim + Cim
= ( (Zre + Zim)2  (Zre  Zim)2 ) / 2 + Cim
That way, all the squares come directly from the table.
Did you ever see my take on the Mandelbrot set, here? It uses 32bit precision, and performs a boundary tracing technique rather than the recursive subdivision that you went for. I wonder if mine would perform better using that trick instead? I used to think mine was fast, but yours just blows it out of the water! (Looking at my source code in retrospect, I can see some obvious low hanging fruit for improving it marginally, but I doubt it'd make a huge difference.)
So, we have:
Zre' = Zre2  Zim2 + Cre
Zim' = 2 Zre Zim + Cim
= ( (Zre + Zim)2  (Zre  Zim)2 ) / 2 + Cim
That way, all the squares come directly from the table.
Did you ever see my take on the Mandelbrot set, here? It uses 32bit precision, and performs a boundary tracing technique rather than the recursive subdivision that you went for. I wonder if mine would perform better using that trick instead? I used to think mine was fast, but yours just blows it out of the water! (Looking at my source code in retrospect, I can see some obvious low hanging fruit for improving it marginally, but I doubt it'd make a huge difference.)
Re: 12second Mandelbrot rendering on the BBC Master!
Sadly, I must be absolutely clear that this is no longer my trick! (Although I can still claim responsibility for the recursive subdivision, which is a big help, at least once I fixed the big bug that meant it didn't work properly in my original version.)
How does your boundary tracing version track edges? i.e. data structure management? I'm using video memory as a cache for rendered pixels; mode 2 colour 8, mapped to black so it's invisible, is used to mark unrendered pixels. I'd like to switch to mode 1 for higher resolution but because I need that colour I'd end up with a threecolour image. Might still be worth it, though.
(Also, I had forgotten how nasty pick/plot pixels on the BBC is... that framebuffer format, ugh. Anyone have any pointers to good (i.e. fast) machine code routines?)
(This algorithm would work *really* well on a Z80, with its 16bit adds/subtracts.)
How does your boundary tracing version track edges? i.e. data structure management? I'm using video memory as a cache for rendered pixels; mode 2 colour 8, mapped to black so it's invisible, is used to mark unrendered pixels. I'd like to switch to mode 1 for higher resolution but because I need that colour I'd end up with a threecolour image. Might still be worth it, though.
(Also, I had forgotten how nasty pick/plot pixels on the BBC is... that framebuffer format, ugh. Anyone have any pointers to good (i.e. fast) machine code routines?)
(This algorithm would work *really* well on a Z80, with its 16bit adds/subtracts.)
David Given
http://cowlark.com
http://cowlark.com
 Rich TalbotWatkins
 Posts: 1339
 Joined: Thu Jan 13, 2005 5:20 pm
 Location: Palma, Mallorca
 Contact:
Re: 12second Mandelbrot rendering on the BBC Master!
The border tracing is quite straightforward. Instead of using recursion, it maintains a queue of coordinates to process and carries on until it's empty. First thing it does is pushes all pixels at the screen edge to the queue. Then, for each item in the queue:
I use a similar trick to you with special onscreen colour values (given the 4th colour bit in MODE 2 is redundant). Each element is actually a pair of pixels (because I use dithering to get more colours), so the resolution ends up being 184x136 (overscan); then the two top bits have a special meaning  one to denote that the pixel hasn't yet been written, and one to denote that a pixel has already been added to the queue (otherwise we could quickly fill the queue with duplicates). These are distinct because we don't always write new pixels to the queue, if they happen to be the same colour as the central pixel being tested against.
I wrote a little about the multiplication technique that it uses here. It uses fixed point 3:29 values (between 4 and +4), and does an earlyout when Zre or Zim are >= 2, before they're squared, as clearly that would already yield Z' >= 4. A cute consequence of this multiplication technique is that it can square a number slightly more quickly than a regular multiplication, because many of the partial products are the same.
Yeah, screen address calculation is a pain. I think I use a table of the start address for each character row, and then add in (y AND 7) + (x AND &FE)*4. Source code is in that link in case you want to see how it all works.
 Evaluate the pixel at that position and plot it. Note that the pixel may already be plotted, in which case use the screen as a cache to save reevaluating the pixel.
 Evaluate all 8 neighbours and plot them (again remembering that their result may already be known from the screen itself). If any neighbour differs from the central pixel, we have found a border, so push this to the queue so we trace along the border.
I use a similar trick to you with special onscreen colour values (given the 4th colour bit in MODE 2 is redundant). Each element is actually a pair of pixels (because I use dithering to get more colours), so the resolution ends up being 184x136 (overscan); then the two top bits have a special meaning  one to denote that the pixel hasn't yet been written, and one to denote that a pixel has already been added to the queue (otherwise we could quickly fill the queue with duplicates). These are distinct because we don't always write new pixels to the queue, if they happen to be the same colour as the central pixel being tested against.
I wrote a little about the multiplication technique that it uses here. It uses fixed point 3:29 values (between 4 and +4), and does an earlyout when Zre or Zim are >= 2, before they're squared, as clearly that would already yield Z' >= 4. A cute consequence of this multiplication technique is that it can square a number slightly more quickly than a regular multiplication, because many of the partial products are the same.
Yeah, screen address calculation is a pain. I think I use a table of the start address for each character row, and then add in (y AND 7) + (x AND &FE)*4. Source code is in that link in case you want to see how it all works.
Re: 12second Mandelbrot rendering on the BBC Master!
Four squarings... it turns out, I think, that three squarings are enough:Rich TalbotWatkins wrote:That's a cool technique!
So, we have:
Zre' = Zre2  Zim2 + Cre
Zim' = 2 Zre Zim + Cim
= ( (Zre + Zim)2  (Zre  Zim)2 ) / 2 + Cim
That way, all the squares come directly from the table.
The one multiply was being used to calculate zr * zi. If we instead calculate (zr + zi)^2 then we get zr^2 + 2*zr*zi + zi^2. Since we’ve already calculated zr^2 and zi^2 we can just subtract them off
 Rich TalbotWatkins
 Posts: 1339
 Joined: Thu Jan 13, 2005 5:20 pm
 Location: Palma, Mallorca
 Contact:
Re: 12second Mandelbrot rendering on the BBC Master!
Oooh, great trick! Kicking myself for not spotting that when I was writing it up! I can actually use that to improve the performance of my code, substantially I suspect!
Re: 12second Mandelbrot rendering on the BBC Master!
There's another idea which I'm failing to find in my searches a little like the boundary following, expect it draws a small circle around each computed point. In areas of less detail, it turns out the circle is not so small, of course. The net effect, mathematically justified, is needing to compute fewer points than you paint.
Re: 12second Mandelbrot rendering on the BBC Master!
BTW, for the cost of a couple of base2 logs, which is almost nothing, surely, in binary, and with the addition of a VideoNULA, it's possible to apply smooth shading to the display. See here
http://linas.org/artgallery/escape/smooth.html
http://linas.org/artgallery/escape/smooth.html
 Rich TalbotWatkins
 Posts: 1339
 Joined: Thu Jan 13, 2005 5:20 pm
 Location: Palma, Mallorca
 Contact:
Re: 12second Mandelbrot rendering on the BBC Master!
Out of curiosity, I just tried making this change in my original program to see what kind of difference it makes. It now renders the initial screen in 1:45 instead of 1:59  not exactly the improvement I was expecting, but it's something I guess. And still clearly not 12 secondsBigEd wrote: Four squarings... it turns out, I think, that three squarings are enough:The one multiply was being used to calculate zr * zi. If we instead calculate (zr + zi)^2 then we get zr^2 + 2*zr*zi + zi^2. Since we’ve already calculated zr^2 and zi^2 we can just subtract them off
Not sure what sneaky tricks could be employed to speed up the multiply while still keeping 32 bit precision.
 Attachments

 mandel.zip
 (9.59 KiB) Downloaded 19 times
 Rich TalbotWatkins
 Posts: 1339
 Joined: Thu Jan 13, 2005 5:20 pm
 Location: Palma, Mallorca
 Contact:
Re: 12second Mandelbrot rendering on the BBC Master!
Sounds interesting. Essentially the greatest optimization at this stage is to avoid calculating as many points as possible (particularly those requiring large numbers of iterations, e.g. those in the Mandelbrot set).BigEd wrote:There's another idea which I'm failing to find in my searches a little like the boundary following, expect it draws a small circle around each computed point. In areas of less detail, it turns out the circle is not so small, of course. The net effect, mathematically justified, is needing to compute fewer points than you paint.
I wonder which strategy hits fewer points out of recursive subdivision or boundary tracing? I figured that boundary tracing will only evaluate the border points of the Mandelbrot set, while subdivision will evaluate any which lie on the edge of the block being considered, but maybe there's an overall advantage in the latter approach. Maybe boundary tracing wastes time evaluating neighbours which are more likely to be the same colour as the central element.
Re: 12second Mandelbrot rendering on the BBC Master!
Reenigne's kernel (it's at the bottom of https://github.com/davidgiven/bogomande ... mandel.asm, annotations mostly mine) is computing 2xy from (x+y)^2  x^2  y^2:
I need to port the code from xa65 to the BBC assembler, which is easier to use and much more powerful, and get the kernel into zero page where it belongs (it uses selfmodifying code to store onebyte values in the literal arguments of instructions).
Also it got pointed out to me that there's a really simple optimisation which will save tonnes of cycles per pixel for converting pixel coordinates to complex space. I may be able to use a similar trick for calculating screen addresses.
Reenigne said that they hadn't written any 6502 code before, BTW.
Code: Select all
zr2_plus_zi2 = zr^2 + zi^2
if zr2_plus_zi2 > 4 then bailout
zr_plus_zi = zr + zi
zr2_minus_zi2 = zr^2  zi^2
zr = zr2_minus_zi2 + cr
zi = zr2_plus_zi2  zr_plus_zi + ci
Also it got pointed out to me that there's a really simple optimisation which will save tonnes of cycles per pixel for converting pixel coordinates to complex space. I may be able to use a similar trick for calculating screen addresses.
Reenigne said that they hadn't written any 6502 code before, BTW.
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
WITNESS THE POWER OF MY FULLY OPERATIONAL... PROGRAM THING.
Now with panandzoom support, Julia sets, actual help text, and three (3) levels of zoom!
The writeup with all the maths in it is here: http://cowlark.com/20180526bogomandel/ (and a link to the code).
The jsbeeb link is here: https://bbc.godbolt.org/?model=Master&d ... d&autoboot
Now with panandzoom support, Julia sets, actual help text, and three (3) levels of zoom!
The writeup with all the maths in it is here: http://cowlark.com/20180526bogomandel/ (and a link to the code).
The jsbeeb link is here: https://bbc.godbolt.org/?model=Master&d ... d&autoboot
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
Wow, I would never have thought I would have seen that on a beeb bitd.
Re: 12second Mandelbrot rendering on the BBC Master!
I don't know if you're interested in another micro optimization but
Because of the bcs we know the carry is clear so no need for the clc.
Inlining mandel as it is only used once saves quite a few cycles.
This is all amazing work.
Code: Select all
bcs bailout
sta zr2_p_zi2_hi
; Calculate zr + zi.
clc
lda zr+0 ; A = low(zr)
Inlining mandel as it is only used once saves quite a few cycles.
This is all amazing work.
Last edited by dp11 on Sun May 27, 2018 8:18 am, edited 1 time in total.
Re: 12second Mandelbrot rendering on the BBC Master!
@dp11: ooh, nice! Every second counts  that shaves a tenth of a second off the render speed for the default image (1805 px/s instead of 1795).
...you know, I think I can scrape another bit of precision out of this (which would add another zoom layer). I'd need a 32kB lookup table, which would extend from 0x4000 to 0xc000; the framebuffer would go into shadow RAM. The top two bits of the number representation would would be either %01 or %10, so the top bit would be the inverse of the sign bit. That means my fixup would go from:
lda number+1
and #&3f
ora #&80
sta number+1
...to:
lda number+1
asl A
lsr A ; propagate sign bit to top bit
eor #&80 ; flip sign bit
sta number+1
That's two extra cycles. There are three fixups in the kernel, so I'd expect 1785 px/s  which should be negligable.
Dammit, I thought I'd finished this.
...you know, I think I can scrape another bit of precision out of this (which would add another zoom layer). I'd need a 32kB lookup table, which would extend from 0x4000 to 0xc000; the framebuffer would go into shadow RAM. The top two bits of the number representation would would be either %01 or %10, so the top bit would be the inverse of the sign bit. That means my fixup would go from:
lda number+1
and #&3f
ora #&80
sta number+1
...to:
lda number+1
asl A
lsr A ; propagate sign bit to top bit
eor #&80 ; flip sign bit
sta number+1
That's two extra cycles. There are three fixups in the kernel, so I'd expect 1785 px/s  which should be negligable.
Dammit, I thought I'd finished this.
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
Just edited above post Inlining mandel will save quite a bit of time. After inling try and make .iterator_loop all sit within one page .
Also align the data to a page boundary so no extra cycles are added when accessing the data :
Also align the data to a page boundary so no extra cycles are added when accessing the data :
Code: Select all
.pixels_to_zr_lo skip &100
.pixels_to_zr_hi skip &100
.pixels_to_zi_lo skip &100
.pixels_to_zi_hi skip &100
.col_table_lo skip &80
.col_table_hi skip &80
.row_table_lo skip &100
.row_table_hi skip &100
Re: 12second Mandelbrot rendering on the BBC Master!
Yup, works fine! And I chopped off a bit from the integer part and added it to the fractional part, which gives me two extra bits of precision  you can now zoom all the way down to areas 0.25 across!
Sadly the performance has suffered. The number fixup is more complex than I expected; I ended up with:
= 10.5 cycles, up from 4.
So it's now getting 1590 px/s for the toplevel image instead of 1800px/s, which is a shame. All I want to do is to set bit7 = !bit6; are there any cunninger ways?
(The jsbeeb link is unchanged  the new version's uploaded.)
Re alignment and inline etc: the program is really topheavy; it's massively dominated by the inner loop inside the mandel routine. Optimising things outside that routine has very little effect; I tried aligning the lookup tables and the result was lost in the noise.
reenigne's original version had the entire kernel running in zero page and even more selfmodifying code, just to shave a few extra cycles off. I actually don't do that, because I'd have to stomp all over Basic workspace to make room and the shell program with the UI in it is in Basic (for simplicity).
Sadly the performance has suffered. The number fixup is more complex than I expected; I ended up with:
Code: Select all
(lda number+1)
asl A ; bit 6 > bit 7 > N; 2
clc ; 2
bmi dont_set_c ; 2.5
sec ; 2
.dont_set_c
ror A ; C > bit 7 ; 2
(sta number+1)
So it's now getting 1590 px/s for the toplevel image instead of 1800px/s, which is a shame. All I want to do is to set bit7 = !bit6; are there any cunninger ways?
(The jsbeeb link is unchanged  the new version's uploaded.)
Re alignment and inline etc: the program is really topheavy; it's massively dominated by the inner loop inside the mandel routine. Optimising things outside that routine has very little effect; I tried aligning the lookup tables and the result was lost in the noise.
reenigne's original version had the entire kernel running in zero page and even more selfmodifying code, just to shave a few extra cycles off. I actually don't do that, because I'd have to stomp all over Basic workspace to make room and the shell program with the UI in it is in Basic (for simplicity).
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
Love this  I haven't looked at the code, but extending the colour depth to the full 16 in MODE2 will mean that a bit of NULA magic can also be worked!
d.
d.
Re: 12second Mandelbrot rendering on the BBC Master!
Bit7 isn't known, because it'll get corrupted by the previous computation. The fixup adjusts the number so that it's a pointer to its square in the lookup table: bit6 is the sign bit, so positive numbers go in 0x8000 to 0xc000 (the top two bits are %10) and negative numbers go in 0x4000 to 0x8000 (the top two bits are %01). Adding two positive numbers will therefore reset bit7, but adding a positive number and a negative number will set it, etc. The fixup is necessary to correct this.
I'm on a BBC Master, so I get to use the extra 65C02 instructions, but I'm not sure they'll help. (We don't get the Rockwell and WDC extra instructions, do we?)
I'm on a BBC Master, so I get to use the extra 65C02 instructions, but I'm not sure they'll help. (We don't get the Rockwell and WDC extra instructions, do we?)
David Given
http://cowlark.com
http://cowlark.com
Re: 12second Mandelbrot rendering on the BBC Master!
Wasn't there a code option.
LDA + 8 cycles if it is correct.
Code: Select all
(LDA)
ASL
CMP #128
ROR
EOR #128
(STA)
Last edited by tricky on Sun May 27, 2018 11:38 am, edited 1 time in total.
Re: 12second Mandelbrot rendering on the BBC Master!
The master is the 65SC12  so I think it doesn't get the bit instructions
d.
d.