## Border tracing Mandelbrot generator for the Beeb

### Re: Border tracing Mandelbrot generator for the Beeb

I've just made a Mandelbrot video for the BBC Micro program. The program had worked during about 5 minutes but the video is accelerated. The speed and GUI are very good. IMHO it is the best Mandelbrot in the 8-bit world. The version for the 2nd processor may be better than the best Mandelbrot for Amiga or Atari ST.

### Re: Border tracing Mandelbrot generator for the Beeb

litwr wrote:I've just made a Mandelbrot video for the BBC Micro program. The program had worked during about 5 minutes but the video is accelerated. The speed and GUI are very good. IMHO it is the best Mandelbrot in the 8-bit world. The version for the 2nd processor may be better than the best Mandelbrot for Amiga or Atari ST.

Is this program available for us to try out?

### Re: Border tracing Mandelbrot generator for the Beeb

hoglet wrote:Is this program available for us to try out?

It is the main program of this thread!

### Re: Border tracing Mandelbrot generator for the Beeb

litwr wrote:hoglet wrote:Is this program available for us to try out?

It is the main program of this thread!

Ah yes, sorry I was confused for a moment that you had a new program. Just ignore me

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

litwr wrote:I've just made a Mandelbrot video for the BBC Micro program. The program had worked during about 5 minutes but the video is accelerated. The speed and GUI are very good. IMHO it is the best Mandelbrot in the 8-bit world. The version for the 2nd processor may be better than the best Mandelbrot for Amiga or Atari ST.

Thanks! That's nice to hear. My biggest problem with it now is the precision problem after zooming for the fifth time (which is why zoom gets disabled after that) - this would be fixed by either using more fixed-point bits (currently 29 bits), or by going for some kind of floating-point representation. Neither are great as they will slow things down more. The former would be simpler but would still only increase the number of zoom levels by 4 per extra byte used. The latter would probably allow zooming to a much greater depth, but will add a much greater processing cost. One day I'll have a play with it and see if I can improve the accuracy at high zooms.

By the way, any 2nd processor version would most likely only double the speed (entirely due to the double clock speed of the parasite) - if I could find a way for both CPUs to perform calculations in parallel, this would improve things a bit more, although I expect the synchronisation would probably hit performance.

### Re: Border tracing Mandelbrot generator for the Beeb

I think floating point is no win at all for Mandelbrot - the orbits of the points go all over the map. You just need more bits of precision, and you can stick with fixed point.

Edit:

I don't suppose it would help that the modern co-pro is running at 64MHz, or 256MHz?!?

Edit:

Rich Talbot-Watkins wrote:By the way, any 2nd processor version would most likely only double the speed.

I don't suppose it would help that the modern co-pro is running at 64MHz, or 256MHz?!?

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

Really? I figured that since the algorithm is basically z2+c iterated until it exceeds 2, the important thing is the number of significant bits you have, not the number of total bits. In cases on the edge of the Mandelbrot set at high zooms, you are potentially iterating many times, and the values themselves might be very small, but what's more important is the number of significant bits of the result. I imagined going to floating point would absolutely be a win, but perhaps I need to think more carefully about that (or just try coding it in C++ and comparing fixed vs floating point results).

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

BigEd wrote:I don't suppose it would help that the modern co-pro is running at 64MHz, or 256MHz?!?

Well, yeah, there is that. But I consider that to be cheating!

### Re: Border tracing Mandelbrot generator for the Beeb

Rich Talbot-Watkins wrote:Really? I figured that since the algorithm is basically z2+c iterated until it exceeds 2, the important thing is the number of significant bits you have, not the number of total bits. In cases on the edge of the Mandelbrot set at high zooms, you are potentially iterating many times, and the values themselves might be very small, but what's more important is the number of significant bits of the result. I imagined going to floating point would absolutely be a win, but perhaps I need to think more carefully about that (or just try coding it in C++ and comparing fixed vs floating point results).

Hmm, on reflection I'm not sure any more... this calls for experiment, or a bit of Googling, or perhaps some deep thought.

### Re: Border tracing Mandelbrot generator for the Beeb

Isn't it true that we're not likely to be zooming in very close to (0,0) but close to somewhere on the boundary, so the magnitude of our numbers is, say, around a half? And yet, when zoomed right in, we need to distinguish the evolution of numbers which differ by very small amounts. So, we do need lots of accuracy, and our magnitudes don't vary so much. Floating point would seem like it's just extra work - what matters is how many bits of precision we have.

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

Yeah I also just reflected and think you're probably right actually. The problem is that the ranges for c in "interesting" areas go from (-1.5...0.5, +/-1) - so adding c each iteration can immediately increase the number of significant bits needed to accurately hold the result. Floating point will sometimes help, but other times will just truncate precision, in which case we still have the same problem. Going to a 6 or 8 byte fixed-point representation would probably be the first step in getting better results. Also it needs to increase the maximum iteration count, the higher the zoom! Another thing that'll slow it down!

Edit: yes, basically what you said there.

Edit: yes, basically what you said there.

### Re: Border tracing Mandelbrot generator for the Beeb

With a normal 3 or 4MHz copro, there would be some advantage in getting the 2MHz host to do a bit of work. We know how to upload a bit of code to the host using OSWORD and to give it control using USERV. We need to remember that the host can't do any I/O while it's busy running our code - including passing on an ESCAPE event - so we need fairly small work packets. How about this:

There's a bit of a gotcha there, in that a much faster copro will now be rate-limited by the host's speed. But that's OK if you provide a way to disable this speedup, or can detect this case.

I think computing z² demands two squarings and one multiplication, so that's three roughly-equal bits of work, which is handy.

- - collect three multiplications which need doing

- pass one to the host

- perform the other two

- collect result from the host

There's a bit of a gotcha there, in that a much faster copro will now be rate-limited by the host's speed. But that's OK if you provide a way to disable this speedup, or can detect this case.

I think computing z² demands two squarings and one multiplication, so that's three roughly-equal bits of work, which is handy.

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

I decided to do a few rough calculations to see whether the multiplication is still likely the bottleneck. The 32 bit * 32 bit multiply code (see .mul_fixed in the source code) takes a bit less than 800 cycles (compared to the naive way which would be anything from 2200-3200 cycles, depending on how many bits are set in the multiplicand), so this is a lot better, but still the slowest thing in the code!

Squaring is actually a little faster as some of the partial products are identical and so only need to be calculated once.

Your idea for splitting up the three multiplications between the two CPUs is great, in the case that the parasite is about a factor of two faster. How lucky the algorithm works like that!

Squaring is actually a little faster as some of the partial products are identical and so only need to be calculated once.

Your idea for splitting up the three multiplications between the two CPUs is great, in the case that the parasite is about a factor of two faster. How lucky the algorithm works like that!

### Re: Border tracing Mandelbrot generator for the Beeb

Rich Talbot-Watkins wrote:New version! Now it looks like all the outlines are grown in parallel from the outside inwards - not sure if this is more or less visually interesting, but it has lower space requirements, so it's probably here to stay. Changed to a big overscan MODE 2 (184x272)!

Just catching up with this. Very nice!

### Re: Border tracing Mandelbrot generator for the Beeb

Rich Talbot-Watkins wrote:Squaring is actually a little faster as some of the partial products are identical and so only need to be calculated once.

I like the idea of squaring being a bit cheaper!

Can you say in a couple of sentences what your approach to multiplication is? I think I can see you're using a number of page-sized lookup tables. (I like the trick of Karatsuba multiplication but I'm not sure I've yet seen it used in real life.)

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

Yeah! So there's the reasonably well known "difference of squares method". This basically says that:

ab = f(a+b) - f(a-b)

where f(x) = x2/4

You can show this by expanding (a + b)2 and (a - b)2, like this:

(a + b)2 = a2 + b2 + 2ab -> (I)

(a - b)2 = a2 + b2 - 2ab -> (II)

(I) - (II) gives:

4ab = (a + b)2 - (a - b)2

So if we were to build a table of f(x) for x = 0...510, that would allow us to quickly look up f(a + b) for any positive a and b between 0 and 255:

How about f(a - b)? There's a lovely little hack which involves creating a new table, this time of g(x) = ((x-255)2)/4.

Now we have f(a - b) = g(a + 255 - b).

(255 - b) is just (b EOR &FF), so now we have the entire code snippet:

The first 6 lines of that are a constant preamble for constant mul1, so we get a further optimization if we want to multiply several values by mul1! That works out an 8 bit * 8 bit = 16 bit multiplication in very few cycles.

How about bigger numbers?

Well this is just about adding together partial products:

(a0 + ma1)(b0 + mb1)

= a0b0 + m(a0b1 + a1b0) + m2(a1b1)

(where m=256)

So, to do 16 bit * 16 bit = 32 bit, it's 4 multiplies, and then add the partial products together.

You can extend that idea to as far as you need, but the number of terms becomes a bit hairy after a while. In the source code, you can see how all these partial products need to get added together. Because each one is 16 bit, they have to be done individually so you don't lose any carries - there's no shortcut, just horrible repetitive code, but it still ends up being quicker than the classic 6502 multiply algorithm.

Hmm, that was longer than a couple of sentences huh!

ab = f(a+b) - f(a-b)

where f(x) = x2/4

You can show this by expanding (a + b)2 and (a - b)2, like this:

(a + b)2 = a2 + b2 + 2ab -> (I)

(a - b)2 = a2 + b2 - 2ab -> (II)

(I) - (II) gives:

4ab = (a + b)2 - (a - b)2

So if we were to build a table of f(x) for x = 0...510, that would allow us to quickly look up f(a + b) for any positive a and b between 0 and 255:

Code: Select all

`LDA mul1`

STA mulindex

LDY mul2

.mulindex LDA (mulindex),Y ; (mulindex) points to page aligned multable

How about f(a - b)? There's a lovely little hack which involves creating a new table, this time of g(x) = ((x-255)2)/4.

Now we have f(a - b) = g(a + 255 - b).

(255 - b) is just (b EOR &FF), so now we have the entire code snippet:

Code: Select all

`LDA mul1`

STA mulindex1

STA mulindex3

EOR #&FF

STA mulindex2

STA mulindex4

LDY mul2

SEC

LDA (mulindex1),Y ; points to page aligned multable_lo

SBC (mulindex2),Y ; points to page aligned multable255_lo

STA result_lo

LDA (mulindex3),Y ; points to page aligned multable_hi

SBC (mulindex4),Y ; points to page aligned multable255_hi

STA result_hi

The first 6 lines of that are a constant preamble for constant mul1, so we get a further optimization if we want to multiply several values by mul1! That works out an 8 bit * 8 bit = 16 bit multiplication in very few cycles.

How about bigger numbers?

Well this is just about adding together partial products:

(a0 + ma1)(b0 + mb1)

= a0b0 + m(a0b1 + a1b0) + m2(a1b1)

(where m=256)

So, to do 16 bit * 16 bit = 32 bit, it's 4 multiplies, and then add the partial products together.

You can extend that idea to as far as you need, but the number of terms becomes a bit hairy after a while. In the source code, you can see how all these partial products need to get added together. Because each one is 16 bit, they have to be done individually so you don't lose any carries - there's no shortcut, just horrible repetitive code, but it still ends up being quicker than the classic 6502 multiply algorithm.

Hmm, that was longer than a couple of sentences huh!

### Re: Border tracing Mandelbrot generator for the Beeb

Thanks! That's great... can you see a way to use Karatsuba's observation that only three multiplies are needed - or is it clear that it won't help?

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

Interesting approach! Unfortunately the problem here is that (a0 + a1) and (b0 + b1) are both 9 bit values, and so won't work as multiplicands with the tables (nor with 6502 registers!). I don't think it's really important though because, in this case, the multiplication operation itself is so quick - particularly if one of the operands has already been set up.

### Re: Border tracing Mandelbrot generator for the Beeb

I suppose we'd need at least a BCS and a second byte-indexed table. But an 18-bit result is bit of a squeeze... does it start to make more sense as you move up to 6 byte and to 8 byte operands?

### Re: Border tracing Mandelbrot generator for the Beeb

Ooh, this looks even better:

"A three-way Toom–Cook can do a size-3N multiplication for the cost of five size-N multiplications"

So that's 6x6 for the price of five 2x2s.

"A three-way Toom–Cook can do a size-3N multiplication for the cost of five size-N multiplications"

So that's 6x6 for the price of five 2x2s.

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

No, I don't think it wouldn't be worth the extra hassle unfortunately. In the code, I actually calculate and cache all the partial products first as it's quicker to calculate, for example, all the terms which contain a3 one after the other. When I add them all up, I do this in a different order (LSBs first), so I have to have them precalculated so I can access them as I need.

Once it's set up to multiply with a3, each term takes only 29-33 cycles to calculate. Just building the extra terms needed for the Karatsuba method and handling the extra carry issues would take longer than that!

(I just had a look at the Toom-Cook page but it hurt my head before I got halfway down the page. I suspect it might suffer the same problem though if it's a generalisation of Karatsuba.)

Code: Select all

` ; do multiplication with a3`

LDA mulA+3:STA multab1loptr:STA multab1hiptr

EOR #&FF:STA multab2loptr:STA multab2hiptr

; a3.b0

LDY mulB+0

LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b0lo

LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b0hi

; a3.b1

LDY mulB+1

LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b1lo

LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b1hi

; a3.b2

LDY mulB+2

LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b2lo

LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b2hi

; a3.b3

LDY mulB+3

LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b3lo

LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b3hi

Once it's set up to multiply with a3, each term takes only 29-33 cycles to calculate. Just building the extra terms needed for the Karatsuba method and handling the extra carry issues would take longer than that!

(I just had a look at the Toom-Cook page but it hurt my head before I got halfway down the page. I suspect it might suffer the same problem though if it's a generalisation of Karatsuba.)

### Re: Border tracing Mandelbrot generator for the Beeb

It feels like the square law will get you eventually - but eventually might be a lot more than 8 bytes of operand!

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

Just to add, we're already doing a 9-bit index operation here, by addressing with (table+A),Y! Essentially the index lookup is doing the addition with carry for us!

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

For a 64 bit * 64 bit multiplication, there'd be 64 terms in total, of which we could discard 21 (which contribute only to underflow terms of the result). That's still 43 multiplications and I frankly hate to think how many 16 bit additions, so it's not going to be quick. For comparison, 32*32 multiplication involves 13 terms and 12 additions, so the quadratic complexity is definitely going to kill.

### Re: Border tracing Mandelbrot generator for the Beeb

It's nice that you only need an approximate result, so you can avoid doing a fair fraction of the work - about a third for 8x8 bytes, from what you say.

According to the table in this presentation it might be around the 8x8 byte point that Karatsuba starts being the right choice - I'm assuming 190 bits is to be considered about 6 words of 32 bits. Fortunately, perhaps, Toom-Cook isn't worth using until the 16x16 byte point.

That's neat!

According to the table in this presentation it might be around the 8x8 byte point that Karatsuba starts being the right choice - I'm assuming 190 bits is to be considered about 6 words of 32 bits. Fortunately, perhaps, Toom-Cook isn't worth using until the 16x16 byte point.

Rich Talbot-Watkins wrote:Just to add, we're already doing a 9-bit index operation here, by addressing with (table+A),Y! Essentially the index lookup is doing the addition with carry for us!

That's neat!

### Re: Border tracing Mandelbrot generator for the Beeb

Rich Talbot-Watkins wrote:By the way, any 2nd processor version would most likely only double the speed (entirely due to the double clock speed of the parasite) - if I could find a way for both CPUs to perform calculations in parallel, this would improve things a bit more, although I expect the synchronisation would probably hit performance.

Just wondering if you're still thinking about giving this a go? There are more (and faster) second processors in play than ever before, and we need something to show them off!

(The PiTubeDirect has a game-friendly and historically appropriate setting for 3MHz operation, which lines up with a cheesewedge, and the Matchbox has a 4MHz operation which lines up with the Master Turbo. The PiTubeDirect next release will most likely have a selectable speed from 1MHz to 24MHz or full-on.)

- Rich Talbot-Watkins
**Posts:**1088**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca

### Re: Border tracing Mandelbrot generator for the Beeb

BigEd wrote:Just wondering if you're still thinking about giving this a go? There are more (and faster) second processors in play than ever before, and we need something to show them off!

One day I'd love to. But not any time real soon - I'm going through a bit of a "busy period" in work right now...

### Re: Border tracing Mandelbrot generator for the Beeb

Fair enough!

### Who is online

Users browsing this forum: No registered users and 8 guests