## Border tracing Mandelbrot generator for the Beeb

### Re: Border tracing Mandelbrot generator for the Beeb

I've just made a Mandelbrot video for the BBC Micro program. The program had worked during about 5 minutes but the video is accelerated. The speed and GUI are very good. IMHO it is the best Mandelbrot in the 8-bit world. The version for the 2nd processor may be better than the best Mandelbrot for Amiga or Atari ST.

### Re: Border tracing Mandelbrot generator for the Beeb

Is this program available for us to try out?litwr wrote:I've just made a Mandelbrot video for the BBC Micro program. The program had worked during about 5 minutes but the video is accelerated. The speed and GUI are very good. IMHO it is the best Mandelbrot in the 8-bit world. The version for the 2nd processor may be better than the best Mandelbrot for Amiga or Atari ST.

### Re: Border tracing Mandelbrot generator for the Beeb

It is the main program of this thread!hoglet wrote:Is this program available for us to try out?

### Re: Border tracing Mandelbrot generator for the Beeb

Ah yes, sorry I was confused for a moment that you had a new program. Just ignore melitwr wrote:It is the main program of this thread!hoglet wrote:Is this program available for us to try out?

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Thanks! That's nice to hear. My biggest problem with it now is the precision problem after zooming for the fifth time (which is why zoom gets disabled after that) - this would be fixed by either using more fixed-point bits (currently 29 bits), or by going for some kind of floating-point representation. Neither are great as they will slow things down more. The former would be simpler but would still only increase the number of zoom levels by 4 per extra byte used. The latter would probably allow zooming to a much greater depth, but will add a much greater processing cost. One day I'll have a play with it and see if I can improve the accuracy at high zooms.litwr wrote:I've just made a Mandelbrot video for the BBC Micro program. The program had worked during about 5 minutes but the video is accelerated. The speed and GUI are very good. IMHO it is the best Mandelbrot in the 8-bit world. The version for the 2nd processor may be better than the best Mandelbrot for Amiga or Atari ST.

By the way, any 2nd processor version would most likely only double the speed (entirely due to the double clock speed of the parasite) - if I could find a way for both CPUs to perform calculations in parallel, this would improve things a bit more, although I expect the synchronisation would probably hit performance.

### Re: Border tracing Mandelbrot generator for the Beeb

I think floating point is no win at all for Mandelbrot - the orbits of the points go all over the map. You just need more bits of precision, and you can stick with fixed point.

Edit:

Edit:

I don't suppose it would help that the modern co-pro is running at 64MHz, or 256MHz?!?Rich Talbot-Watkins wrote:By the way, any 2nd processor version would most likely only double the speed.

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Really? I figured that since the algorithm is basically z2+c iterated until it exceeds 2, the important thing is the number of significant bits you have, not the number of total bits. In cases on the edge of the Mandelbrot set at high zooms, you are potentially iterating many times, and the values themselves might be very small, but what's more important is the number of significant bits of the result. I imagined going to floating point would absolutely be a win, but perhaps I need to think more carefully about that (or just try coding it in C++ and comparing fixed vs floating point results).

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Well, yeah, thereBigEd wrote:I don't suppose it would help that the modern co-pro is running at 64MHz, or 256MHz?!?

*is*that. But I consider that to be cheating!

### Re: Border tracing Mandelbrot generator for the Beeb

Hmm, on reflection I'm not sure any more... this calls for experiment, or a bit of Googling, or perhaps some deep thought.Rich Talbot-Watkins wrote:Really? I figured that since the algorithm is basically z2+c iterated until it exceeds 2, the important thing is the number of significant bits you have, not the number of total bits. In cases on the edge of the Mandelbrot set at high zooms, you are potentially iterating many times, and the values themselves might be very small, but what's more important is the number of significant bits of the result. I imagined going to floating point would absolutely be a win, but perhaps I need to think more carefully about that (or just try coding it in C++ and comparing fixed vs floating point results).

### Re: Border tracing Mandelbrot generator for the Beeb

Isn't it true that we're not likely to be zooming in very close to (0,0) but close to somewhere on the boundary, so the magnitude of our numbers is, say, around a half? And yet, when zoomed right in, we need to distinguish the evolution of numbers which differ by very small amounts. So, we do need lots of accuracy, and our magnitudes don't vary so much. Floating point would seem like it's just extra work - what matters is how many bits of precision we have.

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Yeah I also just reflected and think you're probably right actually. The problem is that the ranges for c in "interesting" areas go from (-1.5...0.5, +/-1) - so adding c each iteration can immediately increase the number of significant bits needed to accurately hold the result. Floating point will sometimes help, but other times will just truncate precision, in which case we still have the same problem. Going to a 6 or 8 byte fixed-point representation would probably be the first step in getting better results. Also it needs to increase the maximum iteration count, the higher the zoom! Another thing that'll slow it down!

Edit: yes, basically what you said there.

Edit: yes, basically what you said there.

### Re: Border tracing Mandelbrot generator for the Beeb

With a normal 3 or 4MHz copro, there would be some advantage in getting the 2MHz host to do a bit of work. We know how to upload a bit of code to the host using OSWORD and to give it control using USERV. We need to remember that the host can't do any I/O while it's busy running our code - including passing on an ESCAPE event - so we need fairly small work packets. How about this:

I think computing z² demands two squarings and one multiplication, so that's three roughly-equal bits of work, which is handy.

- - collect three multiplications which need doing

- pass one to the host

- perform the other two

- collect result from the host

I think computing z² demands two squarings and one multiplication, so that's three roughly-equal bits of work, which is handy.

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

I decided to do a few rough calculations to see whether the multiplication is still likely the bottleneck. The 32 bit * 32 bit multiply code (see .mul_fixed in the source code) takes a bit less than 800 cycles (compared to the naive way which would be anything from 2200-3200 cycles, depending on how many bits are set in the multiplicand), so this is a lot better, but still the slowest thing in the code!

Squaring is actually a little faster as some of the partial products are identical and so only need to be calculated once.

Your idea for splitting up the three multiplications between the two CPUs is great, in the case that the parasite is about a factor of two faster. How lucky the algorithm works like that!

Squaring is actually a little faster as some of the partial products are identical and so only need to be calculated once.

Your idea for splitting up the three multiplications between the two CPUs is great, in the case that the parasite is about a factor of two faster. How lucky the algorithm works like that!

### Re: Border tracing Mandelbrot generator for the Beeb

Just catching up with this. Very nice!Rich Talbot-Watkins wrote:New version! Now it looks like all the outlines are grown in parallel from the outside inwards - not sure if this is more or less visually interesting, but it has lower space requirements, so it's probably here to stay. Changed to a big overscan MODE 2 (184x272)!

### Re: Border tracing Mandelbrot generator for the Beeb

I like the idea of squaring being a bit cheaper!Rich Talbot-Watkins wrote:Squaring is actually a little faster as some of the partial products are identical and so only need to be calculated once.

Can you say in a couple of sentences what your approach to multiplication is? I think I can see you're using a number of page-sized lookup tables. (I like the trick of Karatsuba multiplication but I'm not sure I've yet seen it used in real life.)

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Yeah! So there's the reasonably well known "difference of squares method". This basically says that:

ab = f(a+b) - f(a-b)

where f(x) = x2/4

You can show this by expanding (a + b)2 and (a - b)2, like this:

(a + b)2 = a2 + b2 + 2ab -> (I)

(a - b)2 = a2 + b2 - 2ab -> (II)

(I) - (II) gives:

4ab = (a + b)2 - (a - b)2

So if we were to build a table of f(x) for x = 0...510, that would allow us to quickly look up f(a + b) for any positive a and b between 0 and 255:

How about f(a - b)? There's a lovely little hack which involves creating a new table, this time of g(x) = ((x-255)2)/4.

From this we can say: f(x) = g(x + 255).

So now we have f(a - b) = g( (a + 255) - b).

(255 - b) is just (b EOR &FF), so we can write f(a - b) = g(a + (b EOR &FF) ), and now we have the entire code snippet:

The first 6 lines of that are a constant preamble for constant mul1, so we get a further optimization if we want to multiply several values by mul1! That works out an 8 bit * 8 bit = 16 bit multiplication in very few cycles.

How about bigger numbers?

Well this is just about adding together partial products:

(a0 + ma1)(b0 + mb1)

= a0b0 + m(a0b1 + a1b0) + m2(a1b1)

(where m=256)

So, to do 16 bit * 16 bit = 32 bit, it's 4 multiplies, and then add the partial products together.

You can extend that idea to as far as you need, but the number of terms becomes a bit hairy after a while. In the source code, you can see how all these partial products need to get added together. Because each one is 16 bit, they have to be done individually so you don't lose any carries - there's no shortcut, just horrible repetitive code, but it still ends up being quicker than the classic 6502 multiply algorithm.

Hmm, that was longer than a couple of sentences huh!

ab = f(a+b) - f(a-b)

where f(x) = x2/4

You can show this by expanding (a + b)2 and (a - b)2, like this:

(a + b)2 = a2 + b2 + 2ab -> (I)

(a - b)2 = a2 + b2 - 2ab -> (II)

(I) - (II) gives:

4ab = (a + b)2 - (a - b)2

So if we were to build a table of f(x) for x = 0...510, that would allow us to quickly look up f(a + b) for any positive a and b between 0 and 255:

Code: Select all

```
LDA mul1
STA mulindex
LDY mul2
.mulindex LDA (mulindex),Y ; (mulindex) points to page aligned multable
```

From this we can say: f(x) = g(x + 255).

So now we have f(a - b) = g( (a + 255) - b).

(255 - b) is just (b EOR &FF), so we can write f(a - b) = g(a + (b EOR &FF) ), and now we have the entire code snippet:

Code: Select all

```
LDA mul1
STA mulindex1
STA mulindex3
EOR #&FF
STA mulindex2
STA mulindex4
LDY mul2
SEC
LDA (mulindex1),Y ; points to page aligned multable_lo
SBC (mulindex2),Y ; points to page aligned multable255_lo
STA result_lo
LDA (mulindex3),Y ; points to page aligned multable_hi
SBC (mulindex4),Y ; points to page aligned multable255_hi
STA result_hi
```

How about bigger numbers?

Well this is just about adding together partial products:

(a0 + ma1)(b0 + mb1)

= a0b0 + m(a0b1 + a1b0) + m2(a1b1)

(where m=256)

So, to do 16 bit * 16 bit = 32 bit, it's 4 multiplies, and then add the partial products together.

You can extend that idea to as far as you need, but the number of terms becomes a bit hairy after a while. In the source code, you can see how all these partial products need to get added together. Because each one is 16 bit, they have to be done individually so you don't lose any carries - there's no shortcut, just horrible repetitive code, but it still ends up being quicker than the classic 6502 multiply algorithm.

Hmm, that was longer than a couple of sentences huh!

### Re: Border tracing Mandelbrot generator for the Beeb

Thanks! That's great... can you see a way to use Karatsuba's observation that only three multiplies are needed - or is it clear that it won't help?

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Interesting approach! Unfortunately the problem here is that (a0 + a1) and (b0 + b1) are both 9 bit values, and so won't work as multiplicands with the tables (nor with 6502 registers!). I don't think it's really important though because, in this case, the multiplication operation itself is so quick - particularly if one of the operands has already been set up.

### Re: Border tracing Mandelbrot generator for the Beeb

I suppose we'd need at least a BCS and a second byte-indexed table. But an 18-bit result is bit of a squeeze... does it start to make more sense as you move up to 6 byte and to 8 byte operands?

### Re: Border tracing Mandelbrot generator for the Beeb

Ooh, this looks even better:

"A three-way Toom–Cook can do a size-3N multiplication for the cost of five size-N multiplications"

So that's 6x6 for the price of five 2x2s.

"A three-way Toom–Cook can do a size-3N multiplication for the cost of five size-N multiplications"

So that's 6x6 for the price of five 2x2s.

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

No, I don't think it wouldn't be worth the extra hassle unfortunately. In the code, I actually calculate and cache all the partial products first as it's quicker to calculate, for example, all the terms which contain a3 one after the other. When I add them all up, I do this in a different order (LSBs first), so I have to have them precalculated so I can access them as I need.

Once it's set up to multiply with a3, each term takes only 29-33 cycles to calculate. Just building the extra terms needed for the Karatsuba method and handling the extra carry issues would take longer than that!

(I just had a look at the Toom-Cook page but it hurt my head before I got halfway down the page. I suspect it might suffer the same problem though if it's a generalisation of Karatsuba.)

Code: Select all

```
; do multiplication with a3
LDA mulA+3:STA multab1loptr:STA multab1hiptr
EOR #&FF:STA multab2loptr:STA multab2hiptr
; a3.b0
LDY mulB+0
LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b0lo
LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b0hi
; a3.b1
LDY mulB+1
LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b1lo
LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b1hi
; a3.b2
LDY mulB+2
LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b2lo
LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b2hi
; a3.b3
LDY mulB+3
LDA (multab1loptr),Y:SBC (multab2loptr),Y:STA a3b3lo
LDA (multab1hiptr),Y:SBC (multab2hiptr),Y:STA a3b3hi
```

(I just had a look at the Toom-Cook page but it hurt my head before I got halfway down the page. I suspect it might suffer the same problem though if it's a generalisation of Karatsuba.)

### Re: Border tracing Mandelbrot generator for the Beeb

It feels like the square law will get you eventually - but eventually might be a lot more than 8 bytes of operand!

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Just to add, we're

*already*doing a 9-bit index operation here, by addressing with (table+A),Y! Essentially the index lookup is doing the addition with carry for us!- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

For a 64 bit * 64 bit multiplication, there'd be 64 terms in total, of which we could discard 21 (which contribute only to underflow terms of the result). That's still 43 multiplications and I frankly hate to think how many 16 bit additions, so it's not going to be quick. For comparison, 32*32 multiplication involves 13 terms and 12 additions, so the quadratic complexity is definitely going to kill.

### Re: Border tracing Mandelbrot generator for the Beeb

It's nice that you only need an approximate result, so you can avoid doing a fair fraction of the work - about a third for 8x8 bytes, from what you say.

According to the table in this presentation it might be around the 8x8 byte point that Karatsuba starts being the right choice - I'm assuming 190 bits is to be considered about 6 words of 32 bits. Fortunately, perhaps, Toom-Cook isn't worth using until the 16x16 byte point.

According to the table in this presentation it might be around the 8x8 byte point that Karatsuba starts being the right choice - I'm assuming 190 bits is to be considered about 6 words of 32 bits. Fortunately, perhaps, Toom-Cook isn't worth using until the 16x16 byte point.

That's neat!Rich Talbot-Watkins wrote:Just to add, we'realreadydoing a 9-bit index operation here, by addressing with (table+A),Y! Essentially the index lookup is doing the addition with carry for us!

### Re: Border tracing Mandelbrot generator for the Beeb

Just wondering if you're still thinking about giving this a go? There are more (and faster) second processors in play than ever before, and we need something to show them off!Rich Talbot-Watkins wrote:By the way, any 2nd processor version would most likely only double the speed (entirely due to the double clock speed of the parasite) - if I could find a way for both CPUs to perform calculations in parallel, this would improve things a bit more, although I expect the synchronisation would probably hit performance.

(The PiTubeDirect has a game-friendly and historically appropriate setting for 3MHz operation, which lines up with a cheesewedge, and the Matchbox has a 4MHz operation which lines up with the Master Turbo. The PiTubeDirect next release will most likely have a selectable speed from 1MHz to 24MHz or full-on.)

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

One day I'd love to. But not any time real soon - I'm going through a bit of a "busy period" in work right now...BigEd wrote:Just wondering if you're still thinking about giving this a go? There are more (and faster) second processors in play than ever before, and we need something to show them off!

### Re: Border tracing Mandelbrot generator for the Beeb

Fair enough!

### Re: Border tracing Mandelbrot generator for the Beeb

Dave [hoglet] and I just had a bit of a chat about Mandelbrot speedups as a consequence of the 12-second Mandelbrot thread... which now contains a very interesting idea using 13 bit arithmetic and a big table of squares. (I wonder if perhaps the next step on those lines is 24 bit numbers, treated as two 12-bit digits?)

We were talking about ways of using a coprocessor. (It seems Rich is mostly interested in historical machinery, meaning the 3MHz or 4MHz 6502 second processors, which is fair enough, but it might be interesting to make a program which can also get advantage from a much faster Pi-based copro.)

There are two interesting questions, I think: one is how to get code started which cooperates over two CPUs, and the other is how to divide the labour between the two.

Second question first: we can send individual multiplications (or squarings) to the other side, as discussed above. We can then load-balance the multiplications between the two sides according to how fast they make progress. Or we can send a coordinate and an iteration count, like BBC BASIC's MANDEL command, and have the other side do a lot of work in one go - we get less parallelism, but also spend less time moving requests and results around.

As to how to run code, there are probably a few different ways. When a Beeb has a second processor, that's the default place that code will run, for example with *RUN. In the case of Elite and Tube Life, the code on the second processor then uploads some user interface code byte by byte to the host and set its running (in the case of Elite?) or hooks it into the VDU driver (in the case of Tube Life.)

But we had another idea: if the program finds itself running on the second processor, it can *RUN a helper program which has been saved with host-side load and run addresses, and that program will then run on the host. We think it likely that the parasite-side code can also continue running. (If the main program finds itself running on the host, then there is no enabled second processor and it must do all the work itself.)

This second approach might be easier to write and easier to get working. The main program could act as the computational slave to the helper program, which runs the user interface and decides which coordinates need to be iterated on.

We were talking about ways of using a coprocessor. (It seems Rich is mostly interested in historical machinery, meaning the 3MHz or 4MHz 6502 second processors, which is fair enough, but it might be interesting to make a program which can also get advantage from a much faster Pi-based copro.)

There are two interesting questions, I think: one is how to get code started which cooperates over two CPUs, and the other is how to divide the labour between the two.

Second question first: we can send individual multiplications (or squarings) to the other side, as discussed above. We can then load-balance the multiplications between the two sides according to how fast they make progress. Or we can send a coordinate and an iteration count, like BBC BASIC's MANDEL command, and have the other side do a lot of work in one go - we get less parallelism, but also spend less time moving requests and results around.

As to how to run code, there are probably a few different ways. When a Beeb has a second processor, that's the default place that code will run, for example with *RUN. In the case of Elite and Tube Life, the code on the second processor then uploads some user interface code byte by byte to the host and set its running (in the case of Elite?) or hooks it into the VDU driver (in the case of Tube Life.)

But we had another idea: if the program finds itself running on the second processor, it can *RUN a helper program which has been saved with host-side load and run addresses, and that program will then run on the host. We think it likely that the parasite-side code can also continue running. (If the main program finds itself running on the host, then there is no enabled second processor and it must do all the work itself.)

This second approach might be easier to write and easier to get working. The main program could act as the computational slave to the helper program, which runs the user interface and decides which coordinates need to be iterated on.

- Rich Talbot-Watkins
**Posts:**1365**Joined:**Thu Jan 13, 2005 5:20 pm**Location:**Palma, Mallorca-
**Contact:**

### Re: Border tracing Mandelbrot generator for the Beeb

Not at all, I'd totally love to see a 200MHz (or whatever) Pi-based copro blitz through a Mandelbrot render on a Beeb! This being the case, you wouldn't even want to try to share the computation work - just have the co-pro doing the calculations, and host doing the plotting (via a custom WRCH implementation). I suspect the host might even still be the bottleneck in this case!

In the case of 3MHz or 4MHz co-pros, the parasite would still be the bottleneck, and would probably yield a ~1.5-2x speedup (essentially just from the higher clock). Alternatively, your parallel approach would work well, in which the parasite calculates zre' (with two multiplies) while the host calculates zim' (with 1 multiply), for maybe a ~2-3x speedup.

In the case of 3MHz or 4MHz co-pros, the parasite would still be the bottleneck, and would probably yield a ~1.5-2x speedup (essentially just from the higher clock). Alternatively, your parallel approach would work well, in which the parasite calculates zre' (with two multiplies) while the host calculates zim' (with 1 multiply), for maybe a ~2-3x speedup.