12-second Mandelbrot rendering on the BBC Master!

Discuss all aspects of programming here. From 8-bit through to modern architectures.
dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Tue May 29, 2018 8:30 pm

Thanks for aligning the tables I'm slightly surprised it didn't make more of a difference .

tiny change from :

Code: Select all


.recurse

    ; Start recursion.

    lda midx: pha
    lda midy: pha

    ; Calculate centre point.
    clc
    lda boxx1
    adc boxx2 ; produces 9-bit result in C:A
    ror A     ; 9-bit right shift
    sta midx
    cmp boxx1
    beq box_too_small
 
    clc
    lda boxy1
    adc boxy2 ; produces 9-bit result in C:A
    ror A     ; 9-bit right shift
    sta midy
    cmp boxy1
    beq box_too_small midy
to :

Code: Select all

.recurse

    ; Start recursion.

    ; Calculate centre point.
    
    clc
    lda boxx1
    adc boxx2 ; produces 9-bit result in C:A
    ror A     ; 9-bit right shift
    
    cmp boxx1
    beq box_too_small_x
    ldx midx:phx
    sta midx
    
    clc
    lda boxy1
    adc boxy2 ; produces 9-bit result in C:A
    ror A     ; 9-bit right shift
    
    cmp boxy1
    beq box_too_small_y 
    ldy midy: phy
    sta midy
    
 ;.....
 
.box_too_small    
    pla: sta midy
.box_too_small_y
    pla: sta midx
.box_too_small_x    
    rts    
the above might have errors, but you get the idea.

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Tue May 29, 2018 8:49 pm

From :

Code: Select all

    lda iterations
    and #7
    tax
    lda palette, X
    tax

    ; Plot colour A to the current pixel.

    sta temp

    ; Unshifted values refer to the *left* hand pixel, so odd pixels
    ; need adjusting.
    lda screenx     ; Is this an even pixel?
    ror A           ; odd/even bit to C
    lda #&55        ; a = pixel mask
    bcc plot_even_pixel
    lsr temp
    asl A
.plot_even_pixel
    and (screenptr)
    ora temp
    sta (screenptr)

    ; pixel colour in A on entry
    txa
.dont_calculate:
    cmp corecolour
    bne different_colours
    rts
.different_colours
    stz colourflag
    rts
to

Code: Select all

    lda iterations
    and #7
    tax
    lda palette, X
  
    ; pixel colour in A on entry
    
    cmp corecolour
    beq   same_colours
    stz colourflag
.same_colours

    ; Plot colour A to the current pixel.

    sta temp

    ; Unshifted values refer to the *left* hand pixel, so odd pixels
    ; need adjusting.
    lda screenx     ; Is this an even pixel?
    ror A           ; odd/even bit to C
    lda #&55        ; a = pixel mask
    bcc plot_even_pixel
    lsr temp
    asl A
.plot_even_pixel
    and (screenptr)
    ora temp
    sta (screenptr)
    rts
    
 .dont_calculate:
    cmp corecolour
    bne different_colours
    rts
.different_colours
    stz colourflag
    rts   
    

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Tue May 29, 2018 8:54 pm

is it worth making vline and hline macros ?

the clc in hline isn't needed.

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Tue May 29, 2018 9:25 pm

Sorry, my suggestions are only very small improvements.

User avatar
hoglet
Posts: 7608
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hoglet » Tue May 29, 2018 9:28 pm

Fantastic progress guys.

User avatar
hjalfi
Posts: 121
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Tue May 29, 2018 9:41 pm

Ooh, nice.

The calculate simplifications: 2407 -> 2411.
The midpoint simplifications: 2411 -> 2418.
Removing that one flipping clc: 2418 -> 2421!

Can you tell which one of these is in the hotspot?

calculate's annoying because it has to return the colour in A, which means that even though putting the don't-calculate case at the top makes the code quite a lot more logical, the do-calculate case has to preserve the colour across the plot routine. But I need to save the colour in ZP anyway so I can shift it in case we're on an odd pixel (and I finally realise why Acorn chose that weird pixel format). The plot routine offends me; there must be a better way.

I don't think inlining hline and vline themselves would help much --- the routines themselves aren't hotspots (although the loops inside them are). They'd probably lead to a small but useful 5 or so, but I think the code complexity would outweigh the benefit. I'll keep it in mind, though.

But inlining at least the cached-only part of calculate does help, as that's called a huge number of times, and pushes the speed all the way up to 2498 px/s.
Last edited by hjalfi on Tue May 29, 2018 9:46 pm, edited 1 time in total.
David Given
http://cowlark.com

User avatar
hjalfi
Posts: 121
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Tue May 29, 2018 9:45 pm

...hey, everything helps! And a suggestion which nudges me into thinking about something in a different way that leads to a bigger improvement (like inlining the top of calculate) is really useful, so thanks!

I'm keeping track of who suggested what so I know who to blame^H^H^H^H^Hcredit, BTW. Once it looks like it's done I'll write it up. Although I thought it was done last week, so.
David Given
http://cowlark.com

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Tue May 29, 2018 10:02 pm

we have broken Julia.

User avatar
Rich Talbot-Watkins
Posts: 1370
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by Rich Talbot-Watkins » Tue May 29, 2018 10:12 pm

hjalfi wrote:
Tue May 29, 2018 9:41 pm
The plot routine offends me; there must be a better way.
If you're only plotting pixels onto black (or can contrive colour 0 to mean "unplotted" and colour 8 to mean "black, plotted"), you don't need the AND to mask the screen data. If you had enough zp to put some of that code there, you could replace the LDA/(shift)/AND/ORA/STA with a simple LDA/(shift)/TSB (where the TSB operand is the zp screen address pointer)!

User avatar
hoglet
Posts: 7608
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hoglet » Wed May 30, 2018 12:45 pm

I just ran my 6502 profiler on the initial rendering:
mandel.instr.txt
(33.67 KiB) Downloaded 16 times
Here's a log of the BeebASM output, for reference:
mandel.log
(310.2 KiB) Downloaded 14 times
One thing you can get out of the profiling is the % of time a branch is taken, which might allow you to save a cycle or two by swapping the two cases over.

Dave

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Wed May 30, 2018 1:06 pm

Thanks Dave,

Looking at the output I think the numbers look a bit strange at places just before labels e.g &209E

User avatar
hoglet
Posts: 7608
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hoglet » Wed May 30, 2018 1:15 pm

Looking at the branch stats for the first fixup_a:

Code: Select all

     0026   09 80      ORA #&80
     0028   89 40      BIT #&40
     002A   F0 02      BEQ &002E
     002C   29 7F      AND #&7F
You can see the branch is taken only 37% of the time.

Code: Select all

0026 :   173776 cycles (  0.714725%)    86888 ins (2.00 cpi) *****************
0028 :   173776 cycles (  0.714725%)    86888 ins (2.00 cpi) *****************
002a :   206093 cycles (  0.847641%)    86888 ins (2.37 cpi) ********************
002c :   109142 cycles (  0.448891%)    54571 ins (2.00 cpi) **********
This takes:
- 7 cycles 37% of the time
- 8 cycles 63% of the time
Which averages at 7.63 cycles.

If you invert the test:

Code: Select all

AND #&7F
BIT #&40
BNE &002E
ORA #&80
This takes:
- 7 cycles 63% of the time
- 8 cycles 37% of the time
Which averages at 7.37 cycles.

A whopping 0.26 cycles saved, which I think corresponds to 11.3ms.

Dave
Last edited by hoglet on Wed May 30, 2018 4:57 pm, edited 1 time in total.

User avatar
hoglet
Posts: 7608
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hoglet » Wed May 30, 2018 1:15 pm

dp11 wrote:
Wed May 30, 2018 1:06 pm
Looking at the output I think the numbers look a bit strange at places just before labels e.g &209E
OK, I'll look into that.

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Wed May 30, 2018 1:23 pm

I have some of those Branch cases lined up but your profile does confirm my ideas. Thanks.

User avatar
hoglet
Posts: 7608
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hoglet » Wed May 30, 2018 1:51 pm

dp11 wrote:
Wed May 30, 2018 1:06 pm
Looking at the output I think the numbers look a bit strange at places just before labels e.g &209E

Code: Select all

     2097   A5 80      LDA &80
     2099   6A         ROR A
     209A   B2 7E      LDA (&7E)
     209C   90 01      BCC &209F
     209E   0A         ASL A
     209F   29 AA      AND #&AA

2097 :    21003 cycles (  0.086383%)     7001 ins (3.00 cpi) **
2099 :    14002 cycles (  0.057589%)     7001 ins (2.00 cpi) *
209a :    35005 cycles (  0.143972%)     7001 ins (5.00 cpi) ***
209c :    16190 cycles (  0.066588%)     7001 ins (2.31 cpi) *
209e :     9626 cycles (  0.039591%)     4813 ins (2.00 cpi) 
209f :    14002 cycles (  0.057589%)     7001 ins (2.00 cpi) *
I can't see anything unexpected at 209E.

The branch at 209C is taken 31% of the time, so the ASL A is executed 69% of the time, and 4813/7001 gives 68.7% which is consistent.

What have I missed?

Dave

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Wed May 30, 2018 2:00 pm

Sorry, You are right, I miss read the code

steve3000
Posts: 1895
Joined: Sun Nov 25, 2012 12:43 am
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by steve3000 » Wed May 30, 2018 4:48 pm

I've just tried the latest version and it is truly amazing work! I remember waiting with eager anticipation for 20+ minutes for one of these to draw back in the day.

One recommendation - can you clear the keyboard buffer at the end of a frame? The current version (well the one at the top of the thread, running under jsbeeb) is very sensitive to key presses and only replays them at the end of a frame...so a few auto-repeats on a cursor key, and you're locked of of input for several frames as it draws, moves, draws, moves...etc.

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Wed May 30, 2018 6:15 pm

Could the palette table be extended so the and #7 isn't required before accessing the palette table ?

Can .box be tweaked to:

Code: Select all


    pla: sta boxy1
    plx: stx boxx2

    ; Recurse into bottom right.

    ldx boxx1: phx
    ;lda boxy1: ; A now already has boxy1
    pha

I think floodfill will now fit straight after .box and the JMP BPL can be optimised

The stx screenx in floodfill I don't think actually does anything?

.vline can be optimised to :

Code: Select all

.vline
{
    calculate_through_cache

    ; Move to the next vertical pixel.

    inc screenptr+0
    beq inchighbyte
   
.noinc
    lda screeny
    inc A
    sta screeny
    and #7
    beq nextchar
    
    dec sidecount
    bne vline
    rts
    
.inchighbyte
    inc screenptr+1
    bra ininc
    
.nextchar

    clc
    lda screenptr+0
    adc #lo(640-8)
    sta screenptr+0
    lda screenptr+1
    adc #hi(640-8)
    sta screenptr+1

    dec sidecount
    bne vline
    rts
}

User avatar
hjalfi
Posts: 121
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Wed May 30, 2018 6:21 pm

@rtw: d'oh! Obvious in hindsight. Sadly, tsb only has zp and abs addressing modes, so I still need an ora (screenptr); sta (screenptr), but that's a nice 15 px/s.

@steve3000: yes, good idea! Um, 0 px/s? But it's friendlier to use. (At some point all that cursor code should be moved to machine code for speed, but realistically I probably won't.)

@hoglet: the profiler chart's really cool, and very useful. I identified not just the branch you mentioned (which actually made a measurable 7 px/s difference) but another one. Total: 14 px/s.

We're now up to 2527 px/s.

And I should remind people that the really cool bit of this program isn't mine, it's reenigne's --- I just wrapped an app round it...

@dp11: I don't see anything obviously wrong with Julia rendering, except for the precision errors close up. Which are actually quite striking, but hmm. The only show up in Julia rendering. I *think* they're accumulated errors caused by adding very small c values (because we don't get accumulated carries from the lost lower-precision bits), and that they're not showing up in Mandelbrots because c is always different and they got lost in the fuzz.
small.gif
small.gif (15.85 KiB) Viewed 425 times
David Given
http://cowlark.com

User avatar
Rich Talbot-Watkins
Posts: 1370
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by Rich Talbot-Watkins » Wed May 30, 2018 6:38 pm

hjalfi wrote:
Wed May 30, 2018 6:21 pm
@rtw: d'oh! Obvious in hindsight. Sadly, tsb only has zp and abs addressing modes, so I still need an ora (screenptr); sta (screenptr), but that's a nice 15 px/s.
Hence my other suggestion to find a way to move the routine (or some of it) into the zp so that the operand of the TSB abs can actually be screenptr/screenptr+1 - it'd save 5 glorious cycles, and also be a great excuse to use a TSB for something which isn't setting ACCCON. Not sure how much space you have left at this stage though.

User avatar
hjalfi
Posts: 121
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Wed May 30, 2018 7:21 pm

@dp11: re palette, oh, that's cunning! Plus, because the entire iteration->palette mapping is predetermined, I can avoid using black for drawing stripes --- this makes the images look much crisper. Weirdly, though, it doesn't make a measurable difference in speed, which I don't really understand as it's several instructions fewer.

Re branch flips: yup, that works. That (plus another couple of places with the same optimisation) gets about 10. So we're at 2538 px/s now.
David Given
http://cowlark.com

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Wed May 30, 2018 7:33 pm

well I'm baffled why the palette change doesn't make any speed difference.

User avatar
hjalfi
Posts: 121
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Wed May 30, 2018 8:19 pm

@rtw: re tsb: oh, hmm. Missed that. That's very neat.

I have three spare bytes of zero page in the legal storage area. That's enough to fit a tsb <abs> and a rts...

A jmp abs is three cycles, so jumping there and back wouldn't be worth it. So this would only be worth doing if I didn't need an explicit return (and ideally not need an explicit call either). The obvious thing is to try and get the entire tail of the recalculate_pixel routine from go onwards, because I'm already bra-ing to go so jmp-ing to a go in zero page will only cost one extra cycle.

If I move my application variables up to &a8, stomping over the transient command and file system workspace, that means I get &a0-&73 = 29 bytes to extend the kernel (if I stomp on top of Econet). That should be more than enough.

...time passes...

Yup, works fine. 2538 -> 2572 px/s. That's very nice. Thanks!

Sudden horrible thought: are jsbeeb and b-em cycle accurate? Surely they are?
David Given
http://cowlark.com

User avatar
BigEd
Posts: 2140
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by BigEd » Wed May 30, 2018 8:21 pm

yes, jsbeeb is cycle accurate.

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Wed May 30, 2018 8:30 pm

can you page align the palette table such that when the palette table is read it is ldx absolute ; where the low byte of absolute is the iterations byte?

User avatar
hjalfi
Posts: 121
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Wed May 30, 2018 8:41 pm

@dp11: yes, yes I can. Sadly it's only 4 px/s --- that bit of code's only called at most 128*256 times --- but it's really neat. Thanks. 2576 px/s.
David Given
http://cowlark.com

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Wed May 30, 2018 9:50 pm

Can't give much but a better floodfill if nothing else it is smaller

Code: Select all



; Fill the current box with corecolour, which is corrupted.
.floodfill
{
; Hacky temporary storage, reusing zr and zi in the kernel.
boxx1i = zr+0
boxy1i = zr+1
boxx2i = zi+0
boxy2i = zi+1

    ; Compute pixel colour.

    lda corecolour
    lsr A
    ora corecolour
    sta corecolour

    ; The margins of the box are already drawn. We can use this to avoid
    ; the (expensive) cost of having to draw stray pixels on the left and
    ; right, at the expense of a (very cheap) overdraw.

    lda boxx2
    bit #1
    bne right_margin_odd
    dec A
.right_margin_odd
    sta boxx2i
    
    lda boxx1
    bit #1
    beq left_margin_even
    inc A
.left_margin_even
    sta boxx1i

    ; Check that our box is not empty.

    cmp boxx2i
    bcs exit

    ; Don't redraw top and bottom (this is easy).

    lda boxy2
    dec A
    sta boxy2i

    
    lda boxy1
    inc A
    sta boxy1i
    
    ; Check that our box is not empty.
    
    cmp boxy2i
    bcs exit
    sta screeny
    
    ; Calculate length of line.

    sec
    lda boxx2i
    sbc boxx1i
    lsr A ; to bytes
    
    sta boxx2i
    
.yloop
    ldx boxx1i
    ldy screeny
    calculate_screen_address

    ldx boxx2i

    ldy corecolour
    clc
.xloop
    tya
    sta (screenptr)

    lda screenptr+0
    adc #8
    sta screenptr+0
    bcs inchighbyte
.continue_xloop
    dex
    bpl xloop

    inc screeny
    lda boxy2i
    cmp screeny
    bcs yloop
.exit
    rts
    
.inchighbyte
    inc screenptr+1
    clc
    bra continue_xloop
}    
    

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Fri Jun 01, 2018 4:46 pm

Just when we thought fixup was optimal. As all the important fixups are in ZP we can pull another trick

Code: Select all

macro fixup_a_zp
{
STA	lowbyteoftable  ; 3 cycles
lowbyteoftable = *+1 
LDA  fixuptable;4 cycle 
}

align &100
.fixuptable  
;....


Fixup is now always 7 cycles and 5 bytes long,

The and #&AA can move in macro calculate_through_cache

Code: Select all

; Lazily renders the current point, leaving the colour in A.
macro calculate_through_cache
    ; Pick colour from screenx/screeny (calculate_screen_address must have been
    ; called) into A.

    lda screenx
    ror A ; odd/even bit to C
    lda (screenptr)
    ; Unshifted values refer to the *left* hand pixel, so odd pixels
    ; need adjusting.
    bcc pick_even_pixel
    asl A
.pick_even_pixel
    bmi dont_calculate
    jsr recalculate_pixel
    bra exit

.dont_calculate
    and #&AA	
    ; This pixel is cached, so just check the colour and exit.
    cmp corecolour
    beq exit
    stz colourflag
.exit
endmacro

User avatar
hjalfi
Posts: 121
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Fri Jun 01, 2018 6:20 pm

Yes! PRECOMPUTE ALL THE THINGS! Self-modifying fixups yields 10 px/s.

Moving the and in pick gives 4px/s.

...then I realised that tax; lda fixup_table,X is actually only six cycles, and can be used outside zero page. Unfortunately it consumes X which means there's one occurrence in the kernel which still needs the self-modifying code, but it's still another 18 px/s. (Which is suspiciously much, TBH. It's only one cycle, after all. I'm not convinced the timings are accurate.)

Microoptimising the floodfill yields 5px/s (the bulk of which is the clc move).

So we're now at 2613px/s, or 12.7 seconds for a render. Bear in mind that my original toy, which I was so proud of, took 12.1 seconds to render half of a smaller image at half the precision...
David Given
http://cowlark.com

dp11
Posts: 841
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Fri Jun 01, 2018 6:39 pm

If I can count correctly that is 37px so over 1% gain.

In flood fill very minor but

Code: Select all

    lda screeny
    inc A
    sta screeny
can become

Code: Select all

    inc screeny

Post Reply