12-second Mandelbrot rendering on the BBC Master!

Discuss all aspects of programming here. From 8-bit through to modern architectures.
dp11
Posts: 834
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Fri Jun 01, 2018 7:25 pm

how much more space do you think we can get in ZP ?

Can we get recalculate_pixel at the beginning of kernel but make the beginning self modifying by copying two different code sets one for Julia and one for mandel at first call . saves an LDA, BNE and JMP

User avatar
hjalfi
Posts: 119
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Fri Jun 01, 2018 8:47 pm

Sadly, that bit of code needs to leave screeny in A for a comparison. But I did find out that inc zp; lda zp is one cycle cheaper than lda zp; inc zp; sta zp, so I've changed that throughout. I can't give you a number, though, because I also reworked the floodfill routine completely --- it's now only used if the box is 32 pixels or fewer, because that allows the X loop to use a displacement which will fit into a byte, and so an indexed indirect store.

It's all pretty marginal, though. Everything added together is 10px/s, so the score is 2623.
David Given
http://cowlark.com

dp11
Posts: 834
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Fri Jun 01, 2018 8:57 pm

I think

inc zp;5
lda zp;3

is the same as

lda zp;3
inc a;2
sta zp; 3

But does save code space which is good.

User avatar
Rich Talbot-Watkins
Posts: 1352
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by Rich Talbot-Watkins » Fri Jun 01, 2018 9:41 pm

I think at this stage, the OS interrupt routine is going to be adding noise to the timing results. I'd be inclined to go to 100% assembler, and shut out the OS, so you could use all the zp and have complete control over IRQ handling, but short of that, you could try reducing the OS IRQ footprint. Try OSBYTE 16,0 for a start to turn off ADC conversions. There might be other things you can disable too - I think the 100Hz timer interrupt does a whole load of stuff, but not sure if you can selectively turn off some things or not.

User avatar
hjalfi
Posts: 119
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Fri Jun 01, 2018 9:54 pm

Gah, misread my instruction timing chart. Yes, you're right, sta zp is 3 cycles. That explains why I didn't get any measurable change! (I do most of this stuff last thing at night before I go to bed --- can you tell?)

Re moving recalculate_pixel into the kernel: there's 26 bytes spare, which is loads, so that's quite feasible. It's not quite a hot path, but it's certainly worth doing.

...time passes...

Okay, it's actually a bit problematic --- the trouble is that there would be two kernels, one for Mandelbrot sets and one for Julia sets; I'd load the appropriate kernel once on startup and then, as you say, there'd be no setup cost during the actual execution. Unfortunately all the inlined variables would move depending on which kernel was loaded, and the rest of the program relies heavily on these (in particular, screenptr, which needs to be in the kernel for rtw's neat tsb trick). I can't think of a solution which allows the kernel to load at the same address, these variables to be at the same address, and not have padding inside the kernel and either nops or a branch. Any ideas?

I did try to use a pointer to the calculation routine, but it would have saved a single cycle for Julia sets only and was pretty gnarly, so I took it out again.
David Given
http://cowlark.com

dp11
Posts: 834
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Fri Jun 01, 2018 10:26 pm

I was thinking just one kernel, but two setups. Mandel setup is the longest so that is one block followed by kernel. julia is assembled with a bra to kernel at the end of it. then just swap block over at the start.

User avatar
hjalfi
Posts: 119
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Fri Jun 01, 2018 10:35 pm

@rtw: huh. That never even occurred to me!

Turning off the ADC converter got about 77px/s. Interestingly, now jsbeeb and b-em are reporting the same times. I think jsbeeb might be skimping on the ADC emulation...

So that's 2699 px/s, and a 12.1 second render. Wow. I think you're topping the leaderboard.

I tried masking the 50Hz timer in the system VIA, which gave me another 12, but then Julia sets rendered funny. What is this I don't even.

Masking the 100Hz timer is obviously the next step, but would require manually polling the keyboard and keeping track of elapsed time myself. That's not very hard. I'll have a look tomorrow, but I'm actually off on a trip on Sunday and won't have a chance to work on this again until I get back, by which time I'm sure I'll have been distracted by something else shiny.
David Given
http://cowlark.com

User avatar
Rich Talbot-Watkins
Posts: 1352
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by Rich Talbot-Watkins » Sat Jun 02, 2018 9:06 am

hjalfi wrote:
Fri Jun 01, 2018 10:35 pm
Turning off the ADC converter got about 77px/s. Interestingly, now jsbeeb and b-em are reporting the same times. I think jsbeeb might be skimping on the ADC emulation...
Memory fails me now, but I think it's the other way round. We had a timing test where both B-Em and jsbeeb were out compared to real hardware, and fixing the ADC emulation in jsbeeb brought it into line. BigEd might remember the details.
hjalfi wrote:
Fri Jun 01, 2018 10:35 pm
I tried masking the 50Hz timer in the system VIA, which gave me another 12, but then Julia sets rendered funny. What is this I don't even.
:shock:
hjalfi wrote:
Fri Jun 01, 2018 10:35 pm
Masking the 100Hz timer is obviously the next step, but would require manually polling the keyboard and keeping track of elapsed time myself. That's not very hard. I'll have a look tomorrow, but I'm actually off on a trip on Sunday and won't have a chance to work on this again until I get back, by which time I'm sure I'll have been distracted by something else shiny.
You'd probably be best keeping keyboard IRQs on, so you only scanned for the keys you're interested in when one is pressed, and I've never done that before, so not really sure how it works. Either that or only perform reads every n pixels, but that might feel a bit unresponsive.

User avatar
hjalfi
Posts: 119
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by hjalfi » Sat Jun 02, 2018 10:16 am

Turns out that running the entire computation with interrupts off and polling the system VIA is very easy and quite effective. And I don't need the event handler any more --- I can just test bit 1 in the system VIA IFR and it'll tell me whether there's a pending keypress.

That's 2699 -> 2732 px/s, although the period being timed is now very slightly shorter.

I'm polling the 100Hz clock bit in the IFR every time I attempt to read a pixel, which is the only place which is called often enough to be sure of getting all clock ticks. I did try the 50Hz vsync interrupt, as it happens half as often and so should have less overhead, but the extra cost of doing this:

Code: Select all

lda #vsync_irq
bit sheila+system_via+ifr
beq noclock
vs this:

Code: Select all

bit sheila+system_via+ifr
bvc noclock
...ate up the savings.

Obviously there'd be more savings if I could poll the system VIA less often, but I mustn't lose an interrupt or my timings will be wrong. It may be faster to register my own interrupt handler... the path from the ROM to IRQV1 is actually quite cheap.
David Given
http://cowlark.com

dp11
Posts: 834
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Sat Jun 02, 2018 6:10 pm

just a bit of tidying up really in .box

Code: Select all

    lda #&ff
    sta colourflag
    
I think the lda can be removed


before JSR hline there is this bit of code

Code: Select all

    sec
    lda boxx2
    sbc boxx1
    sta sidecount
    
This can be moved into hline. similar for vline. bne hline will need to be changed to bne hline_loop. I do wonder if hline can be unwound to do two pixels one after another. saving

Code: Select all

    lda screenx
    ror A
    bcs next
or some how use the same result which is hiding in calculate_through_cache

in .recurse we can make better use of X and Y

Code: Select all

.recurse
    clc
    lda boxx1
    adc boxx2 ; produces 9-bit result in C:A
    ror A     ; 9-bit right shift
    cmp boxx1
    beq box_too_small_x

    ldx midx
    phx
    sta midx

    clc
    lda boxy1
    adc boxy2 ; produces 9-bit result in C:A
    ror A     ; 9-bit right shift
    cmp boxy1
    beq box_too_small_y

    ldy midy
    phy
    sta midy

    ; Recurse into top left.

    ldy boxx2: phy
    ldy boxy2: phy

    ldy midx
    sty boxx2
    ;lda midy  *****no longer needed due to using y above**
    sta boxy2
    jsr box

    pla: sta boxy2
    ;pla: sta boxx2 --- immediately pushed back

    ; Recurse into bottom left.

    ;lda boxx2: pha
    lda boxy1: pha
    
    ;lda midx --- already in boxx2.
    ;sta boxx2
    lda midy
    sta boxy1
    jsr box

    pla: sta boxy1
    plx: stx boxx2

    ; Recurse into bottom right.

    ldx boxx1: phx
    ;lda boxy1 ; already in A from previous pull
    pha

    lda midx
    sta boxx1
    lda midy
    sta boxy1
    jsr box

    pla: sta boxy1
    ;pla: sta boxx1 --- immediately pushed back

    ; Recurse into top right.

    ;lda boxx1: pha
    lda boxy2: pha

    ;lda midx --- midx already in boxx1.
    ;sta boxx1
    lda midy
    sta boxy2
    jsr box

    pla: sta boxy2
    pla: sta boxx1

    pla: sta midy
.box_too_small_y
    pla: sta midx
.box_too_small_x

    rts
}
in the flood fill move

Code: Select all

    ; Compute pixel colour.

    lda corecolour
    lsr A
    ora corecolour
    sta corecolour
to just before yloop. no point in doing it if we might exit.

dp11
Posts: 834
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Sat Jun 02, 2018 8:17 pm

The final txa in kernel isn't used in hline or vline (14000ish times) it is used in box (1751 times) so it might be worth in box after calculate_through_cache testing bit 7 of A is it is clear doing ASL A

dp11
Posts: 834
Joined: Sun Aug 12, 2012 8:47 pm
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by dp11 » Sun Jun 03, 2018 8:18 am

If I'm right this

Code: Select all

bit #1
    beq left_margin_even
    inc A
    
Can be replaced with

Code: Select all

   Inc a
   And #&fe
Saves a few bytes

litwr
Posts: 202
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by litwr » Tue Jun 12, 2018 1:12 pm

Yet another very fast Mandelbrot from z80 world - https://www.octoate.de/wp/?s=mavdelbrot - it uses 16 colors.

Coeus
Posts: 1035
Joined: Mon Jul 25, 2016 11:05 am
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by Coeus » Wed Jun 13, 2018 7:50 pm

litwr wrote:
Tue Jun 12, 2018 1:12 pm
Yet another very fast Mandelbrot from z80 world - https://www.octoate.de/wp/?s=mavdelbrot - it uses 16 colors.
Ok, so seeing this is for an Amstrad CPC, and as Thomas Harte's ClockSignal emulates a CPC I thought I'd give it a go but never having owned a real one I am stuck as to how one runs the programs from the disk. Any ideas?

litwr
Posts: 202
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by litwr » Wed Jun 13, 2018 8:21 pm

Coeus wrote:
Wed Jun 13, 2018 7:50 pm
Ok, so seeing this is for an Amstrad CPC, and as Thomas Harte's ClockSignal emulates a CPC I thought I'd give it a go but never having owned a real one I am stuck as to how one runs the programs from the disk. Any ideas?
1. unzip and get mavdelbrot.dsk;
2. attach the disk image to your emulator;
3. type cat to get directory listing;
4. type run"FILENAME - you can omit the final dot, e.g. run"mavnormc
There are 4 variants of the program on the disk.

litwr
Posts: 202
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by litwr » Thu Jun 14, 2018 9:42 am

If we assume that z80 is about 2.5 slower than 6502 at the same СPU frequency then we get that BBC Micro is about 60% faster than Amstad CPC which true CPU frequency is close to 3.2 MHz
So 30 seconds for CPC are roughly equal to 18-20 secs for BBC Micro.

Coeus
Posts: 1035
Joined: Mon Jul 25, 2016 11:05 am
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by Coeus » Mon Jun 18, 2018 11:36 am

litwr wrote:
Thu Jun 14, 2018 9:42 am
If we assume that z80 is about 2.5 slower than 6502 at the same СPU frequency then we get that BBC Micro is about 60% faster than Amstad CPC which true CPU frequency is close to 3.2 MHz
So 30 seconds for CPC are roughly equal to 18-20 secs for BBC Micro.
On the question of the Z80 requiring more clock cycles to perform operations than the 6502 does, I came cross The Z-80 has a 4-bit ALU. Here's how it works. Perhaps that's at the core of the difference.

User avatar
BigEd
Posts: 2099
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: 12-second Mandelbrot rendering on the BBC Master!

Post by BigEd » Mon Jun 18, 2018 11:43 am

That's a great article Coeus, and one in a very interesting series looking at the internals of the Z80.

But I believe the main reason the Z80 takes more clock cycles than the 6502 is that it takes several cycles to access memory, where the 6502 takes one. The Z80 goes through more internal states for each memory access - which is why it can get away with a nibble-sized ALU, because it has several ticks to deliver a result.

One of the peculiarities of the Z80 is that the memory system needs to be fast enough to perform an access in three ticks, but many memory accesses take four ticks. So there's some inefficiency in that.

Of course, the Z80 is a later, larger, and more complex CPU, with much more register state and some much more complex instructions, which helps to bring back the performance ratio.

One of the big ideas behind ARM's superior performance is the observation that CPU performance varies mostly according to memory efficiency: if you can usefully hit memory as fast as memory will allow, you'll be doing pretty well, and if you leave memory idle some of the time you'll be doing much less well. ARM does especially well because it makes use of page mode for sequential accesses.

Post Reply