A mathematical demo and the request for help from the hardware owners

discussion of beeb/electron applications, languages, utils and educational s/w
litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

A mathematical demo and the request for help from the hardware owners

Post by litwr » Fri Jul 08, 2016 9:09 am

I have some semi-scientific research project. I am trying to measure the maximum system speed for a small mathematical algorithm.
http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html
This program contains the very fast 6502 division which is still unpublished. The whole code pretends to be the fastest possible. So if somebody can find a way to make it faster then IMHO it will be close to the scientific achievement.
I have to use the emulators. :( They show the good accuracy for the base systems and the acceptable accuracy for 6502 in TUBE. However the accuracy for z80 in TUBE is very poor. :( B-em is more than 100% faster than the expected speed. Beebem is more accurate here but it is also much faster than hardware. The program for Master 512 80186 TUBE crashes both emulators.
So, please, help to get the results for z80 and 80186 systems. I expess the big gratitude for any help with the real iron.
BTW I also has a bigger multi-platform project - http://litwr2.atspace.eu/xlife/retro/xlife8.html
I want to make BBC Micro port too. The IBM PC port may be used with Master 512 but the absence of ANSI.SYS makes some key functions unreachable. :(
EDIT. Thanks to jgharston the 80186 code has become useful. :D Pipack is updated to version 11. However the speed ratio for Master 512 for Beebem and B-em is 4:1! Beebem is much faster and B-em looks slightly slower than hardware. So only the real iron may show the proper results...

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Sat Jul 09, 2016 8:04 am

I'm very much in favour of computing Pi, of a fast division for the 6502, and spigot algorithms - so this gets my attention in several ways!

Have you considered posting also to the forums at 6502.org - it should get a good reception!

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Sat Jul 09, 2016 6:16 pm

Thank you very much! :D
I'd converted and slightly optimized the division from z80 code which was written by an 8080 expert. So if you want to see *more pure* form of the algorithm then try the examples for z80.
I've just reposted the message to 6502.org but my primary aim is to get results from hardware which doesn't have proper emulator.
EDIT. Z80 in Tube maybe the fastest z80 system because it doesn't require any wait states like Amstrad, Spectrum or MSX.

User avatar
tricky
Posts: 2761
Joined: Tue Jun 21, 2011 8:25 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by tricky » Sun Jul 10, 2016 11:24 am

I was having a quick look at the code to see what could be optimised and I only spotted two before I got sidetracked analysing your division routines and tables.

If you count down instead of up to clear the r array, you could save nearly 800 cycles and removing the ldy at the start of 6502-div6.s another 1000, which won't change the time at all.

As I said, I got sidetracked disassembling and going through you code, but couldn't see a use for &381c..div32x16w, it looks unreachable ;)

EDIT: off for lunch, then back to Phoenix maybe :oops:

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Sun Jul 10, 2016 7:13 pm

Thank you! But... :wink: It is not so easy. LDY sets ZF. However you help me to improve the code. I am going to replace LDY divisor+1 by CPY #0. It gives 1 cycle. So thank you very much! :) I am going to make PIPACK v12 soon. 6502 version should be 0.15% faster! :D
tricky wrote:If you count down instead of up to clear the r array, you could save nearly 800 cycles
Sorry I do not understand your point. Do you mean filling r-array by 2000? It is not timed because r maybe filled statically at the compilation time. So this code is not important for the speed but I'm ready to use the shorter version if somebody provides it.
tricky wrote:couldn't see a use for &381c..div32x16w, it looks unreachable
Sorry again. Div32x16w is the "slow" division to work with the timer result (it is not used with Basic versions) and to make 4 digits for the print. At &381c (if org = &2100) is the fast code of division of 32-bit dividend by 255.
BTW What is a Phoenix?

User avatar
richardtoohey
Posts: 3590
Joined: Thu Dec 29, 2011 5:13 am
Location: Tauranga, New Zealand
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by richardtoohey » Sun Jul 10, 2016 10:02 pm

litwr wrote:BTW What is a Phoenix?
Phoenix is Richard's BBC version of a game: viewtopic.php?f=1&t=8416 We're waiting for it but he keeps getting distracted! :wink: Are we paying him enough? :?: :lol:

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Mon Jul 11, 2016 7:31 am

Thanks. :D BTW I continue to seek an opportunity to get more results. This is the link to 6502.org thread - http://forum.6502.org/viewtopic.php?f=2 ... 265#p46265.
EDIT. CPY #0 is not suitable - it sets CY. :(

User avatar
tricky
Posts: 2761
Joined: Tue Jun 21, 2011 8:25 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by tricky » Mon Jul 11, 2016 5:36 pm

Sorry about the LDY, I don't know why I didn't see that!
litwr wrote:Div32x16w is the "slow" division to work with the timer result ... At &381c (if org = &2100) is the fast code of division of 32-bit dividend by 255.
The version that I ran from your .ssd was using Div32x16w, not the code from &381c to Div32x16w, I guess I was running the wrong version!

Code: Select all

23CC JSR div32x16w ; c + d/10000, AC = dividend+3
When I get some spare time, I'll have a proper look, but it all looked OK to me.

I doubt that it would spot anything, but I keep meaning to add this to an editor.
http://web.archive.org/web/200107210645 ... rc/opt65.c

EDIT: just put it through that optimiser and all it did was to incorrectly remove the clearing loop at the top and the LDY [-X

User avatar
jgharston
Posts: 3211
Joined: Thu Sep 24, 2009 11:22 am
Location: Whitby/Sheffield
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by jgharston » Mon Jul 11, 2016 7:50 pm

litwr wrote:EDIT. CPY #0 is not suitable - it sets CY. :(
If you don't need the contents of A then TYA sets EQ and MI without changing Cy. If you do need the contents of A then DEY:INY sets EQ and MI without changing Cy.

Code: Select all

$ bbcbasic
PDP11 BBC BASIC IV Version 0.25
(C) Copyright J.G.Harston 1989,2005-2015
>_

User avatar
tricky
Posts: 2761
Joined: Tue Jun 21, 2011 8:25 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by tricky » Mon Jul 11, 2016 10:58 pm

Unfortunately, A is needed and it is to replace a 3 cycle instruction :(

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Wed Jul 13, 2016 6:04 pm

Thank you very much. I feel myself like I'm among old friends. :) However this program is very tricky and may consume at least a week of work for its improvement...
Thanks to your help and participation I could solve the problem with CY and change "LDY zp" to "CPY #0". :) I've also got inspiration from peephole-optimizer provided by tricky and wrote a branch-optimizer. This optimizer checks the number of branches which cross the page-boundary (and take 1 more cycle for this) and tries to find the offset which minimizes this number. This had given 0.3% speed up. Together with CPY it gives 0.4%! :) The branch optimizer is the part of just-released pipack-12.
I've just found that Beebem is about 1.5% faster than B-em. So it gives a new question. What emulator is more accurate? It may be solved by π-calculation for 1000 or/and 3000 digits. Please help with the base Beeb hardware... [-o< :?:

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Wed Jul 13, 2016 7:44 pm

On my Master and your v12 program (from the ssd in the tar file on your site) 1000 digits reports taking
177.74
(using my serial port for I/O) and then
176.23
running it more conventionally from the keyboard. And same again for a second run.

(I'd expect JSBeeb to be a very accurate model, but it doesn't seem to be for this program, so I'll raise a bug!)

User avatar
tricky
Posts: 2761
Joined: Tue Jun 21, 2011 8:25 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by tricky » Thu Jul 14, 2016 10:33 am

In my experience, i would say that jsbeeb is slightly more accurate that b-en, which is more accurate than beebem, but the opposite order for features/hacking and all have different very useful debugging features.

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Thu Jul 14, 2016 7:29 pm

A lot of thanks to BigEd! :) The table is just updated. However more data may remove ≈ signs. :wink:
IMHO the exactness of the emulators depends of availability of the cycle-exact software. Commodore, MSX and Amstrad CPC computers have a big amount of such software, but BBC Micro and Amstrad PCW almost completely missed it. So the emulator writers do not have the data to test.
I again should express my admire for BBC Micro. :) Using serial lines to control it looks like the work with the remote mainframe... I used this way with IBM 360 at the student times.

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Fri Jul 15, 2016 11:27 am

Just to note: with a 6502 copro on a Beeb or Master, you can run Hibasic instead of Basic, to get a much higher HIMEM. At which point your program says it can calculate 5264 digits.

I see you've asked before about paged RAM - on a Master that should give you 16k extra space in your machine code program, so again that's some extra digits. Once paged in, and with SHADOW video, you have contiguous RAM up to &BFFF so no need to keep flipping pages or anything like that.

Any interest in a benchmark on a contemporary Pi-based copro, which runs 50x faster?? (The Pi can also act as an ARM copro, which might be of interest if and when you have an ARM version of your spigot program. There's an emulated ARM which is ARM1 or ARM2-like, I think, and a native ARM which is full gigahertz speed but is a slightly different CPU. You might need two versions of code.)

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Sat Jul 16, 2016 6:18 pm

The aim of my research is to get the maximum speed. The number of digits is the secondary and unimportant, just an illustration.
However I've just made the several checks. I used Master 128 with the commands
*SHADOW
P.~HIMEM, ~PAGE
All modes 0 give 8000 and E00
The system with the second 6502 gives 8000 and 800 always. So I can make a version for PAGE=800 and this will give 216 more digits.
I have Raspberry Pi :D but I am using it with Pi2go project. I want to have 72 hours in a day... The idea of Pi in the BBC Tube is fantastic! But I limit my research to hardware of 80s and middle 90s. If somebody provide me with results of the tests then I have to make another table for alike results... It is curious what is faster 80286 @12MHz or ARM1 @8MHz? I hope that this curiosity is not only mine. :wink:
tricky wrote:In my experience, i would say that jsbeeb is slightly more accurate that b-en, which is more accurate than beebem, but the opposite order for features/hacking and all have different very useful debugging features.
I agree but all emulators show a bit surprising inaccuracy. :(

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Sat Jul 16, 2016 6:35 pm

OK, I understand the Pi copro isn't historically accurate!

To see 48k RAM in the Master 128, which you can't do in your Basic program but you can do within the machine code program, you would have to page in a RAM bank instead of Basic.

To see more than 32k in the 6502 second processor, you need to load HiBasic instead of Basic.

As you say, the number of digits is not primary, so this might be more effort than is justified.

I've checked JSBeeb against my hardware, with some tweaks to make the program more deterministic: the Model B emulation is cycle-exact, but the Master emulation is definitely adrift, perhaps by 10%. I'll help to get that fixed - I've already raised the issue on github.

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Sat Jul 16, 2016 7:19 pm

Do Model B and Master have the different speeds? :shock: They use the same CPU, video system, ...
I know one way how to make pi-spigot program a bit faster. It is in changing IRQ handler to the simple one with the only timer counter. I'd done this for C+4 version. But I am not sure about functionality of Beeb without the proper handler. Would discs work? And it is not easy for a BBC Micro newbie...

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Sat Jul 16, 2016 7:25 pm

Both machines run at 2MHz, but they have different CPUs and different versions of the OS. So, for example, printing a character to the screen might take a different number of ticks on a Master compared to a Beeb. They also do have different versions of Basic, so the execution of line 80 where you CALL the machine code could also take a different number of ticks. And the ISR will be different too, most likely.

But they are certainly broadly equal (unless running intensive Basic programs in which case you'll see different timing behaviour there.)

Substituting a simplified ISR is certainly possible, but very expert territory - I personally wouldn't dig that deep!

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Mon Jul 18, 2016 4:14 pm

BigEd wrote: so the execution of line 80 where you CALL the machine code could also take a different number of ticks.
This is almost infinitely small before total number of ticks. IMHO ISR only can make this difference.
BTW I've just made a program for ARM evalution system. :D I had to improve the division published in ARM assembler reference manual. The results show that ARM1 at 8 MHz for π-spigot is close to 68000 @10MHz, 8088 @14MHz, 80286 @3.2 MHz. The absence of hardware division and multiplication at ARM1 makes this. I'm not sure about accuracy of emulators for ARM Evaluation System. It is sad but I almost sure that to get the results from hardware is almost impossible. :cry: :(

User avatar
1024MAK
Posts: 7868
Joined: Mon Apr 18, 2011 4:46 pm
Location: Looking forward to summer in Somerset, UK...
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by 1024MAK » Mon Jul 18, 2016 5:15 pm

The problem is, the most common second processors are 6502 varents. There are not that many members here with a real Z80 second processor.

There should be some members with Master 512 machines. Not sure why none of them has answered your call for assistance.

Mark
For a "Complete BBC Games Archive" visit www.bbcmicro.co.uk NOW!
BeebWiki‬ - for answers to many questions...

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Mon Jul 18, 2016 10:11 pm

I'm very puzzled by the roughly 10% speed discrepancy between my Master 128 and the Master emulation of JSBeeb. If anyone else could run on a plain Master and compare against this result that would be helpful:
The Master calculates 100 digits in 1.67 and 3000 digits in 1434.65 (as reported by the program itself) or computing 80 digits takes 1.18 on my Master, 1.07 on JSBeeb.

User avatar
richardtoohey
Posts: 3590
Joined: Thu Dec 29, 2011 5:13 am
Location: Tauranga, New Zealand
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by richardtoohey » Tue Jul 19, 2016 7:13 am

1024MAK wrote:There are not that many members here with a real Z80 second processor.

There should be some members with Master 512 machines. Not sure why none of them has answered your call for assistance.

Mark
In my case both machines are in the HUGE pile of TODO projects ... :oops: I was going to put my hand up and get involved, but I've done that lots of times and then got distracted so was keeping mum this time. :-#

User avatar
hoglet
Posts: 7495
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by hoglet » Tue Jul 19, 2016 7:22 am

BigEd wrote:I'm very puzzled by the roughly 10% speed discrepancy between my Master 128 and the Master emulation of JSBeeb. If anyone else could run on a plain Master and compare against this result that would be helpful:
The Master calculates 100 digits in 1.67 and 3000 digits in 1434.65 (as reported by the program itself) or computing 80 digits takes 1.18 on my Master, 1.07 on JSBeeb.
I'll give this a go today. I'll also test it on the Master version of Beeb FPGA.

Dave

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Tue Jul 19, 2016 9:23 am

Thanks Dave. To my embarrassment and increasing confusion, late last night the Master was agreeing with JSBeeb. I think there might be at least a couple of things going on, including some minor infelicities in JSBeeb and some more major variation in what the Master does when an IRQ comes along. (Can we rule out NMIs on an idle machine?)

In the github issue conversation, we're invited to switch to TAPE, to switch off ADC conversions, and to squash all non-timer interrupts at source.

Code: Select all

*TAPE
*FX16,0
LDA #&7F:STA &FE4E:STA &FE6E
LDA #&C0:STA &FE4E
Do ROMs get involved in servicing timer interrupts? In that case, more ROMs would mean more time.

User avatar
hoglet
Posts: 7495
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by hoglet » Tue Jul 19, 2016 10:53 am

Hi Ed,

Some quick figures, from PI1000, using mode 7 and 1000 digits.

Master 128 - 160.51s

Master 128 with *FX 16,0 - 158.88s

Beeb FPGA Master 128 - 160.54s

Beeb FPGA Master 128 with *FX 16,0 - 158.82s

*TAPE didn't seem to make any difference.

Dave

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Tue Jul 19, 2016 11:37 am

A big thanks to hoglet .
BigEd wrote:On my Master and your v12 program (from the ssd in the tar file on your site) 1000 digits reports taking
177.74
(using my serial port for I/O) and then
176.23
running it more conventionally from the keyboard. And same again for a second run.
What were these numbers?! :shock: I was trying to use them... I do not see any difference in speeds for BBC Micro and Master. They give about 160 sec for 1000 digits.
BTW I've just attached ARM1 version sources. Could anybody find a way to make the better optimization? There are only about 50 lines of code to analyze: the main loop (16 lines) and the division macro (40 lines). It is my first ARM code. The power of ARM is astonishing. Even without hardware division and multiplication it outperforms 32-bit 68000 which has division and multiplication. So ARM2 should show up to 3 times better results.
I have to report that DEBUG from the disc #3 of ARM Evaluation System disc set doesn't work for me. :( Did anybody have more success with it?
Attachments
pi-aes-src.zip
(2.79 KiB) Downloaded 50 times

User avatar
BigEd
Posts: 2091
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by BigEd » Tue Jul 19, 2016 12:04 pm

It's possible that my Master is capable of running slow - still haven't understood that. Late last night it was capable of running at nominal speed, compared to JSBeeb.

[Thanks Dave. Just checking against the current JSBeeb Master 128, we have 158.74 as-is, with *TAPE 159.05 (or 174.09 which is a new source of bafflement. Or 158.74... goodness me.)]

litwr
Posts: 198
Joined: Sun Jun 12, 2016 8:44 am
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by litwr » Tue Jul 19, 2016 7:36 pm

The tables are updated and Pipack 13 is ready (http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html). I've added FX16,0 to the startup script. Are there other ways to increase speed? The Beeb's result is only 0.14% slower than C128/NTSC... BTW What is the true CPU frequency at BBC Micro? C128 wiki page shows 2 MHz but it is closer to 1.9 MHz.
Last edited by litwr on Tue Jul 19, 2016 8:03 pm, edited 1 time in total.

User avatar
hoglet
Posts: 7495
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: A mathematical demo and the request for help from the hardware owners

Post by hoglet » Tue Jul 19, 2016 7:49 pm

litwr wrote: What is the true CPU frequency at BBC Micro? C128 wiki page shows 2 MHz but it is closer to 1.9 MHz.
It's 2MHz most of the time, but then drops down to 1MHz when accessing some peripheral devices.

Dave

Post Reply