I was talking with Kieran last year at the Leicester ABUG about 2nd processor acceleration for games & the limit of what is practicable. The ultimate I suggested was to execute code directly from the Tube memory window FFE0-FFFF. This is a proof of concept for that method.
A few back of the envelope calculations and it seemed like it might be plausible to do Bad Apple in a graphics mode.
The machine is a stock B. I'm using a PC as a 'copro' on the Tube via a USB board (the one from the 6502decode thread).
Before anyone cries foul - this is exactly what the Tube port is for... hardware acceleration which Acorn did not conceive of ahead of time. I also added 8K of extra buffer to prevent USB buffer underflow/underruns.
I started out with only the Tube snoop board connected to the Tube. Now you might realise why every address in the Tube window can be read from in comms mode
Using BeebASM I created a short bit of code which plotted a string on the screen repeatedly - a hello world to see if it would work reliably. Since all bus accesses come from the Tube I padded the instructions with extra bytes for the double reads the 6502 performs (every cycle does a bus access on the NMOS 6502).
This is the actual source I used:
Code: Select all
\ Simple example illustrating use of BeebAsm oswrch = &FFEE osasci = &FFE3 addr = &70 MACRO PUTS val LDA #val JSR &FFEE NOP \ dead bus cycle ENDMACRO ORG &0000 ; code origin (like P%=&2000) .start SEI:NOP LDA #1 STA &FEFF JMP &FEE0 FOR n, 1, 512 PUTS 'c' PUTS 'f' PUTS 'm' PUTS ' ' JMP &FEE0 PUTS 'r' PUTS 'u' PUTS 'l' PUTS 'e' PUTS 'z' JMP &FEE0 PUTS ' ' PUTS '0'+(n DIV 100) MOD 10 PUTS '0'+(n DIV 10) MOD 10 PUTS '0'+n MOD 10 PUTS ' ' JMP &FEE0 INC &70 LDA &70 AND #7 ORA #&30 JSR &FFEE NOP PUTS 10 PUTS 13 JMP &FEE0 NEXT RTS \.mytext EQUS "Hello world!", 13, 0 .end SAVE "test.bin", start, end
It looked like the technique was going to work so I started work on the Bad Apple video...
The bulk of the work was writing the instruction stream compiler. It runs on a PC and I wrote it in C#. The program is task based and each task creates a stream of timed actions. The instruction compiler/packer then tries to create an instruction sequence which keeps looping in the Tube memory area and completes the actions by the time the task has set.
While I was developing/testing the 6502 instruction scheduler I found the USB chip couldn't keep up with the datarate (2MB/s) with it's piddly 1K internal buffer. I used the Tube snoop and 6502decode to work out it was underflowing the buffer. I had to hook up extra hardware to get some more buffering. So I used my 2nd processor prototype as it had the necessary memory and already had a level shifted and debugged Tube interface.
Initially I tested the scheduler writing blocks of text to mode 7. Then I moved on to the audio. The scheduler stays in sync with the CPU and counts cycle stretching too. PCM playback of the Bad Apple audio stream sounded best although it is really quiet
Once the instruction packer was working it is simple to add the audio. A task just requests actions periodically and leaves it up to the instruction packer to make them happen!
Next step was some form of video output. I created a boot loader which syncs with the video field. Then always launches the main instruction stream with a know time delay after the vsync on a the odd field. I tested this and scan line counting by setting blocks of colour on the screen by altering the palette. The video on the BBC uses the same clock as the processor so perfect cycle counting by my instruction packer means I can issue arbitrary code at arbitrary times... want something 10 scanlines down? Set the action deadline to 10*64us.
After more instruction packer debugging I tried adding the video. First I set a task to dump a frame every few seconds in mode 0. Digital audio playing by now of course. Next 5 fps full frames. Next calculate the difference between adjacent fields... and only send the changes.
By counting microseconds & having a known raster position I schedule the screen updates to actions with an earliest time and latest time. This 'chases the raster' as the image is composed. It tries to write the next field data just behind the display of the current field. In frames when the is high motion (lots of bytes to stuff) the raster starts outrunning the fill window... but it has to catch and overtake to get a visual artefact. In a low motion portion of the video the byte stuffing catches the raster up again and sits just behind it.
This worked unbelievably well the first go and I dumped my list of optimisations I had yet to implement in the bin! (e.g. use iny,dey,inx,dex to save a byte for less looping). A free aspect of this is I have to update the video at 50fps so I get double vertical resolution for free! Hence 640x512 in mode 0.
The audio was running at 22kHz... it runs at 44kHz too but doesn't sound any better on the BBC's limited hardware.... so I saved the cycles.
I tried adding dithering in the video task. This made some portions of the video look significantly better (star field when she's on the broomstick) but some bits worse (some of the shadows). 2x2 ordered dither was the best... I tried 4x4 but that removed too much detail.
There is a problem with the ordered dither and a TV in mode 0 though. Mode 0 dot pitch is higher than PAL... so some of the patterns become a bit flickery. I'll try it on a CUB monitor at some point and see how that looks... it didn't look good on Tricky's TV at ABUG and an LCD TV's temporal filters make a hash of decoding it. The flicker is what anyone who used high res modes on a TV (e.g. Amiga) BITD will have forgotten occured!
I gave up at that point because it is just a proof of concept.
It is just that easy
Seriously though the smarts is in the instruction scheduler... what you do with that is simple. I could have added copper bars... vertical copper bars!... all sorts... a task just need specify an action, timestamp and priority and the scheduler will do it. The scheduler tries to pack similar operations together with the constraints of higher priority actions and the small loop window (JMP &FFE0). If it overshoots it unwinds and runs the high priority action. It pads absolute time actions so they end _on_ the timestamp specified (used for 6522/sound chip) and tries to do as much of the other actions as it can (with the windowed time constraints). So the packer can even do useful work updating the screen in the sound chip 8us write enable time - instead of NOPs. It uses A,X & Y with an LRU policy to do memory writes - which reduces the number LDA/X/Y #abs. As a last resort if it really has nothing to do it pads the stream with NOPs or more usefully JMP &FFE0
I didn't tell Kieran about it ahead of time.. and waited for Sunday for the reveal...
640x512 50 fps Bad Apple on a B with 22kHz digital audio. That's full PAL resolution. LCD TV's make a pigs ear of the dithered shading in mode 0 for some of the patterns but hey, it's a computer from 1981.
It could be improved for CRT - but I don't have one here to test on... it could be improved for LCD too... perhaps and error propogation dithering would do better... but as I'd sunk enough time into it I stopped at 2x2 ordered working.