So here's alpha 2.8: https://github.com/ZornsLemma/ozmoo/rel ... -alpha-2.8
The main change is that when using a second processor, spare RAM in the host is now used as a second level of cache to store game data. Main RAM (between host OSHWM/"PAGE" and the bottom of screen RAM) and sideways RAM can both be used, up to a total of (just under) 128K, in addition to the second processor's own RAM.
Using b-em to emulate a Master 128 with a 3MHz second processor and drive noises on (the best I can do to simulate a real floppy), the benchmark completes 32% faster with the cache than without it.
I'm pretty chuffed with this. Up until now it hasn't always been a clear win to play a game using a second processor - yes, you (almost always) have a faster CPU and that's helpful, but you also only have about 47K of RAM for the game so there might be a lot more disc access. On a Master 128 the CPU isn't as fast, but you get about 79K for game data. With this change, a Master 128 with second processor will have about 125K for game data and still benefit from the faster CPU in the second processor. [The numbers here are approximately correct but estimated, don't take them too seriously.]
The only other change is that I am now using a crude bit of crunching code in make-acorn.py to crunch the BASIC loader code by shortening procedure and variable names. This saves about 1.5K, not huge but it all helps to speed loading and it may reduce the tension I feel when working on the loader code between my natural verbosity and the fact it's an interpreted language and those long names actively take up space.
If this causes you problems, you can disable it by specifying the "--no-loader-crunch" option to make-acorn.py. If you need to do this for any reason other than just debugging/tweaking the loader code on your own machine with proper names, that's a bug so please report it.
The fly in the ointment
While it's not doing anything particularly exotic, the tube host cache code is the most "advanced" second processor code I've ever written, since it has to ship 512-byte blocks of data back and forth between the host and the second processor.
I developed it on b-em in Master 128 mode. I've subsequently tested it on various other emulated machines with mixed results:
- On MAME 0.224 emulating an Electron with an AP5 and ReCo6502 second processor, the tube build doesn't work properly. I *am* (I believe) taking account of the different address of the tube hardware in the host on the Electron as well as the different sideways RAM paging. If anyone can try this on real hardware (any 6502 second processor; I mention ReCo6502 here just because it's the only one that works with the Electron in MAME 0.224) I'd appreciate it. I have vague hopes this is an emulator issue (mentioned in passing here) and when I get my hands on a newer MAME it will work, but I'm not too confident - as I say, this is new territory for me and I've probably done something wrong. Edit: Nigel has sent me a build of the latest MAME (to be officially released at the end of the month) and the new tube cache works perfectly (well, the benchmark completes successfully anyway) now with all the 6502-based second processors, which is great news all round. Tests on real hardware still greatly appreciated!
- On b-em it seems to work fine on a B/B+/Master *as long as you have a 1770 instead of an 8271*. A model B with DFS 1.2 almost always (not always) locks up at exactly the same point right at the end of the benchmark (in mode 3, at least). But BeebEm 4.15 emulating that same model B with DFS 1.2 successfully completes the benchmark every single time. Again, this *might* be an emulator bug but it's more likely to be something I've done wrong. (Depending on the feedback I get from this alpha, I might get in touch with Coeus about this, but I'm reluctant to do that just yet.)
It's this which causes me to only be "pretty chuffed" instead of outright chuffed.
I'd really appreciate any test reports (successful or otherwise) for the new tube cache code. It's used by default, but if it's causing you problems you can disable it by passing the "--no-tube-cache" option to make-acorn.py.
Testing with any game is valuable - in many ways more valuable than running the benchmark - but if you could also try the benchmark I'd appreciate it. This just involves getting hold of a copy of Infocom's "Hollywood Hijinx" and passing "--benchmark" to make-acorn.py when building the game. This causes a built-in walkthrough to be typed automatically and disables the press-SHIFT-to-scroll stuff.
The benchmark is obviously a bit of a spoiler for the game, so you may want to avoid it.
From a correctness point of view the main thing is that a) the benchmark completes OK b) (much less likely) you don't notice any subtle corruption in the text output during the benchmark.
From a performance point of view, the benchmark outputs a timestamp in the form "$xxxxxx" at the start of the run and again at the end. Those are hexadecimal counts in fiftieths of a second; the first shows the time taken for the initial loading and the second includes that and also the time taken to run the game. Those timestamps are only valid if nothing stops the OS timer for long; on an Electron they tend to be useless, on the BBC series I tend to trust them but it's possible I am fooling myself by doing so.
If anyone feels like giving the code a look over to see if I'm doing anything wrong, the host-side cache code is here
. The tube-using code is bracketed by "jsr claim_tube" and "jsr release_tube".
How it works
This section is just me waffling about the technical details of the implementation; if that floats your boat, great, but if you just want to play adventure games you should stop reading now.
There's a new binary CACHE2P which is run by the loader. It loads and executes in the host, loading high but relocating itself down to host OSHWM to get the maximum amount of free RAM for the cache. It installs itself on USERV and claims OSBYTE &88 and OSWORD &E0.
OSBYTE &88 is used by the Ozmoo executable to initialise the cache and query its size. It needs to know the size so it can preload additional blocks of game data in the host cache as well as the second processor's own memory. Doing this preload (I hope!) correctly and efficiently is actually the biggest part of the change needed to support the cache, as I wanted to a) keep the drive head moving smoothly in one direction only during the load b) put preload blocks with younger timestamps in the second processor's own memory and older ones in the host cache.
OSWORD &E0 allows code running on the second processor to say two related things:
- (optional) I have block n at address a which I'm about to overwrite, you can copy it first if you want.
- I'd like to have block m at address a; can you do that for me?
Ignoring the preload support (which is nice but not essential), there are only about 30-40 lines of code added to the Ozmoo code to use the cache. Whenever the virtual memory subsystem is about to go to disc, we make an OSWORD &E0 call to hand over the existing game block we're about to overwrite and to ask if it has the block we're trying to load from disc. If it has the block, we're done, otherwise we just fall through to the existing code to load from disc.
The implementation of OSWORD &E0 is relatively simple. It could potentially be improved, but a fair chunk of the CPU time it takes is involved in copying blocks of data to and fro across the tube and I don't think that can be significantly improved, so any performance penalty from the naive caching algorithm is diminished. It's also the case that any cache hit avoids going to the disc for the data, so even if the cache performance is worse than it could be, it's still much better than not having it. (This assumes an actual floppy drive. Solid-state storage may be so fast there's hardly any benefit from having the cache; if anyone feels inclined to play around timing this on real hardware that would be interesting.)
I originally thought the cache OSWORD would go and fetch the blocks from disc itself if they weren't in the cache, but luckily I spent a bit of time mulling it over before charging ahead with the implementation and realised it would be a lot less awkward (less code duplication and/or rearrangement) to just leave the disc access in the main Ozmoo binary and make the cache a purely RAM-based affair.
The host cache contains a completely disjoint set of blocks from those held in the Ozmoo virtual memory cache in the second processor. When Ozmoo hands a block over to the cache it's doing it because it wants that virtual memory slot for a different block. When the cache hands a block over to Ozmoo, it deletes the local copy of it - it knows Ozmoo in the second processor now has it, it has the youngest possible timestamp there and won't be evicted any time soon, and when it *is* evicted it will be handed back to the host cache, at which point it will be reinserted into the host cache with the youngest possible timestamp there. This means that in the host cache, blocks don't get their timestamps updated because they've been accessed - as soon as they're accessed they are evicted from the cache anyway. The timestamps are just used to decide which block to discard if (as is likely) the cache is full when Ozmoo hands a new block over.
Except during the preload, the timestamps in the two caches are not synchronised in any way. The way blocks move back and forth mean there's no need for this; blocks in the Ozmoo cache in the second processor are all implicitly newer than blocks in the host cache.