Cowgol achieves first light!

Got a programming project in mind? Tell everyone about it!
User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Cowgol achieves first light!

Postby hjalfi » Fri Aug 18, 2017 9:19 pm

My compiler lives! It lives, I tell you!

So far it only does a bit of 8-bit arithmetic, the 16-bit stuff probably needs throwing away and rewriting, and you have to build it with the bootstrap compiler and run it on a PC, but it will turn (some) programs into code and they run...

https://github.com/davidgiven/cowgol

(Yes, this is why I've been asking questions about doing arithmetic on the 6502.)

It's woefully, woefully unfinished, but it'll turn this:

Code: Select all

sub print_char(char: uint8)
    @bytes 0xAD;   # LDA abs
    @words &char;
    @bytes 0x4C;   # JMP abs
    @words 0xFFE3; # OSASCII
end sub;

sub print(ptr: [int8])
    var index: uint8 := 0;
    loop
        var c: int8 := ptr[index];
        if c == 0 then
            return;
        end if;
        print_char(c);
        index := index + 1;
    end loop;
end sub;

print("Hello, world!\r");


...into this. I've annotated it to explain what's going on:

Code: Select all

L0E00:  lda     $0E47
        jmp     LFFE3
        rts

L0E07:  lda     #$00
        sta     $0E48
L0E0C:  ldy     $0E48 -- values can't be retained in registers across labels
        lda     ($00),y -- the pointer index is a uint8, so we get to use an indexed op here
        sta     $0E49 -- write barrier before the conditional, could be improved
        cmp     #$00 -- it can't remember flag state between instructions, so we check again
        bne     L0E19
        rts
L0E19:  lda     $0E49
        sta     $0E47
        jsr     L0E00
        ldx     $0E48 -- haven't taught it how to increment values in memory yet
        inx
        stx     $0E48
        jmp     L0E0C

        lda     #$38
        sta     $00 -- it noticed that ptr is used as a pointer and put it in zero page
        lda     #$0E
        sta     $01
        jsr     L0E07
        rts


The internal architecture is a memory-memory machine with no stack frames. This means that all variables are static and recursion is forbidden (for now). It walks the call tree and attempts to assign overlapping addresses for variables where it can prove that the two subroutines will never be used at the same time; this is astonishingly effective at reducing memory use (or, possibly, just broken). Pointers get assigned to zero page, all other data goes in main memory.

The compiler itself is a seven-stage behemoth. The stages are:

- tokeniser --- takes text and produces a token stream and a string table
- parser --- takes the token stream and produces a stack-based front-end bytecode stream
- typechecker --- enforces the Cowgol type rules on the FE bytecode and emits a memory-based backend bytecode stream
- classifier --- scans the call graph and assigns variables to locations in memory
- codegen --- turns BE bytecode stream into actual 6502 opcodes
- placer --- calculates the size of subroutines, resolves labels, and places them in memory
- emitter --- throws the result into an output file

Eventually I want to get it self-hosting, but I don't know yet how big the binaries are going to be. Building the parser with itself, it looks like it needs 20kB of in-memory storage, plus the binary. That's quite a lot on a machine the size of a BBC Micro. The bytecode stream, of course, doesn't use up memory.

It should be portable to other architectures. Right now it assumes little-endian, unaligned, 16-bit address space. If I ever make this work for the 6502, I'll try the Z80 next.
Attachments
B-em v-2.2_005.png

User avatar
danielj
Posts: 5364
Joined: Thu Oct 02, 2008 4:51 pm
Location: Manchester

Re: Cowgol achieves first light!

Postby danielj » Sat Aug 19, 2017 5:38 am

=D> :D
Excellent work! :D

d.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Cowgol achieves first light!

Postby dp11 » Sat Aug 19, 2017 8:03 am

This is excellent work. As you have found memory allocation instead of a stack often proves how little space a program needs. For me the backend is important. Peephole optimisation should quickly help. E.g jsr label RTS becomes jmp label. Valueflow will save lots extra loads. Automatic inlining of small and single use functions helps too. Have a look at sdcc for peephole ideas if need.

User avatar
tricky
Posts: 1921
Joined: Tue Jun 21, 2011 8:25 am
Contact:

Re: Cowgol achieves first light!

Postby tricky » Sat Aug 19, 2017 10:08 am

Great start, well done.

When you are looking at peephole optimisations, it might be worth comparing your output with that produced by Daniel Dallmann (aka Poldi) 6502 optimizer (attached). I'm not saying use it, although I guess you could, I'm just saying it gives you a minimum target to hit ;)
Attachments
6502PeepHoleOptimiser.zip
(10.65 KiB) Downloaded 10 times

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Sat Aug 19, 2017 5:16 pm

A peephole optimiser would be relatively straightforward to add --- it'd go between the codegen phase and the placer phase (so it still has access to labels). However, the output of the codegen phase doesn't preserve instruction boundaries; it looks like this:

Code: Select all

000a: 07 BYTES 0xa9 0xff
0015: 07 BYTES 0xa2 0x01
0020: 07 BYTES 0xa0 0x00
002b: 07 BYTES 0x8d
0036: 08 ADDRESS thing_id=0x07c1 offset=0x0000
003c: 07 BYTES 0xa9
0047: 0b ADDRESSLO thing_id=0x021f offset=0x0000
004d: 07 BYTES 0x85
0058: 0b ADDRESSLO thing_id=0x03c3 offset=0x0000
005e: 07 BYTES 0xa9
0069: 0c ADDRESSHI thing_id=0x021f offset=0x0000
(and so on, and so on)


So either the optimiser would need to reconstruct the instructions, or else I'd need to change the internal representation so that each output iop was a 6502 instruction. Not hard, but a thing for later.

The big win, though, is better variable memory assignment. Because cowgol's based around a memory machine, it does lots of copies, which is expensive on the 6502 (well, everything is expensive on the 6502). This:

Code: Select all

sub print_with_newline(string: [int8])
   print(string);
   print_newline();
end sub;


...turns into this:

Code: Select all

        ldy     #$01
LFFD5:  lda     $0E61,y
        sta     $00,y -- the code generator actually produced sta ($00), y here; that's a bug
        dey
        bmi     LFFD5
        jsr     L0E03
        jsr     L0E30
LFFE3:  rts


(Although I notice I'm using a loop --- because I'm copying to zero page here, the unrolled version is exactly the same size, so I might as well unroll it...)

If I could assign the parameter of print_with_newline() to the same memory location as the parameter of print(), then I wouldn't need the copy at all. That'd be a really big win and would save huge amounts of space. But figuring out when it's safe to do this is essentially a global register allocation problem, and if you go look up intractable NP-hard problems in a computer science text, they'll frequently say "an example of such a problem is global register allocation"... to do a good job you'd need to do split variable liveness and, probably, go into SSA form, and I really don't think a 6502-grade machine has enough memory.

Anyway, right now I'm more concentrating on making it produce correct code. I've already found a tonne of bugs. Next stop: proper unit tests. And a build system. Ugh.

(Incidentally, if I ever get this self-hosting, it'd be an interesting exercise to write a C frontend for it.)

User avatar
BigEd
Posts: 1499
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: Cowgol achieves first light!

Postby BigEd » Sun Aug 20, 2017 7:03 pm

Well done!

User avatar
Lardo Boffin
Posts: 673
Joined: Thu Aug 06, 2015 6:47 am

Re: Cowgol achieves first light!

Postby Lardo Boffin » Sun Aug 20, 2017 7:07 pm

Looks awesome already. Always wanted to do similar but time is not my friend at the moment (talent questions aside).
BBC model B 32k issue 4, 16k sideways RAM, Watford 12 ROM board, Retroclinic Datacentre + HDD, matchbox co-proc, Viglen twin 40/80 5.25" discs, acorn cassette
BBC model B 32k issue 7, turboMMC, Opus Challenger 3 512k, Pi 3 coproc, Acorn 6502 coproc

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Fri Sep 01, 2017 7:41 pm

Bit of a break, as I had to go back to the UK and be a fantasy Viking for a while...

So, a question for the house. What do you expect this to do?

Code: Select all

var index: uint8 := 0xFF;
print(array[index+1]);

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Cowgol achieves first light!

Postby dp11 » Fri Sep 01, 2017 7:54 pm

if I understandard the question the options are :

Code: Select all

1) print(array[0]);
2) print(array[256]);
3) error index overflow


it might depend on if array[] is defined to have >256 elements

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Cowgol achieves first light!

Postby Rich Talbot-Watkins » Fri Sep 01, 2017 9:10 pm

Probably needs some way of specifying the type of literals. Then you can define that addition always yields a result of the longest type (or whatever rule suits best). Are arrays always fixed size (so they can use indexed addressing as a shortcut in the case of them being <256), or are they interchangeable with pointer types?

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Fri Sep 01, 2017 11:35 pm

Unfortunately it's not the literals that are the problem here... it's the operator. The compiler's reasoning is that because index is a uint8 and 0 is a magic constant, the result of the + must also be a uint8. So it uses an 8-bit computation and the result overflows.

This would be totally reasonable if you were doing this:

Code: Select all

var result: uint8 := index + 1;  # overflow acceptable here


But I think it's weird when it happens in an array dereference.

C doesn't suffer from this because all arithmetic is done as a machine word type --- int or long, usually. So In the equivalent example, index is promoted from uint8 to an int, then 1 is added to it (which fits), then the array is indexed. So it has the expected behaviour. But the 6502 doesn't have a machine word type (which is a useful size).

Consider this:

Code: Select all

record ThreeBytes
  a: int8; b int8; c int8;
end record;

var array: ThreeBytes[1000];
array[index+1].a := 0;


That last statements turns into something like:

Code: Select all

var t1: uint8 := index + 1;
var t2: uint16 := zero_extend(t1);
var t3: uint16 := t2*3;
# now t3 is a byte offset into the array


Trouble is, making the type rules for the evaluation of t1 depend on the context of the expression is really grim...

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Cowgol achieves first light!

Postby Rich Talbot-Watkins » Sat Sep 02, 2017 8:40 am

My suggestion was that operators return a result the same length as their longest operand. If you can then specify a length for literals, you can specify precisely what you mean.

e.g. suppose literals default to 8-bit values.

Code: Select all

index + 1;  # uint8 + uint8 = uint8

Code: Select all

index + 1L;  # uint8 + uint16 = uint16

Code: Select all

index + uint16(1);  # uint8 + uint16 = uint16


This is different to C, which always promotes to at least int, but is probably more useful in this case.

You could probably come up with a similar rule about how signedness propagates.

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Sat Sep 02, 2017 7:19 pm

That's actually what it currently does (with the proviso that numeric constants are magic and have a length of 0, so they're always promoted to the same width as the other parameter). And that's what leads to the unexpected behaviour --- index+1 is evaluated using uint8 arithmetic, and therefore rolls over from 0xFF to 0x01...

So after thinking about it, and writing various replies and then deleting them again (I don't need a rubber duck, I have forums!), my proposal is:

* remove numeric promotion entirely --- if you use mismatched types, you get a compilation error. (Except for numeric constants, which are still silently promoted.) Given that on this scale of device, you're going to want to use as small a type as you can manage, I think I'd rather be verbose but predictable. Incidentally: this also forbids mixed signed/unsigned arithmetic without an explicit cast.

Code: Select all

var left: int8 := 5;
var right: int16 := 9;
var dest1: int16 := left + right; # error
var dest2: int16 := (left as int16) + right; # success


* the type which array indexing requires varies according to the bounds of the array.

Code: Select all

var smallarray: int8[50]; # index type is uint8
print(smallarray[5]); # succeeds
print(smallarray[5 as uint8]); # succeeds
print(smallarray[5 as int8]); # fails --- uint8 is not int8!
print(smallarray[5 as int16]); # fails

var bigarray: int8[500]; # index type is uint16
...examples are obvious...


* add a way to retrieve the index type, in order to shield the user from the details.

Code: Select all

var smallarray: int8[50];
var smallindex: smallarray@index := 0; # is a uint8

var bigarray: int8[500];
var bigindex: bigarray@index := 0; # is a uint16

var ptr: [int8];
var ptrindex: ptr@index := 0; # is a int16


This rather abandons being able to index a pointer with a uint8, but pointer indexing is pretty rare and I'm of half a mind to abandon it completely.

Does this still sound like a language people might want to use?

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Cowgol achieves first light!

Postby Rich Talbot-Watkins » Sat Sep 02, 2017 9:09 pm

That all sounds pretty sensible and predictable. In a platform like 6502, telling the compiler as much as you can is normally a good thing.

If pointer types exist, I think there's definitely a case for being able to index from them with both uint8 and uint16, if anything just because a uint8 index translates really nicely into a 6502 instruction. I can't imagine getting rid of pointer types completely (or do you mean just disallowing dereferencing a pointer at an index != 0) - how else would you perform a write to screen memory, for example?

Many, many years ago, I had the idea to revolutionise the world of type-in games in the Micro User by creating my own language which had built-in sprite commands and produced a streamlined bytecode which was to be interpreted. Knew nothing of formal language / compiler development back then, and this little project didn't get very far, but it would've been a great way to improve the quality of the games and reduce the amount of typing, had it actually come to something.

The problem with BASIC compilers is the difficulty in inferring types of variables, so your approach seems like a much better way to get performance. How are you thinking of dividing up language vs library features?

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Mon Sep 04, 2017 12:03 pm

You can do arithmetic on pointers, although unlike C the offsets are always in bytes, so you'd be able to do *(ptr+number). That would internally do a 16-bit add so it'd end up something like:

ldx #1
ldy #0
clc
.loop
lda ptr, y
adc number, y
sta temp, y
iny
dex
bpl loop
ldy #0
lda (temp), y


22 bytes, ugh. I'm not wedded to this idea. I agree, for 8-bit displacements it'd be so much nicer to end up with:

ldy number
lda (ptr), y


Of course the Z80 can't do variable indexing at all and would require the full add; but at least on the Z80 it's shorter...

ld hl, ptr
ld bc, number
add hl, bc
ld a, (hl)


8 bytes.

Byte-code's a time honoured technique for exchanging time for space, so you end up with much denser but slower code. On these really small systems that's frequently an entirely acceptable tradeoff. The interpreter's complex but with care you can get more done in the same space. Forth's a pretty good compromise with a tiny, very fast interpreter and reasonably dense bytecode; I've also been looking with great interest at SWEET16, with an impressively efficient 300 byte interpreter. In both, my example above would be:

Forth:
ptr num + @ ( eight bytes, usually )

SWEET16:
set r0, ptr
set r1, num
add r1
ld @r0
-- also 8 bytes


Couldn't find a SWEET16 license, though...

I seem to be going in the other direction and generated pretty verbose real machine code. I'm getting really concerned about code size, and I've actually given up on a lot of 16-bit operations and am generating threaded code. e.g. to do a multiplication dest=left*right, right now I have:

jsr __load6bytes
equw left
equw right
equw dest
jsr __mul16


...12 bytes, which isn't brilliant but is a lot better than it could be. I suspect the implementation of __load6bytes is going to end up being a bottleneck, but I won't really know until I start being able to compile non-trivial programs. But even with this, I have a nasty suspicion that the compiler's going to be too big to allow true self-hosting.

The Cowgol runtime in its current state is here, BTW. In all its hand-assembled glory. https://github.com/davidgiven/cowgol/bl ... untime.cow

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Fri Sep 08, 2017 8:34 pm

I've just compiled and run my first ever not-quite-trivial Cowgol program!

...and it kinda blows. The generated code is pretty big and very, very slow. There are a lot of inefficiencies, but it's still real machine code; how can it be this bad? Do I have a bug somewhere that's causing a busy loop inside a division routine or something?

Anyway, the demo program is: https://github.com/davidgiven/cowgol/bl ... bounce.cow
The compiled disassembly, and your eyes will bleed, is: https://pastebin.com/jhxcPGJv
The symbol table, without which the disassembly won't make sense, is: https://pastebin.com/UmXcgwpW
And the zipped SSD which you can load into jsbeeb is: https://github.com/davidgiven/cowgol/fi ... mo.ssd.zip

(Anyone know where I can upload the SSD itself so I can paste a jsbeeb link here?)

(Also, before anyone gets too excited, this is still all cross-compiled from Linux. Lots of work to do before I try compiling the compiler.)

The program itself bounces ten mode 7 balls around the screen, using teletext block graphics. It manages maybe four frames a second. That's just not right. It's direct screen memory access so there's no OS overhead, and it's a total of 80 pixels. Drawing each pixel uses three 8-bit divisions and one 8-bit multiply. Is this reasonable? It's been a long time since I've used a 2MHz 6502...

(Note that I have since putting together the above links optimised my program to just one 8-bit division and multiply per pixel. Didn't make much difference to performance.)

There's a whole pile of trivial optimisations, including stuff like turning LDA addr / INC A / STA addr into INC addr, and using as many zero page variables as possible, which will save 30% of space at a stroke, but right now I'm focusing on actual functionality. I think the next step is to rewrite the stack-machine-to-memory-machine converter, unfortunately, it's getting cumbersome...

SteveF
Posts: 439
Joined: Fri Aug 28, 2015 8:34 pm

Re: Cowgol achieves first light!

Postby SteveF » Fri Sep 08, 2017 11:07 pm

It's still a nice milestone to compile a non-trivial program! Congratulations!

FWIW, I rewrote this program in PLASMA to see how it worked there. Do you have any objections to me including it in my git repo as sample PLASMA code?

If I take out the vsync() call and add some ad-hoc timing logic, on an emulated Master 128 in b-em the PLASMA version takes 8.84 seconds to do 20 frames, which I make to be 2.3 frames/second. I'd expect the Cowgol version to be much faster as it's compiling to native code (the PLASMA code is compiled to a bytecode which is executed by a VM at runtime), so 4 frames/second doesn't seem unreasonable from this point of view - you're getting roughly double the performance.

I agree the actual performance is a bit disappointing for what seems quite a simple task, but I guess plotting and unplotting each pixel with all this calculation adds up. I'm sure one of our resident games programmers will be along with some advice on that aspect of the thing soon. :-)

In the past I have built a basic profiling tool by hacking an emulator to do (roughly) 'profile_count[pc]++' each time round the CPU emulation loop, then dumping the profile_count array out on emulator shutdown. In the absence of anything better you could maybe try that and see if you can identify a particular bottleneck.

It might also be interesting to rewrite this in BBC BASIC and see how it performs there.

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Sat Sep 09, 2017 4:50 pm

I hacked in a timer (a single frame takes 200ms) and found the problem.

The eight-bit division is pretty cheap --- about 125us. What's causing the immense slowdown is the 16-bit multiplication used to calculate the screen row address, which is very nearly a millisecond a shot. Removing that takes the total draw time to 50ms a frame (of course doesn't actually draw anything).

Changing the address calculation to this:

Code: Select all

    var rowoffset: int16 := yq as int16;
    rowoffset := (rowoffset<<5) + (rowoffset<<3);
    var ptr: [uint8] := screen + 1 + rowoffset + (xq as int16);


...reduces the draw time to 130ms. That's better, but not as much as I was hoping. The trouble here is that my 16-bit lsl routine is O(wordsize*shiftamount) --- as far as I know the only way to do it is to shift the entire 16-bit value (actually, the routine is generic and will work with any word size <256 bytes) one bit at a time.

I'm not sure there's anything really I can do about this --- 16-bit arithmetic is simply slow. The only really meaningful answer to 'how do I do multiplication quickly on an 8-bit machine?' is 'don't try to do multiplication on an 8-bit machine'.

So the program now uses a lookup table of screen rows. Time per frame: 60ms. https://github.com/davidgiven/cowgol/co ... 885ebd3477

BTW, it'd totally be possible to compile Cowgol for PLASMA. The Cowgol frontend is stack-based, so it should be straightforward. Right now the typechecker stage both does the type checking and also generates the memory-machine backend opcodes which the code generator consumes; but I'm about to start splitting that up, so the typechecker emits frontend opcodes and then there's an extra pass to generate backend upcodes. Converting to PLASMA would consume the typechecked frontend opcodes and skip the Cowgol code generator.

What you wouldn't get is the smart variable placement (Cowgol analyses the call graph and assigns variables to the same address if it can prove that they'll never need to be set at the same time), but you might not want that for PLASMA anyway.

SteveF
Posts: 439
Joined: Fri Aug 28, 2015 8:34 pm

Re: Cowgol achieves first light!

Postby SteveF » Sat Sep 09, 2017 6:13 pm

Nice one!

I'm not sure it's helpful, but could a more efficient left shift be done by repeated adding? So you'd implement x=x<<3 as 'x=x+x; x=x+x; x=x+x'. The more I think about it the less likely this seems to be a good idea, but I'll mention it anyway on the offchance.

It might be interesting to take a look at generating PLASMA bytecode once this is finished, although if you can crack the code size problem there probably wouldn't be any point given the excellent performance you're getting. Still, it would be a cool hack all the same. :-)

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Cowgol achieves first light!

Postby Rich Talbot-Watkins » Sun Sep 10, 2017 11:11 am

How are you determining the return type of multiplies? The natural result length is the sum of the operand lengths, so you'd naturally expect a uint8*uint8 to yield a uint16. If you're currently applying the same rules as an add, i.e. both operands need to be the same size as the result, you're making more work there by insisting that an 8 bit input is expanded to a 16 bit result (and performing a uint16*uint16 -> uint16 type multiply).

I think to improve performance, you might need to recognise specific inputs and generate different code as a result. For example, your LSL16 could be improved if you were to load the LSB into A, and then perform ASL A:ROL lsb as many times as necessary. Multiplying (or dividing) by a constant should always attempt to degenerate into a LSL/LSR/ASR where possible, and even the sum of them if it makes sense (e.g. *40 = *32 + *8). If you have any provision for 'inlining' code snippets, that could help optimization possibilities.

I saw you replaced the separate / and % operations with a divmod function, which is definitely a good thing (even C has such a thing), as that can be a hard optimization to spot. Remember that there are also various hacks for dividing by small constants, e.g.

Divide by 3 is:

Code: Select all

LDA byte:LSR A:LSR A
LDX #4:.loop
ADC byte:ROR A:LSR A
DEX:BNE loop


Divide by 7 is:

Code: Select all

LDA byte:LSR A:LSR A:LSR A
ADC byte:ROR A:LSR A:LSR A
ADC byte:ROR A:LSR A:LSR A


Divide by 15 is:

Code: Select all

LDA byte:LSR A:LSR A:LSR A:LSR A
ADC byte:ROR A:LSR A:LSR A:LSR A
ADC byte:ROR A:LSR A:LSR A:LSR A

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Cowgol achieves first light!

Postby hjalfi » Sun Sep 10, 2017 6:26 pm

Cowgol's type rules for arithmetic operators are essentially T = T op T, so in this case it's doing a 16bitx16bit=32bit multiply and discarding the top two bytes of the result. That was a deliberate choice --- it keeps the type rules simple and it means I can use the same helper routine for signed and unsigned values. The routine itself takes the word width as an input and will work on any size of value. This does make it slower than it could be, but right now I'm more concerned with code size than speed (unexpected pathological behaviour notwithstanding).

Eventually I suppose I'd want helper routines which specifically work on 16-bit values, as this would allow much more optimised implementations.

More exotic routines like fast division-by-constant and 8x8=16 multiply probably belong in a library rather than the language core. There's nothing to stop someone doing:

Code: Select all

var offset: uint16 := mul8x8x16(left_8_bits, right_8_bits);


...after all.


Return to “projects”

Who is online

Users browsing this forum: No registered users and 1 guest