Space-efficient arithmetic on the 6502

Discuss all aspects of programming here. From 8-bit through to modern architectures.
User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Space-efficient arithmetic on the 6502

Postby hjalfi » Sat Aug 05, 2017 9:38 am

I'm working on a Secret Project™ which involves running really big 6502 binaries. Code size is a real issue.

Trouble is, the 6502 is pretty wordy when it come to non-eight-bit arithmetic. Take a simple 3op add:

Code: Select all

CLC
LDA left
ADC right
STA dest
LDA left+1
ADC right+1
STA dest+1


Worst case (assuming 16-bit addresses): 19 bytes.

I can shrink it a bit using a loop.

Code: Select all

LDY #0
CLC
.loop
LDA left, Y
ADC right, Y
STA dest, Y
INY
CPY #2
BNE loop


17 bytes.

If I were willing to use big-endian number representation, I can get that to 15 bytes, because that lets the loop index decrement rather than increment, so I save the CPY at the end.

The absolute best case I can think of is to cheat:

Code: Select all

JSR add16_3op
EQUW left
EQUW right
EQUB dest


...that's 9 bytes, but it's effectively turning my code into a threaded interpreter. It'll work, but the majority of the time is going to be spent shuffling parameters and the return address rather than actually adding numbers. Also, it's really limited as to what addressing modes can be supported --- I'd need distinct helper functions for every combination.

Am I missing anything obvious here?

User avatar
BigEd
Posts: 1499
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: Space-efficient arithmetic on the 6502

Postby BigEd » Sat Aug 05, 2017 9:45 am

One approach is to operate on an accumulator or stack in zero page: use one or two values in zp as your working set, and load and store from general memory only as necessary. Which is to say, instead of working in terms of 3-ops, use 2-ops, 1-ops (accumulator) or 0-ops (stack-based)

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Space-efficient arithmetic on the 6502

Postby hjalfi » Sat Aug 05, 2017 2:30 pm

Hmm.

A 16-bit accumulator would be doable, just. Assuming that it's in AX, a 2op AX+mem->AX add becomes:

Code: Select all

tay
txa
clc
adc mem+0
tax
tya
adc mem+1


...11 bytes. left+right->AX becomes:

Code: Select all

clc
lda left+0
adc right+0
tax
lda left+1
adc right+1


...14 bytes. In the real world, most of the time the next operation would be to write the accumulator back to memory, which is six bytes; this makes the mem+mem->mem case 20 bytes, which is worse than the existing worst case, but the best case is quite a lot better... I'll have to think about whether I have enough lookahead to decide which strategy to use (I'm not sure I do). It also uses all three registers, so I don't get any indexed addressing modes.

Using an accumulator in zero page doesn't save much --- one byte per parameter. The 6502's not friendly to 2op memory instructions, as the only arithmetic operations which modify memory are inc and dec. Having to compute the value in A and then write it back means that I'm effectively writing 3op instructions anyway.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Space-efficient arithmetic on the 6502

Postby dp11 » Sat Aug 05, 2017 3:13 pm

Code: Select all

LDY #0
CLC
.loop
LDA left, Y
ADC right, Y
STA dest, Y
INY
CPY #2
BNE loop


can be a bit better

Code: Select all

LDY #&FE
CLC
.loop
LDA left-&FE, Y
ADC right-&FE, Y
STA dest-&FE, Y
INY
BNE loop


15bytes now

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Space-efficient arithmetic on the 6502

Postby hjalfi » Sat Aug 05, 2017 5:04 pm

Ooh, nice! Can't use it for pointer indexing, but I'm totally using that. Cheers!

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Space-efficient arithmetic on the 6502

Postby hjalfi » Mon Aug 14, 2017 10:26 pm

Next question: signed comparisons. Oh boy, this turned out to be a can of worms.

I've seen a lot of places saying you can use the N bit to do a signed comparison:

LDA left
CMP right
BMI left_is_less_than_right

However, this page says you can't do that: http://www.6502.org/tutorials/compare_beyond.html#5.1

It's recommending this:

LDA left
SEC
SBC right
BVC 1f
EOR #0x80
1:
BMI left_is_less_than_right

That's an extra five bytes! Is this really necessary, or is there an shorter way?

User avatar
BigEd
Posts: 1499
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: Space-efficient arithmetic on the 6502

Postby BigEd » Tue Aug 15, 2017 5:31 am

That's a Bruce Clark document - it's possible that he's missed a trick, but very unlikely.

The problem with signed comparison is this: the inputs vary from -128 to 127, so the implied subtraction when you compare them ranges from -255 to +255 - that's a 9-bit result. It takes extra steps to construct the sign bit of that result.

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Space-efficient arithmetic on the 6502

Postby hjalfi » Tue Aug 15, 2017 10:00 pm

Well, blast. I was kinda hoping the author would turn out to be a crank.

Looks like the Z80 and the 6809 are similarly inflicted. Apparently people didn't do signed arithmetic back then?

The article's extended-comparison-by-subtraction tricks look very cool, but not amenable to looping --- it looks hard to maintain the flags needed for the comparison while also testing the loop counter at the same time. So I need to use the other approach:

Code: Select all

    LDY #0
.loop
    LDA (left), Y
    CMP (right), Y
    BMI less_than
    INY
    CPY #4
    BNE loop
.not_less_than


...which is, blech, 12 bytes (14 if I use an abs addressing mode). The signed version requires the sign adjustment, so that'll be 17 bytes (19 bytes). Ugh.

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Space-efficient arithmetic on the 6502

Postby Rich Talbot-Watkins » Tue Aug 15, 2017 10:47 pm

Is that right? That should be a BCC if you're doing an unsigned comparison left < right. What does BMI actually do it this context? Comparing 1 < 254 would definitely yield N clear, which is a false negative.

Also, that short-circuit out isn't right is it? You can only perform the flag check at the end after looping through all the bytes, or you can get false positives.

Also also: CMP doesn't take the carry flag as an input, so you'd need to use SBC there, and SEC at the beginning to initialize the carry flag for the first time. I think that's everything.

(Disclaimer: it's late)

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Space-efficient arithmetic on the 6502

Postby dp11 » Wed Aug 16, 2017 8:54 am

untested but you should get the idea.

Code: Select all

// unsigned 32bit  compare
// &xxxx LSB
// &xxxx+1
// &xxxx+2
// &xxxx+3 MSB
   
   LDY #3
.loop
       LDA (left), Y
       CMP (right), Y
       BNE   cmp_done
       DEY
       BPL loop
.cmp_done
   // flags NZ corrupt
   // Carry clear left < right
   // Carry set left >= right

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Space-efficient arithmetic on the 6502

Postby Rich Talbot-Watkins » Wed Aug 16, 2017 9:46 am

Again, the same problems apply there:
  • you're going through the bytes in MSB->LSB order (needs to start at addr+0 and finish at addr+3, or hold the value big-endian).
  • that needs to be SBC so that the C flag is used as an input in subsequent iterations, and so you also need a SEC at the start (or take the initial LDA/CMP out of the loop).
  • you can't short-circuit out with a BNE mid-loop as this will leave C in an incorrect state (imagine A < B, but lsb(A) > lsb(B))

An unsigned 32-bit compare needs to be something like this:

Code: Select all

SEC
LDY #0
.loop
LDA (left),Y
SBC (right),Y
INY
CPY #4
BNE loop
; Z=1: left == right
; C=0: left < right
; C=1: left >= right


As discussed already, there are optimizations: e.g. hold the value big endian and loop from Y=3 to Y=0 with BPL loop; or loop from LDY #&FC and reference (left-&FC) and (right-&FC), at the cost of two extra cycles per iteration in most cases.

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Space-efficient arithmetic on the 6502

Postby Rich Talbot-Watkins » Wed Aug 16, 2017 9:50 am

For completeness, a signed 32-bit compare would be like this:

Code: Select all

SEC
LDY #0
.loop
LDA (left),Y
SBC (right),Y
INY
CPY #4
BNE loop
; Z=1: left == right
BVC vclear
EOR #&80
.vclear
; N=0: left >= right
; N=1: left < right
; Z now meaningless

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Space-efficient arithmetic on the 6502

Postby dp11 » Wed Aug 16, 2017 9:53 am

If you don't need the equal case and just less than or >=. Then you can work MSB to LSB I think

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Space-efficient arithmetic on the 6502

Postby Rich Talbot-Watkins » Wed Aug 16, 2017 10:14 am

Ahhh, that's an interesting approach I don't remember seeing before. Just worked through a few examples and it does indeed seem to work, and even gives the right result in Z (I think).

Was trying out whether a similar approach is viable for signed values, but it seems to get a bit hairy!

User avatar
Rich Talbot-Watkins
Posts: 1121
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca

Re: Space-efficient arithmetic on the 6502

Postby Rich Talbot-Watkins » Wed Aug 16, 2017 10:17 am

I guess another thing we need to know is whether the speed of the routines is important - are we optimizing for size at the expense of speed, or looking for the best compromise between the two? Going MSB->LSB can provide a nice speed optimization by exiting early, but in the signed case, seems to lead to much bulkier code.

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Space-efficient arithmetic on the 6502

Postby dp11 » Wed Aug 16, 2017 10:19 am

You can extract the Z case if required, but that usually takes more cycles. Yep signed is another whole bags of worms.

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Space-efficient arithmetic on the 6502

Postby hjalfi » Wed Aug 16, 2017 12:06 pm

I should know better than to post these things after midnight --- my example managed to mangle the two different approaches. dp11's code is what I was failing to think of. It's only eleven bytes, too; you don't need a final comparison.

Being able to count backwards saves an explicit index comparison. Luckily since the short-circuiting form of the comparisons wants to go MSB to LSB, we get to do this naturally...

So, when doing a signed comparison, we need to look at the result of V and N, but we only need to preserve C from one byte to the next, that should allow, assuming I haven't made any more stupid mistakes:

Code: Select all

  LDY #3
  SEC
.loop
  LDA (left), Y
  SBC (right), Y ; sets V and N (plus flags we don't care about)
  BVC vclear
  EOR #&80
.vclear
  BMI left_is_less_than_right
  DEY ; clobbers N
  BPL loop
.left_is_greater_than_or_equal_to_right


= 15 bytes.

(I'm considering only magnitude comparisons. Equality comparisons are easier as we don't care about signedness.)

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Space-efficient arithmetic on the 6502

Postby hjalfi » Sun Aug 20, 2017 11:50 am

Important safety tip:

As I have just discovered, and should have realised ahead of time, you can't invert a short-circuited comparison by simply flipping the 0x10 bit on the main conditional instruction. You end up with gibberish. This:

Code: Select all

ldy #3
.loop
lda left,y
cmp right,y
bne left_is_not_right # this line works
dey
bpl loop
.left_is_right


...is not equivalent to this:

Code: Select all

ldy #3
.loop
lda left,y
cmp right,y
beq left_is_right # swapped bne to beq and reversed the two labels; this doesn't work
dey
bpl loop
.left_is_not_right


Because, of course, since the conditional is being short-circuited it's not symmetrical.

Because of reasons, I don't have control of the order of the code, so I have to support both cases. The fix is, of course, to add an extra JMP after the final BPL, but that bumps the size of to 15 bytes worst case. Still better than the 18-byte SBC method, though. (And the same applies to magnitude comparisons.)

User avatar
hjalfi
Posts: 74
Joined: Sat May 13, 2017 10:17 pm
Location: Zürich, Switzelrand
Contact:

Re: Space-efficient arithmetic on the 6502

Postby hjalfi » Sun Aug 20, 2017 4:43 pm

Also, my original looped add doesn't work:

Code: Select all

LDY #0
CLC
.loop
LDA (left), Y
ADC (right), Y
STA (dest), Y
INY
CPY #4
BNE loop


...because, of course, CPY clobbers C. The quick fix is to save P on the stack so that we can test Y without causing problems, but that adds another four bytes, to make it 17 bytes (best case) to 20 bytes (worst case).

This is better, if a bit evil:

Code: Select all

LDY #0
LDX #3 -- word size minus one
CLC
.loop
LDA (left), Y
ADC (right), Y
STA (dest), Y
INY
DEX
BPL loop


15 bytes min, 18 max, depending on the address mode. But I hate having to have two loop counters going in opposite directions. Have I missed anything?

(If all three parameters are absolute addresses, I get to use dp11's counting-from-0xFC trick, which reduces the worst case to 15 bytes. But it's pretty specialist.)

AFAICT the best unrolled 16-bit add is 12 bytes; the worst is 21 (where all three addressing modes are pointers, requiring indexing with Y).

dp11
Posts: 708
Joined: Sun Aug 12, 2012 8:47 pm

Re: Space-efficient arithmetic on the 6502

Postby dp11 » Sun Aug 20, 2017 5:05 pm

swap the endianess of the data ?


Return to “programming”

Who is online

Users browsing this forum: No registered users and 1 guest