Programming help required re: Huffman coding

bbc micro/electron/atom/risc os coding queries and routines
fuzzel
Posts: 798
Joined: Sun Jan 02, 2005 1:16 pm
Location: Cullercoats, North Tyneside
Contact:

Programming help required re: Huffman coding

Post by fuzzel »

I'm busy writing Urban Upstart, a conversion of a text adventure with graphics originally for the C64 and Spectrum and because of the amount of graphic screens I'd like to compress the game text as much as possible. I've looked at three options as follows:

No compressed text: Decompressor 0 bytes, data 11,500 bytes, compression table+lookups: 0 bytes
TOTAL 11,500 bytes
Compressed text using dictionary of common "words": Decompressor code 100 bytes, compressed data 5,050 bytes, compression table+lookups 1,400 bytes
TOTAL 6,550 bytes
Huffman compression: Decompressor code: 1,800 bytes, compressed data 6,000 bytes, compression table+lookups 0 bytes
TOTAL 7,800 bytes
(Level 9 used Huffman for their games btw)

In comparison, option 2 is the best bet because my Huffman code is very inefficient. Because Huffman code consists of a long list of bits to represent the text it achieves good compression without a table (eg the character space is represented by 3 bits: 000 or 3/8 of a byte) but to do the compression my program starts at bit 1 and works its way through a tree of 0 and 1 branches until it finds a letter, prints it and starts again. Somehow I need to find a way to shorten my Huffman code to less than 550 bytes so it can beat option 2.

I'm wondering if someone out there can point me in the right direction in terms of shortening my code. Here's an example:

The Huffman code for "a test" based on my table is as follows: 0100000001111010100011 (in memory this will be byte 1=64, byte 2=122, byte 3=140 - with 2 zeros added on the end)
a=0100
t=0011
e=110
s=1010
space=000
So you can see from the Huffman code for "a test" that there's no way of determining how many bits are in the first letter, my program will look at 0, branch as there's no letter for 0, then branch again as there's no letter for 01, then branch again as there's no letter for 010 then it comes to 0100 or "a" and prints it. Then it starts again at bit 5. A further complication is that in my Huffman tree the least common letters ( ) - 4 are represented by 14 consecutive bits so they're longer than a byte but this isn't a problem in terms of space because each only occurs once in the entire text. But I need to branch 14 times in my code before I reach them. Very convoluted. Hopefully someone knows a shortcut I can use, perhaps based on the above details write a short program that prints "a test" (in assembly language ideally but I suppose a basic program could be converted).
User avatar
roland
Posts: 4356
Joined: Thu Aug 29, 2013 9:29 pm
Location: Born (NL)
Contact:

Re: Programming help required re: Huffman coding

Post by roland »

I don't know if it helps you .... I wrote a Huffman compression and decompression in November 1995 and I published an article in Atom News. It's online, in Dutch but maybe the code gives you a start. It's written in Atom Basic with the P-Charme extension ROM. It uses some procedures.

The article can be read at https://site.acornatom.nl/documentatie/ ... 954018.gif
FPGAtom: 512 KB RAM, Real Time Clock and 64 colours
MAN WOMAN :shock:
User avatar
Richard Russell
Posts: 2337
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: Programming help required re: Huffman coding

Post by Richard Russell »

fuzzel wrote:
Sun May 30, 2021 4:13 pm
I'm wondering if someone out there can point me in the right direction in terms of shortening my code.
Have you considered LZW rather than Huffman? The apLib compression library comes with a decoder written in assembly language; admittedly it's x86 code rather than 6502 but it's astonishingly short so if it could be translated it might beat your current best solution:

Code: Select all

;;
;; aPLib compression library  -  the smaller the better :)
;;
;; fasm assembler depacker
;;
;; Copyright (c) 1998-2014 Joergen Ibsen
;; All Rights Reserved
;;
;; http://www.ibsensoftware.com/
;;

format MS COFF

public aP_depack_asm as '_aP_depack_asm'

; =============================================================

section '.text' code readable executable

aP_depack_asm:
    ; aP_depack_asm(const void *source, void *destination)

    _ret$  equ 7*4
    _src$  equ 8*4 + 4
    _dst$  equ 8*4 + 8

    pushad

    mov    esi, [esp + _src$] ; C calling convention
    mov    edi, [esp + _dst$]

    cld
    mov    dl, 80h
    xor    ebx,ebx

literal:
    movsb
    mov    bl, 2
nexttag:
    call   getbit
    jnc    literal

    xor    ecx, ecx
    call   getbit
    jnc    codepair
    xor    eax, eax
    call   getbit
    jnc    shortmatch
    mov    bl, 2
    inc    ecx
    mov    al, 10h
  .getmorebits:
    call   getbit
    adc    al, al
    jnc    .getmorebits
    jnz    domatch
    stosb
    jmp    nexttag
codepair:
    call   getgamma_no_ecx
    sub    ecx, ebx
    jnz    normalcodepair
    call   getgamma
    jmp    domatch_lastpos

shortmatch:
    lodsb
    shr    eax, 1
    jz     donedepacking
    adc    ecx, ecx
    jmp    domatch_with_2inc

normalcodepair:
    xchg   eax, ecx
    dec    eax
    shl    eax, 8
    lodsb
    call   getgamma

    cmp    eax, 32000
    jae    domatch_with_2inc
    cmp    ah, 5
    jae    domatch_with_inc
    cmp    eax, 7fh
    ja     domatch_new_lastpos

domatch_with_2inc:
    inc    ecx

domatch_with_inc:
    inc    ecx

domatch_new_lastpos:
    xchg   eax, ebp
domatch_lastpos:
    mov    eax, ebp

    mov    bl, 1

domatch:
    push   esi
    mov    esi, edi
    sub    esi, eax
    rep    movsb
    pop    esi
    jmp    nexttag

getbit:
    add    dl, dl
    jnz    .stillbitsleft
    mov    dl, [esi]
    inc    esi
    adc    dl, dl
  .stillbitsleft:
    ret

getgamma:
    xor    ecx, ecx
getgamma_no_ecx:
    inc    ecx
  .getgammaloop:
    call   getbit
    adc    ecx, ecx
    call   getbit
    jc     .getgammaloop
    ret

donedepacking:
    sub    edi, [esp + _dst$]
    mov    [esp + _ret$], edi ; return unpacked length in eax

    popad

    ret

; =============================================================
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
fuzzel
Posts: 798
Joined: Sun Jan 02, 2005 1:16 pm
Location: Cullercoats, North Tyneside
Contact:

Re: Programming help required re: Huffman coding

Post by fuzzel »

Thanks for the advice guys. Roland, my Dutch isn't up to much unfortunately, I hear it's possibly the closest language to English but not close enough for me! There's always google translate though... Had a quick look and it's given me the idea to write something in Basic for ease of debugging then rewrite in assembly language. One thing I have in mind would be to compare the left-hand-most bit of the compressed text with a table of the individual letters each stored as 16 bits / 2 bytes as I need up to 14, using ROL A to keep shifting to the next bit. Richard, I'll read up on LZW, it may be easier to get my head around than Huffman and possibly offer better compression. One thing that intrigues me is whether there are compression techniques which involve a trade off between processing time and level of compression achieved, I notice that Level 9's text printing isn't super fast so it may be that they use a higher level of compression than the one I've looked at somehow.
User avatar
TobyLobster
Posts: 74
Joined: Sat Aug 31, 2019 7:58 am
Contact:

Re: Programming help required re: Huffman coding

Post by TobyLobster »

I've not used this but maybe this is of use: https://github.com/clbr/NTC
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

If you are working under emulation, you can hand over the compression duties to the host. You only ever need to decompress on the target. And the more complex the compression scheme, the longer the decompression code will have to grow, and therefore the smaller the benefit. For the same total size of code plus data, lighter compression with a smaller decompressor probably will be subjectively better than heavier compression with a larger, slower decompressor.

For a first effort, I'd probably use a byte-oriented, dictionary-based scheme; where ASCII codes &00-&7F are taken literally and &80-&8F are treated as tokens indicating common sequences of characters (possibly complete words, but also could just be digraphs such as EA, NG and so forth). Maybe even steal a code from the low range &00-&1F or &7F to indicate one of a second set of 256 tokens (2 bytes is still a saving on a word 3 letters or longer). Then you need a table of where in memory the expansion of each token begins; you can store the expansions in a simple flat list, with bit 8 set on the last character. (Depending on the length of this table, you might even have some spare bits in the base addresses to indicate something.)

The host-side script you will need to generate the dictionary (don't even think about doing this by hand -- you will miss loads) can easily be modified into a tool for spotting repeated sequences of instructions that can be abstracted out with JSR / RTS -- anything 8 bytes or longer gives a net saving when moved out into a subroutine.
User avatar
Richard Russell
Posts: 2337
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: Programming help required re: Huffman coding

Post by Richard Russell »

julie_m wrote:
Mon May 31, 2021 2:35 am
And the more complex the compression scheme, the longer the decompression code will have to grow, and therefore the smaller the benefit.
That's not always true. For any given decompression scheme it is often possible to have different degrees of compression, depending on how complex the compressor is and how quickly it needs to run. For example in a dictionary-based approach the degree of compression may depend on how deep the dictionary search is, without the decompressor necessarily being affected at all.

That's one reason why I suggested LZW (as for example implemented in apLib) which suits a situation in which the compression happens only once, and can be as complex and time-consuming as you like, whereas the decompression code can be very small and fast. Another consideration is that Huffman coding/decoding involves handling variable-length binary values, which doesn't suit a CPU that is not rich in large integer and bit-twiddling operations (e.g. bit scans). On the other hand LZW is largely byte-oriented and therefore more likely to suit a CPU like the 6502.
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
User avatar
BigEd
Posts: 4107
Joined: Sun Jan 24, 2010 10:24 am
Location: West Country
Contact:

Re: Programming help required re: Huffman coding

Post by BigEd »

There's a link to a rather handy compression library and some comparison of different libraries here:
https://github.com/emmanuel-marty/lzsa
with code for 6502, z80, 6809...
(as mentioned by ChrisB over in the Urban Upstart - Spectrum to BBC Conversion thread)

I had been following various pointers to compression schemes and thought LZ4 sounded quite good, but this LZSA page says we can do better. It's handy to have a page which concentrates on 8 bit code - compression schemes which are good for modern CPUs may be painful in 6502 land.

So while I'm here, and somewhat OT, here are some links I'd found:
How LZ4 works
Difference: LZ77 vs. LZ4 vs. LZ4HC (compression algorithms)? (stackoverflow.com)
LZSS vs. LZ77 compression difference (stackoverflow.com)

(tricky has described his scheme as LZSS)
User avatar
Richard Russell
Posts: 2337
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: Programming help required re: Huffman coding

Post by Richard Russell »

BigEd wrote:
Mon May 31, 2021 2:36 pm
I had been following various pointers to compression schemes and thought LZ4 sounded quite good, but this LZSA page says we can do better.
Define "better"; it very much depends on the application! For example in a situation in which the compression takes place just once, and not even necessarily on the same platform (e.g. to compress the resources of a game), speed, size and 8-bit friendliness of the compression routine are irrelevant. LZ4 is unlikely to be optimum in that scenario.
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
User avatar
ChrisB
Posts: 145
Joined: Wed Oct 05, 2011 10:37 pm
Location: Surrey
Contact:

Re: Programming help required re: Huffman coding

Post by ChrisB »

Right - I've spent far too much time on this today. I'll tidy up the code and post in a bit.

So attached is an SSD with my efforts.

Run *huff to display some text then CALL &3000 for successive lines.

Decompression code is 86 bytes. It just writes each character as it decompresses with no effort to word wrap. Text is 1097 bytes that compresses down to 614 bytes with a decode tree of 166 bytes.
huff.ssd
(1.5 KiB) Downloaded 5 times
User avatar
BigEd
Posts: 4107
Joined: Sun Jan 24, 2010 10:24 am
Location: West Country
Contact:

Re: Programming help required re: Huffman coding

Post by BigEd »

(Well done - I didn't mean to get us too far away from Huffman. It always struck me as a fine scheme. )
User avatar
ChrisB
Posts: 145
Joined: Wed Oct 05, 2011 10:37 pm
Location: Surrey
Contact:

Re: Programming help required re: Huffman coding

Post by ChrisB »

OK - As promised.
huff.ssd
(9 KiB) Downloaded 7 times
Files as follows:
huff - new smaller version ;) I've reduced the code size to 74 bytes but I'm sure someone could do better.
encode - the encode routine. I'm sure this is very suboptimal - but it's a one time operation. I ran it in BBCBasic for SDL.
source - beebasm source file.

Hope this helps.
User avatar
Richard Russell
Posts: 2337
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: Programming help required re: Huffman coding

Post by Richard Russell »

ChrisB wrote:
Mon May 31, 2021 4:07 pm
Text is 1097 bytes that compresses down to 614 bytes with a decode tree of 166 bytes.
For comparison, that same text compresses to 715 bytes using apLib, so Huffman wins in this case.
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
User avatar
ChrisB
Posts: 145
Joined: Wed Oct 05, 2011 10:37 pm
Location: Surrey
Contact:

Re: Programming help required re: Huffman coding

Post by ChrisB »

Is that the whole text - or each sentence individually? As I understand the requirement here this is a for an adventure game where there are many individual pieces of text and as such simply compressing the whole block is not suitable. When I looked at LZSA in the other thread small blocks of data (<512 bytes) did not compress at all well.
User avatar
Richard Russell
Posts: 2337
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: Programming help required re: Huffman coding

Post by Richard Russell »

ChrisB wrote:
Mon May 31, 2021 6:21 pm
Is that the whole text - or each sentence individually?
It was the whole 1097 bytes. As you say dictionary-based systems perform poorly with short segments.
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
fuzzel
Posts: 798
Joined: Sun Jan 02, 2005 1:16 pm
Location: Cullercoats, North Tyneside
Contact:

Re: Programming help required re: Huffman coding

Post by fuzzel »

Before getting down to work this morning I was pondering julie_m's posting from very early on and it got me thinking about the way I currently access my data tables for compressed words, messages, locations, objects and even the graphics, the intention for this topic being not only data compression but other memory saving techniques so I can cram as many pics in my game as possible. I'd fallen into the habit, not sure why, of reserving a table of 2 pages of memory pointing to the high byte and low byte of each item in the table (catering in each case for 256 items). So for example, to print message 12, the program would look up the high byte and low byte for message 12's location in memory and then print it. Totally unnecessary! All I have to do to print the nth message is have a subroutine search for the (n-1)th message end indicator and the next byte will be the start of the nth message to print. So now I can remove all my lookup tables which total 5x512=2,560 bytes which will free up space for approximately 13 extra graphic screens. Nice.
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

Linear searching for end-of-message markers every time sounds ever so slightly sub-optimal. Certainly wherever I have variable-length records in BCP, I am keeping track of offsets (relative to the start of the list, so it does not matter where in memory the design data is sitting; this seemed the best way of proofing against code growth) and lengths of each entry. Also, I like to keep my low and high bytes together -- but all my tables either have few enough entries to fit the whole lot within 256 bytes, or fixed-length records long enough to use an existing multiply routine to calculate the start address in a zero page pointer.

On the other hand, storing the start of -- say -- just every 8th or 16th message shrinks the table of offsets dramatically, and you only ever have to search for at most 7 or 15 end-of-message markers. So you can pick the prime locations for the most commonly-accessed messages, and keep the slowest ones for game-over scenarios and long descriptions of rarely-visited rooms.
User avatar
ChrisB
Posts: 145
Joined: Wed Oct 05, 2011 10:37 pm
Location: Surrey
Contact:

Re: Programming help required re: Huffman coding

Post by ChrisB »

I realise that this issue has now been "solved" however I've continued to play around with this.

I have a longer piece of text (4430 bytes). My code as above compresses this down to 2517 bytes + 222 bytes for the tree and 76 for the decoder. Total 2815 bytes.

The tree is quite large - so I was wondering how to reduce that. (Obviously you could store the tree another way but this would require more code to unpack). First thought - if you make the start of sentences lower case then you should reduce the potential number of items to tokenise. There is however a bit more code to put these capitals back. It sort of works. New compressed size is 2494 bytes, with a 218 byte tree and 98 bytes of code. Total 2810.

My next thought was for a combined tokenisation and encode. Taking the top x most used words and allocating them to ascii values 1-30. The idea is to trade off a larger tree for less data to compress. Seems that also isn't great. :(

I chose to allocate the weight for the tree as the number of repetitions times their length. So 3 occurrences of a 10 letter word would be higher weighted than 6 three letter words. Even without the extra code to decompress this the best I could manage was only 2713 bytes compared to 2739 as the straight Huffman- and that was only with the top 4 words ("the","was","of","and"). Anything more and the total size of the compression started to get bigger....

Interesting experiment nonetheless.
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

Here's my go at a dictionary-based uncompressor, in BeebEm:

Code: Select all

\  DICTIONARY-BASED TEXT UNCOMPRESSION ROUTINE
\  JULIE KIRSTY LOUISE MONTOYA
\  PUBLIC DOMAIN / NO RIGHTS RESERVED

\  VARIABLES

comp_text_ptr=&70   \  pointer to text to uncompress
dict_ptr=&72        \  pointer to dictionary word
spc_hi=&A0          \  space, with high bit set
CR=&D
osasci=&FFE3

\  THE DICTIONARY

\  THIS IS A VERY SIMPLE DICTIONARY JUST TO SHOW HOW IT WORKS.
\  ALL THIS WILL BE COMPUTER-GENERATED FROM THE INPUT TEXT IN REAL LIFE!

ORG &7000

.dictionary
\  A table of offsets to the beginning of each word in the dictionary.
EQUW 0
EQUW 2
EQUW 4
EQUW 6
EQUW 8

ORG dictionary+&100

.dict_words
\  The actual dictionary words. The last character of each word has
\  its high bit set.
EQUS"l":EQUS'l'OR&80
EQUS"i":EQUS's'OR&80
EQUS"s":EQUS's'OR&80
EQUS"e":EQUS'd'OR&80
EQUS"wibbl":EQUS'e'OR&80

.dict_end

\  THE UNCOMPRESSION CODE

ORG &7800

._begin

\  Print text starting at [comp_text_ptr] until &00 is reached
\  Values &80-&FF are expanded by replacing them with dictionary words

.uncompress
    LDY #0
.uncomp_loop
    LDA (comp_text_ptr),Y
    BMI expand_token
    BEQ uncomp_done
    JSR osasci
.uncomp_next
    INY
    BNE uncomp_loop
    INC comp_text_ptr+1
    BNE uncomp_loop
    \  NB; we are not expecting to fall through here
    
.uncomp_done
    RTS
    
\  Expand a token from the dictionary

.expand_token
    ASL A   \  now C=1
    TAX
    \  Advance the compressed text pointer so we can reset Y
    \  No need to clear the carry, because we really want the
    \  address of the *next* character anyway.
    TYA
    ADC comp_text_ptr
    STA comp_text_ptr
    BCC get_word_base		\  check if or not we need to ...
    INC comp_text_ptr+1     \  ... increase the high byte
    CLC
.get_word_base
    LDA dictionary,X
    ADC #dict_words MOD256
    STA dict_ptr
    LDA dictionary+1,X
    ADC #dict_words DIV256
    STA dict_ptr+1
    LDY #0
.display_word
    LDA (dict_ptr),Y
    BPL _disp_word1
    AND #&7F                \  Clear high bit
    JSR osasci
    LDY #0
    BEQ uncomp_loop         \  Back to characters again
._disp_word1                \  Not the last character
    JSR osasci
	INY
	BNE display_word
    \  NB; we are not expecting to fall through here
._rts
    RTS

ALIGN &100    

\  Messages contain a mix of ASCII characters and tokens, which will
\  be expanded automatically, and end with CR and CHR$(0).

.message

EQUS "He":EQUB &80:EQUS "o World!"
          \ &80 => ll
EQUB CR
BRK

.message1
EQUS "Th":EQUB &81:EQUS " ":EQUB &81:EQUS " compre": EQUB &82:EQUB &83
          \ is              \ is                     \ ss     \ ed
EQUB CR
BRK

.message2
EQUB &84 \ &84 => wibble
EQUB CR
BRK

._end

SAVE "DICT", dictionary, dict_end
SAVE "UNCMP0", _begin, _end, _rts	
I've stuck an extremely simple dictionary in there just to make it work, with just "ll", "is", "ss", "ed" and "wibble"; but the real dictionary would have to be generated by host-side scripts. The dictionary can hold up to 128 entries in this version, each of which can be any length up to 256 characters and must have bit 7 of its last character set; then, characters &80-&FF in the message will be replaced by the corresponding dictionary entry. Messages end with BRK. Place the start address of the message you want to uncompress in comp_text_ptr and comp_text_ptr+1 and call uncompress to display it.

This version doesn't do anything to stop words from going past the end of a line, but we can solve one problem at a time :)
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

And as soon as I had that, I came up with a couple of improvements. Sometimes I just have to do something not quite right in order to see a way to do it better .....

First I noticed a gap where 32 of the possible "end of dictionary word" marker characters did not map onto anything printable. And I figured the most likely thing you might want at the end of a word is a lower-case letter and a trailing space. So I made the ending codes &80-&9F map onto codes &60-&7F and print a trailing space after the character.

Then I noticed another gap in that the non-printing characters &00-&1F never occur within a word. So I found a way to map these others wasted positions onto tokens &80-&FF, figuring even a limited range of digraphs and trigraphs within longer words is some saving. This has meant the token expansion routine is now properly re-entrant and, as a bonus, we can actually save two bytes of zero page.

I also took the liberty of mapping the characters {, |, } and ~ -- which display as 1/4, ||, 3/4 and ÷ in MODE 7 -- onto the (more useful, IMHO) sequences ", ", "- ", ". " and "? " (i.e. comma, minus sign, full stop and question mark all with trailing spaces). The 20 bytes this adds will easily be recouped from instances of those character pairs in the compressed text.

Code: Select all

\  DICTIONARY-BASED TEXT UNCOMPRESSION ROUTINE

\  VARIABLES

text_ptr=&70    \  pointer to text to uncompress
spc_hi=&A0      \  space, with high bit set
CR=&D           \  carriage return
osasci=&FFE3    \  OS routine to display character, adding LF with CR

messages=&6000

\  THE DICTIONARY

\  THIS IS A VERY SIMPLE DICTIONARY JUST TO SHOW HOW IT WORKS.
\  ALL THIS WILL BE COMPUTER-GENERATED FROM THE INPUT TEXT IN REAL LIFE!

ORG &7000

.dictionary
\  A table of offsets to the beginning of each word in the dictionary.
EQUW 0
EQUW 2
EQUW 4
EQUW 6
EQUW 8
EQUW 10
EQUW 12
EQUW 14

ORG dictionary+&100

.dict_words
\  The actual dictionary words. The last character of each word is
\  indicated by setting its high bit.
\  Within a word, codes &00-&1F map onto tokens &80-&9F.
\  If the last character of a word is between &00-&1F, it will be
\  rendered as a lower case letter &60-&7F with a trailing space.
EQUS "l":EQUS 'l'OR&80
EQUS "i":EQUS 's'OR&80
EQUS "s":EQUS 's'OR&80
EQUS "e":EQUS 'd'OR&80
EQUS "o":EQUS 'o'OR&80
EQUS "d":EQUS 'o'OR&80
EQUS "o":EQUS 'r'OR&80
EQUB 5:EQUB 6:EQUS "wa":EQUS 'y'AND&1F OR&80
\    do     or                  \ trailing space
.dict_end

\  THE UNCOMPRESSION CODE

ORG &7800

._begin

.select_msg
    ASL A
    TAX
    CLC
    LDA msg_table,X
    ADC #messages MOD256
    STA text_ptr
    LDA msg_table+1,X
    ADC #messages DIV256
    STA text_ptr+1

\  Print text starting at [text_ptr] until &00 is reached
\  Values &80-&FF are expanded by replacing them with dictionary words

.uncompress
    LDY #0
.uncomp_loop
    LDA (text_ptr),Y
    BEQ uncomp_done
    BPL uncomp_not_token
    JSR expand_token
    TYA
    LDY #0
    SEC
    ADC text_ptr
    STA text_ptr
    BCC uncomp_loop		\  check if or not we need to ...
    INC text_ptr+1     \  ... increase the high byte
    BCS uncomp_loop        \  (INC did not affect carry)
.uncomp_not_token
    JSR newwrch
.uncomp_next
    INY
    BNE uncomp_loop
    INC text_ptr+1
    BNE uncomp_loop
    \  NB; we are not expecting to fall through here
    
.uncomp_done
    RTS

\  Expand a token from the dictionary

.expand_token
    ASL A   \  now C=1
    TAX    
    \  Stash Y and the old pointer
    TYA
    PHA
    LDA text_ptr+1
    PHA
    LDA text_ptr
    PHA
.get_word_base
    CLC
    LDA dictionary,X
    ADC #dict_words MOD256
    STA text_ptr
    LDA dictionary+1,X
    ADC #dict_words DIV256
    STA text_ptr+1
    LDY #0
.display_word
    LDA (text_ptr),Y
    BMI expand_done
._disp_word1                \  Not the last character
    CMP #32
    BCS disp_word2

\  Codes &00-&1F in middle of a word are expanded as
\  tokens &80-&9F.

    JSR expand_token
    JMP disp_word3

.disp_word2
    JSR newwrch
.disp_word3
	INY
	BNE display_word
    \  NB; we are not expecting to fall through here

\  Display last character of word and return.

.expand_done
    AND #&7F                \  Clear high bit
    CMP #32
    BCS expand_done1
    
\  Codes &00-&1F at end of word are displayed as a lower
\  case letter with a trailing space.
    
    ORA #&60
    JSR newwrch1
    JMP expand_done2
      
.expand_done1
    JSR newwrch
.expand_done2
    \  Retrieve old pointer and Y
    PLA
    STA text_ptr
    PLA
    STA text_ptr+1
    PLA
    TAY
._rts
    RTS

\  For an additional space saving, the characters { | } ~
\  (which would ordinarily display as 1/4 || 3/4 ÷ in MODE 7)
\  are mapped to display as , - . ? with a trailing space.

.newwrch
    CMP #&7B        \ opening posh bracket or 1/4
    BCC newwrch2
    SBC #79 \ we know C=1 because BCC didn't branch
    CMP #47         \ ~ would ordinarily give /
    BNE newwrch1
    LDA #63         \ change to ?

\  Display a character with a trailing space

.newwrch1
    JSR osasci
    LDA #32
.newwrch2
    JMP osasci

ALIGN &100    

.msg_table
EQUW msg0-messages
EQUW msg1-messages
EQUW msg2-messages
EQUW msg3-messages
EQUW msg4-messages
EQUW msg5-messages
EQUW msg6-messages
EQUW msg7-messages
EQUW msg8-messages
EQUW msg9-messages
EQUW msg10-messages
EQUW msg11-messages
EQUW msg12-messages
EQUW msg13-messages
EQUW msg14-messages

._code_end

ORG messages   

\  Messages contain a mix of ASCII characters and tokens, which will
\  be expanded automatically, and end with CR and CHR$(0).

.msg0
EQUS "You are in the front room.  Through a window to the North you can see out into the street.  A door leads South into another room."
EQUB CR
BRK
.msg1
EQUS "You are in the street outside number 32, which stands to your South.  The street runs East and West."
EQUB CR
BRK
.msg2
EQUS "You are in the back room.  There are exits to the North and South, and a staircase to an upper floor."
EQUB CR
BRK
.msg3
EQUS "You are in the kitchen, which an estate agent might describe as compact and bijou.  There is a room to the North and a garden to the West."
EQUB CR
BRK
.msg4
EQUS "You are in the back yard.  To the East is a rather pokey kitchen.  A door leads North into a narrow alleyway."
EQUB CR
BRK
.msg5
EQUS "You are on a landing at the top of the stairs.  To the North is the master bedroom.  A corridor leads South."
EQUB CR
BRK
.msg6
EQUS "You are in the Master bedroom.  A window to the North commands an imposing view of the houses on the other side of the street.  There is a door to the South."
EQUB CR
BRK
.msg7
EQUS "You are in a narrow corridor which runs from the front of the house (North) to the back.  There are doors South and West, and a hatch above you."
EQUB CR
BRK
.msg8
EQUS "You are in the back bedroom.  Through a window to the South you can see the back garden below.  There are doors East and North."
EQUB CR
BRK
.msg9
EQUS "You are in a little cupboard.  The door takes up most of the South wall."
EQUB CR
BRK
.msg10
EQUS "You are in the bathroom.  The bathtub stands to the West.  To the North, a step leads up to a door."
EQUB CR
BRK
.msg11
EQUS "You are standing under the shower.  The bathroom proper is to the East."
EQUB CR
BRK
.msg12
EQUS "You are in the attic.  There is plenty of standing room.  Below you is a corridor."
EQUB CR
BRK
.msg13
EQUS "You are in an alleyway which runs North-South alongside the house."
EQUB CR
BRK
.msg14
EQUS "You are in the street between numbers 30 and 32.  To your South is a large wooden door.  The street runs East and West."
EQUB CR
BRK
._msgs_end

._end

SAVE "DICT", dictionary, dict_end
SAVE "UNCMP1", _begin, _code_end, _rts	
SAVE "MSGS", messages, _msgs_end
I've also added some more dictionary entries, some text from an unfinished game I had lying around, and a "select message" function which expects an index into the message table in A. Note that all the text above is uncompressed! My next job is to generate a dictionary for it and see how much the saving is .....
User avatar
tricky
Posts: 5599
Joined: Tue Jun 21, 2011 9:25 am
Contact:

Re: Programming help required re: Huffman coding

Post by tricky »

I'm not sure what the name of the algorithm is that I use as I just made it up and then someone told me what it was from my description.
I have a different compression scheme that I use for the text of my menus, which works per title and publisher. It is a kind of recursive dictionary system and I think gets about 2:1 Inc the dictionary for my games menu.
User avatar
ChrisB
Posts: 145
Joined: Wed Oct 05, 2011 10:37 pm
Location: Surrey
Contact:

Re: Programming help required re: Huffman coding

Post by ChrisB »

Julie - I expect that text will compress very well if you can build the right dictionary. There are only about 120 individual words in total but I'd imagine encoding - for example - "You are in " as a word might help significantly.

For reference my Huffman gives 863 + 154 + 94 = 1111 bytes to encode the 1637 you have here.
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

Yes, it's come out quite nicely already, just tokenising entire words; and it will even go a little bit smaller once I get the next of my compression scripts up and running (notice it's not making any use of tokens &80-&9F yet). I added another hard-coded shortcut, once I saw my word table and realised something; so now, @ will display as "a ".

Code: Select all

\  DICTIONARY-BASED TEXT UNCOMPRESSION ROUTINE

\  VARIABLES

text_ptr=&70    \  pointer to text to uncompress
spc_hi=&A0      \  space, with high bit set
CR=&D           \  carriage return
osasci=&FFE3    \  OS routine to display character, adding LF with CR

messages=&6000

\  THE DICTIONARY

\  THIS IS A VERY SIMPLE DICTIONARY JUST TO SHOW HOW IT WORKS.
\  ALL THIS WILL BE COMPUTER-GENERATED FROM THE INPUT TEXT IN REAL LIFE!

ORG &7000

.dictionary
\  A table of offsets to the beginning of each word in the dictionary.
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW 0
    EQUW word160-dict_words
    EQUW word161-dict_words
    EQUW word162-dict_words
    EQUW word163-dict_words
    EQUW word164-dict_words
    EQUW word165-dict_words
    EQUW word166-dict_words
    EQUW word167-dict_words
    EQUW word168-dict_words
    EQUW word169-dict_words
    EQUW word170-dict_words
    EQUW word171-dict_words
    EQUW word172-dict_words
    EQUW word173-dict_words
    EQUW word174-dict_words
    EQUW word175-dict_words
    EQUW word176-dict_words
    EQUW word177-dict_words
    EQUW word178-dict_words
    EQUW word179-dict_words
    EQUW word180-dict_words
    EQUW word181-dict_words
    EQUW word182-dict_words
    EQUW word183-dict_words
    EQUW word184-dict_words
    EQUW word185-dict_words
    EQUW word186-dict_words
    EQUW word187-dict_words
    EQUW word188-dict_words
    EQUW word189-dict_words
    EQUW word190-dict_words
    EQUW word191-dict_words
    EQUW word192-dict_words
    EQUW word193-dict_words
    EQUW word194-dict_words
    EQUW word195-dict_words
    EQUW word196-dict_words
    EQUW word197-dict_words
    EQUW word198-dict_words
    EQUW word199-dict_words
    EQUW word200-dict_words
    EQUW word201-dict_words
    EQUW word202-dict_words
    EQUW word203-dict_words
    EQUW word204-dict_words
    EQUW word205-dict_words
    EQUW word206-dict_words

ORG dictionary+&100

.dict_words
\  The actual dictionary words. The last character of each word is
\  indicated by setting its high bit.
\  Within a word, codes &00-&1F map onto tokens &80-&9F.
\  If the last character of a word is between &00-&1F, it will be
\  rendered as a lower case letter &60-&7F with a trailing space.
.word128
.word160
    EQUS "bathroo":EQUB &ED
.word161
    EQUS "standin":EQUB &87
.word162
    EQUS "corrido":EQUB &F2
.word163
    EQUS "alleywa":EQUB &F9
.word164
    EQUS "Throug":EQUB &88
.word165
    EQUS "bedroo":EQUB &ED
.word166
    EQUS "kitche":EQUB &EE
.word167
    EQUS "narro":EQUB &97
.word168
    EQUS "windo":EQUB &97
.word169
    EQUS "stree":EQUB &F4
.word170
    EQUS "numbe":EQUB &F2
.word171
    EQUS "garde":EQUB &8E
.word172
    EQUS "stand":EQUB &93
.word173
    EQUS "door":EQUB &93
.word174
    EQUS "hous":EQUB &E5
.word175
    EQUS "Sout":EQUB &E8
.word176
    EQUS "fron":EQUB &94
.word177
    EQUS "Ther":EQUB &85
.word178
    EQUS "whic":EQUB &88
.word179
    EQUS "othe":EQUB &92
.word180
    EQUS "Nort":EQUB &E8
.word181
    EQUS "lead":EQUB &93
.word182
    EQUS "doo":EQUB &F2
.word183
    EQUS "bac":EQUB &EB
.word184
    EQUS "int":EQUB &8F
.word185
    EQUS "run":EQUB &93
.word186
    EQUS "sid":EQUB &85
.word187
    EQUS "Wes":EQUB &F4
.word188
    EQUS "Eas":EQUB &F4
.word189
    EQUS "roo":EQUB &ED
.word190
    EQUS "you":EQUB &92
.word191
    EQUS "yo":EQUB &F5
.word192
    EQUS "an":EQUB &E4
.word193
    EQUS "ou":EQUB &F4
.word194
    EQUS "ar":EQUB &85
.word195
    EQUS "Yo":EQUB &95
.word196
    EQUS "th":EQUB &E5
.word197
    EQUS "Th":EQUB &E5
.word198
    EQUS "a":EQUB &F4
.word199
    EQUS "a":EQUB &F3
.word200
    EQUS "a":EQUB &EE
.word201
    EQUS "T":EQUB &8F
.word202
    EQUS "i":EQUB &EE
.word203
    EQUS "t":EQUB &EF
.word204
    EQUS "o":EQUB &EE
.word205
    EQUS "o":EQUB &86
.word206
    EQUS "i":EQUB &93
.word207
.dict_end

\  THE UNCOMPRESSION CODE

ORG &7800

._begin

.select_msg
    ASL A
    TAX
    CLC
    LDA msg_table,X
    ADC #messages MOD256
    STA text_ptr
    LDA msg_table+1,X
    ADC #messages DIV256
    STA text_ptr+1

\  Print text starting at [text_ptr] until &00 is reached
\  Values &80-&FF are expanded by replacing them with dictionary words

.uncompress
    LDY #0
.uncomp_loop
    LDA (text_ptr),Y
    BEQ uncomp_done
    BPL uncomp_not_token
    JSR expand_token
    TYA
    LDY #0
    SEC
    ADC text_ptr
    STA text_ptr
    BCC uncomp_loop		\  check if or not we need to ...
    INC text_ptr+1     \  ... increase the high byte
    BCS uncomp_loop        \  (INC did not affect carry)
.uncomp_not_token
    JSR newwrch
.uncomp_next
    INY
    BNE uncomp_loop
    INC text_ptr+1
    BNE uncomp_loop
    \  NB; we are not expecting to fall through here
    
.uncomp_done
    RTS

\  Expand a token from the dictionary

.expand_token
    ASL A   \  now C=1
    TAX    
    \  Stash Y and the old pointer
    TYA
    PHA
    LDA text_ptr+1
    PHA
    LDA text_ptr
    PHA
.get_word_base
    CLC
    LDA dictionary,X
    ADC #dict_words MOD256
    STA text_ptr
    LDA dictionary+1,X
    ADC #dict_words DIV256
    STA text_ptr+1
    LDY #0
.display_word
    LDA (text_ptr),Y
    BMI expand_done
._disp_word1                \  Not the last character
    CMP #32
    BCS disp_word2

\  Codes &00-&1F in middle of a word are expanded as
\  tokens &80-&9F.

    JSR expand_token
    JMP disp_word3

.disp_word2
    JSR newwrch
.disp_word3
	INY
	BNE display_word
    \  NB; we are not expecting to fall through here

\  Display last character of word and return.

.expand_done
    AND #&7F                \  Clear high bit
    CMP #32
    BCS expand_done1
    
\  Codes &00-&1F at end of word are displayed as a lower
\  case letter with a trailing space.
    
    ORA #&60
    JSR newwrch1
    JMP expand_done2
      
.expand_done1
    JSR newwrch
.expand_done2
    \  Retrieve old pointer and Y
    PLA
    STA text_ptr
    PLA
    STA text_ptr+1
    PLA
    TAY
._rts
    RTS

\  For an additional space saving, the characters { | } ~
\  (which would ordinarily display as 1/4 || 3/4 ÷ in MODE 7)
\  are mapped to display as , - . ? with a trailing space.

.newwrch
    CMP #64
    BNE newwrch0
    LDA #97
    BNE newwrch1
.newwrch0
    CMP #&7B        \ opening posh bracket or 1/4
    BCC newwrch2
    SBC #79 \ we know C=1 because BCC didn't branch
    CMP #47         \ ~ would ordinarily give /
    BNE newwrch1
    LDA #63         \ change to ?

\  Display a character with a trailing space

.newwrch1
    JSR osasci
    LDA #32
.newwrch2
    JMP osasci

ALIGN &100    

.msg_table
    EQUW msg0-msg0
    EQUW msg1-msg0
    EQUW msg2-msg0
    EQUW msg3-msg0
    EQUW msg4-msg0
    EQUW msg5-msg0
    EQUW msg6-msg0
    EQUW msg7-msg0
    EQUW msg8-msg0
    EQUW msg9-msg0
    EQUW msg10-msg0
    EQUW msg11-msg0
    EQUW msg12-msg0
    EQUW msg13-msg0
    EQUW msg14-msg0

._code_end

ORG messages   

\  Messages contain a mix of ASCII characters and tokens, which will
\  be expanded automatically, and end with CR and CHR$(0).

.msg0
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &B0:EQUB &BD:EQUS "} "
    EQUB &A4:EQUS "@":EQUB &A8:EQUB &CB:EQUS " ":EQUB &C4:EQUS " ":EQUB &B4:EQUS " "
    EQUB &BF:EQUS " c":EQUB &C8:EQUS " see ":EQUB &C1:EQUS " ":EQUB &B8:EQUB &C4
    EQUS " ":EQUB &A9:EQUS "} A ":EQUB &B6:EQUS " ":EQUB &B5:EQUB &AF:EQUS " "
    EQUB &B8:EQUB &C8:EQUB &B3:EQUB &BD:EQUS "."
    EQUB CR
    BRK
.msg1
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &A9:EQUS " ":EQUB &C1
    EQUB &BA:EQUB &AA:EQUS " 32{":EQUB &B2:EQUB &AC:EQUB &CB:EQUS " ":EQUB &BE
    EQUB &AF:EQUS "} ":EQUB &C5:EQUS " ":EQUB &A9:EQUS " ":EQUB &B9:EQUB &BC:EQUS " "
    EQUB &C0:EQUS " ":EQUB &BB:EQUS "."
    EQUB CR
    BRK
.msg2
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &B7:EQUS " ":EQUB &BD
    EQUS "} ":EQUB &B1:EQUB &C2:EQUS "exits ":EQUB &CB:EQUS " ":EQUB &C4:EQUS " "
    EQUB &B4:EQUS " ":EQUB &C0:EQUS " ":EQUB &AF:EQUS "{":EQUB &C0:EQUS " @stairc"
    EQUB &C7:EQUS "e ":EQUB &CB:EQUS " ":EQUB &C8:EQUS " upper floor."
    EQUB CR
    BRK
.msg3
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &A6:EQUS "{":EQUB &B2
    EQUB &C8:EQUS " est":EQUB &C6:EQUS "e agent might describe ":EQUB &C7
    EQUS " compact ":EQUB &C0:EQUS " bijou} ":EQUB &B1:EQUB &CE:EQUS "@":EQUB &BD
    EQUS " ":EQUB &CB:EQUS " ":EQUB &C4:EQUS " ":EQUB &B4:EQUS " ":EQUB &C0:EQUS " @"
    EQUB &AB:EQUB &CB:EQUS " ":EQUB &C4:EQUS " ":EQUB &BB:EQUS "."
    EQUB CR
    BRK
.msg4
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &B7:EQUS " yard} "
    EQUB &C9:EQUB &C4:EQUS " ":EQUB &BC:EQUS " ":EQUB &CE:EQUS "@ra":EQUB &C4
    EQUS "r pokey ":EQUB &A6:EQUS "} A ":EQUB &B6:EQUS " ":EQUB &B5:EQUB &B4
    EQUS " ":EQUB &B8:EQUS "@":EQUB &A7:EQUB &A3:EQUS "."
    EQUB CR
    BRK
.msg5
    EQUB &C3:EQUB &C2:EQUB &CC:EQUS " @l":EQUB &C0:EQUB &CA:EQUS "g ":EQUB &C6
    EQUS " ":EQUB &C4:EQUS " ":EQUB &CB:EQUS "p ":EQUB &CD:EQUB &C4:EQUS " stairs} "
    EQUB &C9:EQUB &C4:EQUS " ":EQUB &B4:EQUS " ":EQUB &CE:EQUB &C4:EQUS " m":EQUB &C7
    EQUS "ter ":EQUB &A5:EQUS "} A ":EQUB &A2:EQUS " ":EQUB &B5:EQUB &AF:EQUS "."
    EQUB CR
    BRK
.msg6
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " M":EQUB &C7:EQUS "ter "
    EQUB &A5:EQUS "} A ":EQUB &A8:EQUB &CB:EQUS " ":EQUB &C4:EQUS " ":EQUB &B4
    EQUS " comm":EQUB &C0:EQUS "s ":EQUB &C8:EQUS " impos":EQUB &CA:EQUS "g view "
    EQUB &CD:EQUB &C4:EQUS " ":EQUB &AE:EQUS "s ":EQUB &CC:EQUS " ":EQUB &C4
    EQUS " ":EQUB &B3:EQUB &BA:EQUB &CD:EQUB &C4:EQUS " ":EQUB &A9:EQUS "} "
    EQUB &B1:EQUB &CE:EQUS "@":EQUB &B6:EQUS " ":EQUB &CB:EQUS " ":EQUB &C4
    EQUS " ":EQUB &AF:EQUS "."
    EQUB CR
    BRK
.msg7
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " @":EQUB &A7:EQUB &A2:EQUS " ":EQUB &B2:EQUB &B9
    EQUS "from ":EQUB &C4:EQUS " ":EQUB &B0:EQUB &CD:EQUB &C4:EQUS " ":EQUB &AE
    EQUS " (":EQUB &B4:EQUS ") ":EQUB &CB:EQUS " ":EQUB &C4:EQUS " ":EQUB &B7
    EQUS "} ":EQUB &B1:EQUB &C2:EQUB &AD:EQUB &AF:EQUS " ":EQUB &C0:EQUS " ":EQUB &BB
    EQUS "{":EQUB &C0:EQUS " @h":EQUB &C6:EQUS "ch above ":EQUB &BF:EQUS "."
    EQUB CR
    BRK
.msg8
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &B7:EQUS " ":EQUB &A5
    EQUS "} ":EQUB &A4:EQUS "@":EQUB &A8:EQUB &CB:EQUS " ":EQUB &C4:EQUS " ":EQUB &AF
    EQUS " ":EQUB &BF:EQUS " c":EQUB &C8:EQUS " see ":EQUB &C4:EQUS " ":EQUB &B7
    EQUS " ":EQUB &AB:EQUS "below} ":EQUB &B1:EQUB &C2:EQUB &AD:EQUB &BC:EQUS " "
    EQUB &C0:EQUS " ":EQUB &B4:EQUS "."
    EQUB CR
    BRK
.msg9
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " @little cupboard} ":EQUB &C5:EQUS " ":EQUB &B6
    EQUS " takes up most ":EQUB &CD:EQUB &C4:EQUS " ":EQUB &AF:EQUS " wall."
    EQUB CR
    BRK
.msg10
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &A0:EQUS "} ":EQUB &C5
    EQUS " b":EQUB &C6:EQUS "htub ":EQUB &AC:EQUB &CB:EQUS " ":EQUB &C4:EQUS " "
    EQUB &BB:EQUS "} ":EQUB &C9:EQUB &C4:EQUS " ":EQUB &B4:EQUS "{@step ":EQUB &B5
    EQUS "up ":EQUB &CB:EQUS " @":EQUB &B6:EQUS "."
    EQUB CR
    BRK
.msg11
    EQUB &C3:EQUB &C2:EQUB &A1:EQUS "under ":EQUB &C4:EQUS " shower} ":EQUB &C5
    EQUS " ":EQUB &A0:EQUS " proper ":EQUB &CE:EQUB &CB:EQUS " ":EQUB &C4:EQUS " "
    EQUB &BC:EQUS "."
    EQUB CR
    BRK
.msg12
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &C6:EQUS "tic} "
    EQUB &B1:EQUB &CE:EQUS "plenty ":EQUB &CD:EQUB &A1:EQUB &BD:EQUS "} Below "
    EQUB &BF:EQUS " ":EQUB &CE:EQUS "@":EQUB &A2:EQUS "."
    EQUB CR
    BRK
.msg13
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C8:EQUS " ":EQUB &A3:EQUS " ":EQUB &B2
    EQUB &B9:EQUB &B4:EQUS "-":EQUB &AF:EQUS " al":EQUB &CC:EQUS "g":EQUB &BA:EQUB &C4
    EQUS " ":EQUB &AE:EQUS "."
    EQUB CR
    BRK
.msg14
    EQUB &C3:EQUB &C2:EQUB &CA:EQUS " ":EQUB &C4:EQUS " ":EQUB &A9:EQUS " between "
    EQUB &AA:EQUS "s 30 ":EQUB &C0:EQUS " 32} ":EQUB &C9:EQUB &BE:EQUB &AF:EQUS " "
    EQUB &CE:EQUS "@large wooden ":EQUB &B6:EQUS "} ":EQUB &C5:EQUS " ":EQUB &A9
    EQUS " ":EQUB &B9:EQUB &BC:EQUS " ":EQUB &C0:EQUS " ":EQUB &BB:EQUS "."
    EQUB CR
    BRK
._msgs_end

._end

SAVE "DICT", dictionary, dict_end
SAVE "UNCMP1", _begin, _code_end, _rts	
SAVE "MSGS", messages, _msgs_end
The dictionary occupies &1D1 = 465 bytes, and the messages occupy &311 = 785 bytes. So that's 1250, plus another 28 for the message index table. That ought to shrink a little bit more once I have pulled out some di- and trigraphs, and made a few more tweaks ..... Then I'll post the dictionary build scripts.
User avatar
ChrisB
Posts: 145
Joined: Wed Oct 05, 2011 10:37 pm
Location: Surrey
Contact:

Re: Programming help required re: Huffman coding

Post by ChrisB »

I'd be interested to see the dictionary building scripts. Slightly surprised by the numbers you're getting - I was expecting it to be smaller...
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

Yeah, I'm still working on the scripts myself -- I keep seeing ways in which I can do better, like looking for shared prefixes and suffixes.
User avatar
tricky
Posts: 5599
Joined: Tue Jun 21, 2011 9:25 am
Contact:

Re: Programming help required re: Huffman coding

Post by tricky »

I wanted to see how it compares to my GOTEK/MMC menu text compression, which gets:
data: 452 dict: 318 total: 770 + the start of string pointers and 8 bytes of workspace. (code is ~56 bytes).
If the whole text is compressed as one, so you would have to seek to the end of string, that reduces by 24 bytes.
I have also just noticed that I have CR+LF line endings, so am compressing a little extra, but it won't make much difference.
Again, I don't know the proper name for my compression scheme, but it is sort of a binary tree of nodes.

PS The code is actually 42 bytes if just OSWRCHing and the 8 bytes of workspace are in ZP (44 if not in ZP).
It is all byte aligned allowing it to be fast too ;)
Last edited by tricky on Thu Jun 10, 2021 11:16 am, edited 2 times in total.
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

Sorry for the delay -- I went down a bit of a rabbit hole trying out some tree-based ideas. And came up with this:

Code: Select all

\\  HUFFMAN TREE UNCOMPRESSION CODE

\  VARIABLES

stream_ptr = &70
stream_bit = &72
dest = &73
dest_bit = &75

codes_length = 129

stream = &7000

osasci = &FFE3


ORG &7800

._code_begin

.get_stream_bit
    JMP real_get_stream_bit
.reset_dest
    JMP real_reset_dest
.get_char
    JMP real_get_char
.disp_char
    JMP real_disp_char
.select_msg

\  SELECT AND DISPLAY A MESSAGE

.real_select_msg
    ASL A
    TAX
    CLC
    LDA msg_table,X
    ADC #stream MOD256
    STA stream_ptr
    LDA msg_table+1,X
    ADC #stream DIV256
    STA stream_ptr+1
    LDA #0
    STA stream_bit

.real_disp_msg
    JSR real_get_char
    JSR osasci
    CMP #32
    BCS real_disp_msg
.end_of_msg
    RTS

\  DISPLAY A CHARACTER FROM THE STREAM

.real_disp_char
    RTS

\  GET A CHARACTER

.real_get_char

    JSR real_reset_dest
.gc1
    JSR real_get_stream_bit
.gc2
    LDX #0
.gc_test1
    LDA valid_codes,X
    CMP dest
    BNE gc_no_match
    LDA valid_codes+1,X
    CMP dest+1
    BEQ gc_match
.gc_no_match
   INX
   INX
   INX
   CPX #codes_length
   BCC gc_test1
   BCS gc1

.gc_match
    LDA valid_codes+2,X        
    \  return with C=0
    CLC
    RTS

\  GET A BIT FROM THE STREAM

.real_get_stream_bit
    TYA
    PHA
    LDY #0
    LDA (stream_ptr),Y
    LDX stream_bit
    AND bits_LR,X
    TAY
    INX
    STX stream_bit
    CPX #8
    BCC gsb1
    LDX #0
    STX stream_bit
    INC stream_ptr
    BNE gsb1
    INC stream_ptr+1

.gsb1
    LDX dest_bit
    TYA
    BEQ gsb3
    \ See which byte we need to affect
    CPX #8
    BCC gsb2_hi
    
.gsb2_lo
    \  Set the relevant bit in dest
    LDA dest
    ORA bits_LR-8,X
    STA dest
    BCS gsb3 \ always branches
    
.gsb2_hi
    \  Set the relevant bit in dest+1
    LDA dest+1
    ORA bits_LR,X
    STA dest+1

.gsb3
    INX
    CPX #16
    BCC gsb4
    LDX #0
    
.gsb4
    STX dest_bit
    LDA dest
    AND #&F0
    ORA dest_bit
    STA dest
    PLA
    TAY

._rts
    RTS
    
\  RESET THE DESTINATION BYTE AND BIT

.real_reset_dest
    LDA #0
    STA dest
    STA dest+1
    STA dest_bit
    RTS

\  BIT POSITIONS

.bits_LR    \ bits in order left to right
	EQUB &80
	EQUB &40
	EQUB &20
	EQUB &10
	EQUB &08
	EQUB &04
	EQUB &02
.bits_RL    \ bits in order right to left
	EQUB &01
	EQUB &02
	EQUB &04
	EQUB &08
	EQUB &10
	EQUB &20
	EQUB &40
	EQUB &80

\  MESSAGE TABLE

.msg_table
    EQUW msg0-msg0
    EQUW msg1-msg0
    EQUW msg2-msg0
    EQUW msg3-msg0
    EQUW msg4-msg0
    EQUW msg5-msg0
    EQUW msg6-msg0
    EQUW msg7-msg0
    EQUW msg8-msg0
    EQUW msg9-msg0
    EQUW msg10-msg0
    EQUW msg11-msg0
    EQUW msg12-msg0
    EQUW msg13-msg0
    EQUW msg14-msg0
._mtable_end

ALIGN &100

.valid_codes
    EQUW &C002  \  1100000000000010
    EQUB &20    \  &20
    EQUW &8004  \  1000000000000100
    EQUB &68    \  h
    EQUW &6004  \  0110000000000100
    EQUB &61    \  a
    EQUW &5004  \  0101000000000100
    EQUB &72    \  r
    EQUW &3004  \  0011000000000100
    EQUB &74    \  t
    EQUW &1004  \  0001000000000100
    EQUB &65    \  e
    EQUW &0004  \  0000000000000100
    EQUB &6F    \  o
    EQUW &B005  \  1011000000000101
    EQUB &2E    \  .
    EQUW &9005  \  1001000000000101
    EQUB &64    \  d
    EQUW &7005  \  0111000000000101
    EQUB &75    \  u
    EQUW &4805  \  0100100000000101
    EQUB &69    \  i
    EQUW &2805  \  0010100000000101
    EQUB &73    \  s
    EQUW &2005  \  0010000000000101
    EQUB &6E    \  n
    EQUW &BC06  \  1011110000000110
    EQUB &54    \  T
    EQUW &AC06  \  1010110000000110
    EQUB &6C    \  l
    EQUW &A806  \  1010100000000110
    EQUB &6D    \  m
    EQUW &A406  \  1010010000000110
    EQUB &62    \  b
    EQUW &9806  \  1001100000000110
    EQUB &77    \  w
    EQUW &7C06  \  0111110000000110
    EQUB &63    \  c
    EQUW &BA07  \  1011101000000111
    EQUB &6B    \  k
    EQUW &A207  \  1010001000000111
    EQUB &66    \  f
    EQUW &A007  \  1010000000000111
    EQUB &53    \  S
    EQUW &9E07  \  1001111000000111
    EQUB &4E    \  N
    EQUW &7A07  \  0111101000000111
    EQUB &67    \  g
    EQUW &7807  \  0111100000000111
    EQUB &79    \  y
    EQUW &4607  \  0100011000000111
    EQUB &70    \  p
    EQUW &4407  \  0100010000000111
    EQUB &59    \  Y
    EQUW &4207  \  0100001000000111
    EQUB &0D    \  newline
    EQUW &B808  \  1011100000001000
    EQUB &57    \  W
    EQUW &9D08  \  1001110100001000
    EQUB &45    \  E
    EQUW &9C08  \  1001110000001000
    EQUB &2C    \  ,
    EQUW &4189  \  0100000110001001
    EQUB &33    \  3
    EQUW &4109  \  0100000100001001
    EQUB &41    \  A
    EQUW &B9CA  \  1011100111001010
    EQUB &29    \  )
    EQUW &B98A  \  1011100110001010
    EQUB &28    \  (
    EQUW &B94A  \  1011100101001010
    EQUB &42    \  B
    EQUW &B90A  \  1011100100001010
    EQUB &6A    \  j
    EQUW &40CA  \  0100000011001010
    EQUB &76    \  v
    EQUW &408A  \  0100000010001010
    EQUB &32    \  2
    EQUW &406B  \  0100000001101011
    EQUB &78    \  x
    EQUW &404B  \  0100000001001011
    EQUB &30    \  0
    EQUW &402B  \  0100000000101011
    EQUB &2D    \  -
    EQUW &400B  \  0100000000001011
    EQUB &4D    \  M

._code_end

ORG stream

.msg0
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &E8
    EQUB &A8:EQUB &10:EQUB &F5:EQUB &00:EQUB &AA:EQUB &DF:EQUB &7C:EQUB &28
    EQUB &39:EQUB &EC:EQUB &6D:EQUB &CC:EQUB &92:EQUB &48:EQUB &26:EQUB &CC
    EQUB &33:EQUB &81:EQUB &E7:EQUB &82:EQUB &9C:EQUB &6F:EQUB &01:EQUB &DB
    EQUB &EC:EQUB &4C:EQUB &A2:EQUB &38:EQUB &38:EQUB &F4:EQUB &90:EQUB &C3
    EQUB &38:EQUB &1C:EQUB &A6:EQUB &A2:EQUB &27:EQUB &6F:EQUB &41:EQUB &72
    EQUB &00:EQUB &5E:EQUB &B1:EQUB &69:EQUB &17:EQUB &A0:EQUB &0E:EQUB &38
    EQUB &D2:EQUB &43:EQUB &0D:EQUB &88:EQUB &07:EQUB &02:EQUB &BA:EQUB &80
    EQUB &55:EQUB &64:EQUB &20
.msg1
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &CA
    EQUB &6A:EQUB &22:EQUB &78:EQUB &38:EQUB &CA:EQUB &99:EQUB &0E:EQUB &47
    EQUB &55:EQUB &48:EQUB &AE:EQUB &83:EQUB &40:EQUB &A7:EQUB &39:EQUB &A1
    EQUB &2F:EQUB &C6:EQUB &53:EQUB &62:EQUB &48:EQUB &B9:EQUB &86:EQUB &F0
    EQUB &1C:EQUB &BD:EQUB &00:EQUB &71:EQUB &C5:EQUB &BE:EQUB &F8:EQUB &1C
    EQUB &A6:EQUB &A2:EQUB &27:EQUB &AB:EQUB &88:EQUB &5E:EQUB &75:EQUB &8A
    EQUB &7B:EQUB &12:EQUB &5D:EQUB &C0:EQUB &94:EQUB &EC:EQUB &84
.msg2
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &E9
    EQUB &67:EQUB &EE:EQUB &EA:EQUB &01:EQUB &55:EQUB &BE:EQUB &F8:EQUB &15
    EQUB &1D:EQUB &94:EQUB &71:EQUB &40:EQUB &69:EQUB &32:EQUB &E6:EQUB &19
    EQUB &C0:EQUB &F3:EQUB &C1:EQUB &4E:EQUB &36:EQUB &24:EQUB &BA:EQUB &00
    EQUB &E3:EQUB &89:EQUB &CD:EQUB &89:EQUB &2D:EQUB &B2:EQUB &9B:EQUB &25
    EQUB &5F:EQUB &62:EQUB &8E:EQUB &61:EQUB &B1:EQUB &37:EQUB &23:EQUB &46
    EQUB &2B:EQUB &D1:EQUB &AC:EQUB &01:EQUB &6C:EQUB &84
.msg3
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &EE
    EQUB &A4:EQUB &DF:EQUB &81:EQUB &24:EQUB &E7:EQUB &34:EQUB &25:EQUB &F8
    EQUB &D8:EQUB &98:EQUB &94:EQUB &D8:EQUB &C7:EQUB &67:EQUB &A2:EQUB &43
    EQUB &EA:EQUB &4B:EQUB &D8:EQUB &3E:EQUB &42:EQUB &57:EQUB &D5:EQUB &34
    EQUB &8E:EQUB &C5:EQUB &DF:EQUB &0A:EQUB &91:EQUB &B3:EQUB &E7:EQUB &B1
    EQUB &25:EQUB &D2:EQUB &9B:EQUB &90:EQUB &1D:EQUB &6F:EQUB &BE:EQUB &05
    EQUB &47:EQUB &49:EQUB &76:EQUB &D4:EQUB &02:EQUB &AC:EQUB &C3:EQUB &38
    EQUB &1E:EQUB &78:EQUB &29:EQUB &C6:EQUB &C4:EQUB &96:EQUB &DB:EQUB &D6
    EQUB &59:EQUB &09:EQUB &33:EQUB &0C:EQUB &E0:EQUB &7B:EQUB &81:EQUB &29
    EQUB &D9:EQUB &08
.msg4
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &E9
    EQUB &67:EQUB &EE:EQUB &EF:EQUB &19:EQUB &65:EQUB &6F:EQUB &BC:EQUB &33
    EQUB &81:EQUB &E7:EQUB &58:EQUB &A7:EQUB &A4:EQUB &BB:EQUB &6A:EQUB &C7
    EQUB &02:EQUB &BA:EQUB &30:EQUB &BA:EQUB &2F:EQUB &3B:EQUB &A9:EQUB &37
    EQUB &E0:EQUB &49:EQUB &6F:EQUB &41:EQUB &72:EQUB &00:EQUB &5E:EQUB &B1
    EQUB &69:EQUB &17:EQUB &9E:EQUB &0A:EQUB &71:EQUB &A4:EQUB &86:EQUB &1B
    EQUB &64:EQUB &65:EQUB &50:EQUB &9B:EQUB &6A:EQUB &EB:EQUB &17:EQUB &93
    EQUB &33:EQUB &CB:EQUB &21
.msg5
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &02:EQUB &6D:EQUB &D6:EQUB &C4
    EQUB &92:EQUB &48:EQUB &F7:EQUB &63:EQUB &CE:EQUB &07:EQUB &30:EQUB &47
    EQUB &85:EQUB &1C:EQUB &E0:EQUB &72:EQUB &9B:EQUB &25:EQUB &4B:EQUB &6F
    EQUB &BC:EQUB &33:EQUB &81:EQUB &E7:EQUB &82:EQUB &9C:EQUB &69:EQUB &2E
    EQUB &70:EQUB &3D:EQUB &4C:EQUB &53:EQUB &15:EQUB &E9:EQUB &19:EQUB &28
    EQUB &05:EQUB &56:EQUB &F4:EQUB &16:EQUB &F8:EQUB &2A:EQUB &A6:EQUB &40
    EQUB &BD:EQUB &62:EQUB &D2:EQUB &2F:EQUB &40:EQUB &1C:EQUB &71:EQUB &64
    EQUB &20
.msg6
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &D0
    EQUB &03:EQUB &14:EQUB &C5:EQUB &7A:EQUB &46:EQUB &4A:EQUB &01:EQUB &55
    EQUB &BD:EQUB &05:EQUB &CC:EQUB &92:EQUB &48:EQUB &26:EQUB &CC:EQUB &33
    EQUB &81:EQUB &E7:EQUB &82:EQUB &9C:EQUB &6F:EQUB &85:EQUB &55:EQUB &31
    EQUB &24:EQUB &5D:EQUB &89:EQUB &A6:EQUB &A4:EQUB &60:EQUB &54:EQUB &91
    EQUB &EE:EQUB &81:EQUB &A4:EQUB &66:EQUB &C2:EQUB &8E:EQUB &70:EQUB &3C
    EQUB &03:EQUB &8A:EQUB &25:EQUB &C0:EQUB &99:EQUB &C0:EQUB &E0:EQUB &70
    EQUB &2B:EQUB &95:EQUB &32:EQUB &1C:EQUB &28:EQUB &E7:EQUB &03:EQUB &94
    EQUB &D4:EQUB &44:EQUB &ED:EQUB &F7:EQUB &C0:EQUB &A8:EQUB &E9:EQUB &2E
    EQUB &DC:EQUB &80:EQUB &17:EQUB &30:EQUB &CE:EQUB &07:EQUB &A0:EQUB &0E
    EQUB &38:EQUB &B2:EQUB &10
.msg7
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &36:EQUB &C8:EQUB &CA
    EQUB &A1:EQUB &36:EQUB &F8:EQUB &2A:EQUB &A6:EQUB &40:EQUB &BC:EQUB &D0
    EQUB &97:EQUB &E3:EQUB &57:EQUB &10:EQUB &BD:EQUB &15:EQUB &0A:EQUB &B3
    EQUB &81:EQUB &E8:EQUB &A8:EQUB &10:EQUB &F0:EQUB &A3:EQUB &9C:EQUB &0F
    EQUB &00:EQUB &E2:EQUB &8F:EQUB &73:EQUB &4F:EQUB &05:EQUB &38:EQUB &B9
    EQUB &F3:EQUB &0C:EQUB &E0:EQUB &7A:EQUB &59:EQUB &FB:EQUB &B6:EQUB &FB
    EQUB &E0:EQUB &54:EQUB &76:EQUB &51:EQUB &E4:EQUB &00:EQUB &A5:EQUB &E8
    EQUB &03:EQUB &8E:EQUB &36:EQUB &24:EQUB &BB:EQUB &81:EQUB &29:EQUB &CE
    EQUB &6C:EQUB &49:EQUB &6D:EQUB &C3:EQUB &1B:EQUB &F1:EQUB &B5:EQUB &20
    EQUB &81:EQUB &8E:EQUB &F0:EQUB &1D:EQUB &64:EQUB &20
.msg8
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &E9
    EQUB &67:EQUB &EE:EQUB &F4:EQUB &8C:EQUB &94:EQUB &02:EQUB &AB:EQUB &7D
    EQUB &F0:EQUB &A0:EQUB &E7:EQUB &B1:EQUB &B7:EQUB &32:EQUB &49:EQUB &20
    EQUB &9B:EQUB &30:EQUB &CE:EQUB &07:EQUB &A0:EQUB &0E:EQUB &38:EQUB &DE
    EQUB &03:EQUB &B7:EQUB &D8:EQUB &99:EQUB &44:EQUB &73:EQUB &81:EQUB &E9
    EQUB &67:EQUB &EE:EQUB &EF:EQUB &59:EQUB &64:EQUB &24:EQUB &E9:EQUB &1A
    EQUB &C2:EQUB &6B:EQUB &7D:EQUB &F0:EQUB &2A:EQUB &3B:EQUB &28:EQUB &F2
    EQUB &00:EQUB &52:EQUB &F3:EQUB &AC:EQUB &53:EQUB &D8:EQUB &92:EQUB &E7
    EQUB &82:EQUB &9C:EQUB &59:EQUB &08
.msg9
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &36:EQUB &EB:EQUB &49
    EQUB &9D:EQUB &63:EQUB &BE:EQUB &E4:EQUB &74:EQUB &83:EQUB &2C:EQUB &AD
    EQUB &F7:EQUB &C0:EQUB &F2:EQUB &00:EQUB &5C:EQUB &DA:EQUB &E8:EQUB &97
    EQUB &72:EQUB &3E:EQUB &A0:EQUB &29:EQUB &E1:EQUB &47:EQUB &38:EQUB &1E
    EQUB &80:EQUB &38:EQUB &E3:EQUB &99:EQUB &AB:EQUB &AE:EQUB &C8:EQUB &40
.msg10
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &E9
    EQUB &63:EQUB &85:EQUB &00:EQUB &AA:EQUB &DF:EQUB &7C:EQUB &0F:EQUB &4B
    EQUB &1C:EQUB &1B:EQUB &A9:EQUB &CA:EQUB &6C:EQUB &49:EQUB &17:EQUB &30
    EQUB &CE:EQUB &07:EQUB &B8:EQUB &12:EQUB &9D:EQUB &BE:EQUB &F0:EQUB &CE
    EQUB &07:EQUB &9E:EQUB &0A:EQUB &71:EQUB &39:EQUB &B6:EQUB &53:EQUB &14
    EQUB &7D:EQUB &62:EQUB &D2:EQUB &2E:EQUB &E4:EQUB &79:EQUB &86:EQUB &DC
    EQUB &80:EQUB &16:EQUB &C8:EQUB &40
.msg11
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &29:EQUB &B1:EQUB &24:EQUB &92
    EQUB &3D:EQUB &DC:EQUB &49:EQUB &0A:EQUB &E7:EQUB &03:EQUB &96:EQUB &02
    EQUB &61:EQUB &5B:EQUB &7D:EQUB &F0:EQUB &3D:EQUB &2C:EQUB &70:EQUB &A0
    EQUB &15:EQUB &68:EQUB &D4:EQUB &11:EQUB &8A:EQUB &E9:EQUB &2E:EQUB &61
    EQUB &9C:EQUB &0F:EQUB &3A:EQUB &C5:EQUB &3B:EQUB &21
.msg12
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &D8
    EQUB &CD:EQUB &2F:EQUB &DB:EQUB &EF:EQUB &81:EQUB &51:EQUB &D2:EQUB &5D
    EQUB &1D:EQUB &62:EQUB &43:EQUB &79:EQUB &85:EQUB &1C:EQUB &A6:EQUB &C4
    EQUB &92:EQUB &48:EQUB &F7:EQUB &50:EQUB &0A:EQUB &AD:EQUB &F7:EQUB &28
    EQUB &D6:EQUB &13:EQUB &6F:EQUB &01:EQUB &DA:EQUB &4B:EQUB &B6:EQUB &F8
    EQUB &2A:EQUB &A6:EQUB &40:EQUB &B6:EQUB &42
.msg13
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &36:EQUB &26:EQUB &D5
    EQUB &D6:EQUB &2F:EQUB &26:EQUB &67:EQUB &9C:EQUB &D0:EQUB &97:EQUB &E3
    EQUB &57:EQUB &10:EQUB &BC:EQUB &F0:EQUB &53:EQUB &84:EQUB &03:EQUB &40
    EQUB &1C:EQUB &71:EQUB &B5:EQUB &60:EQUB &47:EQUB &A5:EQUB &4C:EQUB &87
    EQUB &38:EQUB &1E:EQUB &01:EQUB &C5:EQUB &1B:EQUB &21
.msg14
    EQUB &44:EQUB &0E:EQUB &D9:EQUB &47:EQUB &49:EQUB &33:EQUB &81:EQUB &CA
    EQUB &6A:EQUB &22:EQUB &7D:EQUB &22:EQUB &73:EQUB &08:EQUB &93:EQUB &23
    EQUB &AA:EQUB &A4:EQUB &54:EQUB &BA:EQUB &0D:EQUB &01:EQUB &6C:EQUB &49
    EQUB &68:EQUB &34:EQUB &0A:EQUB &DF:EQUB &78:EQUB &6F:EQUB &01:EQUB &CB
    EQUB &D0:EQUB &07:EQUB &1C:EQUB &69:EQUB &2E:EQUB &DD:EQUB &6C:EQUB &AF
    EQUB &47:EQUB &98:EQUB &02:EQUB &42:EQUB &4E:EQUB &40:EQUB &0B:EQUB &6F
    EQUB &BE:EQUB &07:EQUB &29:EQUB &A8:EQUB &89:EQUB &EA:EQUB &E2:EQUB &17
    EQUB &9D:EQUB &62:EQUB &9E:EQUB &C4:EQUB &97:EQUB &70:EQUB &25:EQUB &3B
    EQUB &21

._stream_end

SAVE "STRM3", _code_begin, _code_end, _rts
SAVE "BITS", stream, _stream_end

PRINT "Length of valid codes =", ~_code_end-valid_codes
PRINT "get_stream_bit =", ~get_stream_bit
PRINT "reset_dest =", ~reset_dest
PRINT "get_char =", ~get_char
PRINT "disp_char =", ~disp_char
PRINT "select_msg =", ~select_msg

\\  DEDICATED TO THE PUBLIC DOMAIN 2021 BY JULIE KIRSTY LOUISE MONTOYA
\\  <bluerizlagirl@gmail.com>  NO RIGHTS RESERVED
\\
\\  USE IT - ABUSE IT - ENJOY IT - DESTROY IT - STUDY IT - SHARE IT - ADAPT IT
\\
The tree has 43 leaves and a maximum depth of 11. The code as written can support a tree with up to 85 leaves (could go up to 256 with different memory usage) with up to 12 levels of depth. I got my test messages down to 873 bytes, plus another 385 for the uncompression code (this does include the message start points, the table of valid codes and a 47-byte gap in order to ensure the code table is page-aligned).

It's already not the world's fastest, and a larger alphabet will only end up slowing it down more .....

I'll open a GitHub, and post my tree and dictionary generating scripts there once I have tidied them up enough to dare show them to anyone else :D
Coeus
Posts: 2226
Joined: Mon Jul 25, 2016 12:05 pm
Contact:

Re: Programming help required re: Huffman coding

Post by Coeus »

I'm with Richard on this that dictionary based compression for text is a better idea than huffman coding. Not only is the decompression simpler it seems to get better results, at least on typical text.

Given you were talking of dropping the index into the table of messages/tokens and searching it from the beginning there are a couple of ways you could avoid the speed penalty whilst still saving space.

1. Allow forward references only, i.e. essentially you have a message table where there are messages you want to print and they can in turn refer to other messages which can be words or groups of words, or even groups of letters that are used in more than one top-level message, as determined by the compressor. If the high level messages are at the start of the table and the fragments towards the end so fragments never refer backwards to something nearer the start of the table that means when expanding any given message you don't have to search from the start of the table, only from the current position.

2. Instead of delimiting messages, have a table of lengths. As there will be one length per message instead of one terminator per message it will not take up any more space. To search for message n you simply add all the lengths up to n-1. The lengths could be in a separate index or you could store them in front of each message, i.e. kind of linked list. That does mean any one message cannot be more than 256 bytes but that's only in its compressed representation. Once fully expanded it could be.

The Lancaster Assembler does some of this message compression stuff.
julie_m
Posts: 311
Joined: Wed Jul 24, 2019 9:53 pm
Location: Derby, UK
Contact:

Re: Programming help required re: Huffman coding

Post by julie_m »

Storing just the lengths of each message, as opposed to storing the start addresses, is actually not a bad idea for squeezing out a bit more space! You only need one byte per message, not two; and you only need to search linearly over up to as many bytes as you have messages. If you are really cunning, you can even store the messages backwards in memory. That way, as you access each byte of the compressed message with LDA(short),Y you can use DEY/BNE and save a CPY.
Post Reply

Return to “programming”