English wordlist checker

Discuss all aspects of programming here. From 8-bit through to modern architectures.
Post Reply
joachim
Posts: 157
Joined: Wed Jun 21, 2006 1:20 am
Contact:

English wordlist checker

Post by joachim » Sun Jun 17, 2018 1:40 am

BITD "everyone knew" that if you implemented a game like Scrabble or Boggle, it wasn't practical to include a full English wordlist — there just isn't memory to have a list that's complete enough to be convincing.

I started thinking about how well you could solve this problem with Bloom filters — the idea being that words in the wordlist would always be accepted, words not in the wordlist would *usually* be rejected. That's good enough for an informal word game — certainly better than what we expected of implementations on 8-bit micros.

If I understand the mathematics in the Wikipedia article correctly, you can pick an acceptable false positive rate p and then the memory required for your Bloom filter is linear in your wordlist size, approximately (1/ln 2) ceil( log2 (1/p) ) bits per word. So an 8k filter with a 1/16 false positive rate, or a 10k filter with a 1/32 false positive rate, would be able to store 2kbyte * 8 bits/byte * ln 2 ≈ 11356.5 words … enough for a playable game.

WIth a 128k Master you'd be able to think about a Scrabble dictionary accurate enough for club-level play.

User avatar
BigEd
Posts: 2565
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: English wordlist checker

Post by BigEd » Sun Jun 17, 2018 8:17 am

I like it! A cromulent idea. With some code for stemming you could do a bit better, I think: forming plurals and gerunds, with a list of irregular cases, so your filter needn't deal with a full vocabulary as-spelt-out.

joachim
Posts: 157
Joined: Wed Jun 21, 2006 1:20 am
Contact:

Re: English wordlist checker

Post by joachim » Sun Jun 17, 2018 4:47 pm

Yes! I initially didn't see how to combine stemming with the Bloom filter lookup without using more memory, but I think it's workable. You would have entries in the wordlist with special characters, for example "FILTER*" means filter and filters and filtering and filtered. Then when the lookup code encounters a stemmable-looking word, say SUBSTRING, then it has to do two lookups, one for "SUBSTR*" and one for "SUBSTRING" itself.

This makes the lookup code larger and slower, but you at least don't have to do multiple lookups on every word, only on words that look as if they came from a stem.

Post Reply