OCR of magazine listings

This is where we keep track of who is scanning what, so we can avoid duplicating work and also define conventions/standards etc
Moderators: DaveH, andrew_rowland, Dave_E, bfoley, sabre150, Moderators
User avatar
SimonSideburns
Posts: 257
Joined: Mon Aug 26, 2013 8:09 pm
Location: Purbrook, Hampshire
Contact:

OCR of magazine listings

Postby SimonSideburns » Tue Jul 18, 2017 4:08 pm

I'm on a few groups on facebook in the Speccy scene and I was musing about the possibility of OCRing listings from magazines so as to avoid typing the listings in by hand. It may be of interest to members here also, so I thought I'd mention it here.

I have tried a couple of free OCR programs for Windows but they have been totally hopeless. They assume the page being scanned is in a particular written language and not a computer listing. Also they do struggle to cope with the many styles of computer printout that these listings have been printed in.

One of my scanners, part of a multi-function device (an A3 Brother MFC-J6910DW) has a multi-page scanning facility, but I separated out all the pages of the magazine (Best of Sinclair Programs '84), fed the whole thing into the multi page feeder to be scanned and tried to scan the pages into separate files (one picture per page or something) but my laptop didn't seem to be happy with that and crashed telling me it had run out of memory without even saving a single image. Looks like I might have to scan each page one at a time. So much for that!

The vocabulary of a listing (apart from variable names) is going to be limited to keywords, symbols and possibly graphical characters, so we would need to find some way of training the OCR software to recognise those, but the problem is finding (or writing) OCR software that learns as it goes along and is 'happy' working in this manner.

I'm looking for recommendations, suggestions, solutions, etc.
I'm writing a game where you can change your character from a Wizard to a monkey to a cat.

Well, Imogen that!

User avatar
sydney
Posts: 1986
Joined: Wed May 18, 2005 9:09 am
Location: Newcastle upon Tyne

Re: OCR of magazine listings

Postby sydney » Tue Jul 18, 2017 4:29 pm

Could you load the games from the cover disc images then print them out.

User avatar
sydney
Posts: 1986
Joined: Wed May 18, 2005 9:09 am
Location: Newcastle upon Tyne

Re: OCR of magazine listings

Postby sydney » Tue Jul 18, 2017 4:47 pm

That wasn't as helpful as I thought- feel free to ignore it as I'd just woken from a nap.

User avatar
1024MAK
Posts: 6791
Joined: Mon Apr 18, 2011 4:46 pm
Location: Looking forward to summer in Somerset, UK...

Re: OCR of magazine listings

Postby 1024MAK » Tue Jul 18, 2017 6:39 pm

Erm, what cover disk?
Only the later magazines had cover disks, and for example, a lot of home computers magazines never had cover disks. True, some had cover tapes. But even where these exist, it still leaves vast numbers of magazines where only the printed listing was published.

When I try OCR on documents where either the characters are different to what it expects, or the wording is different to what it expects, it makes a complete hash of it.

As Simon indicates, the listings were often printed in various typefaces, various sizes, sometimes it was a dot matrix printout. And some magazines liked to jazz them up (diffent colours, both the type and the background). Or worst, print it on top of a picture.

All of which makes OCR recognition a nightmare :(

Even a human eye cannot always work out what has been printed :twisted:

Mark
For a "Complete BBC Games Archive" visit www.bbcmicro.co.uk NOW!
BeebWiki‬ - for answers to many questions...

User avatar
1024MAK
Posts: 6791
Joined: Mon Apr 18, 2011 4:46 pm
Location: Looking forward to summer in Somerset, UK...

Re: OCR of magazine listings

Postby 1024MAK » Tue Jul 18, 2017 6:39 pm

sydney wrote:That wasn't as helpful as I thought- feel free to ignore it as I'd just woken from a nap.

I hope you had a better snooze than I did...

Mark
For a "Complete BBC Games Archive" visit www.bbcmicro.co.uk NOW!
BeebWiki‬ - for answers to many questions...

User avatar
sydney
Posts: 1986
Joined: Wed May 18, 2005 9:09 am
Location: Newcastle upon Tyne

Re: OCR of magazine listings

Postby sydney » Tue Jul 18, 2017 7:24 pm

1024MAK wrote:Erm, what cover disk?
...
Mark


8bs.com has lots (ALL???) of cover disks/tapes, even when a cover disk or tape was not available - they have been typed in for you!

http://8bs.com/catalogue.htm

A&B
Beebug
Electron User
The Micro User

Whether or not this is actually useful is another matter altogether. :lol:

User avatar
lurkio
Posts: 1290
Joined: Tue Apr 09, 2013 11:30 pm
Location: Doomawangara
Contact:

Re: OCR of magazine listings

Postby lurkio » Tue Jul 18, 2017 7:49 pm

davidb seemed to have got some good results with the Tesseract OCR program recently:

davidb wrote:I ran the pages through Tesseract and cleaned up the output. It's not really designed for this kind of text. Feel free to fix any errors I've introduced ...
:?:

User avatar
davidb
Posts: 1901
Joined: Sun Nov 11, 2007 10:11 pm
Contact:

Re: OCR of magazine listings

Postby davidb » Tue Jul 18, 2017 8:36 pm

lurkio wrote:davidb seemed to have got some good results with the Tesseract OCR program recently:

I think it does use a dictionary sometimes, however, so the programs might get "translated" a bit. It's worth a try. If anyone wants me to try a few other OCR tools available in Debian then just let me know. :)

User avatar
Wouter Scholten
Posts: 182
Joined: Wed May 02, 2001 10:14 pm
Location: NL
Contact:

Re: OCR of magazine listings

Postby Wouter Scholten » Tue Jul 18, 2017 10:43 pm

lurkio wrote:davidb seemed to have got some good results with the Tesseract OCR program recently:

davidb wrote:I ran the pages through Tesseract and cleaned up the output. It's not really designed for this kind of text. Feel free to fix any errors I've introduced ...
:?:



I used tesseract many years ago already to create e.g. the text in the ad of tubelink and included that with the Advanced basic diskimage to give a little information as there was no manual (and none has surfaced as yet) nor other information on this 6502 2p basic. It worked well (I saved files to uncompressed tiff from the scanner), but only with clean scans/high res. I tried it recently on some lower res scans (e.g. the beebug pdfs from 8bs.com, converted to tiff via application-grab-window with xv) and the result was quite poor even with some changes to the image that might help the OCR program.


Return to “coordination of magazine scanning projects”

Who is online

Users browsing this forum: No registered users and 1 guest