Educational Archive status

feedback, questions and discussion relating to the Complete BBC Games Archive (beta site now open!)
Post Reply
User avatar
pau1ie
Posts: 525
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Educational Archive status

Post by pau1ie » Mon Mar 12, 2018 2:13 pm

I am making steady progress. I have completed Johns "Anita Straker" directory, and she has around 50 programs of the just over 100 or so entries I have created so far. There were a few more not written by her, but distributed by similar means.

I am now on the "Arnold" directory which contains two programs. The time taken here is to assemble the documentation into a pdf from the raw scans that are available. Since I will have to do this several times I suspect I ought to develop a script to do so.

I want to keep the PDFs as detailed as possible, but also searchable. I use OCRFeeder, which is pretty basic, but more or less works to create searchable pdfs.

Booklets tended to have a colour cover, then black and white typed (Or possibly daisy wheel printed) insides. I find it is a massive space saving to use indexed images (In gimp, Image->Mode->Indexed and select "Use black and white(1-bit) palette"). For the cover, while it is probably printed using four colours, the scan will have taken the texture of the paper, so I use more colours. The images are then saved as png, which are much smaller than jpegs.

At present I load all the images into Gimp, crop them into individual pages, rotate if necessary, fix the colours, change the mode as above then save.

Then I load them into OCRFeeder. I find the automatic recognition to be rather poor, so I select each text area individually. The OCR is generally not too bad, but they are never perfect (Especially since number 1 and lower case l are often identical, as are 0 and O), and also page numbers often don't come out, so they need correcting.

Then I make it generate a PDF using OCRFeeder.

I am not sure if any of this can be automated, but it takes me 2-3 evenings to do a manual, so it is slow going, but not all titles have documentation, so some can be done much quicker.
I'm working on http://bbcmicro.co.uk

User avatar
flaxcottage
Posts: 3091
Joined: Thu Dec 13, 2012 8:46 pm
Location: Derbyshire
Contact:

Re: Educational Archive status

Post by flaxcottage » Mon Mar 12, 2018 8:02 pm

Wow! That is intense and definitely a work of love. =D> =D> =D>

Well done so far.
- John

Why do I keep collecting Acorn gear? I'm going to need a considerably bigger man-cave. :?

User avatar
richardtoohey
Posts: 3563
Joined: Thu Dec 29, 2011 5:13 am
Location: Tauranga, New Zealand
Contact:

Re: Educational Archive status

Post by richardtoohey » Tue Mar 13, 2018 5:23 am

I work quite a bit with automated PDFs, but more generation or overlaying text/simple objects onto existing PDFs.

What you are describing seems to be needing a lot of human oversight and skill, so can't see what could easily be automated? :-k

I know AI is meant to be taking over the world, but don't think we are there yet?

If there are definitely one or two steps that you always need to do, without even looking at the PDF, then what are they?

You can also do macros in Gimp, so there might be something doable there?

Thanks for your efforts. =D>

User avatar
pau1ie
Posts: 525
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Educational Archive status

Post by pau1ie » Tue Mar 13, 2018 7:29 pm

OK, OCRfeeder broke half way through, (That has happened to me before), so maybe it isn't the best to use.

I created a file using libreoffice draw in the end. I have done it on Linux, so not sure how the fonts will look on Windows - I will have a look at work tomorrow. I think I will try another program to do the OCR stuff. I saw reference to PDF Sandwich. Maybe that will be better... I will try that next, but first I need to check the state of the SSD.
Attachments
BeatTheClock.zip
(4.44 MiB) Downloaded 13 times
I'm working on http://bbcmicro.co.uk

User avatar
lurkio
Posts: 1611
Joined: Tue Apr 09, 2013 11:30 pm
Location: Doomawangara
Contact:

Re: Educational Archive status

Post by lurkio » Tue Mar 13, 2018 7:37 pm

pau1ie wrote:I created a file using libreoffice draw in the end.
Nice! Looks good on a Mac. (I notice that the penultimate page is blank though. Intentional?)

Btw, dunno if this is helpful or relevant to you, but I've found that the imagemagick suite of commandline tools can be useful for rotating and splitting a double-page scan into two portrait-oriented single pages:
  • Code: Select all

    convert -rotate "90" double-page.jpg outdir/out.tiff
    convert -crop 50%x100% +repage outdir/out.tiff outdir/cropped_%d.tiff
    rm outdir/out.tiff
    
:idea:

User avatar
flaxcottage
Posts: 3091
Joined: Thu Dec 13, 2012 8:46 pm
Location: Derbyshire
Contact:

Re: Educational Archive status

Post by flaxcottage » Tue Mar 13, 2018 8:21 pm

Looks good on Windows too. :D
- John

Why do I keep collecting Acorn gear? I'm going to need a considerably bigger man-cave. :?

User avatar
pau1ie
Posts: 525
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Educational Archive status

Post by pau1ie » Thu Mar 15, 2018 2:42 pm

Thanks for checking!
lurkio wrote:commandline tools can be useful for rotating and splitting
Thanks. It always takes me ages to work out what the options I want are!
I'm working on http://bbcmicro.co.uk

User avatar
pau1ie
Posts: 525
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Educational Archive status

Post by pau1ie » Sun Mar 18, 2018 3:13 pm

The disc image for Beat the Clock was fine, though the menu was different from the suggested one. I don't think it matters.

Here is the manual for Genetics with Budgetigars. I used lurkio's command in a bash for loop to generate pages, then created a pdf with imagemagik:

I edited some of the PDFs to remove shadows at the edge of the pages (Though pdfsandwich is supposed to do this) and also to darken some very light text. Then I used

Code: Select all

convert *.tiff out.pdf
to create a PDF

Then I used pdfsandwich - It is actually supposed to be able to do many of the previous steps, including rotating. I might try this later.

Code: Select all

pdfsandwich out.pdf -o GeneticsWithBudgerigars.pdf
I was happy with the result except the covers ended up pretty much solid white (Previously they were white on dark green). When I edited them with libreoffice draw and saved, the images all turned black. I assume this is a bug in libreoffice, or maybe an incompatibility between libreoffice and imagemagick in the use of tiffs. I think I will try using the defaults for the document apart from the covers, and add them on at the end.

pdfsandwich complained that the images were too big. According to the documentation it does this if the images are larger than A3. Maybe they should be scaled to be smaller - I assume the booklet is A5 size. This would reduce the file size.

Lurkio - you seemed to be suggesting I use tiffs in your convert command. Is there a reason for this? Is there a better way to edit pdfs than using libreoffice draw? I could run pdfsandwich in debug mode which leaves the ghostscript, then edit that to use the covers I want.
I'm working on http://bbcmicro.co.uk

User avatar
lurkio
Posts: 1611
Joined: Tue Apr 09, 2013 11:30 pm
Location: Doomawangara
Contact:

Re: Educational Archive status

Post by lurkio » Mon Mar 19, 2018 12:47 pm

pau1ie wrote:Lurkio - you seemed to be suggesting I use tiffs in your convert command. Is there a reason for this?
It's just because TIFFs use lossless compression, whereas JPEGs are lossy and can visibly reduce the fidelity of the scanned image by introducing compression artefacts. I always scan to TIFF too (except when I forget and accidentally save as "double-page.jpg" or something).

:idea:

User avatar
pau1ie
Posts: 525
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Educational Archive status

Post by pau1ie » Mon Mar 19, 2018 11:06 pm

Are tiffs any better than PNGs? I normally use PNG out of habbit. Anyhow, I have found a way around using libreoffice draw which I think is too memory intensive for my laptop, which is pretty retro itself now with a mere 4G RAM.

Anyway, I tried the following. Create a pdf with all pages apart from the cover, and also create pdfs of the cover pages:

Code: Select all

convert frontcover.tiff frontcover.png
convert backcover.tiff backcover.png
convert p*.tiff out.pdf
Where the pages are named p*.tiff

Code: Select all

pdfsandwich out.pdf -o out2.pdf
To OCR and format the pages.

Then I downloaded and used pdfmod to insert the covers at the beginning and end. They aren't OCRed, but I don't think that matters too much.

Another thing I learned is that the text inside the searchable PDF doesn't have glyphs. This doesn't matter, because the image provides the shape of the letters. However the evince pdf viewer displays boxes when you highlight the text, which is what happens when there isn't a glyph for the character. The evince developers say this is what you expect, but they seem to be in the minority in this interpretation, other pdf readers don't try to display the font, they just highlight the image.

So the Arnold directory is pretty much done, which means I am finished with the letter A. Next up is BBC Soft, which should go quicker as there is no documentation in there!
Attachments
GeneticsWithBudgerigars.zip
(2.98 MiB) Downloaded 9 times
I'm working on http://bbcmicro.co.uk

Post Reply