Things I am doing. Desktop application

feedback, questions and discussion relating to the Complete BBC Games Archive (beta site now open!)
User avatar
pau1ie
Posts: 634
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Post by pau1ie » Thu Mar 01, 2018 11:14 pm

I am using an HTML table. That is why I put "spreadsheet" in quotes. I use a python library (html.parser) to parse it, (Into an sqlite database as it happens) Lee opens it with a spreadsheet program. I think html is more flexible than sqlite, You could parse it with grep and sed, or even Basic. To save myself work I would like to have one database extract if at all possible.
I'm working on http://bbcmicro.co.uk

crj
Posts: 837
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Post by crj » Fri Mar 02, 2018 12:20 am

pau1ie wrote:I am using an HTML table. That is why I put "spreadsheet" in quotes. I use a python library (html.parser) to parse it, (Into an sqlite database as it happens) Lee opens it with a spreadsheet program. I think html is more flexible than sqlite, You could parse it with grep and sed, or even Basic. To save myself work I would like to have one database extract if at all possible.
Normally, by choosing an HTML table you'd be heading for a world of hurt with content encodings. But I guess almost everything to do with old Acorn kit will be flat ASCII. Just make sure pound signs survive every translation and transmission step. (-8

sqlite is rather better at giving you back the bytes you put in.

crj
Posts: 837
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Post by crj » Fri Mar 02, 2018 12:30 am

pau1ie wrote:You could parse it with grep and sed, or even Basic.
Hmm. That's set me thinking that if you had something more modern attached to a Beeb it would be pretty easy to give BBC BASIC a nice interface to sqlite...

Code: Select all

*attach mydb path.to.my.database/sqlite
X=OPENIN("sqlite:SELECT author,game,filename FROM mydb.games")
REPEAT
  INPUT#X,author$,game$,filename$
UNTIL EOF#X
CLOSE#X
*detach mydb
...or something like that. (-8

User avatar
pau1ie
Posts: 634
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Post by pau1ie » Fri Mar 02, 2018 1:06 pm

crj wrote:by choosing an HTML table you'd be heading for a world of hurt with content encodings.
The bbcmicro.co.uk/ss.php page is in UTF8 according to my broswer. As I say, Lee already uses it, so it has to work. The only title I had issues with was:

Secret Diary Of Adrian Mole Aged 13¾, The

That works fine, though I think I had to tell the browser the page is UTF-8 to get it to work properly. The ¾ is UTF-8 code C2BE. I suppose that an 8 bit computer would have problems with that, so I would have to consider how to deal with it if I ever get round to building a menu. } in mode 7, or change it to 3/4, or just lose any non 7 bit ASCII characters. But that is way in the future.

Sitting at work as I am now, html table wins because it can be opened in Excel. I am not sure this would be possible with Excel. Also, both Excel and LibreOffice Calc can open the table directly from the web page, which is nice.
I'm working on http://bbcmicro.co.uk

crj
Posts: 837
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Post by crj » Fri Mar 02, 2018 4:14 pm

As I say, you might be OK in this application.

However, I would ask: is this the same title? "Secret Diary Of Adrian Mole Aged 13³⁄₄, The"

Also, I would ask how you cope if you fetch the page in a hotel room and the hotel's "transparent" proxy transcodes to iso-8859-1.

Also, do any titles contain double-spaces? If so, it would be prudent to transform the extra spaces into en spaces, then convert them back afterwards. Giving consideration to whether it's necessary to cope with the   HTML entity as well as the bare Unicode character.

Also, in a retrocomputing context, how confident are you that any data you store as HTML will be interpreted the same way by tools available in 2048?

HTML is a format for presenting information to a user, not for robust, durable storage. Personally, I'd want to be sure of getting all my bits back, so would use sqlite or similar. /-8

User avatar
pau1ie
Posts: 634
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Post by pau1ie » Fri Mar 02, 2018 5:00 pm

crj wrote:is this the same title?
That is a philosophical question.I think I am going to say I don't care. I suspect Lee typed 3/4 into a spreadsheet and it was auto-corrected to ¾. Does that matter? Would it have been better to leave it? Would ³⁄₄ have been better? I don't think I mind. For my purposes they are the same. (For those following along, the original has one character, crj's example has three. Try highlighting it.
crj wrote:iso-8859-1
I expect your point is that (Depending on how the proxy mangles things) it will end up looking like this.

Secret Diary Of Adrian Mole Aged 13¾, The

You often see this type of problem I would prefer to use https to stop the proxy being able to mess with the content I am delivering.
crj wrote:Also, in a retrocomputing context, how confident are you that any data you store as HTML will be interpreted the same way by tools available in 2048?
Extremely confident. I think it is similar to my being able to open an ascii text file from my bbc micro in notepad now. HTML is pretty much the most common file format, it is widely documented, open and easily accessible. UTF-8 is the most common encoding. The sqlite database it is likely to have changed format between then and now, so you will have to find an old version of the code and compile it to be able to read it. Doable, but I am confident that UTF-8 encoded HTML will "just work".
I'm working on http://bbcmicro.co.uk

crj
Posts: 837
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Post by crj » Fri Mar 02, 2018 5:42 pm

pau1ie wrote:Extremely confident. [...] notepad
Funny you should mention Notepad... that's natively UTF-16 and can cause a lot of damage in transcoding UTF-8 back and forth.
I suspect Lee typed 3/4 into a spreadsheet and it was auto-corrected to ¾. Does that matter?
To me it does, because my hope is that somebody, somewhere, will have a definitive archival-grade list. It would be a shame to have something that only got 99% of the way there, because a lot of effort would have to be duplicated before anybody could start working on that last 1%.

User avatar
pau1ie
Posts: 634
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Post by pau1ie » Fri Mar 02, 2018 8:43 pm

crj wrote: definitive archival-grade list
This is absolutely not what we are trying to achieve. I do try to be as open with what we are doing as possible. In practice I think the HTML table isn't as bad as you are concerned it might be. Lee uses it to maintain the site, so it feeds back on itself.

The other thing that might help is the mysql dump which I update by hand periodically. This is probably closer to what you want. If you build something that uses it, I will investigate generating it on demand.

It is interesting (Though somewhat academic as I am not interested in going down that route) to wonder what an archival grade list would mean. Is the title what was written on the front of the cassette, or what was displayed when the program was run, or what was displayed in the adverts. Presumably where different all of these would have to be logged. I think anyone doing this would find they have more than 1% to do, though Lee has done amazing work with the bbcmicro site metadata.
I'm working on http://bbcmicro.co.uk

crj
Posts: 837
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Post by crj » Sat Mar 03, 2018 12:01 am

Being enough of a Douglas Adams fan to have seen the Hitch-Hiker's v. Hitch Hiker's v. Hitchhiker's confusion, I do appreciate that difficulty. Though IMDb does try to preserve all variant names of films, and BoardGameGeek does the same for board games. It would be good if someone, somewhere was doing the same for computer games.

Then again, I'm not volunteering, so I know I can't complain. (-8

User avatar
pau1ie
Posts: 634
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Post by pau1ie » Sun Jan 13, 2019 11:28 am

Must be something about winter that made me think about this again, and the fact that as things stand it is impossible to link the screen shots in the archive download file with a game entry. So I am revisiting this, and have more or less unilaterally decided that I will use the following scheme for file names inside the zip file:

Code: Select all

t/title-id.ext
Where:
  • t is the first character of the title upper-cased (Or zero if it is a number)
  • title is the title of the game without the article truncated at the first (, [ or , with all non alphanumeric characters removed.
  • id is the id. This is what I really want to have in the file name so they can definitely be uniquely identified.
  • ext is the extension
I know some people wanted other stuff, in particular publishers, but I don't understand, given that we have the spreadsheet (ss.php), why this would be useful. It would lead to long filenames and the names would be mangled by applying the above rules anyway.

Speaking of the spreadsheet, I intend to include that in the file as well to make sure that whenever it is downloaded all the metadata comes with it, and all the work that goes into producing the site won't be lost if it drops off the internet. I'll probably include a readme with some waffle and the generated date as well. The spreadsheet currently has a file name column, but those are the names on the server not in the archive, so I will include another column containing the file name in the archive.

I have more or less coded this. If it will cause a problem to you in any project you have, please let me know and I will try to accommodate you, but I am keen not to let the perfect become the enemy of the good.
I'm working on http://bbcmicro.co.uk

User avatar
pau1ie
Posts: 634
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Post by pau1ie » Thu Jan 17, 2019 12:42 pm

That is all done now.
I'm working on http://bbcmicro.co.uk

User avatar
leenew
Posts: 3780
Joined: Wed Jul 04, 2012 3:27 pm
Location: Doncaster, Yorkshire
Contact:

Re: Things I am doing. Desktop application

Post by leenew » Mon Jan 21, 2019 9:58 pm

Thanks Paul,
Looks to be working well 🙂

Lee.

Post Reply