Things I am doing. Desktop application

feedback, questions and discussion relating to the Complete BBC Games Archive (beta site now open!)
User avatar
pau1ie
Posts: 470
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Postby pau1ie » Thu Mar 01, 2018 11:14 pm

I am using an HTML table. That is why I put "spreadsheet" in quotes. I use a python library (html.parser) to parse it, (Into an sqlite database as it happens) Lee opens it with a spreadsheet program. I think html is more flexible than sqlite, You could parse it with grep and sed, or even Basic. To save myself work I would like to have one database extract if at all possible.
I'm working on http://bbcmicro.co.uk

crj
Posts: 830
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Postby crj » Fri Mar 02, 2018 12:20 am

pau1ie wrote:I am using an HTML table. That is why I put "spreadsheet" in quotes. I use a python library (html.parser) to parse it, (Into an sqlite database as it happens) Lee opens it with a spreadsheet program. I think html is more flexible than sqlite, You could parse it with grep and sed, or even Basic. To save myself work I would like to have one database extract if at all possible.

Normally, by choosing an HTML table you'd be heading for a world of hurt with content encodings. But I guess almost everything to do with old Acorn kit will be flat ASCII. Just make sure pound signs survive every translation and transmission step. (-8

sqlite is rather better at giving you back the bytes you put in.

crj
Posts: 830
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Postby crj » Fri Mar 02, 2018 12:30 am

pau1ie wrote:You could parse it with grep and sed, or even Basic.

Hmm. That's set me thinking that if you had something more modern attached to a Beeb it would be pretty easy to give BBC BASIC a nice interface to sqlite...

Code: Select all

*attach mydb path.to.my.database/sqlite
X=OPENIN("sqlite:SELECT author,game,filename FROM mydb.games")
REPEAT
  INPUT#X,author$,game$,filename$
UNTIL EOF#X
CLOSE#X
*detach mydb
...or something like that. (-8

User avatar
pau1ie
Posts: 470
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Postby pau1ie » Fri Mar 02, 2018 1:06 pm

crj wrote:by choosing an HTML table you'd be heading for a world of hurt with content encodings.

The bbcmicro.co.uk/ss.php page is in UTF8 according to my broswer. As I say, Lee already uses it, so it has to work. The only title I had issues with was:

Secret Diary Of Adrian Mole Aged 13¾, The

That works fine, though I think I had to tell the browser the page is UTF-8 to get it to work properly. The ¾ is UTF-8 code C2BE. I suppose that an 8 bit computer would have problems with that, so I would have to consider how to deal with it if I ever get round to building a menu. } in mode 7, or change it to 3/4, or just lose any non 7 bit ASCII characters. But that is way in the future.

Sitting at work as I am now, html table wins because it can be opened in Excel. I am not sure this would be possible with Excel. Also, both Excel and LibreOffice Calc can open the table directly from the web page, which is nice.
I'm working on http://bbcmicro.co.uk

crj
Posts: 830
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Postby crj » Fri Mar 02, 2018 4:14 pm

As I say, you might be OK in this application.

However, I would ask: is this the same title? "Secret Diary Of Adrian Mole Aged 13³⁄₄, The"

Also, I would ask how you cope if you fetch the page in a hotel room and the hotel's "transparent" proxy transcodes to iso-8859-1.

Also, do any titles contain double-spaces? If so, it would be prudent to transform the extra spaces into en spaces, then convert them back afterwards. Giving consideration to whether it's necessary to cope with the   HTML entity as well as the bare Unicode character.

Also, in a retrocomputing context, how confident are you that any data you store as HTML will be interpreted the same way by tools available in 2048?

HTML is a format for presenting information to a user, not for robust, durable storage. Personally, I'd want to be sure of getting all my bits back, so would use sqlite or similar. /-8

User avatar
pau1ie
Posts: 470
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Postby pau1ie » Fri Mar 02, 2018 5:00 pm

crj wrote:is this the same title?


That is a philosophical question.I think I am going to say I don't care. I suspect Lee typed 3/4 into a spreadsheet and it was auto-corrected to ¾. Does that matter? Would it have been better to leave it? Would ³⁄₄ have been better? I don't think I mind. For my purposes they are the same. (For those following along, the original has one character, crj's example has three. Try highlighting it.

crj wrote:iso-8859-1


I expect your point is that (Depending on how the proxy mangles things) it will end up looking like this.

Secret Diary Of Adrian Mole Aged 13¾, The

You often see this type of problem I would prefer to use https to stop the proxy being able to mess with the content I am delivering.

crj wrote:Also, in a retrocomputing context, how confident are you that any data you store as HTML will be interpreted the same way by tools available in 2048?


Extremely confident. I think it is similar to my being able to open an ascii text file from my bbc micro in notepad now. HTML is pretty much the most common file format, it is widely documented, open and easily accessible. UTF-8 is the most common encoding. The sqlite database it is likely to have changed format between then and now, so you will have to find an old version of the code and compile it to be able to read it. Doable, but I am confident that UTF-8 encoded HTML will "just work".
I'm working on http://bbcmicro.co.uk

crj
Posts: 830
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Postby crj » Fri Mar 02, 2018 5:42 pm

pau1ie wrote:Extremely confident. [...] notepad

Funny you should mention Notepad... that's natively UTF-16 and can cause a lot of damage in transcoding UTF-8 back and forth.

I suspect Lee typed 3/4 into a spreadsheet and it was auto-corrected to ¾. Does that matter?


To me it does, because my hope is that somebody, somewhere, will have a definitive archival-grade list. It would be a shame to have something that only got 99% of the way there, because a lot of effort would have to be duplicated before anybody could start working on that last 1%.

User avatar
pau1ie
Posts: 470
Joined: Thu May 10, 2012 9:48 pm
Location: Bedford
Contact:

Re: Things I am doing. Desktop application

Postby pau1ie » Fri Mar 02, 2018 8:43 pm

crj wrote: definitive archival-grade list


This is absolutely not what we are trying to achieve. I do try to be as open with what we are doing as possible. In practice I think the HTML table isn't as bad as you are concerned it might be. Lee uses it to maintain the site, so it feeds back on itself.

The other thing that might help is the mysql dump which I update by hand periodically. This is probably closer to what you want. If you build something that uses it, I will investigate generating it on demand.

It is interesting (Though somewhat academic as I am not interested in going down that route) to wonder what an archival grade list would mean. Is the title what was written on the front of the cassette, or what was displayed when the program was run, or what was displayed in the adverts. Presumably where different all of these would have to be logged. I think anyone doing this would find they have more than 1% to do, though Lee has done amazing work with the bbcmicro site metadata.
I'm working on http://bbcmicro.co.uk

crj
Posts: 830
Joined: Thu May 02, 2013 4:58 pm
Contact:

Re: Things I am doing. Desktop application

Postby crj » Sat Mar 03, 2018 12:01 am

Being enough of a Douglas Adams fan to have seen the Hitch-Hiker's v. Hitch Hiker's v. Hitchhiker's confusion, I do appreciate that difficulty. Though IMDb does try to preserve all variant names of films, and BoardGameGeek does the same for board games. It would be good if someone, somewhere was doing the same for computer games.

Then again, I'm not volunteering, so I know I can't complain. (-8