Unicode déjà vu (not important)

feedback, comments and suggestions pertaining to the stardot forum
Post Reply
User avatar
BigEd
Posts: 3431
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Unicode déjà vu (not important)

Post by BigEd » Tue Sep 15, 2020 6:13 pm

This thread below has an e-acute in the post titles, which is mis-rendered in three places (title, headline, search result title) and looks very like a Unicode kind of thing:
(I hesitated about posting such a small observation, but Dave Arcadian put me up to it...)

User avatar
Diminished
Posts: 519
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Unicode déjà vu (not important)

Post by Diminished » Tue Sep 15, 2020 6:55 pm

I think it's being UTF-8 encoded ... three times, instead of just once?!


User avatar
Diminished
Posts: 519
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Unicode déjà vu (not important)

Post by Diminished » Tue Sep 15, 2020 7:34 pm

e-acute is Unicode code point U00e9 ...

If you encode it three times, you get an eight-byte UTF-8 sequence:

Code: Select all

Unicode code points                     |  UTF8
----------------------------------------|------------
0xe9                                    -> 0xc3 0xa9
0xc3 0xa9                               -> 0xc3 0x83 0xc2 0xa9
0xc3 0x83 0xc2 0xa9                     -> 0xc3 0x83 0xc2 0x83 0xc3 0x82 0xc2 0xa9
Then, if you go here, select "Input type: Freeform numeric", and paste the sequence "0xc3 0x83 0xc2 0x83 0xc3 0x82 0xc2 0xa9", you get:

As raw characters:

é

Which is the corrupted version I'm seeing in my browser.

This explaineth the how, but not the why ...

User avatar
BigEd
Posts: 3431
Joined: Sun Jan 24, 2010 10:24 am
Location: West
Contact:

Re: Unicode déjà vu (not important)

Post by BigEd » Tue Sep 15, 2020 7:41 pm

There was a big forum migration (maybe this one?) which left a bit of Unicode wreckage, which was then, I think, found and cleaned up. But just possibly only post texts were fixed, not title texts.

User avatar
Diminished
Posts: 519
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Unicode déjà vu (not important)

Post by Diminished » Tue Sep 15, 2020 7:45 pm

BigEd wrote:
Tue Sep 15, 2020 7:41 pm
There was a big forum migration (maybe this one?) which left a bit of Unicode wreckage, which was then, I think, found and cleaned up. But just possibly only post texts were fixed, not title texts.
Ah yes, that would do it.

If you searched for Ã, maybe you could find all of them, but the search function (understandably) prohibits me from doing that.

Edit: ah, I think you used google. So that's probably all of them.

Carry on!

User avatar
BeebMaster
Posts: 3626
Joined: Sun Aug 02, 2009 5:59 pm
Location: Lost in the BeebVault!
Contact:

Re: Unicode déjà vu (not important)

Post by BeebMaster » Sun Sep 20, 2020 2:51 pm

It is important though, because character encoding is such a problem in so many ways! I have all my website HTML pages using ISO-8859-15 so that "quarter" and "half" symbols and pound signs all display correctly on the same page.

And on Linux there must be a difference between the character encoding used by Thunar and Dolphin because I saved a file with a quarter symbol in the filename and it displayed OK in the directory window but I couldn't open it any more in Dolphin, which claimed that the file didn't exist!

There's something wrong with "watch" in Pi Linux which displays a funny symbol instead of a blank character:
Screenshot_2020-09-20_14-47-30.png
I had to do quite a bit of jiggery pokery in my start up sequence to get it to resize and move to top right automatically:
Screenshot_2020-09-20_14-47-17.png
Image

Post Reply

Return to “stardot FORUM”