Unicode shenanigans:
Martine écrit en UTF-8
On my feed aggregator haskell.pl-a.net, I occasionally saw posts with broken titles like this (from ezyang’s blog):
What’s different this time? LLM edition
Yesterday I decided to do something about it.
Locating the problem
Tracing back where it came from, that title was sent already broken by Planet Haskell, which is itself a feed aggregator for blogs. The blog originally produces the good not broken title. Therefore the blame lies with Planet Haskell. It’s probably a misconfigured locale. Maybe someone will fix it. It seems to be running archaic software on an old machine, stuff I wouldn’t deal with myself so I won’t ask someone else to.
In any case, this mistake can be fixed after the fact. Mis-encoded text is such an ubiquitous issue that there are nicely packaged solutions out there, like ftfy.
ftfy has been used as a data processing step in major NLP research, including OpenAI’s original GPT.
But my hobby site is written in OCaml and I would rather have fun solving this encoding problem than figure out how to install a Python program and call it from OCaml.
Explaining the problem
This is the typical situation where a program is assuming the wrong text encoding.
Text encodings
A quick summary for those who don’t know about text encodings.
Humans read and write sequences of characters, while computers talk to each other using sequences of bytes. If Alice writes a blog, and Bob wants to read it from across the world, the characters that Alice writes must be encoded into bytes so her computer can send it over the internet to Bob’s computer, and Bob’s computer must decode those bytes to display them on his screen. The mapping between sequences of characters and sequences of bytes is called an encoding.
Multiple encodings are possible, but it’s not always obvious which encoding to use to decode a given byte string. There are good and bad reasons for this, but the net effect is that many text-processing programs arbitrarily guess and assume the encoding in use, and sometimes they assume wrong.
Back to the problem
UTF-8 is the most prevalent encoding nowadays.1 I’d be surprised if one of the Planet Haskell blogs doesn’t use it, which is ironic considering the issue we’re dealing with.
- A blog using UTF-8 encodes the right single quote2 " ’ " as three consecutive bytes (226, 128, 153) in its RSS or Atom feed.
- The culprit, Planet Haskell, read those bytes but wrongly assumed an encoding different from UTF-8 where each byte corresponds to one character.
- It did some transformation to the decoded text (extract the title and body and put it on a webpage with other blogs).
- It encoded the final result in UTF-8.
The final encoding doesn’t really matter, as long as everyone else downstream agrees with it. The point is that Planet Haskell outputs three characters “’” in place of the right single quote " ’ ", all because UTF-8 represents " ’ " with three bytes.
In spite of their differences, most encodings in practice agree at least about ASCII characters, in the range 0-127, which is sufficient to contain the majority of English language writing if you can compromise on details such as confusing the apostrophe and the single quotes. That’s why in the title “What’s different this time?” everything but one character was transferred fine.
Solving the problem
The fix is simple: replace “’” with " ’ ". Of course, we also want to do that with all other characters that are mis-encoded the same way: those are exactly all the non-ASCII Unicode characters. The more general fix is to invert Planet Haskell’s decoding logic. Thank the world that this mistake can be reversed to begin with. If information had been lost by mis-encoding, I may have been forced to use one of those dreadful LLMs to reconstruct titles.3
- Decode Planet Haskell’s output in UTF-8.
- Encode each character as a byte to recover the original output from the blog.
- Decode the original output correctly, in UTF-8.
There is one missing detail: what encoding to use in step 2? I first tried the naive thing: each character is canonically a Unicode code point, which is a number between 0 and 1114111, and I just hoped that those which did occur would fit in the range 0-255. That amounts to making the hypothesis that Planet Haskell is decoding blog posts in Latin-1. That seems likely enough, but you will have guessed correctly that the naive thing did not reconstruct the right single quote in this case. The Latin-1 hypothesis was proven false.
As it turns out, the euro sign “€” and the trademark symbol “™” are not in the Latin-1 alphabet. They are code points numbers 8364 and 8482 in Unicode, which are not in the range 0-255. Planet Haskell has to be using an encoding that features these two symbols. I needed to find which one.
Faffing about, I came across the Wikipedia article on Western Latin character sets which lists a comparison table. How convenient. I looked up the two symbols to find what encoding had them, if any. There were two candidates: Windows-1252 and Macintosh. Flip a coin. It was Windows-1252.
Windows-1252 differs from Latin-1 (and thus Unicode) in 27 positions, those whose byte starts with 8 or 9 in hexadecimal (27 valid characters + 5 unused positions): that’s 27 characters that I had to map manually to the range 0-255 according to the Windows-1252 encoding, and the remaining characters would be mapped for free by Unicode. This data entry task was autocompleted halfway through by Copilot, because of course GPT-* knows Windows-1252 by heart.
let windows1252_hack (c : Uchar.t) : int =
let c = Uchar.to_int c in
if c = 0x20AC then 0x80
else if c = 0x201A then 0x82
else if c = 0x0192 then 0x83
else if c = 0x201E then 0x84
else if c = 0x2026 then 0x85
else if c = 0x2020 then 0x86
else if c = 0x2021 then 0x87
else if c = 0x02C6 then 0x88
else if c = 0x2030 then 0x89
else if c = 0x0160 then 0x8A
else if c = 0x2039 then 0x8B
else if c = 0x0152 then 0x8C
else if c = 0x017D then 0x8E
else if c = 0x2018 then 0x91
else if c = 0x2019 then 0x92
else if c = 0x201C then 0x93
else if c = 0x201D then 0x94
else if c = 0x2022 then 0x95
else if c = 0x2013 then 0x96
else if c = 0x2014 then 0x97
else if c = 0x02DC then 0x98
else if c = 0x2122 then 0x99
else if c = 0x0161 then 0x9A
else if c = 0x203A then 0x9B
else if c = 0x0153 then 0x9C
else if c = 0x017E then 0x9E
else if c = 0x0178 then 0x9F
else c
And that’s how I restored the quotes, apostrophes, guillemets, accents, et autres in my feed.
See also
- Mojibake, anyone? from BASHing data 2
Update: When Planet Haskell picked up this post, it fixed the intentional mojibake in the title.
There is no room for this in my mental model. Planet Haskell is doing something wild to parse blog titles.
As of September 2024, UTF-8 is used by 98.3% of surveyed web sites.↩︎
The Unicode right single quote is sometimes used as an apostrophe, to much disapproval.↩︎
Or I could just query the blogs directly for their titles.↩︎