Unicode shenanigans:
Martine Ã©crit en UTF-8

Posted on October 5, 2024 — 1400 words (7 minutes)

Martine Ã©crit en UTF-8 (parody cover of the Martine series of French children's books) — An old French meme

On my feed aggregator haskell.pl-a.net, I occasionally saw posts with broken titles like this (from ezyang’s blog):

Whatâ€™s different this time? LLM edition

Yesterday I decided to do something about it.

Locating the problem

Tracing back where it came from, that title was sent already broken by Planet Haskell, which is itself a feed aggregator for blogs. The blog originally produces the good not broken title. Therefore the blame lies with Planet Haskell. It’s probably a misconfigured locale. Maybe someone will fix it. It seems to be running archaic software on an old machine, stuff I wouldn’t deal with myself so I won’t ask someone else to.

ASCII diagram of how a blog title travels through the relevant parties

      Blog
       |
       | What’s
       v
 Planet Haskell
       | 
       | Whatâ€™s
       v
haskell.pl-a.net (my site)
       |
       | Whatâ€™s
       v
  Your screen

In any case, this mistake can be fixed after the fact. Mis-encoded text is such an ubiquitous issue that there are nicely packaged solutions out there, like ftfy.

ftfy has been used as a data processing step in major NLP research, including OpenAI’s original GPT.

But my hobby site is written in OCaml and I would rather have fun solving this encoding problem than figure out how to install a Python program and call it from OCaml.

Explaining the problem

This is the typical situation where a program is assuming the wrong text encoding.

Text encodings

A quick summary for those who don’t know about text encodings.

Humans read and write sequences of characters, while computers talk to each other using sequences of bytes. If Alice writes a blog, and Bob wants to read it from across the world, the characters that Alice writes must be encoded into bytes so her computer can send it over the internet to Bob’s computer, and Bob’s computer must decode those bytes to display them on his screen. The mapping between sequences of characters and sequences of bytes is called an encoding.

Multiple encodings are possible, but it’s not always obvious which encoding to use to decode a given byte string. There are good and bad reasons for this, but the net effect is that many text-processing programs arbitrarily guess and assume the encoding in use, and sometimes they assume wrong.

Back to the problem

UTF-8 is the most prevalent encoding nowadays.¹ I’d be surprised if one of the Planet Haskell blogs doesn’t use it, which is ironic considering the issue we’re dealing with.

A blog using UTF-8 encodes the right single quote² " ’ " as three consecutive bytes (226, 128, 153) in its RSS or Atom feed.
The culprit, Planet Haskell, read those bytes but wrongly assumed an encoding different from UTF-8 where each byte corresponds to one character.
It did some transformation to the decoded text (extract the title and body and put it on a webpage with other blogs).
It encoded the final result in UTF-8.

ASCII diagram of how text gets encoded and decoded (wrongly)

      What the blog sees →       '’'
                                  |
                                  | UTF-8 encode (one character into three bytes)
                                  v
                             226 128 153
                                  |
                                  | ??? decode (not UTF-8)
                                  v
What Planet Haskell sees →   'â' '€' '™'
                                  |
                                  | UTF-8 encode
                                  v
                                (...)
                                  |
                                  | UTF-8 decode
                                  v
            What you see →   'â' '€' '™'

The final encoding doesn’t really matter, as long as everyone else downstream agrees with it. The point is that Planet Haskell outputs three characters “â€™” in place of the right single quote " ’ ", all because UTF-8 represents " ’ " with three bytes.

In spite of their differences, most encodings in practice agree at least about ASCII characters, in the range 0-127, which is sufficient to contain the majority of English language writing if you can compromise on details such as confusing the apostrophe and the single quotes. That’s why in the title “Whatâ€™s different this time?” everything but one character was transferred fine.

Solving the problem

The fix is simple: replace “â€™” with " ’ ". Of course, we also want to do that with all other characters that are mis-encoded the same way: those are exactly all the non-ASCII Unicode characters. The more general fix is to invert Planet Haskell’s decoding logic. Thank the world that this mistake can be reversed to begin with. If information had been lost by mis-encoding, I may have been forced to use one of those dreadful LLMs to reconstruct titles.³

Decode Planet Haskell’s output in UTF-8.
Encode each character as a byte to recover the original output from the blog.
Decode the original output correctly, in UTF-8.

There is one missing detail: what encoding to use in step 2? I first tried the naive thing: each character is canonically a Unicode code point, which is a number between 0 and 1114111, and I just hoped that those which did occur would fit in the range 0-255. That amounts to making the hypothesis that Planet Haskell is decoding blog posts in Latin-1. That seems likely enough, but you will have guessed correctly that the naive thing did not reconstruct the right single quote in this case. The Latin-1 hypothesis was proven false.

As it turns out, the euro sign “€” and the trademark symbol “™” are not in the Latin-1 alphabet. They are code points numbers 8364 and 8482 in Unicode, which are not in the range 0-255. Planet Haskell has to be using an encoding that features these two symbols. I needed to find which one.

Faffing about, I came across the Wikipedia article on Western Latin character sets which lists a comparison table. How convenient. I looked up the two symbols to find what encoding had them, if any. There were two candidates: Windows-1252 and Macintosh. Flip a coin. It was Windows-1252.

Windows-1252 differs from Latin-1 (and thus Unicode) in 27 positions, those whose byte starts with 8 or 9 in hexadecimal (27 valid characters + 5 unused positions): that’s 27 characters that I had to map manually to the range 0-255 according to the Windows-1252 encoding, and the remaining characters would be mapped for free by Unicode. This data entry task was autocompleted halfway through by Copilot, because of course GPT-* knows Windows-1252 by heart.

let windows1252_hack (c : Uchar.t) : int =
  let c = Uchar.to_int c in
  if      c = 0x20AC then 0x80
  else if c = 0x201A then 0x82
  else if c = 0x0192 then 0x83
  else if c = 0x201E then 0x84
  else if c = 0x2026 then 0x85
  else if c = 0x2020 then 0x86
  else if c = 0x2021 then 0x87
  else if c = 0x02C6 then 0x88
  else if c = 0x2030 then 0x89
  else if c = 0x0160 then 0x8A
  else if c = 0x2039 then 0x8B
  else if c = 0x0152 then 0x8C
  else if c = 0x017D then 0x8E
  else if c = 0x2018 then 0x91
  else if c = 0x2019 then 0x92
  else if c = 0x201C then 0x93
  else if c = 0x201D then 0x94
  else if c = 0x2022 then 0x95
  else if c = 0x2013 then 0x96
  else if c = 0x2014 then 0x97
  else if c = 0x02DC then 0x98
  else if c = 0x2122 then 0x99
  else if c = 0x0161 then 0x9A
  else if c = 0x203A then 0x9B
  else if c = 0x0153 then 0x9C
  else if c = 0x017E then 0x9E
  else if c = 0x0178 then 0x9F
  else c

And that’s how I restored the quotes, apostrophes, guillemets, accents, et autres in my feed.