A Little Program to fix one particular type of mojibake


Has this ever happened to you? You're downloading your daughter's Chinese homework assignment, but the file name gets all up in your mojibake, and the results are nonsense.

Time to do some reverse-mojibake.

The first step in reversing mojibake is figuring out what wrong turn the encoding went through. I took an educated guess and assumed that the file name was encoded in UTF-8, which was then misinterpreted as ANSI. I suspect this type of error is pretty common, so it was my first stab.

To reverse it, therefore, we need to take the Unicode file name, convert it to ANSI bytes, then reinterpret those bytes as UTF-8. Let's try it:

using System.Text;

class Program
{
  static public void Main(string[] args)
  {
    foreach (var file in args)
    {
      var bytes = Encoding.Default.GetBytes(file);
      var s = Encoding.UTF8.GetString(bytes);
      System.IO.File.Move(file, s);
    }
  }
}

I'll take the file name on the command line, convert it via the default system code page into bytes, then take those bytes and convert them back into a string by reinterpret them as UTF-8. I then rename the file with the "fixed" name.

Fortunately, this worked. The file name got unscrambled.

U+00E5 U+00AE U+00B6 U+00E5 U+00BA U+00AD U+00E8 U+0081 U+00AF U+00E7 U+00B5 U+00A1 U+00E5 U+2013 U+00AE U+002E U+0070 U+0064 U+0066
å ® å º ­ è  ¯ ç µ ¡ å ® . p d f

Converted to bytes via code page 1252 Windows Western European Latin 1 (which is the default code page for the United States):

E5 AE B6 E5 BA AD E8 81 AF E7 B5 A1 E5 96 AE 2E 70 64 66

And then converted back to Unicode via UTF-8:

U+5BB6 U+5EAD U+806F U+7D61 U+55AE U+002E U+0070 U+0064 U+0066
. p d f

Et voilà.

Comments (28)
  1. Ray Koopa says:

    Can it be that ZIP created with the File Explorer have a similar problem? One South Korean guy zipped me a whole huge package of files. Opening them here resulted in completely garbled file names all over. I had to set my computer’s language for non-unicode programs to Korean to make it work out.

    That also gave me this weird striked-through W for backslashes in paths *insert oldnewthing link here* so you can tell I didn’t leave it like this for long.

    1. pc says:

      I believe ZIP predates Unicode by quite a bit, and while they have finally added extensions to the format to allow for Unicode filenames (added to the Spec in 2006 according to Wikipedia), it wouldn’t surprise me if not all ZIP programs implemented them correctly, if at all.

      1. cheong00 says:

        And even if they implemented Unicode suport for Zip files, since there is more than 1 way to store unicode filename in Zip, the software will have to implement support to decode both to extract filenames correctly.

        1) Introduced in v4.1.0, set field 0x0008 Reserved for extended language encoding data (PFS), then treat the normal filename fields as in that encoding.
        2) Introduced in v6.3.0, store unicode filenames in InfoZip Unicode Path Extra Field 0x7075

    2. Darran Rowe says:

      The weird strike-through W symbol should be the currency symbol. https://en.wikipedia.org/wiki/Won_sign is what you saw, right?
      The East Asian code pages made the decision to use the \ for the currency symbol.

      1. Ray Koopa says:

        Yep, I was jokingly just describing it as a weird W. Ray wrote an article about it; https://blogs.msdn.microsoft.com/oldnewthing/20051014-20/?p=33753

      2. DWalker07 says:

        A backslash for the currency symbol? (That’s what I see in your message, on Windows 7 Enterprise, IE 11.)

        1. DWalker07 says:

          In Darran’s message. :-)

  2. PastorGL says:

    Кракозябры are fairly common for Cyrillic users. Even these days Microsoft Rewards greets me as -É-+-¦-¦-ü-¦-¦ in their emails.

    1. ender says:

      Microsoft somehow managed to transcribe the č’s in my last name to e’s (with nothing else), which would suggest they somehow converted the input encoding to CP1250, read that as CP1252, and then removed all diacritics.

  3. guest says:

    A hint is that the mojibake looks like they come in groups of 3, for example 0xe5 å appears in positions 0, 3, and 12 (all multiples of 3). For east Asian languages this pattern usually indicates UTF-8. If it were UTF-16 they would be predominantly 2 bytes per character. If it were some other encoding (e.g. GB2312) then there would be no easily discernible pattern.

    1. Seeing an accented ‘a’ as the first character in the group is a pretty good sign that you’re looking at UTF-8 that’s being parsed as code page 1252. It’s one of the patterns that you recognize after seeing mojibake enough times.

    2. Dan says:

      Though, for European languages (or to be exact, characters U+0080 thru U+07FF), the mojibake will come in groups of 2 instead of 3.

      1. French Guy says:

        The first character in each sequence gives you a narrower range. You won’t get the same starting characters for latin (U+00C0 to U+00C9), greek (U+00C0 to U+00CF) or cyrillic (U+00D0 to U+00D3).

  4. Al go says:

    Is this CLR week?

  5. Yuri Khan says:

    On non-Windows systems, for this kind of tasks, we write not Little Programs but shell one-liners.

    $ for i in *; do mv “$i” “$(echo $i | iconv -f utf-8 -t windows-1252 | iconv -f utf-8)”; done

    (Remember, Shell One-Liners do little to no error checking.)

    1. Yuri Khan says:

      (The blog engine broke that one-liner by auto-typographing the quotation marks.)

    2. Ray Koopa says:

      Tsk tsk, you could’ve posted a stupidly looking Powershell command for it, but yet you decide to use this weird free operating system ;D (sarcasm intended)

    3. xcomcmdr says:

      s/non-Windows/GNU/Linux and compatible

      Non-Windows can also mean AROS, Amiga OS, OS/2, etc…

      1. Yuri Khan says:

        Okay, nitpicker’s corner: the technically correct name is POSIX.

  6. Jan De Kock says:

    I remember on an old job of mine, I had to fix the erroneous entries in the database. It was almost the same issue, but repeated multiple times. UTF-8 strings (common in french) interpreted as ansi, that saved as UTF-8 again and again.

    Gave cool strings like “é” basically I had to look for all the text fields in the database, loop over them, correct the content, and show the difference before allowing a correction.

    Now that I remember, I still have it on github somewhere: https://gist.github.com/jandk/9093978

  7. Chris Chilvers says:

    Another very common variation of this is using PHP with MySQL. It’s really easy to set up the connector incorrectly so the connector thinks you have Latin-1 text (when really it’s UTF-8 from the browser) which results as UTF-8 interpreted as Latin-1 then converted to UTF-8.

    It made me appreciate MSSQL and .Net where pretty much all the libraries have agreed on the character encoding and you don’t have to spend all your time making sure every step in the chain has been configured to use the right encoding.

  8. smf says:

    As an English only speaker/reader/writer, I probably wouldn’t notice if the file name was corrupt anyway. I’m probably not alone, which is one of the reasons these problems persist.

    1. Boris says:

      Anyone would know enough to realize that this is not a file name in any language:

      å ® ¶ å º ­ è  ¯ ç µ ¡ å – ® . p

      1. Joshua says:

        My mind immediately went Korean. Too much StarCraft. When the Koreans chat back and forth it looks like that. I can’t tell it’s not a single byte charset though, just that it’s not mine.

        1. Boris says:

          Sorry, I’m not sure what you mean. Are you saying you can’t tell the difference between Hangul and gibberish?

          https://en.wikipedia.org/wiki/Hangul

          If you mean that you’re seeing corrupted Hangul, then nobody is expecting you to know what the original script was without more context, like Raymond here knew that it was Chinese. Smf’s assertion was that speakers of only one language can’t distinguish foreign-language writing from a sequence of random-looking characters, which isn’t something I can imagine.

          1. He’s just saying that this garbled string reminds him of how chat text coming from Korean StarCraft players would end up looking on his screen, due to whatever mojibake StarCraft did.

          2. Joshua says:

            The mojibake of StarCraft is treat foreign strings as though they’re your own locale. So I was seeing whatever Korean’s codes look like in my own locale. Since my “solution” for Windows 8+ crashing StarCraft randomly (they’re fighting over the graphics card) was run the thing under Wine I’m not absolutely confident my code page is Windows-1252.

  9. For a while, a certain email client had a mojibake bug where it would inexplicably make the following transformations:

    — (U+2014 EM DASH) became ‹ (U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK)
    ‘ (U+2018 LEFT SINGLE QUOTATION MARK) became Œ (U+0152 LATIN CAPITAL LIGATURE OE)
    ’ (U+2019 RIGHT SINGLE QUOTATION MARK) became ¹ (U+00B9 SUPERSCRIPT ONE)
    “ (U+201C LEFT DOUBLE QUOTATION MARK) became ³ (U+00B3 SUPERSCRIPT THREE)
    ” (U+201D RIGHT DOUBLE QUOTATION MARK) became ² (U+00B2 SUPERSCRIPT TWO)
    … (U+2026 HORIZONTAL ELLIPSIS) became Š (LATIN CAPITAL LETTER S WITH CARON)

    I was never able to figure out the type of mojibake error that would cause those transformations, the only thing the obviously had in common was that the source characters were all in the 0x80-0x9F block of Windows-1252. But it seems that more recent versions of that email client have fixed that bug, as I haven’t seen those particular mojibake in a while.

Comments are closed.

Skip to main content