The crazy world of stripping diacritics


Today's Little Program strips diacritics from a Unicode string. Why? Hey, I said that Little Programs require little to no motivation. It might come in handy in a spam filter, since it was popular, at least for a time, to put random accent marks on spam subject lines in order to sneak past keyword filters. (It doesn't seem to be popular any more.)

This is basically a C-ization of the C# code originally written by Michael Kaplan. Don't forget to read the follow-up discussion that notes that this can result in strange results.

First, let's create our dialog box. Note that I intentionally give it a huge font so that the diacritics are easier to see.

// scratch.h

#define IDD_SCRATCH 1
#define IDC_SOURCE 100
#define IDC_SOURCEPOINTS 101
#define IDC_DEST 102
#define IDC_DESTPOINTS 103

// scratch.rc

#include <windows.h>
#include "scratch.h"

IDD_SCRATCH DIALOGEX 0, 0, 320, 88
STYLE DS_MODALFRAME | WS_POPUP | WS_CAPTION | WS_SYSMENU
Caption "Stripping diacritics"
FONT 20, "MS Shell Dlg"
BEGIN
    LTEXT "Original:", -1, 4, 8, 38, 10
    EDITTEXT IDC_SOURCE, 46, 6, 270, 12, ES_AUTOHSCROLL
    LTEXT "", IDC_SOURCEPOINTS, 46, 22, 270, 12
    LTEXT "Modified:", -1, 4, 40, 38, 10
    EDITTEXT IDC_DEST, 46, 38, 270, 12, ES_AUTOHSCROLL
    LTEXT "", IDC_DESTPOINTS, 46, 54, 270, 12
    DEFPUSHBUTTON "OK", IDOK, 266, 70, 50, 14
END

Now the program that uses the dialog box.

// scratch.cpp

#define STRICT
#define UNICODE
#define _UNICODE
#include <windows.h>
#include <windowsx.h>
#include <strsafe.h>
#include "scratch.h"

#define MAXSOURCE 64

void SetDlgItemCodePoints(HWND hwnd, int idc, PCWSTR psz)
{
  wchar_t szResult[MAXSOURCE * 4 * 5];
  szResult[0] = 0;
  PWSTR pszResult = szResult;
  size_t cchResult = ARRAYSIZE(szResult);
  HRESULT hr = S_OK;
  for (; SUCCEEDED(hr) && *psz; psz++) {
    wchar_t szPoint[6];
    hr = StringCchPrintf(szPoint, ARRAYSIZE(szPoint), L"%04x ", *psz);
    if (SUCCEEDED(hr)) {
      hr = StringCchCatEx(pszResult, cchResult, szPoint, &pszResult, &cchResult, 0);
    }
  }
  SetDlgItemText(hwnd, idc, szResult);
}

The Set­Dlg­Item­Code­Points function takes a UTF-16 string and prints all the code points. This is just to help visualize the result; it's not part of the actual diacritic-removal algorithm.

void OnUpdate(HWND hwnd)
{
  wchar_t szSource[MAXSOURCE];
  GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));
  wchar_t szDest[MAXSOURCE * 4];

  int cchActual = NormalizeString(NormalizationKD,
                                  szSource, -1,
                                  szDest, ARRAYSIZE(szDest));
  if (cchActual <= 0) szDest[0] = 0;

  WORD rgType[ARRAYSIZE(szDest)];
  GetStringTypeW(CT_CTYPE3, szDest, -1, rgType);

  PWSTR pszWrite = szDest;
  for (int i = 0; szDest[i]; i++) {
    if (!(rgType[i] & C3_NONSPACING)) {
      *pszWrite++ = szDest[i];
    }
  }
  *pszWrite = 0;

  SetDlgItemText(hwnd, IDC_DEST, szDest);
  SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);
  SetDlgItemCodePoints(hwnd, IDC_DESTPOINTS, szDest);
}

Okay, here's where the actual work happens. We put the source string into Normalization Form KD. This decomposes the diacritics so that we can identify them with Get­String­TypeW and then strip them out.

Of course, in real life, you wouldn't hard-code the array sizes like I did here, but this is just a Little Program, and Little Programs are allowed to take shortcuts.

The rest of the program is just a framework to get into that function.

INT_PTR CALLBACK DlgProc(HWND hwnd, UINT wm,
                         WPARAM wParam, LPARAM lParam)
{
  switch (wm)
  {
  case WM_INITDIALOG:
    return TRUE;

  case WM_COMMAND:
    switch (GET_WM_COMMAND_ID(wParam, lParam)) {
    case IDC_SOURCE:
      switch (GET_WM_COMMAND_CMD(wParam, lParam)) {
    case EN_UPDATE:
      OnUpdate(hwnd);
      break;
    }
    break;
    case IDOK:
      EndDialog(hwnd, 0);
      return TRUE;
  }
  break;

  case WM_CLOSE:
    EndDialog(hwnd, 0);
    return TRUE;
  }

  return FALSE;
}

int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE hinstPrev,
                   LPWSTR lpCmdLine, int nShowCmd)
{
  DialogBox(hinst, MAKEINTRESOURCE(IDD_SCRATCH), nullptr, DlgProc);
  return 0;
}

Okay, let's take this program for a spin. Here are some interesting characters to try:

Original character Resulting character
ª 00AA Feminine ordinal indicator a 0061 Latin small letter a
¹ 00B1 Superscript one 1 0031 Digit one
½ 00BD Vulgar fraction one half 1⁄2 0031 2044 0032 Digit one + Fraction slash + Digit two
ı 0131 Latin small letter dotless i ı 0131 Latin small letter dotless i
Ø 00D8 Latin capital letter O with stroke Disappears!
ł 0142 Latin small letter l with stroke ł 0142 Latin small letter l with stroke
ŀ 0140 Latin small letter l with middle dot 006C 00B7 Latin small letter l + middle dot
æ 00E6 Latin small letter ae æ 00E6 Latin small letter ae
Ή 0389 Greek capital letter Eta with tonos Η 0397 Greek capital letter Eta
А 0410 Cyrillic capital letter А А 0410 Cyrillic capital letter А
Å 00C5 Latin capital letter A with ring above A 0041 Latin capital letter A
FF21 Fullwidth Latin capital letter A A 0041 Latin capital letter A
2460 Circled digit one 1 0031 Digit one
2780 Dingbat circled sans-serif digit one 2780 Dingbat circled sans-serif digit one
® 00AE Registered sign ® 00AE Registered sign
24c7 Circled Latin capital letter R R 0052 Latin capital letter R
𝖕 D835 DD95 Mathematical bold Fraktur small p p 0070 Latin small letter p
FF6C Halfwidth Katakana letter small Ya 30E3 Katakana letter small Ya
30E3 Katakana letter small Ya 30E3 Katakana letter small Ya
30B4 Katakana letter Go 30B3 Katakana letter Ko
201C Left double quotation mark 201C Left double quotation mark
201D Right double quotation mark 201D Right double quotation mark
201E Double low-9 quotation mark 201E Double low-9 quotation mark
201F Double high-reversed-9 quotation mark 201F Double high-reversed-9 quotation mark
2033 Double prime ′′ 2032 2032 Prime + Prime
2035 Reverse prime 2035 Reverse prime
2039 Single left-pointing angle quotation mark 2039 Single left-pointing angle quotation mark
« 00AB Left-pointing double angle quotation mark « 00AB Left-pointing double angle quotation mark
2014 Em-dash 2014 Em-dash
203C Double exclamation mark !! 0021 0021 Exclamation mark + Exclamation mark

There are some interesting quirks here. Mind you, this is what the Unicode Consortium says, so if you think they are wrong, you can take it up with them.

The superscript-like characters are converted to their plain versions. Enclosed alphabetics are also converted, but not the ® symbol. Fullwidth forms of Latin letters are converted to their halfwidth equivalents. On the other hand, halfwidth Katakana characters are expanded to their fullwidth equivalents. But small Katakana does not convert to their large equivalents.

The Ø disappears completely! What's up with that? The character code for Ø is reported as C3_ALPHA | C3_NONSPACING | C3_DIACRITIC, and since we are removing nonspacing characters, this causes it to be removed. (Why is Ø nonspacing? It occupies space!) For whatever reason, it does not decompose into O + Combining Solidus Overlay. On the other hand, the Polish ł remains intact because it is reported as C3_ALPHA | C3_DIACRITIC. Poland wins and Norway loses?

The diacritic removal ignores linguistic rules. The Swedish Å decomposes into a capital A and a combining ring above, even though in Swedish, the character is considered nondecomposable. (Just like the capital letter Q in English does not decompose into an O and a tail.) Katakana Go suffers a similar ignoble fate, converting to Katakana Ko, which is linguistically nonsensical. But then again, removing diacritics is already linguistically nonsensical. Nonsensical operation is nonsensical.

There is no attempt to unify look-alike characters from different scripts. Look-alike characters in the Greek and Cyrillic alphabets are not mapped to their Latin doppelgängers.

The infamous Turkish dotless i does not turn into a dotted i. (And the lowercase Latin i does not decompose into a combining dot and a dotless i.)

Finally, I tried a selection of punctuation marks. Most of them pass through unchanged, with the exception of the double prime and double exclamation mark which each decompose into a pair of singles. (But double quotation marks do not decompose into a pair of singles.)

Okay, but the goal of this exercise was spam detection, so we are actually interested in mapping as far as possible all the way down to plain ASCII. We'd like to convert, for example, the look-alike characters in the Cyrillic and Greek alphabets to the Latin characters they resemble.

So let's try something else. If we want to convert to ASCII, then just convert to ASCII!

#define CP_ASCII 20127
void OnUpdate(HWND hwnd)
{
  wchar_t szSource[MAXSOURCE];
  GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));
  char szDest[MAXSOURCE * 2];
  int cchActual = WideCharToMultiByte(CP_ASCII, 0, szSource, -1,
                              szDest, ARRAYSIZE(szDest), 0, 0);
  if (cchActual <= 0) szDest[0] = 0;

  SetDlgItemTextA(hwnd, IDC_DEST, szDest);
  SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);
}

We can extend the table above with a new column.

Original character KD character ASCII character
ª 00AA Feminine ordinal indicator a 0061 Latin small letter a a 0061 Latin small letter a
¹ 00B1 Superscript one 1 0031 Digit one 1 0031 Digit one
½ 00BD Vulgar fraction one half 1⁄2 0031 2044 0032 Digit one + Fraction slash + Digit two ? No conversion
ı 0131 Latin small letter dotless i ı 0131 Latin small letter dotless i i 0069 Latin small letter i
Ø 00D8 Latin capital letter O with stroke Disappears! O 004F Latin capital letter O
ł 0142 Latin small letter l with stroke ł 0142 Latin small letter l with stroke l 006C Latin small letter l
ŀ 0140 Latin small letter l with middle dot 006C 00B7 Latin small letter l + middle dot ? No conversion
æ 00E6 Latin small letter ae æ 00E6 Latin small letter ae a 0061 Latin small letter a
Ή 0389 Greek capital letter Eta with tonos Η 0397 Greek capital letter Eta ? No conversion
А 0410 Cyrillic capital letter А А 0410 Cyrillic capital letter А ? No conversion
Å 00C5 Latin capital letter A with ring above A 0041 Latin capital letter A A 0041 Latin capital letter A
FF21 Fullwidth Latin capital letter A A 0041 Latin capital letter A A 0041 Latin capital letter A
2460 Circled digit one 1 0031 Digit one ? No conversion
2780 Dingbat circled sans-serif digit one 2780 Dingbat circled sans-serif digit one ? No conversion
® 00AE Registered sign ® 00AE Registered sign R 0052 Latin capital letter R
24c7 Circled Latin capital letter R R 0052 Latin capital letter R ? No conversion
𝖕 D835 DD95 Mathematical bold Fraktur small p p 0070 Latin small letter p ?? No conversion
FF6C Halfwidth Katakana letter small Ya 30E3 Katakana letter small Ya ? No conversion
30E3 Katakana letter small Ya 30E3 Katakana letter small Ya ? No conversion
30B4 Katakana letter Go 30B3 Katakana letter Ko ? No conversion
201C Left double quotation mark 201C Left double quotation mark " 0022 Quotation mark
201D Right double quotation mark 201D Right double quotation mark " 0022 Quotation mark
201E Double low-9 quotation mark 201E Double low-9 quotation mark " 0022 Quotation mark
201F Double high-reversed-9 quotation mark 201F Double high-reversed-9 quotation mark ? No conversion
2033 Double prime ′′ 2032 2032 Prime + Prime ? No conversion
2032 Prime 2032 Prime ' 0027 Apostrophe
2035 Reverse prime 2035 Reverse prime ` 0060 Grave accent
2039 Single left-pointing angle quotation mark 2039 Single left-pointing angle quotation mark < 003C Less-than sign
« 00AB Left-pointing double angle quotation mark « 00AB Left-pointing double angle quotation mark < 003C Less-than sign
2014 Em-dash 2014 Em-dash - 002D Hyphen-minus
203C Double exclamation mark !! 0021 0021 Exclamation mark + Exclamation mark ? No conversion

There are some interesting differences here.

Some characters fail to convert to ASCII outright. This is not unexpected for the Japanese characters, is mildly unexpected for the look-alikes in the Cyrillic and Greek alphabets, and is surprising for some characters like double prime, double exclamation point, enclosed alphanumerics, and vulgar fractions because they had ASCII decompositions in Normalization Form KD, but converting directly into ASCII refused to use them.

But the dotless i gets its dot back.

Another weird thing you might notice is that the æ converts to just the a. This goes contrary to the expectations of American English, because words which historically use the æ and œ are largely respelled in American English to use just the e. (Encyclopædia → encyclopedia, fœtus → fetus.) Mysteries abound.

If your real goal is to map every character to its nearest ASCII look-alike, then all these code page games are just beating around the bush. The way to go is to use the Unicode Confusables database. There is a huge data file and instructions on how to use it. There's also a nice Web site that lets you explore the confusables database interactively.

Or you could just take the sledgehammer approach: If there are a significant number of characters outside the Latin alphabet and punctuation and you are expecting English text, then just reject it as likely spam.

ಠ_ಠ

Comments (25)
  1. ಠ_ಠ says:

    ಠ_ಠ

  2. Joshua says:

    Looks vaguely like my downmapper from Windows-1252 to ASCII-7 (I forget why I needed it). Oh, and that trailing mark in the article looks like an owl's eyes.

  3. Mark says:

    Joshua: you need to spend more time on the internet.

  4. alegr1 says:

    But how about "coöperation". Does it get converted to "cooperation"?

  5. poizan42 says:

    The categories for 'Ø' seems to be a bug in Windows. Here is the entry from UnicodeData.txt:

    00D8;LATIN CAPITAL LETTER O WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER O SLASH;;;00F8;

    That is U+00D8:

    1 Name: LATIN CAPITAL LETTER O WITH STROKE

    2 General_Category: Lu (Uppercase_Letter)

    3 Canonical_Combining_Class: 0

    4 Bidi_Class: L (Left_To_Right)

    5 Decomposition_Type/Decomposition_Mapping: (none)

    6,7,8 Numeric_Type/Numeric_Value: (none)

    9 Bidi_Mirrored: N (No)

    10 Unicode_1_Name: LATIN CAPITAL LETTER O SLASH

    11 ISO_Comment: (none)

    12 Simple_Uppercase_Mapping: (none)

    13 Simple_Lowercase_Mapping: U+00F8 ('ø')

    14 Simple_Titlecase_Mapping: (none)

    I don't think The Unicode Consortium has ever claimed that 'Ø' was nonspacing nor a diacritic.

  6. Macrosofter says:

    Or as more general sledgehammer approach you could use any good language detection library and reject texts in undetected or unknown to you languages. Of course you will find out that most of such libraries are based on Bayesian filtering which you plan to use for further spam rejection anyway, so language detection becomes sort of redundant.

  7. Nick says:

    "... C-ization of the C# code..." would that be "flattening," to remove the sharp-ness? I'd go with "dulling" but the source of "Sharp" is music notation.

  8. Katie says:

    @Nick, "naturalization" would be a good word for it too. Or maybe C is just the result of stripping the diacritic from C#.

    [+1 for "naturalization." It's a nice double-pun. -Raymond]
  9. alegr1 says:

    "... C-ization of the C# code..."

    That's melodic D-major scale for you.

  10. j b says:

    Re that Norwegian ø:

    There has been, since the days of ISO 646, a Norwegian standard for converting ISO 646-60 to iSO 646 IRV (aka ASCII), saying to convert Æ to E, Ø to O and Å to A. (There is another standardard for context allowing a change in string length, with the conversions AE, OE and AA.)

    First time I encountered this conversion was when a colleague reportet from an interational conference where only English keyboars where available (we didn't have personal portables at that time), tellig "Jeg horer på en dame ....". The bad thing is that "hore" means "whore" in Norwegian, while "høre" means "listen". So rather than telling "I am listening to a lady..." he told us "I am whoring on a lady...".

  11. Jonathan says:

    In Hebrew, diacritics are optional, so removing them actually does make sense. I think Arabic is the same.

  12. Useful and interesting, thanks!

  13. IanBoyd says:

    I recently had the fortune to write this function (`RemoveDiacritics(str)`). I used an initial similar step:

    - Normalize string into Decomposed (D) form: "Unicode normalization form D, canonical decomposition. Transforms each precomposed character to its canonical decomposed equivalent. For example, Ä becomes A + ¨"

    I then also iterated every code point. But instead i checked if the WCHAR was in the unicode block Combining Diacritical Marks (U+0300..U+036F).

    Of the four Unicode normalization forms:

    - NormalizationC = 1; //canonical composition. For example, A + ¨ becomes Ä  

    - NormalizationD = 2; //canonical decomposition. For example, Ä becomes A + ¨  

    - NormalizationKC = 5; //compatibility composition. For example, A + ¨ + fi + n becomes Ä + f + i + n  

    - NormalizationKD = 6; //compatibility decomposition. For example, Ä + fi + n becomes A + ¨ + f + i + n  

    i chose the "Canonical Decomposition", as i didn't need ligatures decomposed.

  14. Spelling Nat'l Socialist says:

    Fetus wasn't the best example because it was never correctly spelled using the œ character; that was a mistake introduced sometime between 600 and 1600 AD. I don't know if it was corrected intentionally or due to Webster's Americanization efforts, but thankfully the medical world got back to the correct form of the word. Meant more as a fun fact than an actual nitpick.

  15. mikeb says:

    Interesting, but I hope to never have to be responsible for dealing with this kind of functionality. It seems much harder than having to deal correctly with floating point, and I've long since given up on that (about once a year I'll make an attempt to answer what appears to be an easy floating point question on Stackoverflow, and I always end up regretting that decision).

  16. Kevin says:

    @mikeb: Floating point logic is easy: Everything is always wrong unless you've conclusively proved it correct or used an arbitrary-precision library throughout the entire computation.

  17. SomeGuyOnTheInternet says:

    Could you please write a Little Program to remove smart quotes from every computer everywhere? I am sick of my customers searching for O'Brien and not finding O’Brien.

  18. Beldantazar says:

    @ SomeGuyOnTheInternet  How about instead you fix your search system to actually work correctly and understand unicode apostrophes.

  19. Boris says:

    But I don't see the need to blunt any code, seeing as you've been adding more and more C# examples in what started out as "not actually a .NET blog".

    [This was a case of parallel evolution. After I wrote the program, I realized that Michael Kaplan already wrote it. -Raymond]
  20. morlamweb says:

    @Beldantazar: smart quotes have other problems not related to search.  See Raymond's post on the topic: blogs.msdn.com/.../9443404.aspx

  21. Viila says:

    "But how about "coöperation". Does it get converted to "cooperation"?"

    That's a good question. What should it convert into? In English ¨ is a diacritic that only modifies the pronunciation, but in Finnish and Swedish Ö and Ä are non-decomposable letters, just like Å. Either way, you're going to make somebody angry.

  22. SomeGuyOnTheInternet says:

    @Beldantazar - you're assuming I have source code access to every system the user interacts with. I don't. The user presses the U+0027 key on the keyboard but does not get a U+0027 character. In fact, user presses the U+0027 key on the keyboard and will get one of three different characters! The user has typed the same sequence of keystrokes in two different apps and gets a different series of characters each time.

  23. Boris says:

    Coöperation simplifies down to co-erperation, obviously.

  24. Marek says:

    Yeah, it can be fun with unicode: twitter.com/.../367557195186970624

    .

  25. Chris says:

    I am fully of the opinion that this was written simply to be able to end it with that smiley.

Comments are closed.

Skip to main content