Diacritics, Accents, Tashkeel

Article
12/19/2010

So what are diacritics anyway? Diacritic is a mark typically used in combination with another character. These are sometimes called 'accents', in Latin, for example, “â” the mark on top of the “a” is a diacritic. In Hebrew and Arabic these are marks that denote vowels. In Arabic, these are also called ‘tashkeel’, for example “ًَُِ”.

There are two main issues that arise around diacritics. The first issue is in the display, as they occupy the same space as another glyph, the font has to display them correctly and you need to make sure that your control doesn’t truncate them. The second issue is in the string manipulation. That means how you will search, store, and also sort strings that do have diacritics. In this post we’ll explore how to remove diacritics and how to search words, ignoring the diacritics.

How to remove diacritics?

These are the steps that you need to do. First step, normalize the string that you need to remove it’s diacritics. This means that the combined characters are reverted to their basic character combinations. Final step, remove the diacritics and reconstruct your string.

Seems simple, so let’s explore the code:

private static string RemoveDiacritics(string InputStr)

{

string BasicStr = InputStr.Normalize(NormalizationForm.FormD);

string TempStr = "";

for (int i = 0; i < BasicStr.Length; i++)

{

if (char.GetUnicodeCategory(BasicStr[i]) != System.Globalization.UnicodeCategory.NonSpacingMark)

TempStr += BasicStr[i];

}

return TempStr;

}

Another approach is to define your diacritics in an array and removing the member of this array.

How to search, but ignore diacritics?

The .NET framework has a powerful string comparison features. The most common is the String.Compare and String.CompareOrdinal. However, if you need a more advanced culture functionality then you have to use the CultureInfo.CompareInfo. You have three main enumerators to decide on your search. Let’s examine them in more detail:

· IgnoreNonSpace: Ignores nonspace characters, including diacritics but in case of Arabic it ignore the Alef-hamza.

· IgnoreSymbols: Ignores symbols, such as white-space characters, punctuation, currency symbols, the percent sign, mathematical symbols, the ampersand, and so on. In addition to ignoring diacritics, in the case of Arabic.

· CompareOrdinal: Indicates that the string comparison must be done using the Unicode values of each character, which is a fast comparison but is culture-insensitive. This is important if you want to disable any culture specific search like ignoring the kashida.

So in order to ignore all diacritics, I’ll use the following code:

private static int IgnoreDiacritics(string InputStr1, string InputStr2)

{

CultureInfo ArabicCult = new CultureInfo("ar");

string ResultStr = null;

int Result = ArCulture.CompareInfo.Compare(InputStr1, InputStr2, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreSymbols);

ResultStr = "The result of comparing text with CompareOptions";

ResultStr += InputStr1 + " and " + InputStr2 + " is : " + Convert.ToString(Result);

MessageBox.Show(ResultStr);

return result

}

That’s wraps up our discussion about diacritics but we’ll have more posts soon. I hope you enjoyed this post and until other topics.

Diacritics, Accents, Tashkeel

Additional resources