Char.IsLetter() and Ascii


Interesting question from inside the firewall:


I expected Char.IsLetter() to return true only if presented with ‘A’..’Z’ or ‘a’..’z’ in my current (US English) locale. I find that I also get true returned for alphabetic characters above the ANSI range, such as Chinese character.


 


Is this because that we are using Unicode 3.0 in here? Where can I find Unicode 3.0 spec?


 


By the way is there a method which can return true only if presented with ‘A’..’Z’ or ‘a’..’z’?


 


And the answer, from a developer on the globalization team:


It is indeed using Unicode properties (updated for Unicode 3.2 in future release and then to 4.0 after that eventually). Info on Unicode character props can be found in the Unicode Character Database at http://unicode.org/.


 


If you need to get just ASCII then you can look for (Char.IsLetter(c) && c <= 0x007a).


 

Comments (13)

  1. Jerry Pisk says:

    I don’t think there is a method but you can easily write your own:

    bool IsAsciiLetter(char c)

    {

    return (c >= ‘A’ && c <= ‘Z’) || (c >= ‘a’ && c <= ‘z’);

    }

    Char.IsLetter docs says "Indicates whether a Unicode character is categorized as an alphabetic letter." – not just in the current locale.

  2. Kevin Daly says:

    It would be nice if there was an override that let you pass CultureInfo to it, but I can see difficulties in that.

    By the way, for an English language scenario you would probably want to take it on a case by case basis as to whether you wanted to return "true" for accented letters (less clear cut that a different script like Chinese), depending on the point of the test, since these are certainly letters in the Roman alphabet (particularly in a validation scenario…telling people their names are invalid is a just a teensy bit culturally insensitive).

  3. Jerry Pisk says:

    Well, you can always P/Invoke GetStringTypeEx.

  4. Michael Entin says:

    > Is this because that we are using Unicode 3.0 in here? Where can I find Unicode 3.0 spec?

    I don’t think this method makes any sense at all in the scope of Unicode 3.0.

    Unicode allows 4-byte characters (which are represented as surrogates in CLR’s UFT-16 encoding), so you can’t reliably determine whether individual 2-byte CLR Char (which could be half of the character) is a letter.

  5. You have been Taken Out! Thanks for the good post.

  6. Norman Diamond says:

    I agree with Kevin Daly that it would be nice if CultureInfo could be passed to it, but do not agree with his second paragraph. English writing does not have accented letters. If a person or program is testing whether an accented letter is a letter, in English it is not and in other languages it depends on precisely what their alphabets are.

    Sure it can be considered culturally insensitive for the Japanese government to require my wife to misspell her name in official documents. But Japanese Ro-maji do not include the Spanish letter which comes after n, so we wrote it as ny. Interestingly there is no trouble writing the pronunciation correctly in katakana.

    Surely it’s a bit more culturally insensitive for the Japanese government to prohibit visiting Chinese scholars from writing their names in Chinese characters on official documents, requiring Chinese to write their names in Ro-maji and/or katakana, saying that only Japanese citizens are allowed to write their names in Chinese characters. Sometimes this might have a purpose, if names are being entered in databases that have been encoded in SJIS or EUC for decades and are not being converted to Unicode, and Chinese characters that were not copied into Japanese cannot be represented. But sometimes it’s just offensive, as when it involves a name stamp (hanko).

    Back to English as discussed by Mr. Daly, sure it’s culturally insensitive to tell a Swede or Chinese that their name isn’t a valid name, but it’s not culturally insensitive to tell them that their scripting isn’t English.

  7. Keith Hill says:

    OK, I’m an admitted character set simpleton but wouldn’t a Char.IsAscii() method come in handy?

  8. Norman Diamond says:

    If the encoding is ASCII then every value from 0 to 127 is ASCII. Some values represent printable characters and some values represent control codes. If the encoding is JIS Ro-maji or SJIS or EUC then every value from 0 to 127 is JIS Ro-maji. Some values represent printable characters and some values represent control codes. Up to this point, the return value of an IsAscii() method would have to depend solely on cultural information, not on the value of the Char.

    If the encoding is an extension of ASCII (I don’t know if there are standards for this, though Microsoft has code page 437) then values from 128 to 255 are in this extension of ASCII. I think IsAscii() should return False for these values, but would understand if it returned True, if code page 437 is near-standard.

    If the encoding is Latin-1, ISO-8859-1, then values from 128 to 255 are in this character set. But this encoding is not ASCII, so surely IsAscii() should return False. Up to this point, the return value of an IsAscii() method might depend on the value of the Char, but even if it does, so what? It still depends on cultural information as well. The Char is not enough.

    In SJIS, a few of the values in the range 128 to 255 are characters (half-width katakana, which no one likes but everyone has to contend with them because they exist). In EUC this does not happen, nothing in the range 128 to 255 is a character. In both SJIS and EUC, there are various ranges inside of the range 32768 (0x8000) to 65535 (0xFFFF) that are characters, and various ranges that are not characters. Of course the codings are completely independent, there happen to be a few values that overlap (but which represent different characters in SJIS and EUC) but mostly they’re different. Actually I’ve seen some 3-byte characters in EUC too. Anyway, these values obviously are not Ascii.

    But nearly all non-Ascii characters are letters.

    If you allow IsAscii to use cultural information together with the value of the Char then you can compute a return value for IsAscii, but what would you use it for?

  9. When answering issues like this, I begin by questioning the question itself: Why do you need a method which can return true only if presented with ‘A’..’Z’ or ‘a’..’z’? Is it because you’re going to use the result for something that may not allow other characters? It is usually much better to test that usage itself.

    For example, I remember an issue where accented characters were causing problems when used for file paths. This path was stored in the registry and later used to call functions using the ACP in some cases (most often, from user mode) and functions using the OEMCP in others (few instances but it was a kernel mode driver where I was told that codepage conversion was quite difficult to handle, and I didn’t find out until much later that in kernel mode there are flags that can be set for handling path codepages).

    We could have checked the paths for A..Z and a…z but the real problem wasn’t there. it was more effective to check if the path converted to ACP and the path converted to OEMCP were both identical. This allowed for many cases that are not A..Z or a..z but still work. For example, an e with a tilde would cause problems (0x82 on CP437, CP850 and a few others but it is 0xE9 on CP1252) but a Japanese character on a Japanese system would be ok (because the ACP and the primary OEMCP on Japanese systems are 932)

  10. Kevin Daly says:

    In reply to Norman Diamond on allowing for accented characters in English: the reality is that people writing in English *do* have to take accented characters into account. And there are countries where the normal day to day language may be English, but names in a non-English language may be common and there may be very good social, cultural or political reasons for taking the care to spell them correctly. Ireland is a good example: names in Irish (surnames, Christian names, place names and names of government departments and institutions) are in common usage, they make extensive use of the acute accent, and it is simply incorrect to ignore those accents (it can also create confusion).

    So frankly it is not enough rest in the comfort of English linguistic chauvinism.

    And hell, it’s hardly a biggie to deal with…and the ability to do so was after all one of the reaons for everybody moving to Unicode.

  11. Norman Diamond says:

    Replying to Kevin Daly’s posting of 3/15/2004 8:07 PM.

    Language issues:

    > the reality is that people writing in English *do* have to take accented characters

    > into account.

    Then people writing in English *do* have to take Chinese characters into account, for exactly the same reasons you stated, except possibly for one exception:

    > names in Irish (surnames, Christian names, place names and names of

    > government departments and institutions) are in common usage, they make

    > extensive use of the acute accent,

    You say those are names in Irish. If you meant English, then I have just learned from you that English uses the acute accent. If you meant English, then then people writing in English do have to take acute accented characters into account but don’t have to take Chinese characters into account. But if you meant Irish exactly as you wrote, then we’re back to the situation of accented characters and Chinese characters being equally not part of English while equally needing some non-English treatment on occasion.

    Encoding issue:

    > and the ability to do so was after all one of the reaons for everybody moving to

    > Unicode.

    Not exactly. A lot of people did not already use 16-bit character sets so they didn’t have to worry about backwards compatibility. Suppose Unicode required you to replace all your databases, filenames, e-mail archives, etc., with 8-bit encodings that differed from your legacy 8-bit encodings. You’d refuse, wouldn’t you? You’d keep using your legacy 8-bit encoding because you have so many legacy files using them. Now notice how many people have legacy files using legacy 16-bit encodings and you’re not helping them with the lack of backwards compatibility. Everyone’s not moving to Unicode.