Char.IsDigit() matches more than just “0” through “9”


Warning: .NET content ahead!

Yesterday, Brad Abrams noted that Char.IsLetter() matches more than just "A" through "Z".

What people might not realize is that Char.IsDigit() matches more than just "0" through "9".

Valid digits are members of the following category in UnicodeCategory: DecimalDigitNumber.

But what exactly is a DecimalDigitNumber?

DecimalDigitNumber

Indicates that the character is a decimal digit; that is, in the range 0 through 9. Signified by the Unicode designation "Nd" (number, decimal digit). The value is 8.

At this point you have to go to the Unicode Standard Committee to see exactly what qualifies as "Nd", and then you get lost in a twisty maze of specifications and documents, all different.

So let's run an experiment.

class Program {
  public static void Main(string[] args) {
    System.Console.WriteLine(
      System.Text.RegularExpressions.Regex.Match(
        "\x0661\x0662\x0663", // "١٢٣"
        "^\\d+$").Success);
    System.Console.WriteLine(
      System.Char.IsDigit('\x0661'));
  }
}

The characters in the string are Arabic digits, but they are still digits, as evidenced by the program output:

True
True

Uh-oh. Do you have this bug in your parameter validation? (More examples..) If you use a pattern like @"^\d$" to validate that you receive only digits, and then later use System.Int32.Parse() to parse it, then I can hand you some Arabic digits and sit back and watch the fireworks. The Arabic digits will pass your validation expression, but when you get around to using it, boom, you throw a System.FormatException and die.

Comments (38)
  1. Anonymous says:

    Aren’t the parse methods culture-dependent?

  2. Anonymous says:

    So what’s the solution?

  3. Anonymous says:

    As a regular expression: "^[0-9]+$". Alternatively, you can turn on ECMA mode via RegexOptions.ECMAScript but make sure to understand all that ECMA mode changes; it’s more than just the way d is interpreted.

    bool IsAsciiDigit(char c)

    {

    return c >= ‘0’ && c <= ‘9’;

    }

  4. Anonymous says:

    Ahem, in this case, if everything’s becoming so internationalized, shouldn’t in the perfect world the System.Int32.Parse() be able to handle the numbers in the arabic notation as well? ;)

  5. Anonymous says:

    Incidentally, I tried this in Java as well, translating your C# code into this:

    import java.util.regex.*;

    public class TestDigit {

    public static void main(String[] args) {

    System.out.println(Pattern.matches("^\d+$","u0661u0662u0663"));

    System.out.println(Character.isDigit(‘u0661’));

    }

    }

    The output this produced was:

    false

    true

    The difference is that Java 1.4’s regexp engine interprets "d" as "[0-9]", not as "p{Nd}", which seems to be how C#’s regexp engine is interpreting it.

    Of course, which behavior is "correct"? It could be argued both ways; C# is more "internationally" correct by default, while Java is more "intuitively" correct, at least for US/European programmers. And there’s certainly ways to force either behavior in either engine, just by writing a different regexp or setting engine options. Ah well, I’ll just leave that question to the language/library lawyers.

  6. Anonymous says:

    This demonstrates what Bruce Schneier has said that Unicode is a huge security disaster waiting to happen..

  7. Anonymous says:

    It just shows that you need to actually parse the input to validate it. If Int32.Parse throws an exception then it’s not a valid number, if it doesn’t then it is. Using regular expressions to validate input is pointless – this digits issue is just the beginning, even if you actually manage to only include symbols you can parse you’re still not validating the actual value, which can be lesss or more than the minimum or maximum Int32 value. So you may just go ahead and save yourself the regular expression testing and just try to parse the input.

  8. Anonymous says:

    Following up from what Jerry said: use Double.TryParse() to save yourself having to catch an exception.

  9. Anonymous says:

    Catching exceptions is slow. RegEx isn’t slow.

    If it’s time critical code you’d be far better off using a RegEx test for [0-9] then wrap the parse in a Try…Catch for numbers too big or too small for your type. At least that way you could avoid most of the Exceptions.

  10. Anonymous says:

    Steve, you would have to create a regex for each locale your code would run under, which is going to be somewhat of an issue. For example US numbers can include multiple commas (thousand separators) while most european number formats can only have one (since it’s used as a decimal separator) but can have spaces in the number as a thousands separator. Of course you can not allow your users to use their number format, this whole localization fad is going nowhere anyways ;)

  11. Anonymous says:

    I find the Snippet Compiler app useful for trying out code samples such as the experiment in this post. http://www.sliver.com/dotnet/snippetcompiler/

    Does anyone know if there is a similar program for Java?

  12. Anonymous says:

    This reminds me of the case-sensitivity dilemma. In English it’s trivial to check letter cases, but if you were parsing Chinese I’d hope you would consider Traditional and Simplified characters as "case-equivalent" (this is a somewhat arbitrary many-to-many mapping, with thousands of entries).

    If you do a search on a Japanese filesystem, are katakana and hiragana considered "case-equivalent?" What about kana and (equivalent) kanji?

  13. Anonymous says:

    The obvious fix, to me, would be to make system.int32.parse Do The Right Thing with non-ASCII digits. That is, consider the value of each character based off of the unicode character database, not from a naive ‘0’-c. (There are various ways of optimizing this, of course.)

    The definition of Nd and the unicode decimal property seem to imply that the characters should function as exact duplcates of 0-9 — that is, be used in decimal place-value systems, with the most significant digit first.

    Of course, the previous paragrah is in direct disagreement with the fact that there are two sets of Nd characters that do not include a zero digit, Ethopic and Tamil. (I’m using the UCD 4.0.0 for this.) There are a total of 268 Nd characters.

    BTW, this info is from a quick perl program:

    perl -MUnicode::UCD=charinfo -le while (++$n<0x1FFFF) {my $c=chr($n); my $i=charinfo($n); printf "U+%x (%s): %s = %dn", $n, $i->{name}, $c, $i->{decimal} if $c=~/p{Nd}/}

    (PS — I don’t mean to start a language war, but perl is my language of choice. It does get this similarly wrong, though it also gives your choice of how to get it wrong.)

  14. Anonymous says:

    All the more evidence that we should get everyone to learn English because it would be much easier to do that than get every program to speak every language.

    Plus your strings take half the memory that way, because we can go back to 8 bit strings rather than 16 bit ones.

    Everybody wins! ;)

  15. Anonymous says:

    Replies here to both the base note and Dan Maas.

    I think this quotation in the base note probably comes from the Unicode standard:

    > DecimalDigitNumber

    > Indicates that the character is a decimal

    > digit; that is, in the range 0 through 9.

    Gads. Then the following characters are decimal digits:

    ???????????????????

    (not a word, ichi, ni, san, yon, go, roku, shichi, hachi, kyuu)

    (zero, one, two, three, four, five, six, seven, eight, nine)

    but the following characters are not decimal digits:

    ??????????etc.

    (juu, hyaku, sen, man, oku, etc.)

    (ten, hundred, thousand, ten thousand, hundred million, etc.)

    (and I’ll have to find one of my old Rubik’s Revenges in order to be reminded of the words for larger powers of ten thousand ^0^)

    There are even more reasons than I thought there were, to not adopt Unicode.

    3/9/2004 2:54 PM Dan Maas:

    > If you do a search on a Japanese filesystem,

    > are katakana and hiragana considered "case-

    > equivalent?"

    If it’s a filesystem, generally no. Even a full-width Ro-maji and equivalent half-width Ro-maji do not match. One time I was looking at a folder in Windows Explorer, looking at two files with identical names, trying to figure out how it could happen. Then suddenly I noticed that one letter was shaped slightly differently in one name than in the other. Windows Explorer uses a font which has proportional spacing for Roman letters, so the difference between the full-width and half-width forms of that particular letter was almost invisible. (By the way, half-width Ro-maji use the same character encodings as ASCII, single byte values in two ranges below 127.)

    However, upper-case and lower-case are considered equivalent if both are half-width (as you expected) or if both are full-width (which you probably wondered about but didn’t ask).

    In Microsoft Word, there are options to consider various forms as either identical or different, and I’m pretty sure these include hiragana vs. katakana, full-width vs. half-width, and upper-case vs. lower-case.

    I think search engines can match kanji against various possible readings of the kanji.

  16. Anonymous says:

    Don’t get me wrong, I love Unicode. Having all languages in a single character set is a great step forward. But it shows the true difficulty of multi-lingual programming. (and allows for good retorts when young whippersnappers suggest making a filesystem that matches upper and lower case inside the kernel :)

    (I think the only viable alternative to Unicode would be mandating everything English-only or Esperanto-only or Mandarin-only… not very attractive options for most people in the world!)

  17. Anonymous says:

    In Microsoft Word, you can set the language property for each portion of a document as you type it in.

    In some e-mail systems, you can set the language and character encoding of each message. In some, you can set it on portions of a message.

    It is possible for web pages too. It isn’t always necessary for American and European web pages to display garbage Kanji all over the place and make it impossible to read what was intended. It is possible to add a tag saying that the encoding is ISO-8859-1. Some designers do that and then some browsers can display it properly if you have suitable fonts. A few years ago I used Internet Explorer to mostly read and then save some documents, and when saving them I specified the encoding as Western European, but when I opened the saved files Internet Explorer still displayed them with garbage Kanji instead of the authors’ intended texts. That can be avoided even without Unicode.

    Unicode itself is not a bad idea. If it had been invented a few decades earlier than it was, it would be even more wonderful. The biggest problem is that it isn’t backward compatible. There already exist billions of files in SJIS and EUC which are not going to be converted. No matter how much further improvement and/or debugging gets made to Unicode, it will always be necessary to allow additional cultural or technical specifications.

  18. Anonymous says:

    I’ve just read a bug in the MSDN page cited in the base note (and now hope that someone else will post another followup before this so I won’t have two in a row, sigh).

    > Overload List

    > Indicates whether the specified Unicode

    > character is categorized as a decimal digit.

    […]

    > [C++] public: static bool IsDigit(__wchar_t);

    […]

    > int main() {

    > char ch = ‘8’;

    >

    > Console::WriteLine(Char::IsDigit(ch));

    ‘8’ and ch are not Unicode characters. The example code gets the correct output but the example code didn’t get there by design. The person who wrote the declaration should talk to the person who wrote the example code.

  19. Anonymous says:

    Note that this works due to integer promotion rules, so the error is stylistic and not technical. Fortunately there is a "Feedback" link at the bottom, which I have had much success with. I’ve found that being polite helps.

  20. Anonymous says:

    Dan> How about Klingon-only? ;)

  21. Anonymous says:

    I see Chinese users moving (at a snail’s pace) away from the old 16-bit character sets. Sort of a chicken-and-egg problem given the lack of Unicode-aware applications. (it would be a major step forward for everyone to use Unicode for "Cut"/"Paste" between programs!)

  22. Anonymous says:

    re Steve Hiner:

    > Catching exceptions is slow. RegEx isn’t

    > slow.

    >

    > If it’s time critical code you’d be far

    > better off using a RegEx test for [0-9] then

    > wrap the parse in a Try…Catch for numbers

    > too big or too small for your type. At least

    > that way you could avoid most of the

    > Exceptions.

    How does that avoid most of the exceptions? How do you know that there will be more exceptions due to chars outside the range [0-9] than due to absolute values outside the int range being passed?

    Unless you know all the ways in which the code is going to be called and what values they are likely to pass in, you simply don’t know this.

  23. Anonymous says:

    Catching exceptions is slow.

    Exceptions are supposed to be exceptional. I can’t speak for all exception handling mechanisms, but good ones have minimal overhead for the non-exceptional case. If you’re trying to parse random data for integer strings, this would throw a lot of exceptions, but how realistic is that?

  24. Anonymous says:

    A slow exception is preferable to crashing or giving the malicious user control of your program!

  25. Anonymous says:

    Unicode is a big mess, but it wouldn’t work any other way because natural languages themselves are so messy.

    Someone mentioned problems with case-insensitive matching and Unicode. (I think Raymond even wrote about it once.) To me, this shows how horribly broken the Windows file system is. I’m sure it seemed like a good idea in the 1980s.

  26. Anonymous says:

    This is just as amusing as good old VB6/VBA.

    I live and work in Sweden where we (don’t laugh) use comma instead of dot when writing real values. For example you might think that Pi=3.1415… but in Sweden it is 3,1415…

    This is not a big problem, if I use the Format function I get a comma (or dot, depending on the regional settings of Windows). The big problem appears when I want to convert it to a value again. The val() function does not care about the regional settings so "1,55" = 1 but "1.55" = 1.55. This means that I had to define RVal() (Regional Val) that finds and replaces the first (if any) comma with a dot before calling val(). Not much of a problem, but an annoying work around since Format honours the regional settings.

  27. Anonymous says:

    Johan,

    I think the CCur, CDbl, CInt, CSng, etc. functions are all locale-aware in VB, whereas Val, as you’ve discovered, always uses ".".

  28. Anonymous says:

    So has anybody come up with a best practice for validating UNICODE input?

    Seems like the problem stems from needing to know the Natural Language NL context of the input creator.

    I don’t recall. Do common user agent’s provide this in trustworth fashion?

    Regardless, your code either has to trust the UA provided NL context, decide an arbitrary NL context, or parse the value for the entire unicode encoding (ick).

    I’m just getting up to speed with the framework classes. So another question I have is do the regexs for the system.text or whatever have flags for language contexts to make this easier for the non-multilingual programmer. For example, I’m bob in U.S and I just want asp.net to work for my blog and will reject any digits that I don’t know how to validate with regexs.

  29. Anonymous says:

    If you set the ECMA flag, then everything is just English. So digits are "0" through "9" and nothing else, letters are "a" through "z" and "A" through "Z" and nothing else, etc.

  30. Anonymous says:

    FoldString(MAP_FOLDDIGITS) will convert all the digits.

  31. Anonymous says:

    The RegexOptions.ECMAScript flag changes the behavior of .NET regular expressions.

  32. Anonymous says:

    No, you don’t do that.

Comments are closed.