Mapping all those “strange” digits to “0” through “9”


In an earlier article, I discussed how the Char.IsDigit() method and its Win32 counterpart, GetStringTypeEx report things to be digits that aren't just "0" through "9".

If you really care just about "0" through "9", then you can test for them explicitly. For example, as a regular expression, use [0-9] instead of \d. Alternatively, for a regular expression, you can enable ECMA mode via RegexOptions.ECMAScript. Note that this controls much more than just the interpretation of the \d character class, so make sure to read carefully to ensure that you really want all the ECMA behavior.

It has been pointed out to me that there is a way to convert all those "strange" digits to the "0" through "9" range, namely by calling the FoldString function with the MAP_FOLDDIGITS flag.

(I put the word "strange" in quotation marks because of course they aren't strange at all. Just different.)

This converts digits but doesn't help with decimal points, so you still have to deal with correctly interpreting "1,500" as either "one thousand five hundred" (as it would be in the United States) or "one and a half" (as it would be in most of Europe). For that, you need to call GetLocaleInfo to get the LOCAL_SDECIMAL and LOCAL_STHOUSAND strings.

Comments (14)
  1. Wouldn’t it be a lot simpler to just have a ‘ParseNumber(String x)’ function that did all this work for you? Adding all this complexity does nobody any favours, and as you said in your earlier article, makes it a lot easier to inadvertantly add bugs. "Oh, we forgot to test combination #187523874 of input, which causes a crash / buffer overflow vulnerability / whatever."

  2. Spong says:

    If anyone knows of a relatively intermediate-level tutorial on Regular Expressions, please point the way.

    Most I’ve seen seem to either go "here’s the syntax" or "here’s a complex example", and the learning curve goes from null to thermonuclear in under one page.

  3. Tim Smith says:

    Having one simple routine that does it all sounds great, but life just isn’t that simple.

    Imagine I am in Europe and my software is talking to an U.S. made device. This device only spits out values in a U.S. format (very common for even European devices). How would this simple routine know that the value I need to convert is European or U.S.? Having those flags lets me tell the software exactly how to parse those numbers.

  4. Spong, There is a nice book on regular expressions here:

    http://www.amazon.com/exec/obidos/tg/detail/-/0596002890/ref=sib_rdr_dp/102-4360522-9929718?%5Fencoding=UTF8&no=283155&me=ATVPDKIKX0DER&st=books

    It introduces regular expressions in a fairly simple way and then works up to more advanced stuff. Amazon seem to have a ‘look inside’ feature for this book but it wasn’t working when I looked there.

    Jonathan

  5. Norman Diamond says:

    This converts digits but doesn’t help with

    > decimal points

    It also doesn’t help with other numerics. For example if a number is written using Chinese characters with these meanings:

    one two three four five

    then presumably it is folded to 12345

    but if it is written normally in this manner:

    one ten-thousand two thousand three hundred four ten five

    or if it is written shorthand in this manner:

    one ten-thousand two three four five

    then it doesn’t get folded properly.

  6. Raymond Chen says:

    "one ten-thousand two thousand three hundred four ten five" is not a number in decimal form. According to the Unicode definition, a decimal digit is one that can be used to form decimal-radix numbers. But "one ten-thousand two thousand three hundred four ten five" is not decimal radix. It’s just a number written out in words.

  7. Johan Thelin says:

    TrollTech has a nice introduction to their RegEx class. Since regexs works the same way in both .Net and Qt, the intro is still valid:

    http://doc.trolltech.com/3.3/qregexp.html#details

    /Johan

  8. Norman Diamond says:

    4/18/2004 5:36 PM Raymond Chen:

    > But "one ten-thousand two thousand three

    > hundred four ten five" is not decimal radix.

    > It’s just a number written out in words.

    True. I hadn’t noticed in either the base note or the FoldString page that the restriction was intended to be related to the way the radix is used.

    OK, what happens then if two strings are involved, the first being the five Chinese characters for "one two three four five", and the second being the full form with nine Chinese characters. FoldString should fold the first string because it’s using the radix (like 12345) but should not fold the second string because it’s words (like the English twelve thousand, three hundred forty-five). Is FoldString smart enough to distinguish these cases? Is the Unicode definition smart enough to distinguish these cases?

    (Incidentally I misstated the shorthand form of "one ten-thousand two three four five". I don’t think I’ve seen this form written purely in Chinese characters, I’ve mostly seen it as a mix of European digits and the Chinese character for ten-thousand. So FoldString should probably leave this alone. Of course the location of the decimal point doesn’t help FoldString either, but we already know that that’s cultural. For example a typical real estate advertisement would list my monthly rent as 6 . 7 ten-thousand (meaning 67,000) (this might look like it’s missing a digit, but it’s not, I’m 2 hours from central Tokyo).

  9. Raymond Chen says:

    FoldString doesn’t try to distinguish the cases; it doesn’t speak Chinese. It just converts characters.

  10. Norman Diamond says:

    4/19/2004 7:27 AM Raymond Chen:

    > FoldString doesn’t try to distinguish the

    > cases; it doesn’t speak Chinese.

    And surely it doesn’t speak any language. But then its actions do not depend on whether the decimal radix system is being used by the actual string that it’s told to fold.

    > It just converts characters.

    Then "one ten-thousand two thousand three hundred four ten five" must be converted to "1 ten-thousand 2 thousand 3 hundred 4 ten 5". Argh, this abuse of quotation marks and your server’s refusal to display either SJIS or Unicode as submitted doesn’t help any. In the original "quoted" string all words are their single-character Chinese numeric words. In the result "quoted" string the European numerics are as quoted and the words remain Chinese as in the original. And there are no blank spaces.

  11. Raymond Chen says:

    It’s not my server. I’m just a guest here. If you don’t like the server software, you can talk to Scott. http://scottwater.com/Blog/

    My point is that FoldString converts decimal characters to 0-9. It doesn’t try to interpret them in any way. if that’s not what you want, then don’t use it.

  12. Norman Diamond says:

    "Your server" meant the server you use, not that you own. "My employer" doesn’t mean that I own the company.

  13. Alex says:

    You can download EditPad pro or PowerGREP from here http://www.just-great-software.com . Both come with very good help files with thorough explanation of RegExp.

Comments are closed.

Skip to main content