Why are leading digits converted to language-specific digit shapes, but not trailing digits, and how do I suppress the conversion entirely?


If you have a string like 12345ABCDE67890, and you render it on an Arabic system, you might get ٠١٢٣٤ABCDE67890. The leading digits are rendered as Arabic-Indic digits, but the trailing digits are rendered as European digits. What's going on here?

This is a feature known as contextual digit substitution. You can specify whether European digits are replaced with native equivalents by going to the Region control panel (formerly known as Regional and Language Options), clicking on the Formats tab, going to Additional settings (formerly known as Customize this format), and looking at the options under Use native digits. The three options there correspond to the three values for LOCAL_IDIGITSUBSTITUTION.

Programmatically, you can override the user preference (if you know that you are in a special case, like an IP address) by following the instructions in MSDN.

  • Uniscribe: Script­Apply­Digit­Substitution
  • DWrite: IDWrite­Text­Analysis­Sink::Set­Number­Substitution
  • GDI: ETO_NUMERICS­LATIN or ETO_NUMERICS­LOCAL.

As a last resort, you can stick a Unicode NODS (U+206F) at the beginning of the string to force European digits, or a Unicode NADS (U+206E) to force national digits.

Bonus chatter: What's the point of contextual digit substitution anyway?

Suppose you have the string "there are 3 items remaining." (Let's say that all text in lowercase is in Arabic.) You want this 3 to be rendered in Arabic-Indic digits because it is part of an Arabic sentence. On the other hand, if you have the string "that's a really nice BMW 350." you want the 350 to be in European digits since it is part of the brand name "BMW 350".

Contextual digit substitution chooses whether to use Arabic-Indic digits or European digits by matching them to the characters that immediately precede them. (And if no character precedes them, then it uses the ambient language.)

Comments (18)
  1. Anon says:

    I think the real question here was "Why are they being substituted in the first place?" What context could cause SOME of the digits to be substituted?

    [I added some bonus chatter for you. -Raymond]
  2. Adam Rosenfield says:

    The Unicode standard says that U+206A through U+206F are deprecated and that their use is strongly discouraged.  That's consistent with "As a last resort", so the other 3 methods should definitely be preferred.

    What sorts of text are these substitutions made on?  Any text in a standard Windows control like a STATIC or BUTTON?  Any text rendered with TextOut()?  Something different?

  3. Adam Rosenfield says:

    (Answer: Any text rendered using the Uniscribe, GDI, or DirectWrite APIs mentioned in the MSDN article Raymond linked to, which I ashamedly didn't read before posting that; that includes ExtTextOut() for GDI)

  4. DWalker says:

    The bonus chatter doesn't address why any number that is longer than a couple of digits doesn't get ENTIRELY left alone. That is, it's fine to convert a single digit, but for a longer string, is it that hard to leave the entire string in "European digits"?

    ["there are 314159 items remaining" presumably should also use Arabic-Indic digits. -Raymond]
  5. Anon says:

    @Raymond

    Thanks for the bonus chatter! That explains it perfectly.

  6. Myria says:

    Is there such a thing as rendering hexadecimal numbers in national digits? =)

    If Marco Polo had remembered that they write in the opposite direction and flipped the significance order when importing the system to Europe, maybe people wouldn't be so terribly confused by little-endian computer systems today.  Little-endian really is more natural.

  7. Brian_EE says:

    @Myria: Marco Polo was too busy hiding from his blind-folded friends in the swimming pool to think about that….

  8. Ben says:

    I'm surprised no one has pointed out that apparently the conversion subtracts 1 from each digit…

    [Heh. Nice one. -Raymond]
  9. Yes, it is possible to programmatically suppress this behavior. Word and OneNote do that. Word is okay but OneNote is a pain in the abdomen with its stupidity.

  10. Lev says:

    I think any alteration of the string while rendering it violates the principle of least surprise. It would be more proper to require each string to be generated with the proper digits. To this end, there could either be an extra flag for %d, something like %Ld, or %d could work in the current locale by default.

    This way,

    "there are %d items", 3 -> "there are ٣ items"

    "This is a %s", "BMW 350" -> "This is a BMW 350"

  11. Daniel says:

    Didn't you mean "BMW350"? because I honestly don't see how a computer can distinguish between "This is a BMW 350" and "You have to add 350"

  12. Jim says:

    @Myria I don't usually write at a low enough level for endianness to matter at all, so I'm shooting in the dark a bit here, but I can see at least why reason why little endian is worse than big endian. If you interpret a little-endian number at a memory location but with the wrong size (e.g. cast to a int32_t* instead of a int64_t*), then for small positive numbers you will read the correct result. Many numeric values are almost always small positive numbers, so this bug might slip through testing.

  13. Sam says:

    @ Daniel: Imagine his string was "這是一個 BMW 350" and "有3樣東西"

  14. Daniel says:

    @Sam: What I'm talking about is the space between BMW and 350. As I understand, if there is a space before the number, local digits are used but if the number is directly appended to other characters the original digits are used.

    The point is, that in his example, there is no difference between the 350 and the 3 (–> So either it's a mistake in the example, or I didn't get the point)

  15. Marcel says:

    @Daniel: you understood wrong, it has nothing to do with spaces. Local digits are used when preceded by local script, "European" digits if preceded by latin characters. Look again at Sam's example to see how this looks when not written in English (so you can understand the sentences), but with a different script.

  16. Daniel says:

    Thanks for the clarification. I think I've got it now (e.g. Use the same Alphabet as the previous character. If at the start of the text: use local digits)

  17. Matt says:

    I think the first glyph in the ٠١٢٣٤ABCDE67890 is actually an arabic 0 (i.e. what you have is 01234ABCDE67890). The 5 is ٥‎, so it should be ١٢٣٤٥‎AVCDE67890.

  18. Alex Cohn says:

    Another last resort to prevent automatic substitution: use the U+FF1x characters instead of U+003x

Comments are closed.

Skip to main content