When does 0 + 4 = 1?

I received a good question today from a developer who asked:

Why should one not use pointer arithmetic for string length calculation, access to string elements, or string manipulation ?

The main reason pointer arithmetic is not a good idea is that not all characters are represented by the same number of bytes.  For example in some code pages (like Japanese) some characters (a, b, c ... and half-size katakana) are represented by one byte, while kanji characters are represented by two bytes.  Thus you can not compute where the fifth character in a string is by simply adding 4 to the beginning byte pointer of the string.  The fifth character could be as much as 8 bytes offset from the beginning pointer (5 kanji characters - 2 bytes each)


Even using Unicode you can not assume that each character is 16 bits (or a word), because if you are using the UTF-16 encoding, then some characters (supplementary characters) are represented by surrogate pairs (two 16 bit words).  Some single characters are represent as a base character (like "a") plus one or more combining characters (such as some diacritic "^').  And if you are using UTF-8, then a character can be 1, 2, 3 or 4 bytes long. 


That is why it is better to use APIs that are aware of these differences to walk through a string than to use pointer arithmetic. (See: StringInfo Class)  In globalization best practices, you never assume that all characters are the same byte size. 


Because sometimes, 0 (beginning pointer to a string array) + 4 (a high surrogate and a low surrogate [4 bytes]) = 1 (a single Unicode character).




Comments (0)

Skip to main content