What curious property does this string have?


There are all kinds of interesting things in the Unicode standard. For example, the block of characters from U+A000 to U+A48F is for representing syllables in the “Yi script”. Apparently it is a Chinese language writing system developed during the Tang Dynasty.

A string drawn from this block has an unusual property; the string consists of just two characters, both the same: a repetition of character U+A0A2:

string s = “ꂢꂢ”;

Or, if your browser can’t hack the Yi script, that’s the equivalent of the C# program fragment:

string s = “\uA0A2\uA0A2”;

What curious property does this string have?

I’ll leave some hints in the comments, and post the answer next time..

UPDATE: A couple people have figured it out, so don’t read the comments too far if you don’t want to be spoiled. I’ll post a follow-up article on Friday.

Comments (33)

  1. Eric Lippert says:

    Hint #1: The curious property is platform-dependent; you'll want to be using a 32 bit version of CLR v4.

  2. Eric Lippert says:

    Hint #2: The curious property is also a property of a much more commonly-used string.

  3. Eyal says:

    It has the same hash code of String.Empty?

    Wow, you are fast!

    Hint #3: You are on the right track; there is an even more curious property this string has which is related to its hash code being equal to that of String.Empty. — Eric

  4. [ICR] says:

    "s.ToUpper() == s.ToLower()" is true. Though that's not that curious.

    Indeed, I think all strings in Chinese-style languages have this property. – Eric

  5. Scott Hellewell says:

    I am not seeing any curious property either.  Although I did notice that it doesn't have a case difference lice [ICR].  

  6. John says:

    Is it something to do with byte order marks?  Something like it matches an empty string in a different encoding with byte order marks?

  7. Paul Irwin says:

    Bingo to Eyal – in a 32-bit process, s.GetHashCode() == string.Empty.GetHashCode() but not on an x64 process. I would guess this is a lesson in depending on hash codes? 🙂

  8. Gabriel says:

    I'm at work right now and only have access (easily) to a 2008 R2, ergo a 64 bit CLR. I'll give it a look when I get home.

  9. James Hart says:

    Well Eyal is correct, on the x86 v4 CLR, "uA0A2uA0A2".GetHashcode() == "".GetHashcode(). Though technically that doesn't meet the criterion of hint number 2. Unless the curious property that "uA0A2uA0A2" shares with a much more commonly used string (i.e. string.Empty) is 'having the hashcode 757602046'. But I don't know, I just don't find that property all that curious.

  10. henk says:

    As far as I can tell the Hashcode also match using x64.

  11. iCe says:

    Colisions like this happen in real system everytime when using GetHashCode(). However a lot of comparing and sorting infraestructure of the framework, like LINQ for example, depend on it.

    For me this model of equality is (on some scenarios) broken. I would like to see a much improven hashing algortihm, with less colision probabilty, be implemented in future versiones of dotnet.

  12. Diego F. says:

    @Paul Irwin

    Hashcodes are meant to collide! Not a problem depending on them! 🙂

  13. Brett says:

    @James Hart

    Maybe the curious property is "sharing a hash code with string.Empty."  Interestingly, Eric only seems to claim that the property is curious on s, but not on the "much more commonly-used string."

  14. Diego F. says:

    @iCe

    Hashcodes are not meant to express equality — a good reason why it would be a broken equality model –, but it surely can express inequality.

  15. Shuggy says:

    guessing here:

    it's the shortest (in code points) string whose hashcode matches? or the only 2 codepoint string to match?

    The smallest (if treated as an unsigned number) legal UTF-16 string which shares the hashcode?

    Or is the fact it's a palindrome (and string.Empty is essentially a palindrome in root case) and the only such palindrom to share the hashcode?

  16. gsvick@gmail.com says:

    Could it be that s.GetHashCode() == (s + s).GetHashCode()?

    You're so close! — Eric

  17. shawnhar says:

    > Could it be that s.GetHashCode() == (s + s).GetHashCode()?

    More than that – any number of repetitions of this string have the same hashcode.

    Give the man a cigar! Nicely done. — Eric

  18. Yuriy Guts says:

    I wonder if it has anything to do with string interning? Although I played around with String.Intern(…) on the 32-bit CLR v4 and didn't come to any definitive conclusion.

  19. df says:

    Eric, is it that if s is concatenated with itself any number of times (>= 0), the hash code of the resulting string will have the same hash code as s?  If so, indeed an interesting property.

    Yes! – Eric

  20. James C-S says:

    Is this special property of this string (hash code equals empty string no matter how many times it is concatenated to itself) a freaky happenstance? Or is it by design? Does the underlying hashing function some how use "uA0A2uA0A2" in the hash code calculation? Or does it happen to be that these values somehow have fallen into an algorithmic crack?

  21. Eugene says:

    @James

    This is hardly by design. Hash algorithm likely just reads the string in 4 byte pieces (2 chars), and 0xA0A2A0A2 is a fixpoint of the function it applies.

  22. Rasmus Faber says:

    > > Could it be that s.GetHashCode() == (s + s).GetHashCode()?

    > More than that – any number of repetitions of this string have the same hashcode.

    And the much more common string that shares this property is of course String.Empty (where it is much less interesting).

  23. Luc Bos says:

    UPDATE: A couple people have figured it out, so don't read the comments too far if you don't want to be spoiled. I'll post a follow-up article on Friday.

    => the first thing I did was look at the answer 🙁

  24. Ramon Smit says:

    @Diego: It can express inequality it BOTH hashcode are generated on the same machine with the same .net framework version in the same mode (x84/x64).

  25. Eamon Nerbonne says:

    @Eugene: the hashcode function clearly contains some kind of internal state, but if you include that state it's not a general fixpoint.  For instance "abcd".GetHashCode() != "abcdꂢꂢ".GetHashCode().  On the other hand, for any 2-letter string s it appears that  (s+"ꂢꂢ").GetHashCode() == s.GetHashCode(); even though s+"ꂢꂢꂢꂢ" does not; which seems to indicate that some parts of the internal state don't immediately affect the output but do after later iterations.

    This kind of hashcode property might be exploitable to form a denial of service attack; it's a bit far-fetched, but vaguely reminiscent (though not nearly as easily exploitable) of the php/java floating point parsing bug of a while back…

  26. Shuggy says:

    @Eamon, James

    The interesting behavior is (chiefly) function of two things, one is the specific magic number used to intialize the hash computation (352654597 or 0x15051505) and the second is that the hash calculation is formed of two such calculations, both entirely independent (one looking at even pairs of characters one looking at odd pairs) which are then merged at the end (independent of length).

    The function involved for the "shuffle existing bits, bring in more bits from the source" is such that there is a magic input pair where the shuffle:  (((num << 5) + num) + (num >> 0x1b))  followed by the bringing in of the bits (a simple xor) results in a no-op.

    For any reasonable implementation this is an offshoot of the very desirable property that you do not want to bias the hash function output while wanting to keep the function fast.

    The string.Empty case of course doesn't go through this shuffle then augment process at all, so it it a no-op by virtue of not actually happening :).

    Thus, for strings with a sequence of 4 special characters from the start of the string till the suffix that matters will result in a collision.

    This does indeed allow for targeted performance attacks (if you worry about this use a hash function not vulnerable to this that trades a little speed for being much harder to attack (or use a cryptographic hash like SHA-1 or better if you really care).

    simply including the length in the calculation directly (rather than as a by product of the iteration count) would recuce the simplicity of the attack, but at the cost of bias in the output bits without quite a strong avalanche phase (since the string lengths encountered in the real world will tend to be very clustered and small). Essentially the better the avalanche the less easy it would be to target too.

  27. Shuggy says:

    the phonetic pronunciation of this is (IPA) m̥o pronounced (very roughly) like hmo (but with the hm bit voiceless)

    Exploiting this would thus be a "hmOhmO" attack. Not really catchy so I can't imagine it would be popular 🙂

  28. Brian says:

    @Luc Bos  : Me, too.  The second thing I did was peek at the reference source out of curiosity.  I think my favorite line in the reference source is hash1 ^= ThisAssembly.DailyBuildNumber .  For context:

    #if DEBUG

                       // We want to ensure we can change our hash function daily.

                       // This is perfectly fine as long as you don't persist the

                       // value from GetHashCode to disk or count on String A

                       // hashing before string B.  Those are bugs in your code.

                       hash1 ^= ThisAssembly.DailyBuildNumber;

    #endif

    That makes me smile.

  29. Brian says:

    I suppose another curious property is that it looks like an M.C. Escher drawing of impossible eyeglasses.

  30. Anonymous Coward says:

    This is not an interesting property of the string, but of the hash function.

  31. CodeInChaos says:

    @Ramon

    GetHashCode in general can only express inequality withing a single AppDomain.

    @iCe

    Code usually relies on a low number of collisions to get a large performance gain. But any code that relies on `GetHashCode()` being unique for correctness is broken.

  32. Alex says:

    Have there been any DOS attacks on IIS by passing in a bunch of these strings of varying lengths? Non-uniform hashes sometimes lead to nasty problems like that if data structures rely on a good distribution of hash values.

  33. Diego F. says:

    @Alex

    Yes! http://arst.ch/rz0

    Remembered this post while reading the above article.