On the ambiguity of uniqueness


The MSDN documentation for System.Object.GetHashCode says

[T]he implementation of GetHashCode provided by the String class returns unique hash codes for unique string values.

This is another case of ambiguous use of the word "unique". The intended meaning is "for each string value, the same hash code is returned".

Even though "unique" means "one and only one", the domain in which the term applies is often left unsaid, as here, where the domain of comparison is "all the hash codes returned for a specific string value". If you instead misinterpreted the domain as "all the hash codes returned for all string values", then you end up erroneously concluding that no two strings hash to the same value.

Another conflicting sense of "unique" is "you get the same one each time" as opposed to "you get a different one each time".

  • GetCurrentProcessId returns a unique value that identifies the process. You get the same one each time.
  • CoCreateGuid returns a unique GUID. You get a different one each time.

In the original C standard, malloc(0) is permitted to return NULL or "a unique pointer". What does "unique" mean here? Does it mean that the non-NULL return value is always the same? Can I assert(malloc(0) == malloc(0))? Or does it mean that the non-NULL return value is a value distinct from any other return value from malloc()?

In Technical Corrigendum 1, this ambiguity was resolved by removing the word "unique". Instead, the specification says "as if the size were some nonzero value" which makes it clear that it is the second interpretation that is intended.

My suggestion: Don't use the word "unique". It's too ambiguous.

Comments (27)
  1. Anonymous says:

    ‘In the original C standard, malloc(0) is permitted to return NULL or "a unique pointer". … Can I assert(malloc(0) == malloc(0))?’

    How I read that (unless you’ve included more ambiguity) is malloc(0) could return NULL one time and "a unique pointer" another, so even if "a unique pointer" was always the same value, you couldn’t ‘assert(malloc(0) == malloc(0))’.

    Or am I over quibbling?

  2. Anonymous says:

    How about finding someone who can write?

  3. Kuwanger: Yes, it could return NULL sometimes and a unique pointer sometimes, but try capturing that in a single expression that doesn’t ruin the flow of the paragraph. "Can I void *p1=malloc(0), p2=malloc(0), assert(!p1 || !p2 || p1==p2)?" doesn’t quite have the same ring to it. Everybody knew what I meant, no need to get bogged down in details.

  4. Anonymous says:

    My suggestion: Don’t use the word "unique".

    >It’s too ambiguous.

    That’s a reasonable default :-)

  5. Anonymous says:

    What a unique way of putting things. :: grins :: Sorry, I couldn’t resist.

  6. Yeesh. Like I said, "No need to get bogged down in details". Did somebody declare today to be "nitpick day" without telling me? (Besides, I already set the context to "malloc(0) returns a non-NULL value" in the previous paragraph and re-emphasized it in the previous sentence. Do I have to worry about every single sentence being taken out of context?)

  7. Anonymous says:

    I hereby declare December 14th, 2005 as Worldwide Nitpick Day!

    Sorry you didn’t get that memo, Raymond.

    And Aaron: Leave the comedy to DJ Noname and the 107.7 The End morning show.

  8. Anonymous says:

    Aaron, there are those who appreciate cheesy puns and those who don’t. I do.

  9. Anonymous says:

    I’d write something like "GetHashCode provided by the String class returns a deterministic hash code when used on a string value."

  10. Anonymous says:

    Did the MSDN documentation even really need to say "unique hash codes"? Why not just say "hash codes"? It seems like a standard definition of hashing to me, and the extra verbiage just makes it confusing. Do people ever make hashing implementations that /aren’t/ deterministic?

  11. Anonymous says:

    It’s very annoying that GetHashCode() doesn’t return a unique hash code! What if I wanted to be able to uniquely identify a string via its hash code??? Could you please add a GetHashCodeUnique() method into the next version of .NET please??? Thank you!!!

  12. Anonymous says:

    "Could you please add a GetHashCodeUnique()"

    The very nature of hashes makes this impossible – you’re mapping a string of arbitrary length to a number of fixed size. You just can’t give every string a unique numbers, you don’t have enough.

    A point to start further research: http://en.wikipedia.org/wiki/Hashing

  13. Anonymous says:

    Here’s one for you: http://en.wikipedia.org/wiki/Irony :)

  14. Anonymous says:

    http://burtleburtle.net/bob/hash/perfect.html

    It is possible to produce unique hashes for all valid inputs to a function – provided, of course, that you choose a suitable subset of inputs to define as ‘valid’.

    Not a general purpose tool, obviously, but still interesting.

    (And it’s always nitpick day on the internet.)

  15. Anonymous says:

    Here’s how I would write it:

    [T]he implementation of GetHashCode provided by the String class returns hash codes for string values. Identical string values are guaranteed to return identical hash codes.

  16. Anonymous says:

    "Could you please add a GetHashCodeUnique()"

    I once coded this function for a minicomputer operating system. My implementation worked without error but was rarely used due to its poor static efficiency.

    However for those developers who could afford the space, my function could generate a domain-unique hash key for strings up to 2^10 bytes long using only 1024 bytes for the key.

    -Wang-Lo.

  17. Anonymous says:

    OldNewThing: Make it "void *p1=malloc(0), *p2=malloc(0); assert(!p1 || !p2 || p1==p2)" to do what you most probably intended (note the missing "*" before p2 and the replacement of ";" for "," before the assert().

  18. MSDN has a wide audience. Usually I get complaints from people that MSDN doesn’t state *enough* obvious things, not that it states too many…

  19. Anonymous says:

    But the word "unique" isn’t ambiguous enough. If you want it too ambiguous, you need uniques. And if you have get even more of them you have multiques.

    Reinder: malloc(10) isn’t malloc(0) so the particular question here doesn’t arise. Though now I feel compelled to reread that section of the standard to see if there’s another ambique statement along the lines you mentioned.

  20. Anonymous says:

    Oracle databases have UNIQUE indexes. But people have a very hard time understanding that an Oracle UNIQUE index can still contain *multiple* NULL values.

  21. Anonymous says:

    It is good to hear that it is "Worldwide Nitpick Day".

    "Or does it mean that the non-NULL return value is a value distinct from any other return value from malloc()?"

    Regardless of the parameter passed to it, malloc never guarantees that its return value is distinct from any other return value from malloc. Consider the following:

    void * p = malloc( 10);

    free(p);

    void * q = malloc( 10);

    I doubt that it was ever the intention of the C standards committee to require p and q to be different here.

  22. Anonymous says:

    Why would you want to call malloc(0)? It doesn’t seem to do anything useful, unless it’s to coerce your implementation of malloc() to do something weird behind the scenes stuff.

  23. Anonymous says:

    Ritchie: Why would you want to call malloc(0)?

    So you can realloc() later…

  24. Anonymous says:

    James Schend wrote:

    > I hereby declare December 14th, 2005 as

    > Worldwide Nitpick Day!

    Don’t be so light-hearted about that day. 66 years ago, on December 14 Soviet Union was expelled from League of Nations because violations of treaties with Finland and attacking it. That was a heck of a lot of nitpicking!

  25. Anonymous says:

    "Oracle databases have UNIQUE indexes. But people have a very hard time understanding that an Oracle UNIQUE index can still contain *multiple* NULL values."

    This makes perfect sense. NULL is not a value, but the absence of value. Once people properly understand this, SQL’s NULL handling is sensible.

  26. Anonymous says:

    "Oracle databases have UNIQUE indexes. But people have a very hard time understanding that an Oracle UNIQUE index can still contain *multiple* NULL values."

    "This makes perfect sense. NULL is not a value, but the absence of value. Once people properly understand this, SQL’s NULL handling is sensible."

    Sql Server doesn’t allow multiple NULL values in a unique constraint…and the reasoning is the same. NULL is not only the abscense of a value, but is an UNKNOWN value. Therefore, it is UNKNOWN whether 2 NULLs are equal. So, 2 (possibly) equal NULL values can’t be allowed into a unique constraint.

    For those solid reasons that you want Oracle behavior in Sql Server 2000, create an indexed view on the table that excludes nulls and enforce a unique constraint on the view.

  27. Anonymous says:

    null in oracle sux, can’t save an empty string without it becoming null.

Comments are closed.