Performance Quiz #12 — The Cost of a Good Hash


This quiz is a little bit different than some of my others.  Here is System.String.GetHashCode

public override unsafe int GetHashCode()
{
      fixed (char* text1 = ((char*) this))
      {
            char* chPtr1 = text1;
            int num1 = 0x15051505;
            int num2 = num1;
            int* numPtr1 = (int*) chPtr1;
            for (int num3 = this.Length; num3 > 0; num3 -= 4)
            {
                  num1 = (((num1 << 5) + num1) + (num1 >> 0x1b)) ^ numPtr1[0];
                  if (num3 <= 2)
                  {
                        break;
                  }
                  num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr1[1];
                  numPtr1 += 2;
            }
            return (num1 + (num2 * 0x5d588b65));
      }
}

It’s a fine little hash function that’s does pretty much the sorts of things you’d expect a hash function to do.

So now the quiz.  It’s a bit of trivia but I’m going somewhere with it.

Can you tell me 5 implementations of GetHashCode in the framework that do things that you wouldn’t expect a hashing function to do?

Comments (28)

  1. Peter Ritchie says:

    I don’t know what I’d "expect" a hash method to do; but some interesting GetHashCode methods in comparison to each other:

    Char/Int16.GetHashCode returns value | (value << 16)

    while byte/Uint16.GetHashCode simply returns value.

    and SByte.GetHashCode returns value ^ (value << 8)

    … not to mention, any class that doesn’t override GetHashCode doesn’t behave as expected (at least in terms of the impossible properties that a GetHashCode override is expected to exhibit).

  2. evin says:

    According to http://wesnerm.blogs.com/net_undocumented/2004/09/enums_and_perfo.html these will use reflection in GetHashCode, giving terrible performance.

  3. Alois Kraus says:

    XmlQualifiedName does some caching.

    GacInstalled     simply returns 0.

    Cookie           does concat all strings and calls GetHashcode on the complete string.

    HashMembershipCondition does alter the object state quite heavily before it calculates the hash value.

    Many *Condition classes have this behaviour.

    Yours,

     Alois Kraus

  4. Expanding on Peter Ritchie’s point above, even those that do override GetHashCode don’t always get it right after factoring in mutability and Equals overrides:

    http://weblogs.asp.net/bleroy/archive/2004/12/15/316601.aspx

    See for instance System.Web.UI.WebControls.ListItem.

    In terms of other unexpected behavior, I was surprised by System.Uri’s GetHashCode method, which goes to a lot of trouble to establish the completeness of the URI infoset and to distinguish remote URLs for non-absolute URIs.

  5. Rico,

    Somewhat unrelated: I see a lot of classes that implement GetHashCode from constituent members in a pretty clumsy way. Why don’t you guys provide a CombineHash function in order to create a pit of success here?

    That way somebody implementing GetHashCode for, say, struct Point, would simply do:

    return CombineHash (X.GetHashCode (), Y.GetHashCode ());

    Thanks,

    Dejan

  6. shuggy says:

    Dejan. The actual implementation for combination of objects should handle the null check (if you want pit of success). Obviously this then gives you unfortunate boxing for value types which is a pain (or a combinatorial set of all int / object combinations for three parameter methods which is a lot of binary to lump everyone with…

    Enumeration equals and hashcode are appalling, and while they do not use reflection when used in Dictionary<foo,bar> the Default EqualityComparer is rubbish causing (IIRC) two boxes for a hashcode check and another 2 for the equals check.

    Unfortunately writing your own fast version via Generics *requires* the use of MC++ or C++/CLI

    there is no fast way of doing it in C# we could find (and we tried a lot of ways – even an unsafe block won’t let you cast a T to a void* but CLI will).

    It would be really nice if the runtime could magic in for any enum MyEnum based on integral Type I:

    enum MyEnum : I becomes

    enum MyEnum : I, IEquatable<MyEnum>, IEquatable<I>

    Enums are a LOT slower than using raw ints in a depressingly large proportion of the framework*

    * Especially if memory allocation is a performance nono for your code path

  7. Pent says:

    IMHO there is no need for a special CombineHash function.

    Combining a hash code from multiple fields should be a simple XOR, right?

    The code for Point.GetHashCode() would be "return X.GetHashCode()^Y.GetHashCode();" or maybe even "return (X^Y).GetHashCode();"?

  8. Dejan, Shuggy,

    CombineHash is more than a convenience. It is a necessary function if you want to have a good hash. The function used to combine a hash with the previously incrementally computed hash for other fields in a class/struct, should be closely tied to the computation of a hash value for a primitive type (int etc), and to the finalization of the hash computation, where the bits for the hash value are "avalanched".

    We need is a CombineHash function and a FinalizeHash function, both of which should be part of the MSIL specifically to avoid boxing of any value types.

    If you have a poor quality hash value, one way to try and "improve" it is to mod by a prime number. Any hash table using prime number mod indicates there is a problem computing the hash values themselves — they should be sufficiently scrambled that no such mod is needed, only a truncation of bits. The fact that Hashtable uses the archaic prime number mod trick indicates general hashing problems in .net.

    We found that implementing our own hash functions and table class even for strings ran significantly faster than using the built in hash functions in .net.

    For more information on good hash functions:

    http://www.azillionmonkeys.com/qed/hash.html

    http://burtleburtle.net/bob/hash/

  9. shuggy says:

    Frank – I wasn’t suggesting it wasn’t – Just that doing it pit of success style requires doing something about nulls, which then makes handling valuetypes expensive given the current model.

    As you say adding MSIL instructions to do it nicely are a ‘clean’ solution* but how that would be cleanly represented in code is another.

    At the moment you could bodge something together using the null coallescing operator but it would look very messy.

    As to how the ‘avalanching’ is performed I think you have some real issues percolating the info on how to do this across composite types where different sub sclasses of a member may have different behaviours.

    In essence this does not need perfection – since no single perfect solution exists for all possible domains.  It needs to work well in cases where it ‘looks’ like it should. If the framework encourages you to treat an enum is just fancy naming aliases for integral types with some static typesafety thrown in plus some reflective based discovery then the bits that are not obviously reflective should do their best to be as fast as the integral cases. They don’t. And I mean they don’t *badly*.

    How many people expected Dictionary<MyEnum, foo> to box repeatedly on every access?

    * not commenting on codebloat in IL here at all

  10. Shuggy,

    The avalanching, or finalization, only needs to be done by the final user of the hash value, the hash table or other dictionary class. In all other places we only need the intermediate, in progress value. So I dont see the issues you are implying.

    I was thinking CombineHash, FinalizeHash, would be low level CLR operators, not much different from an operator like "add". In source code they would look like method calls.

    Ideally a GetHashCode implementation would receive special treatment by a compiler and possibly verifier, banning any allocation of memory. It seems these things all need better support at a lower level. Hashtables and other dictionaries are very important.

  11. shuggy says:

    Frank.

    Forgive me if I am being dense since this is not my field but if I have let’s say struct

    struct Pair<T,U> : IEquatable<Pair<T,U>>

    {

       T first;

       U second;

       public override bool Equals(Pair<T,U> other)

    {    

        // skip the null checks and the like you get the idea

            return this.first.Equals(other.first) && this.second.Equals(other.second));

       }  

       public override int GetHashcode()

       {

           // again ignoring null check

           CombineHash(first, second);

       }

    }

    By CombineHash I assumed you meant the combination of multiple independent hash values into one single value.

    What is taking responsibility for combining the bits from first and seconds hashcode. At compile time the Pair class has no idea about how the bits in the hashes of first and second may be optimally combined.

    There is no performant way to make that happen…

    That is why I am saying that anything trying to  do any kind of halfway reasonable general combination should try to avoid needless boxing on the pased in parameters.

    Sadly if the CombineHash action is made to look syntactically like a method that may confuse people since you merge two very different actions into the same syntax. Something I view as a bad idea in something like C#. (I think the multi(ab)use of using is definitely bad – indexers perhaps not)

    For example people looking at it might think "Oh this will box or create an array – I’ll work round it" as the ‘signature’ of that sort of multiple argument of any type you like use case would do if it was a function.

    Without meaning to denigrate your work I would suggest that this step is more important for most since the writing of a decent hash for a type is something that you can do. Providing an EqualityComparer for a non value type is easy. doing it for a primitive is normally unessecary as the default one is fine.

    Doing it for your own structs or (worse) enums is a PITA requiring C++ drop down for non boxed maximum performance.

  12. Hi Shuggy,

    Let’s distinguish between intermediate hash values and final hash values. Only the dictionary/hash table class needs the final hash values. So only such a class would call the FinalizeHash. Let’s assume GetHashcode only produces an intermediate hash value.

    The fast (performant) way to make CombineHash happen, is no different from the fast way the integer addition operator "happens" — CombineHash becomes an operator in the CLR. Then the compiler, CLR can deal with making sure there is no boxing. GetHashCode should also never box. As far as syntax is concerned, I leave that to the language people.

    Hashing is extremely important… and now we have lots of poor GetHashCode methods, that should never have been allowed to do more than GetHashcode and CombineHash operations. If we have these restrictions, the "good" hashing is all taken care of at the lowest level, and the class implementor has only to decide what fields, or portions of fields (in the case of primitives) should be hashed, and what should not be hashed.

    Writing a good hash function, and a good combiner, is very hard, and most developers will not get it right.

  13. Haiyuan Pan says:

    I don’t know why he used unsafe code to generate hash code. But that is one thing I don’t expect.

    He used two "^" in a loop, I think it will cost lots of CUP time. Can we change the algorithm to avoid it? (It is better to avoid the "*", too.)

  14. shuggy says:

    Frank – I see where you are going.

    Though I should note that addition in .net is not fast because it has an explict il operator. It is fast because the resulting instructions are targeted at the underlying architechture’s addition operation. If no such operation exists for combine or finalize it makes no difference how simple the il operation was.

    One way to do it would be to try going with the ‘Intentional’ programing style and use compiler aware attributes

    e.g. Allowing the provision of

    EqualitySignificantAttribute (perhaps with properties that indicate behaviour on nulls and the like)

    The presence of this attribute on any field causes the compiler to override object.Equals() (and perhaps IEquatable<T> equals though how to handle this for T <> itself may be rather tricky) as well as GetHashcode doing the right thing with no boxing (and this can just use normal MSIL, there would be no need for new instructions)

    If you were willing to have a hashcode which was only a subset of the equatable fields you could attribute them with an [IncludeInHash(false)]

    Back of a envelope stuff here but you get the idea. FxCop rules to spot explicit implementations which would be better as intentional auto created.

    Obviously the cost here is compilation time and complexity goes up as well as much more complex debugging…

  15. shuggy says:

    Frank – Having read the informative articles you linked to I noted the line

    "(Incidentally, Bob Jenkins overly chastizes CRCs for their lack of avalanching — CRCs are not supposed to be truncated to fewer bits as other more general hash functions are; you are supposed to construct a custom CRC for each number of bits you require."

    I would have thought that, for the general purpose use required by any framework API the use of int as the result of GetHashcode is making it clear that the custom number of bits required is in all cases 32. If you want to do something that would benefit from less (or more) you should roll your own.

    I seriously doubt there are many people who would benefit from that being in the framework.

    The number of developers who even have the faintest conception of the difference between open and closed addressing is scary. MS has no option but to make the framework usable to those people too. This is why I dislike the performance of Enums, since it fails to perform as expected of it, even for the low expectations (as you say – not allocating memory being a big one!)

  16. Olivier Coanet says:

    @Shuggy

    You can image an helper class to hide the complexity of object / value types combinations .

    public class HashCodeBuilder{

     public void AddObject(object o){

       //…

     }

     public void AddHash(int hash){

       //…

     }

     public override int GetHashCode(){

       //…

     }

    }

    public class Foo{

     object _bar1;

     object _bar2;

     uint _bar3;

     public override int GetHashCode(){

       HashCodeBuilder hashCodeBuilder = new HashCodeBuilder();

       hashCodeBuilder.AddObject(_bar1);

       hashCodeBuilder.AddObject(_bar2);

       hashCodeBuilder.AddHash(_bar3.GetHashCode());

       return hashCodeBuilder.GetHashCode();

     }

    }

  17. shuggy says:

    Olivier

    This approach has some problems regarding performance

    1) you create a new object each time! (even if you cache it you still have the hit in space terms of caching it somwhere)

    2) lots of function calls – if you have structs then they will not be inlined. This then means you must add the invocation of GetHashcode() which hardly helps you in usability stakes.

    3) it doesn’t help with enums at all – they still behave poorly

    If you were willing to sacrifice thread safety and the risk of not ‘resetting’ the accumulator (which is fine for lots of cases) then a series of static functions in C++/CLI could do what both myself and Frank wanted*. You just can’t in C# unless you explicitly provide a method for every single enumeration you care about (and thus an explict EqualityComparer for every enum you care about).

    C# has unsafe blocks and I would like to be able to use them fully if I wish. The restrictions placed on generic parameter types in them is (to me) annoying. Perhaps the design team felt that discouraging unsafe casts on generic types was worth it.

    * Plus Frank could put in asm if he wanted to target the architecture since this clearly makes a big difference if you are amalgamating a lot of data which is not enregistered and you care about access alignment.

  18. Here is an alternative solution. My goal is to abstract away the hash value computation and combining functions, which should not be written by the ordinary class developer. Since the hash value combining algorithm and hash value "avalanche" algorithm are intimately tied mathematically, it is impossible for developers to write good hash functions.

    I am not solving here the enum boxing problem, which is more general and applies to all inherited enum virtual methods.

    See the proposal in this link. We have created  very fast hash tables using Paul Hsieh’s "super fast hash" algorithms.

    http://docs.google.com/View?docid=dhmdqxdn_4hp8zqx

  19. One more point… the purpose of the proposal is not to increase the number of bits in the result, though you could vary that by platform. The purpose is to fix the terrible hash code computations currently performed in many .net applications, and banish once and for all the "mod prime" hack which is no real solution.

  20. I continue to be astounded by what my readers can come up with. As usual I had a purpose for posing my

  21. Thomas says:

    The biggest question I have for this topic:

    Why does an immutable string calculate its hash value over and over again, why is it not cached in the string itself?

  22. shuggy says:

    Thomas

    What about people who don’t need the strings hashcode – you have made all their strings 4bytes longer for no reason. If those strings were small (say average about 8 characters) you add nearly 10% to the memory required to store them. This has a direct knock on in speed terms.

    Try firing up SoS sometime and query the size of your heap by object type. Highly likely that string is quite high in the list…

  23. ricom says:

    >>The biggest question I have for this topic:

    Why does an immutable string calculate its hash value over and over again, why is it not cached in the string itself?

    That would make an excellent perf quiz!  It’s no accident 🙂

  24. shuggy says:

    At a guess:

    pReuse = probabiliy of HashCode reuse

    bCalc  = benefit of not performing calculation

    bTrav  = benefit of not traversing the character array again due to cache misses/mem reads

    cL1    = cost of increased level 1 cache miss

    cL2    = as above level 2

    cClr   = cost of clearing the extra 4 bytes of memory

    there will be more but I would guess those are the biggies

    if

    pReuse * (benefits) < numStrings * costs

    it isn’t worth it.

    would guess based on most code that I have seen that many people create an *awful* lot of temporary strings compared to how many times they are used as keys in a hash table.

    You can do an awful lot of calculations to avoid hitting memory…

    Oh yeah and I forgot to say that if you lazily evaluate the hash (if you don’t then pReuse had better be *really* high) then the calculation gets an if statement (which I doubt will be very well predicted).  

    If you want to use a string as a key in a hash you can very easily roll your own key and cache the value.

    Of course since this is Rico we are talking about this would be case of measure, measure measure.

  25. ricom says:

    >>Of course since this is Rico we are talking about this would be case of measure, measure measure.

    *GRIN*

  26. In our hashtables we always store the hash along with the value… any hash value storage in a string or other CLR object is wasted space. And we measured to see it was faster.

  27. Thomas says:

    OK, got it. And Dictionary<T,v> does the caching probably exactly for this reason. But look what happens in the System.Collections.Hashtable implementation…

  28. .mytable { font-face: arial; font-size: 8pt; } Well once again there have been many thoughtful replies