Guidelines and rules for GetHashCode

"The code is more what you'd call guidelines than actual rules" - truer words were never spoken. It's important when writing code to understand what are vague "guidelines" that should be followed but can be broken or fudged, and what are crisp "rules" that have serious negative consequences for correctness and robustness. I often get questions about the rules and guidelines for GetHashCode, so I thought I might summarize them here.

What is GetHashCode used for?

It is by design useful for only one thing: putting an object in a hash table. Hence the name.

Why do we have this method on Object in the first place?

It makes perfect sense that every object in the type system should provide a GetType method; data's ability to describe itself is a key feature of the CLR type system. And it makes sense that every object should have a ToString, so that it is able to print out a representation of itself as a string, for debugging purposes. It seems plausible that objects should be able to compare themselves to other objects for equality. But why should it be the case that every object should be able to hash itself for insertion into a hash table? Seems like an odd thing to require every object to be able to do.

I think if we were redesigning the type system from scratch today, hashing might be done differently, perhaps with an IHashable interface. But when the CLR type system was designed there were no generic types and therefore a general-purpose hash table needed to be able to store any object.

How do hash tables and similar data structures use GetHashCode?

Consider a "set" abstract data type. Though there are many operations you might want to perform on a set, the two basic ones are insert a new item into the set, and check to see whether a given item is in the set. We would like these operations to be fast even if the set is large. Consider, for example, using a list as an implementation detail of a set:

class Set<T>
  private List<T> list = new List<T>();

  public void Insert(T item)
    if (!Contains(t))

  public bool Contains(T item)
    foreach(T member in list)
      if (member.Equals(item))
        return true;
    return false;

(I've omitted any error checking throughout this article; we probably want to make sure that the item is not null. We probably want to implement some interfaces, and so on. I'm keeping things simple so that we concentrate on the hashing part.)

The test for containment here is linear; if there are ten thousand items in the list then we have to look at all ten thousand of them to determine that the object is not in the list. This does not scale well.

The trick is to trade a small amount of increased memory burden for a huge amount of increased speed. The idea is to make many shorter lists, called "buckets", and then be clever about quickly working out which bucket we're looking at:

class Set<T>
  private List<T>[] buckets = new List<T>[100];

  public void Insert(T item)
    int bucket = GetBucket(item.GetHashCode());
    if (Contains(item, bucket))
    if (buckets[bucket] == null)
      buckets[bucket] = new List<T>();

  public bool Contains(T item)
    return Contains(item, GetBucket(item.GetHashCode());

  private int GetBucket(int hashcode)
      // A hash code can be negative, and thus its remainder can be negative also.
      // Do the math in unsigned ints to be sure we stay positive.
      return (int)((uint)hashcode % (uint)buckets.Length);

  private bool Contains(T item, int bucket)
    if (buckets[bucket] != null)
      foreach(T member in buckets[bucket])
        if (member.Equals(item))
          return true;
    return false;

Now if we have ten thousand items in the set then we are looking through one of a hundred buckets, each with on average a hundred items; the Contains operation just got a hundred times cheaper.

On average.

We hope.

We could be even more clever here; just as a List<T> resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of improvements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.

By starting with the position that this code should work, we can deduce what the rules and guidelines must be for GetHashCode:

Rule: equal items have equal hashes

If two objects are equal then they must have the same hash code; or, equivalently, if two objects have different hash codes then they must be unequal.

The reasoning here is straightforward. Suppose two objects were equal but had different hash codes. If you put the first object in the set then it might be put into bucket #12. If you then ask the set whether the second object is a member, it might search bucket #67, and not find it.

Note that it is NOT a rule that if two objects have the same hash code, then they must be equal. There are only four billion or so possible hash codes, but obviously there are more than four billion possible objects. There are far more than four billion ten-character strings alone. Therefore there must be at least two unequal objects that share the same hash code, by the Pigeonhole Principle. (*)

Guideline: the integer returned by GetHashCode should never change

Ideally, the hash code of a mutable object should be computed from only fields which cannot mutate, and therefore the hash value of an object is the same for its entire lifetime.

However, this is only an ideal-situation guideline; the actual rule is:

Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable

It is permissible, though dangerous, to make an object whose hash code value can mutate as the fields of the object mutate. If you have such an object and you put it in a hash table then the code which mutates the object and the code which maintains the hash table are required to have some agreed-upon protocol that ensures that the object is not mutated while it is in the hash table. What that protocol looks like is up to you.

If an object's hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn't find it.

Remember, objects can be put into hash tables in ways that you didn't expect. A lot of the LINQ sequence operators use hash tables internally. Don't go dangerously mutating objects while enumerating a LINQ query that returns them!

Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains

Suppose you have a Customer object that has a bunch of fields like Name, Address, and so on. If you make two such objects with exactly the same data in two different processes, they do not have to return the same hash code. If you make such an object on Tuesday in one process, shut it down, and run the program again on Wednesday, the hash codes can be different.

This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

Rule: GetHashCode must never throw an exception, and must return

Getting a hash code simply calculates an integer; there's no reason why it should ever fail. An implementation of GetHashCode should be able to handle any legal configuration of the object.

I occasionally get the response "but I want to throw NotImplementedException in my GetHashCode to ensure that the object is never put into a hash table; I don't intend for this object to ever be put into a hash table."  Well, OK, but the last sentence of the previous guideline applies; this means that your object cannot be a result in many LINQ-to-objects queries that use hash tables internally for performance reasons.

Since it doesn't throw an exception, it has to return a value eventually. It's not legal or smart to make an implementation of GetHashCode that goes into an infinite loop.

This is particularly important when hashing objects that might be recursively defined and contain circular references. If hashing object Alpha hashes the value of property Beta, and hashing Beta turns right around and hashes Alpha, then you're going to either loop forever (if you're on an architecture that can optimize tail calls) or run out of stack and crash the process.

Guideline: the implementation of GetHashCode must be extremely fast

The whole point of GetHashCode is to optimize a lookup operation; if the call to GetHashCode is slower than looking through those ten thousand items one at a time, then you haven't made a performance gain.

I classify this as a "guideline" and not a "rule" because it is so vague. How slow is too slow? That's up to you to decide.

Guideline: the distribution of hash codes must be "random"

By a "random distribution" I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be "clustered"; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

Again, I list this as a guideline rather than a rule because it is somewhat vague, not because it is unimportant. It's very important. But since good distribution and good speed can be opposites, it's important to find a good balance between the two.

I know this from deep, personal, painful experience. Over a decade ago I wrote a string hash algorithm for a table used by the backend servers. I thought it was a reasonably randomly distributed algorithm, but I made a mistake and it was not. It turned out that all of the one hundred thousand strings that are five characters long and contain only numbers were always hashed to one of five buckets, instead of any of the six hundred or so buckets that were available. The guys were using my table to attempt to do fast lookups of tens of thousands of US postal codes, all of which are strings of five digits. Between that and a threading bug in the same code, I wrecked the performance of an important page on; this was both costly and embarrassing. Data is sometimes heavily clustered, and a good hash algorithm will take that into account.

In particular, be careful of "xor". It is very common to combine hash codes together by xoring them, but that is not necessarily a good thing. Suppose you have a data structure that contains strings for shipping address and home address. Even if the hash algorithm on the individual strings is really good, if the two strings are frequently the same then xoring their hashes together is frequently going to produce zero. "xor" can create or exacerbate distribution problems when there is redundancy in data structures.

Security issue: if the hashed data can be chosen by an attacker then you might have a problem on your hands

When I wrecked that page on it was an accident that the chosen data interacted poorly with my algorithm. But suppose the page was in fact collecting data from users and storing it in a hash table for server-side analysis. If the user is hostile and can deliberately craft huge amounts of data that always hashes to the same bucket then they can mount a denial-of-service attack against the server by making the server waste a lot of time looking through an unbalanced hash table. If that's the situation you are in, consult an expert. It is possible to build hostile-data-resistant implementations of GetHashCode but doing so correctly and safely is a job for an expert in that field.

Security issue: do not use GetHashCode "off label"

GetHashCode is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:

* It does not provide a unique key for an object; probability of collision is extremely high.
* It is not of cryptographic strength, so do not use it as part of a digital signature or as a password equivalent
* It does not necessarily have the error-detection properties needed for checksums.

and so on.

Getting all this stuff right is surprisingly tricky.

(*) If you have ten pigeons kept in nine pigeonholes, then at least one pigeonhole has more than one pigeon in it.

Comments (37)
  1. Ben says:

    Thanks for this blog post.  It's good to see an authoritative source provide guidelines for GetHashCode.  There's such a hodgepodge of misinformation out there right now.  Sadly I must report that I'm not currently in adherence with your guidelines.

    Thanks also for the great blog 🙂

  2. pete.d says:

    "…perhaps with an IHashable interface. But when the CLR type system was designed there were no generic types and therefore a general-purpose hash table needed to be able to store any object."

    Hindsight's 20/20 of course. But I don't see why, given that there _were_ interfaces when the CLR type system was designed, and the general-purpose hash table could simply have stored instances of IHashable instead of instances of Object. Either way, client code would have to cast back anyway.

  3. LaughingJohn says:

    Thanks Eric, very informative.

    Given that we now have the guidelines and rules can you or anyone give us any clue about actually writing a good hash code. Lets say I have a class that has properties which only use built in types. Is there a "best practice" for that scenario?

    Is there anything that could be provided in the framework to either automate or help with the provision of Hash codes?

    Good question. What we do when hashing fields of an anonymous type is roughly:

    hash = initialValue;
    hash = hash * multiplier + (field1 == null) ? 0 : field1.GetHashCode();
    hash = hash * multiplier + (field2 == null) ? 0 : field2.GetHashCode();

    where "initialValue" is a value chosen by the compiler based on the names of the anonymous type fields, and multiplier is a large prime number.

    That gives us a good balance between speed and good distribution. However, if you know more about how the data is clustered then you might be able to do a better job. — Eric

  4. DanyX says:

    Great blog post! One thing that follows from the "stable hash codes" rule and the "equal objects equal hashcodes" rule is that it is difficult to write a correct implementation for GetHashCode for mutable objects. Could you outline the pros and cons of different approaches to handle this kind of situation? We currently work with a guid like data structure for our application where we need deep-copy functionality to handle equality and hash codes in a predictable way with mutable data structures.

  5. Arun Mahendrakar says:

    This is just amazing. I've always been curious (and have tried various articles) to understand the GetHashCode() logic. None of the blog articles come close to this with regards to the explanation. Thanks a lot Eric.

  6. Shuggy says:

    As a side note it can be very hard to get this right in all cases as the BCL writers may have found:

    var a = BitConverter.Int64BitsToDouble(BitConverter.ToInt64(BitConverter.GetBytes(0xFFF8000000000000), 0));

    var b = BitConverter.Int64BitsToDouble(BitConverter.ToInt64(BitConverter.GetBytes(0xFFF8000000000001), 0));

    Console.WriteLine(a == b);



    Console.WriteLine(a.GetHashCode() == b.GetHashCode());

    Then again you shouldn't be using doubles as keys in a hash table anyway.

    Given this I never did understand why the Equals() methods we special cased to consider NaN's equal (even different NaNs) when they didn't ensure that GetHashCode() maintained the contract).

    I'd raise it as a bug on connect but by now I am sure such behaviour is now the ChangeRiskTooHigh(tm) bucket

  7. Dave Gladfelter says:

    Excellent article, thanks.  Your point about the risk of hashing algorithms' susceptibility to denial of service was especially enlightening.

  8. Gabe says:

    I recently used GetHashCode for seeding a random number generator to create demo data. I needed to create random number generators for many objects at once, but wanted tham all to have different sequences. By using hash codes I was able to get repeatable results across runs for each object.

  9. Yann Trevin says:

    Might be worth to have a look at Gallio/MbUnit's hash code contract verifier:…/doku.php

  10. Michael Burge says:

    If you can't have an IHashable interface, why not a HashableObject class? I understand that multiple inheritance problems can creep up, and interfaces were introduced to solve this; but surely they can't be worse than putting it directly in Object.

  11. configurator says:

    Why does the multiplier in your anonymous object example need to be prime? And if it does, why does the Tuple implementation use 33?

    There's some black magic here. First off, note that multiplication is nothing more than repeated bit shifts and adds; multiplying by 33 is just shifting by five bits and adding. Basically this means "mess up the top 27 bits and keep the bottom 5 the same" in the hopes that the subsequent add will mess up the lower 5 bits. Multiplying by a largish prime has the nice property that it messes up all the bits. I'm not sure where the number 33 comes from though.

    I suspect that prime numbers turn up in hash algorithms as much by tradition and superstition as by science. Using a prime number as the modulus and a different one as the multiplier can apparently help avoid clustering in some scenarios. But basically there's no substitute for actually trying out the algorithm with a lot of real-world data and seeing whether it gives you a good distribution. – Eric

  12. Olivier says:

    Yeah, getting GetHashCode() right is really important…

    At work we ran into a problem once, where a Dictionary had linear lookup time (which ruined everything performance-wise). Of course, GetHashCode() was the culprit.

    It turns out it wasn't our fault: the keys of the dictionary were binary data serialized as a string, and the .Net runtime for 64bit has that bug in String.GetHashCode() where it doesn't look after the first '' byte (and our keys mostly begun by such bytes).

    What we did is, we just XOR'ed all our keys with some constant that minimized the probability of '' bytes occurring.

  13. configurator says:

    By the way, from the class System.Tuple:

    internal static int CombineHashCodes(int h1, int h2)


       return (((h1 << 5) + h1) ^ h2);


    Possibly, 33 is used because it is slightly quicker than actually multiplying and is 'good enough'.

    I usually use something like 15 or 31 when I implement it myself; this messes up lots of bits, I think, and it's worked well for me so far.

    Of course, there's no substitute for type-specific hashes; if you have a boolean, only make it change one bit, and if you have an enum with 10 possible values, there's no reason to give it 8 bits of space.

  14. Shuggy says:

    For those asking about the multiplication (and ways you can do better) have a read of…/inthash.htm

    For some other very useful discussion see…/getting-hash-of-a-list-of-strings and the discussion

  15. My "Simple Guide to writing good Equals and GetHashCode" ™:

    1. Write a simple program that creates an anonymous type with the same fields that you believe should participate in the definition of equality for your object.

    2. Decompile said program with Reflector, and copy the bodies of Equals and GetHashCode over to your real code, correcting types as needed for Equals.

    Of course, this still places the burden on you to pick the fields…

    It would be nice to have something like C++0x "=default"  to auto-generate good sensible implementation of Equals & GetHashCode given a list of fields. Maybe something like:

    class Xyzzy {

      // Equals & GetHashCode are generated automatically same as for anonymous

      // types, taking into account fields Foo and Bar but not Baz

      [DefaultEquality] int Foo;

      [DefaultEquality] int Bar;

      int Baz;


  16. paul_sh says:

    Thanks for article, Eric!

    Unfortunately, the default implementations of GetHashCode for some value types violate the rules as well. For example, the decimal type (see Connect ID 314630 for details). I've also experienced such a violation with a simple struct of primitive value types.

    So the guideline is the following: if you want your type to play well with LINQ, take care of its GetHashCode method.

  17. Shuggy says:

    @paul_sh thanks.

    this is a bit silly:…/single-double-break-contract-for-equals-and-gethashcode

    is marked as fixed when it's still broken for NaN's

    Connect will not let me re-open a bug so I actually think I'll raise another

  18. Erik says:

    "I think if we were redesigning the type system from scratch today…"

    Please do! And get rid of null-references while you're at it. Use Option<T> or Maybe<T> instead. I hate checking for null.

    P.S. I'm not really serious. I just hate nulls.

  19. Ferdinand Swaters says:

    How is it with the default GetHashCode of a struct, can we use that or should we write our own?

  20. paul_sh says:

    @Ferdinand Swaters How is it with the default GetHashCode of a struct, can we use that or should we write our own?

    You must write your own. Even if it consists of just a bunch of primitive value types. Even the MSDN agrees on this: " Value types **must** override this method to provide a hash function…". (…/system.object.gethashcode.aspx)

  21. Bill P. Godfrey says:

    I recall back in the .NET1 days, the hash code of a Process object was the only way to get something like a process ID.

  22. Chris B says:

    Why are hash codes allowed to be negative, i.e., why not return uint instead of int?  Is there a situation where a negative hash is useful?

    The Common Language Subset specification does not require a compliant .NET language to be able to understand uints, and therefore forbids a CLS-compliant object model from exposing a method that returns a uint. It is desirable that GetHashCode be compliant with the CLS rules. — Eric

  23. Alex Lo says:

    I've never seen a good reason to implement your own GetHashCode – thus my objects always have the default Object implementation (as a corollary, never override default Equals() either).  As a follow up, could you explain when you might want to do this?

  24. >> when you might want to do this

    My take on that would be:

    For structs, you should implement Equals and GetHashCode as a matter of efficiency (the default implementation supplied by ValueType uses reflection and is slow).

    For classes, you should implement Equals and GetHashCode when your class is an "reference value type" (i.e. the interesting part about it is solely the data that it represents, not its object identity; this should generally also imply immutability). Good examples from the Framework itself are Tuple, String, Uri and XName.

  25. Simon Buchan says:

    I knew a lot of this beforehand, though I'm guilty of ^ing a set of fields together, though I was more worried about just keeping the Equals() rule than hash table perfomance (not likely to be more than 1000 names used in a source file, right?)

    So the default Object.GetHashCode() is derived from the object identity, right? Does the BCL give a method to get back that value or similar for debug purposes at least? An equivalent to Object.ReferenceEquals()?

  26. says:

    > But why should it be the case that every object should be able to hash itself for insertion into a hash table? Seems like an odd thing to require every object to be able to do.

    I want to point out that I'm glad the CLR supports hashing on reference equality for all objects! This comes in handy often. Fortunately, the object does not need to "hash itself." In this case, the CLR provides exactly what is needed. One reason this is useful is that it provides an efficient way to add extra fields to any instance of a reference type. In the hash table, use the main object as the key and an object with the extra fields as the value.

    The Object base class implementation of GetHashCode calls System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode. For reference types, this always returns hash codes consistent with reference equality. That is, if Object.ReferenceEquals(a, b) then RuntimeHelpers.GetHashCode(a) == RuntimeHelpers.GetHashCode(b).

    Sometimes you find a class that overrides GetHashCode to provide value equality but you want to insert it into a hash table based on reference equality. In these cases you can implement an IEqualityComparer<T> that calls RuntimeHelpers.GetHashCode. (I previously argued that the .NET Framework should implement this for you:…/provide-an-implementation-of-iequalitycomparer-based-on-object-identity.)

    I agree that IHashable would have been better than Object.GetHashCode–so long as some way to hash instances of all reference types on reference equality still existed.

  27. Simon Buchan says:

    @BinaryCode: Wow, that was fast, thanks!

  28. Bill P. Godfrey says:

    Correction. I was thinking of the Thread ID, not the Process ID.

  29. Paul G says:

    Good article, thanks. A good part two would expand on some of the comments with strategies for creating hash values.

    In the past I've been stuck in the situation of not fully understanding how the hash value was used internally (only that equal objects should return equal hashes). Therefore when implementing for objects with mixtures of strings and values I've often found myself concatenating the values into a string (eg name + "," + age) and then returning .GetHashCode() of the string.

    It seems too easy. Is there a downside to this approach?

  30. Alex Lo says:


    so for structs it is a matter of the ValueType.GetHashCode() being inefficient – ok – why isnt this fixed in the CLR?

    But what is the advantage of what you propose for classes that 'is an "reference value type" (i.e. the interesting part about it is solely the data that it represents, not its object identity; this should generally also imply immutability).'?  What does that add?

  31. Random832 says:

    "so for structs it is a matter of the ValueType.GetHashCode() being inefficient – ok – why isnt this fixed in the CLR?"

    What makes you think a fully general implementation of this could be more efficient than it is? The reason you're supposed to override it is to provide an implementation that's _not_ fully general and doing lots of extra checking that your type doesn't need [e.g. iterating over the fields with reflection to see if any of them override GetHashCode]

    "But what is the advantage of what you propose for classes that 'is an "reference value type" (i.e. the interesting part about it is solely the data that it represents, not its object identity; this should generally also imply immutability).'?  What does that add?"

    Well, the most common example of a "reference value type" is String. How would you feel if you were unable to reliably use a String as a key in a hash table?

  32. Alex Lo says:

    @Random / Pavel

    I think I understand your points better now.  For structs, as they are value types you'd expect Equals() to work when two have the same content, therefore the GetHashCode should be the same between two with equal content.  Likewise for strings, "a" and "a" should be equal (even though the object ref may not be) – Tuple, String, Uri and XName, I see.  I think then you could say that you ought to implement GetHashCode when you want the Equals() operator to be a logical Equals and not a ReferenceEquals?  is that a fair characterization?

  33. >> What makes you think a fully general implementation of this could be more efficient than it is?

    Given that ValueType is one of the "magical" CLR types about which the runtime has special knowledge, it's quite possible – for example, the runtime could detect that your struct does not override Equals/GetHashCode, and generate its own (separately for every struct) that accesses fields directly.

    >> Likewise for strings, "a" and "a" should be equal (even though the object ref may not be) – Tuple, String, Uri and XName, I see.  I think then you could say that you ought to implement GetHashCode when you want the Equals() operator to be a logical Equals and not a ReferenceEquals?  is that a fair characterization?

    Pretty much, yes. Equals and GetHashCode always go hand in hand – if you're not overriding Equals, the default GetHashCode is good as it is. If I remember correctly, C# compiler will even warn if you override one but not the other.

    Ultimately, the point is that for all those types like String, reference equality is mostly meaningless since you don't derive anything useful out of knowing whether it is the same object or not. Hence why you'd want to default to value comparisons. If anyone actually does care about object identity (e.g. for strings, this is sometimes used as an optimization, either with interning or with a local symbol table, like XmlNameTable in System.Xml), they can always use Object.ReferenceEqual.

    Usually, for those kinds of types, operator== is also overloaded: out of those listed, for String this is done on language level, Uri and XName actually do overload it, and Tuple doesn't overload it because there is no clear way to propagate language-dependent equality semantics of tuple components (which can differ e.g. between C#/VB).

  34. Oh, and here's one more very good reason why you should always override Equals and GetHashCode on your structs in practice:…/valuetype-equals-is-incorrect

  35. Marc Brooks says:

    Oliver, yes, I reported that bug here: (and as usual, Microsoft decided to close as Won't Fix)

    I've written extensively about hashing here:

  36. Martijn says:

    Thank you very much for this post, which was pointed out to me on stackoverflow:…/11411100

    The collary of

    "Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable" (thus, Equality comparison should never change while the object is contained in a datastructure that depends on equality – but other analogies also might be appropriate), I came up with the following rule (which is in fact more-or-less equal to the above rule.):

    "GetHashCode should never depend on values that are mutable"

Comments are closed.

Skip to main content