System.Collections.Generic Dictionary Capacity, Hashing, and Collisions.

Somebody asked me a question how to set the initial capacity of System.Collections.Generic.Dictionary<K,V> if one knows that the Dictionary will contain 1000 entries.

The short answer

Dictionary<int,string> numbers = new Dictionary<int,string>(1000);

The long answer

Dictionary<K,V> is a generic type almost identical to the non-generic System.Collections.Hashtable. Both types are implemented as hashtables, but they are slightly different implementations. In particular, they use different algorithms to deal with collisions. Hashtable uses probing and Dictionary<K,V> uses chaining.

Probing means that when a collision is detected, the algorithm tries to store the new entry in the next free bucket. When a hashtable gets close to full (saturated), finding the next free bucket becomes difficult. For this reason, hashtables that use probing grow their size way before they get completely saturated to minimize collisions. In our Hashtable, the threshold of acceptable saturation is controlled by the loadFactor parameter the constructor, but this is a topic for another time. Hashtables using probing always have more buckets allocated than the number of entries they store.

Now, the interesting question is whether the capacity specifies the number of buckets we allocate or the algorithm is actually smarter. The good news is that we take the loadFactor into account when allocating the buckets. So, if you specify capacity of let’s say 1000, we actually allocate more buckets. In fact we allocate enough buckets to ensure that we don’t have to grow the number of buckets before more entries are added than the specified capacity.

Chaining means that all entries with the same hash code are stored in a list associated with one bucket corresponding to the hash code. Therefore the Dictionary<K,V> can be completely saturated, we don’t use loadFactor, and we simply take the next prime number after the capacity to determine how many buckets to allocate.