Performance Quiz #12 — The Cost of a Good Hash — Solution




Well once again there have  been many thoughtful replies to both the original question as well as the followup with hints.


 


Recall the original question asked a bit of trivia:  could you name 5 implementations of GetHashCode in the framework that do things you might not expect to see in a hash function?  It’s a little bit of a vague question and also requires a good deal of arcane knowledge of the CLR. The most interesting thing about the discussion I think is just what people considered “unusual.”  But as always there was a method to my madness.


So without further ado here’s my list of “unusual” hash functions, there are 53 of them.


 



































































































































































GetHashCode Methods that Allocate
java.lang.reflect.Method.GetHashCode 4.2
com.ms.lang.MulticastDelegate.GetHashCode 4.2
com.ms.lang.Delegate.GetHashCode 4.2
System.Uri.GetHashCode 3.8
System.Security.Policy.HashMembershipCondition.GetHashCode 3.8
java.lang.reflect.Field.GetHashCode 3.5
com.ms.wfc.ui.Region.GetHashCode 3.5
com.ms.wfc.ui.Pen.GetHashCode 3.5
com.ms.wfc.ui.Font.GetHashCode 3.5
com.ms.wfc.ui.Color.GetHashCode 3.5
com.ms.wfc.ui.Brush.GetHashCode 3.5
System.Security.Policy.NetCodeGroup.GetHashCode 3.5
System.Security.Policy.FileCodeGroup.GetHashCode 3.5
System.Security.Policy.CodeGroup.GetHashCode 3.5
System.Security.PermissionSet.GetHashCode 3.3
System.Security.NamedPermissionSet.GetHashCode 3.3
System.Configuration.SettingElement.GetHashCode 3.1
System.Diagnostics.ListenerElement.GetHashCode 3.0
System.Configuration.ConfigurationElement.GetHashCode 3.0
java.math.BigDecimal.GetHashCode 2.8
System.Web.Configuration.TagPrefixInfo.GetHashCode 2.8
System.Web.Configuration.TransformerInfo.GetHashCode 2.4
System.Web.Configuration.TagMapInfo.GetHashCode 2.4
System.Web.Configuration.ProfileGroupSettings.GetHashCode 2.4
System.Web.Configuration.CustomError.GetHashCode 2.4
System.Web.Configuration.BuildProvider.GetHashCode 2.4
System.Web.Configuration.NamespaceInfo.GetHashCode 2.1
System.Security.Policy.StrongNameMembershipCondition.GetHashCode 2.0
System.Xml.Serialization.CaseInsensitiveKeyComparer.System.Collections.IEqualityComparer.GetHashCode(System.Object) 1.9
System.Security.Policy.PublisherMembershipCondition.GetHashCode 1.7
System.Xml.Schema.KeySequence.GetHashCode 1.6
System.Data.SqlTypes.SqlString.GetHashCode 1.6
System.Data.SqlTypes.SqlDecimal.GetHashCode 1.6
java.text.CollationKey.GetHashCode 1.5
System.Security.Policy.UrlMembershipCondition.GetHashCode 1.5
System.Security.Policy.SiteMembershipCondition.GetHashCode 1.5
System.Globalization.SortKey.GetHashCode 1.5
System.Net.Cookie.GetHashCode 1.3
System.Management.ManagementBaseObject.GetHashCode 1.3
java.math.BigInteger.GetHashCode 1.0
System.Windows.Forms.TableLayoutPanelCellPosition.GetHashCode 1.0
System.Windows.Forms.DataGridViewCellStyle.GetHashCode 1.0
System.Windows.Forms.DataGridViewAdvancedBorderStyle.GetHashCode 1.0
System.Windows.Forms.DataGridView+HitTestInfo.GetHashCode 1.0
System.Web.UI.ControlCachedVary.GetHashCode 1.0
System.Web.UI.AttributeCollection.GetHashCode 1.0
System.Web.FileSecurity+DaclComparer.System.Collections.IEqualityComparer.GetHashCode(System.Object) 1.0
System.Web.Configuration.RootProfilePropertySettingsCollection.GetHashCode 1.0
System.Web.Caching.CachedVary.GetHashCode 1.0
System.Security.Cryptography.X509Certificates.X509CertificateCollection.GetHashCode 1.0
System.Security.AccessControl.GenericAce.GetHashCode 1.0
System.Drawing.FontFamily.GetHashCode 1.0
System.Configuration.ConfigurationElementCollection.GetHashCode 1.0


And you can already see why I think they are unusual:  They allocate.  In my opinion any “normal” hash function should be able to do its job without creating any temporary objects.  Given the frequency with which things can be hashed I think that’s important.  That doesn’t mean that I’m going to go to Red Alert about any of the above but it does mean that we should think more carefully about creating big hash tables of the above.


OK, so, great, an interesting observation.  What’s the point?


Well, I while I believe that the particulars of this question are not profound, the notion that you can have a cost metric applied universally to the runtime — even one that is as simple minded as the one I just proposed — *is* profound.


Even though my “allocation complexity” metric may be feeble it is already useful and actually it is my hope that many people find it feeble and write about how their metric is much better and by the way here it is the much better metric in text format etc.  etc.  Imagine if we had the methods scored in a variety of ways all readily available and pluggable into intellisense.  You could have scores relating to synchronization, i/o, memory, algorithmic complexity, even measured data where appropriate.  Other engineering disciplines have abundant information about their raw materials but for those of us in the software business its often either guesswork or our own hard-won data.


Let’s look at what you can do with even my feeble metric.  In my very first quiz I asked (in part) whether it would be better to call Write three times on a stream or create a format string and call it once.  Can we answer that question with the published metric?


Here are the relevant two lines from my costs file:



System.IO.TextWriter.Write(System.String) 0
System.IO.TextWriter.Write(System.String,System.Object[]) 1.5


That’s not as good as the actual measurement but, wow, that’s something now isn’t it?  Rough guidance to give you a clue!  A great “tip” that the var-args formatting option has some underlying cost.

Even these rough costs are very powerful in terms of helping you write code.  Suppose you’re writing a new “Foo” and you want it to have cost that is roughly the same as existing “Foo”.  You could go and look at how the existing “Foo” methods score which gives you some idea what costs you can afford in your new “Foo”. 

Importantly you must observe Rico’s Rule:  if you are writing a new “Foo” method and you want that method to have a cost of X (on any metric) then you must not use any methods whose cost is known to be greater than X.  Just that simple rule can save you from many costly mistakes.

Finally you have some way to answer the question:  Is it even remotely reasonable to use a given method in a given context.  Certainly this particular metric is imperfect, maybe even feeble, but it’s something.  And it will only get better if we all work on it.

Comments (6)

  1. JoelMartinez says:

    Wow, this is pretty awesome Rico.  I really think that this can evolve to be a very useful tool in the ol’ toolbelt

  2. Ihar Bruy says:

    TextWriter.Write with format string and three parameters is System.IO.TextWriter.Write(System.String format, Object arg0, Object arg1, Object arg2) but not System.IO.TextWriter.Write(System.String, System.Object[])

  3. Norman Diamond says:

    Maybe .Net (and maybe Win32) need two completely separate sets of hashing functions.  One set to be cryptographically secure, in which security is more important than cost.  One set to be fast, to make the use of hash tables faster than simple linked lists.  The second set will do no memory allocations ever, but might sometimes do a single division by a prime number.

  4. Rico is doing some very interesting work to get some sort of idea about the allocation overhead of using

  5. Jeffrey Sax says:

    Hidden allocations is one of two main reasons why I feel the ‘default’ implementation of compound assignment operators in C# and VB is so unfortunate. These operators are commonly used deep in loop constructs, but they are less than optimal:

    If your operands are a reference type, then you have at least one allocation and one dead block of memory in each iteration.

    If your operands are a value type, then you have the overhead of a function call, because as far as I know, methods with value type parameters still don’t get inlined.

    Why bother having the feature at all if you can’t really use it in decently performing code?

  6. A quick graphical view of how the framework measures up. This graph shows the number of methods of any