The problem with being second

First, hats off to the Mono team.  I think they’ve done a great job at writing some great software, but also at proving that evil MS really does know how to produce a truly common language runtime that can be properly standardized and ported to other platforms and languages.

Now for the meat of my blog.  Working on the AMD64 JIT for the runtime is really fun, but it has a few downsides.  You we are the ‘second’ JIT that MS has written.  As such the knee-jerk reaction of testers is not to consult the spec and see if what we’re doing is different, but acceptable.  Instead they blindly compare it to the first JIT (the x86 JIT) and if it’s different and it breaks their test, it must be wrong.  As we all know, 2 different implementations of the same spec are going to behave differently.

The biggest difference I see is that the AMD64 JIT has borrowed a lot of code and optimizations from the native AMD64 compiler.  Because of this, we perform some optimizations that are 100% legal according to the CLR spec, but that are certainly unexpected by those who have been using the x86 JIT for a while.  Almost all of these examples stem from end users not writing proper thread safe code.

Here’s a real example from System.Collections.Hashtable:

private bucket[] buckets;

public virtual bool ContainsKey(Object key) {
    if (key == null) {
        throw new ArgumentNullException("key", Environment.GetResourceString("ArgumentNull_Key"));

    uint seed;
    uint incr;
    // Take a snapshot of buckets, in case another thread resizes table
    bucket[] lbuckets = buckets;
    uint hashcode = InitHash(key, lbuckets.Length, out seed, out incr);
    int  ntry = 0;
    bucket b;
    int bucketNumber = (int) (seed % (uint)lbuckets.Length);            
    do {
        b = lbuckets[bucketNumber];
        if (b.key == null) {
            return false;
        if (((b.hash_coll &  0x7FFFFFFF) == hashcode) && 
            KeyEquals (b.key, key))
            return true;
        bucketNumber = (int) (((long)bucketNumber + incr)% (uint)lbuckets.Length);                              
    } while (b.hash_coll < 0 && ++ntry < lbuckets.Length);
    return false;

I added the emphasis.  Notice the comment that indicates the code is assuming that this.buckets will change on another thread. With the x86 JIT, simply copying the member into a local was enough.  However, the AMD64 JIT applied some simple copy-propagation to this code and realized that since the this pointer was already enregistered, it was easy to access this.buckets as a simple indirection. So the AMD64 JIT completely ignores the local copy. At this point some of you might be thinking that was an illegal optimization.  It’s not.  Go read the specs and study your compiler theory.

There is exactly one thing needed to correct this method.  Its the volatile keyword.  By simply marking the buckets member as being volatile, then the JIT cannot add or remove reads or writes to it, thus it cannot copy-prop.

As a side note, I believe this code really is fine even if the buckets array does get bigger on another thread.  It would only have a problem it buckets got smaller after bucketNumber was calculated, but before it was used to index into the array.  Don’t quote me on this, but I vaguely remember that Hashtable never does resize smaller, so this really is a total non-issue, but still a good example.


Comments (28)

  1. Jim Arnold says:

    Er, what’s the point in having a portable runtime if I have to make my high-level code aware of the processor I’m running on? Are you saying that the behaviour of my code varies between x86 and AMD64?

  2. Grant says:

    No you shouldn’t have to be aware of the processor you’re running on, but yes, there will be differences. And specifically if you’re writing multi-threaded code and you don’t follow the spec for all the things the runtime is allowed to do (but may not actually be doing right now on x86 version 1.0 or 1.1) you will likely get bitten.

    My point is that that x86 is allowed to do the same thing, but they don’t (right now). It is tough being second because people find these differences and naturally assume it’s a bug in the new stuff, rather than their own mis-use or misunderstanding.


  3. Joe says:

    I agree, hats off to the Mono team, but suspect that over time it will be difficult to keep up with the evolution of the .NET Framework. And a minimal CLR won’t be enough for many users – .NET 2.0 contains lots of stuff which will quickly become "must have".

    I remember people taking their hats off to IBM when OS/2 was a "better Windows than Windows". But IBM couldn’t keep up with the evolution to Win32.

    I already know of one subtle difference which could mean an app that runs in Mono might break in MS .NET.

  4. What about the spot CLR, or the compact CLR?

    Aren’t you fourth? 🙂

  5. Grant says:

    I’ve never heard of the spot CLR.

    In the case of the compact CLR, they weren’t trying to duplicate the desktop CLR, but simply mimic it. Thus certain differences were expected. Also due to device constraints the compact CLR is unlikely to perform more optimizations than the desktop CLR.

    People expect the desktop CLR and compact CLR to be slightly different. People won’t expect the CLR to be different between their server running on a Xeon and their server running on an Opteron.

    Yes, technically you are right about us not being the second JIT (there was even the Rotor JITs). But we are the second desktop JIT built by Microsoft that was expected to perform as good as the existing JIT, but on a new architecture.


  6. Ah, the Spot CLR is the never confirmed but widely rumoured CLR for the Spot watches.

    So is there an Intel 64bit CLR out there as well?

    Is the AMD64 CLR tied to XP-64bit, or will it be optimising in a 32 bit environment?

  7. Grant says:

    Yes, the AMD64 JIT will support the Intel processors that implement the x64 extensions, as well as the Itanium family. These 64-bit JITs will only work on a 64-bit CLR on a 64-bit OS (Windows Server 2003 for x64 is in beta IIRC).

    Also don’t get me wrong, the x86 JIT is working on a fair number of their own optimizations to speed things up.


  8. anonymous says:

    Top wish for this JIT (and the next version of the 32-bit one):

    Allow inlining of functions that operate on value types. 3D vector math would be soo much faster and more convienent if you could use a vector class without worrying that every little vec3+vec3 will turn into a full function call..

  9. Craig Thrall says:

    Here’s a question for you. If the spec allows both behaviors, does that indicate a problem in the spec that should be revisited? Make a change to specifically say "local references will change across threads?"

  10. Grant says:

    That not what’s happening. Look at the spec for the C# keyword volatile or the volatile IL opcode prefix. Unless those are present the JIT/runtime is free to assume that only 1 thread is modifying the variable at a time, and therefore can perform agressive optimizations that make your app run faster (like putting things in registers). That’s all in the spec. The problem is that most people don’t read the spec, they just try it and see if it works. 🙁


  11. I can’t for the life of me figure out why you think this optimization is legal.

    If I understand you correctly, you are effectively saying the contents of a local variable can be affected by another thread, right? But when Craig said this you reply that that is not the case, but your explanation of volatile is totally incorrect. Volatile loads/stores act as a memory barrier, but that’s basically irrelevant here, we’re looking at the behavior of a single thread and your optimization clearly violates the spec:


    Conforming implementations of the CLI are free to execute programs using any technology that guarantees, within a single thread of execution, that side-effects and exceptions generated by a thread are visible in the order specified by the CIL.


  12. Grant says:

    OK. Show me how replacing all occurances of the local ‘lbuckets’ with the member variable ‘buckets’ violates that statement. The replacement will not change the order of exceptions, nor any of the side-effects on a *single* thread.

    On multple threads, yes the side-effects definitely change, but the spec is clear about only applying to a single thread.

    You are correct about volatile acting as memory barriers. That basically means that reading and writing suddenly become side-effects. The original code has only one read of the member variable ‘buckets’. If it was a volatile read, the JIT would be forced to only have one read and thus force the JIT to use that read to copy into a non-volatile local (as the user intended). If it didn’t and instead tried to replace all occurances of ‘lbuckets’ with ‘buckets’ it would be violating the rule because now there are more memory barriers. Hence the need for volatile.


  13. "To a amateur, if it works, it’s right. To a professional, if it’s right, it works".

    I welcome any new optimizations, even if they happen to break someone’s buggy code.

  14. I just hope when MS guys will realize how much code is broken by those optimization they won’t simply remove it, but rather make it an opt-in atrribute. Maybe add another flag to CompilationRelaxations

  15. I gave this some thought, and I really don’t think very much code would be broken. Relatively few .NET applications use threading, and those that do invariably do it to improve the speed or responsiveness of their program… any optimizations would be good news.

    Not to mention, we’re talking about a Major New Release–this is 2.0 coming up, not 1.2. This is the time for riskier changes.

    CompilationRelaxations is worth considering, but I don’t think using it to support the small amount of flawed code affected by this issue is the right direction.

  16. I disagree with your (and apparently most of the CLR team’s) reading of the spec (and even if I’d agree, I’d say the spec is broken), but let’s assume for a moment that I agree with your reading, in that case the only place where things can go wrong is the "lbuckets.Length" that is passed into the InitHash method. The subsequent ones that you highlighted are not allowed to be read from the field, because InitHash contains a virtual method call to GetHash and that effectively prevents the JIT from seeing if anyone else on this thread is going to write to the buckets field.

  17. Is this really an issue in this case? I thought that Hashtables weren’t intended to be thread-safe, and if you’re using them in two threads and are modifying the data, it’s up to the caller to use monitors or other synchronization techniques. If one thread calls Add and another calls ContainsKey without locking, that’s not the fault of the code or the JIT, is it?

  18. How to be more volatile.

  19. How to be more volatile.

  20. A while ago, I was directed to a disturbing (in my view) post on GrantRi’s blog, to do with the .NET