Memory Model


One of
the suggestions for a blog entry was the managed memory model. style="mso-spacerun: yes">  This is timely, because we’ve just been
revising our overall approach to this confusing topic. style="mso-spacerun: yes">  For the most part, I write about product
decisions that have already been made and shipped. style="mso-spacerun: yes">  In this note, I’m talking about future
directions.  Be
skeptical.


"urn:schemas-microsoft-com:office:office" /> size=2> 


So what
is a memory model?  It’s the
abstraction that makes the reality of today’s exotic hardware comprehensible to
software developers.


size=2> 


The
reality of hardware is that CPUs are renaming registers, performing speculative
and out-of-order execution, and fixing up the world during retirement. style="mso-spacerun: yes">  Memory state is cached at various levels
in the system (L0 thru L3 on modern X86 boxes, presumably with more levels on
the way).  Some levels of cache are
shared between particular CPUs but not others. style="mso-spacerun: yes">  For example, L0 is typically per-CPU but
a hyper-threaded CPU may share L0
between the logical CPUs of a single physical CPU. style="mso-spacerun: yes">  Or an 8-way box style="mso-bidi-font-style: normal">may split the system into two
hemispheres with cache controllers performing an elaborate coherency protocol
between these separate hemispheres. 
If you consider caching effects, at some level all MP (multi-processor)
computers are NUMA (non-uniform memory access). style="mso-spacerun: yes">  But there’s enough magic going on that
even a Unisys 32-way can generally be considered as UMA by
developers.


size=2> 


It’s
reasonable for the CLR to know as much as possible about the cache architecture
of your hardware so that it can exploit any imbalances. style="mso-spacerun: yes">  For example, the developers on our
performance team have experimented with a scalable rendezvous for phases of the
GC.  The idea was that each CPU
establishes a rendezvous with the CPU that is “closest” to it in distance in the
cache hierarchy, and then one of this pair cascades up a tree to its closest
neighbor until we reach a single root CPU. 
At that point, the rendezvous is complete. style="mso-spacerun: yes">  I think the jury is still out on this
particular technique, but they have found some other techniques that really pay
off on the larger systems.


size=2> 


Of
course, it’s absolutely unreasonable for any managed developer (or 99.99% of
unmanaged developers) to ever concern themselves with these imbalances. style="mso-spacerun: yes">  Instead, software developers want to
treat all computers as equivalent. 
For managed developers, the CLR style="mso-bidi-font-style: normal">is the computer and it better work
consistently regardless of the underlying machine.


size=2> 


size=2>Although managed developers shouldn’t know the difference between a 4-way
AMD server and an Intel P4 hyper-threaded dual proc, they still need to face the
realities of today’s hardware. 
Today, I think the penalty of a CPU cache miss that goes all the way to
main memory is about 1/10th the penalty of a memory miss that goes
all the way to disk.  And the trend
is clear.


size=2> 


If
you wanted good performance on a virtual memory system, you’ve always been
responsible for relieving the paging system by getting good page density and
locality in your data structures and access patterns.


size=2> 


In
a similar vein, if you want good performance on today’s hardware, where
accessing main memory is a small disaster, you must pack your data into cache
lines and limit indirections.  If
you are building shared data structures, consider separating any data that’s
subject to false sharing.


size=2> 


To
some extent, the CLR can help you here. 
On MP machines, we use lock-free allocators which (statistically)
guarantee locality for each thread’s allocations. style="mso-spacerun: yes">  Any compaction will (statistically)
preserve that locality.  Looking
into the very far future – perhaps after our sun explodes – you could imagine a
CLR that can reorganize your data structures to achieve even better
performance.


size=2> 


size=2>This means that if you are writing single-threaded managed code to
process a server request, and if you can avoid writing to any shared state, you
are probably going to be pretty scalable without even trying.


size=2> 


Getting
back to memory models, what is the abstraction that will make sense of current
hardware?  It’s a simplifying model
where all the cache levels disappear. 
We pretend that all the CPUs are attached to a single shared memory. style="mso-spacerun: yes">  Now we just need to know whether all the
CPUs see the same state in that memory, or if it’s possible for some of them to
see reordering in the loads and stores that occur on other CPUs.


size=2> 


At one
extreme, we have a world where all the CPUs see a single consistent memory. style="mso-spacerun: yes">  All the loads and stores expressed in
programs are performed in a serialized manner and nobody perceives a particular
thread’s loads or stores being reordered. 
That’s a wonderfully sane model which is easy for software developers to
comprehend and program to. 
Unfortunately, it is far too slow and non-scalable. style="mso-spacerun: yes">  Nobody builds this.


size=2> 


At the
other extreme, we have a world where CPUs operate almost entirely out of private
cache.  If another CPU ever sees
anything my CPU is doing, it’s a total accident of timing. style="mso-spacerun: yes">  Because loads and stores can propagate
to other CPUs in any random order, performance and scaling are great. style="mso-spacerun: yes">  But it is impossible for humans to
program to this model.


size=2> 


In
between those extremes are a lot of different possibilities. style="mso-spacerun: yes">  Those possibilities are explained in
terms of acquire and release semantics:


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>A normal load or store can be freely reordered with respect
    to other normal load or store operations.

  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>A load with acquire semantics creates a downwards
    fence.  This means that normal
    loads and stores can be moved down past the load.acquire, but nothing can be
    moved to above the load.acquire.

  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>A store with release semantics creates an upwards
    fence.  This means that normal
    loads and stores can be moved above the store.release, but nothing can be
    moved to below the store.release.

  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>A full fence is effectively an upwards and downwards
    fence.  Nothing can move in either
    direction across a full fence.

size=2> 


A
super-strong extreme model puts a full fence after every load or store. style="mso-spacerun: yes">  A super-weak extreme model uses normal
loads and stores everywhere, with no fencing.


size=2> 


The most
familiar model is X86.  It’s a
relatively strong model.  Stores are
never reordered with respect to other stores. style="mso-spacerun: yes">  But, in the absence of data dependence,
loads can be reordered with respect to other loads and stores. style="mso-spacerun: yes">  Many X86 developers don’t realize that
this reordering is possible, though it can lead to some nasty failures under
stress on big MP machines.


size=2> 


In terms
of the above, the memory model for X86 can be described as:


size=2> 



  1. style="MARGIN: 0in 0in 0pt; mso-list: l3 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>All stores are actually store.release.

  2. style="MARGIN: 0in 0in 0pt; mso-list: l3 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>All loads are normal loads.

  3. style="MARGIN: 0in 0in 0pt; mso-list: l3 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Any use of the LOCK prefix (e.g. ‘LOCK CMPXCHG’ or ‘LOCK
    INC’) creates a full fence.

size=2> 


size=2>Historically, Windows NT has run on Alpha and MIPS computers.


size=2> 


Looking
forwards, Microsoft has announced that Windows will support Intel’s IA64 and
AMD’s AMD64 processors.  Eventually,
we need to port the CLR to wherever Windows runs. style="mso-spacerun: yes">  You can draw an obvious conclusion from
these facts.


size=2> 


AMD64
has the same memory model as X86.


size=2> 


IA64
specifies a weaker memory model than X86. 
Specifically, all loads and stores are normal loads and stores. style="mso-spacerun: yes">  The application must use special ld.acq
and st.rel instructions to achieve acquire and release semantics. style="mso-spacerun: yes">  There’s also a full fence instruction,
though I can’t remember the opcode (mf?).


size=2> 


Be
especially skeptical when you read the next paragraph:


size=2> 


There’s
some reason to believe that current IA64 hardware actually implements a stronger
model than is specified.  Based on
informed hearsay and lots of experimental evidence, it looks like normal store
instructions on current IA64 hardware are retired in order with release
semantics.


size=2> 


If this
is indeed the case, why would Intel specify something weaker than what they have
built?  Presumably they would do
this to leave the door open for a weaker (i.e. faster and more scalable)
implementation in the future.


size=2> 


In fact,
the CLR has done exactly the same thing. 
Section 12.6 of Partition I of the ECMA CLI specification explains our
memory model.  This explains the
alignment rules, byte ordering, the atomicity of loads and stores, volatile
semantics, locking behavior, etc. 
According to that specification, an application must use volatile loads
and volatile stores to achieve acquire and release semantics. style="mso-spacerun: yes">  Normal loads and stores can be freely
reordered, as seen by other CPUs.


size=2> 


What is
the practical implication of this? 
Consider the standard double-locking protocol:


size=2> 


style="FONT-FAMILY: 'Lucida Console'; mso-bidi-font-family: Tahoma"> size=2>if (a == null)


style="FONT-FAMILY: 'Lucida Console'; mso-bidi-font-family: Tahoma"> size=2>{


style="FONT-FAMILY: 'Lucida Console'; mso-bidi-font-family: Tahoma"> size=2>  lock(obj)


style="FONT-FAMILY: 'Lucida Console'; mso-bidi-font-family: Tahoma"> size=2>  {


style="FONT-FAMILY: 'Lucida Console'; mso-bidi-font-family: Tahoma"> size=2>    if (a == null) a = new
A();


style="FONT-FAMILY: 'Lucida Console'; mso-bidi-font-family: Tahoma"> size=2>  }


style="FONT-FAMILY: 'Lucida Console'; mso-bidi-font-family: Tahoma"> size=2>}


size=2> 


This is
a common technique for avoiding a lock on the read of ‘a’ in the typical
case.  It works just fine on
X86.  But it would be broken by a
legal but weak implementation of the ECMA CLI spec. style="mso-spacerun: yes">  It’s true that, according to the ECMA
spec, acquiring a lock has acquire semantics and releasing a lock has release
semantics.


size=2> 


However,
we have to assume that a series of stores have taken place during construction
of ‘a’.  Those stores can be
arbitrarily reordered, including the possibility of delaying them until after
the publishing store which assigns the new object to ‘a’. style="mso-spacerun: yes">  At that point, there is a small window
before the store.release implied by leaving the lock. style="mso-spacerun: yes">  Inside that window, other CPUs can
navigate through the reference ‘a’ and see a partially constructed
instance.


size=2> 


We could
fix this code in various ways.  For
example, we could insert a memory barrier of some sort after construction and
before assignment to ‘a’.  Or – if
construction of ‘a’ has no side effects – we could move the assignment outside
the lock, and use an Interlocked.CompareExchange to ensure that assignment only
happens once.  The GC would collect
any extra ‘A’ instances created by this race.


size=2> 


I hope
that this example has convinced you that you don’t want to try writing reliable
code against the documented CLI model.


size=2> 


I wrote
a fair amount of “clever” lock-free thread-safe code in version 1 of the
CLR.  This included techniques like
lock-free synchronization between the class loader, the prestub (which traps
first calls on methods so it can generate code for them), and AppDomain
unloading so that I could back-patch MethodTable slots efficiently. style="mso-spacerun: yes">  But I have no desire to write any kind
of code on a system that’s as weak as the ECMA CLI spec.


size=2> 


Even if
I tried to write code that is robust under that memory model, I have no hardware
that I could test it on.  X86, AMD64
and (presumably) IA64 are stronger than what we specified.


size=2> 


In my
opinion, we screwed up when we specified the ECMA memory model. style="mso-spacerun: yes">  That model is unreasonable
because:


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>All stores to shared memory really require a volatile
    prefix.

  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>This is not a productive way to code.

  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>Developers will often make mistakes as they follow this
    onerous discipline.

  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>These mistakes cannot be discovered through testing,
    because the hardware is too strong.

size=2> 


So what
would make a sensible memory model for the CLR?


size=2> 


Well,
first we would want to have a consistent model across all CLI
implementations.  This would include
the CLR, Rotor, the Compact Frameworks, SPOT, and – ideally – non-Microsoft
implementations like Mono.  So
putting a common memory model into an ECMA spec was definitely a good
idea.


size=2> 


It goes
without saying that this model should be consistent across all possible
CPUs.  We’re in big trouble if
everyone is testing on X86 but then deploying on Alpha (which had a notoriously
weak model).


size=2> 


We would
also want to have a consistent model between the native code generator (JIT or
NGEN) and the CPU.  It doesn’t make
sense to constrain the JIT or NGEN to order stores, but then allow the CPU to
reorder those stores.  Or vice
versa.


size=2> 


Ideally,
the IL generator would also follow the same model. style="mso-spacerun: yes">  In other words, your C# compiler should
be allowed to reorder whatever the native code generator and CPU are allowed to
reorder.  There’s some debate
whether the converse is true. 
Arguably, it is okay for an IL generator to apply more aggressive
optimizations than the native code generator and CPU are permitted, because IL
generation occurs on the developer’s box and is subject to testing.


size=2> 


size=2>Ultimately, that last point is a language decision rather than a CLR
decision.  Some IL generators, like
ILASM, will rigorously emit IL in the sequence specified by the source
code.  Other IL generators, like
Managed C++, might pursue aggressive reordering based on their own language
rules and compiler optimization switches. 
If I had to guess, IL generators like the Microsoft compilers for C# and
VB.NET would decide to respect the CLR’s memory model.


size=2> 


We’ve
spent a lot of time thinking about what the correct memory model for the CLR
should be.  If I had to guess, we’re
going to switch from the ECMA model to the following model. style="mso-spacerun: yes">  I think that we will try to persuade
other CLI implementations to adopt this same model, and that we will try to
change the ECMA specification to reflect this.


size=2> 



  1. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list .5in"> face=Tahoma size=2>Memory ordering only applies to locations which can be
    globally visible or locations that are marked volatile. style="mso-spacerun: yes">  Any locals that are not address
    exposed can be optimized without using memory ordering as a constraint since
    these locations cannot be touched by multiple threads in parallel.

  2. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list .5in"> face=Tahoma size=2>Non-volatile loads can be reordered freely.

  3. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list .5in"> face=Tahoma size=2>Every store (regardless of volatile marking) is considered
    a release.

  4. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list .5in"> face=Tahoma size=2>Volatile loads are considered acquire.

  5. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list .5in"> face=Tahoma size=2>Device oriented software may need special programmer
    care.  Volatile stores are still
    required for any access of device memory. style="mso-spacerun: yes">  This is typically not a concern for
    the managed developer.

size=2> 


If
you’re thinking this looks an awful lot like X86, AMD64 and (presumably) IA64,
you are right.  We also think it
hits the sweet spots for compilers. 
Reordering loads is much more important for enabling optimizations than
reordering stores.


size=2> 


So what
happens in 10 years when these architectures are gone and we’re all using
futuristic Starbucks computers with an ultra-weak model? style="mso-spacerun: yes">  Well, hopefully I’ll be living the good
life in retirement on "urn:schemas-microsoft-com:office:smarttags" />Maui. style="mso-spacerun: yes">  But the CLR’s native code generators
will generate whatever instructions are necessary to keep stores ordered when
executing your existing programs. 
Obviously this will sacrifice some performance.


size=2> 


The
trade-off between developer productivity and computer performance is really an
economic one.  If there’s sufficient
incentive to write code to a weak memory model so it can execute efficiently on
future computers, then developers will do so. style="mso-spacerun: yes">  At that point, we will allow them to
mark their assemblies (or individual methods) to indicate that they are “weak
model clean”.  This will permit the
native code generator to emit normal stores rather than store.release
instructions.  You’ll be able to
achieve high performance on weak machines, but this will always be “opt
in”.  And we won’t build this
capability until there’s a real demand for it.


size=2> 


I
personally believe that for mainstream computing, weak memory models will never
catch on with human developers. 
Human productivity and software reliability are more important than the
increment of performance and scaling these models provide.


size=2> 


Finally,
I think the person asking about memory models was really interested in where he
should use volatile and fences in his code. style="mso-spacerun: yes">  Here’s my advice:


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list .5in"> face=Tahoma size=2>Use managed locks like Monitor.Enter (C# lock / VB.NET
    synclock) for synchronization, except where performance really requires you to
    be “clever”.

  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list .5in"> face=Tahoma size=2>When you’re being “clever”, assume the relatively strong
    model I described above.  Only
    loads are subject to re-ordering.

  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list .5in"> face=Tahoma size=2>If you have more than a few places that you are using
    volatile, you’re probably being too clever. style="mso-spacerun: yes">  Consider backing off and using managed
    locks instead.

  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list .5in"> face=Tahoma size=2>Realize that synchronization is expensive. style="mso-spacerun: yes">  The full fence implied by
    Interlocked.Increment can be many 100’s of cycles on modern hardware. style="mso-spacerun: yes">  That penalty may continue to grow, in
    relative terms.

  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list .5in"> face=Tahoma size=2>Consider locality and caching effects like hot spots due to
    false sharing.

  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list .5in"> face=Tahoma size=2>Stress test for days with the biggest MP box you can get
    your hands on.

  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list .5in"> face=Tahoma size=2>Take everything I said with a grain of
    salt.

Comments (38)

  1. Chris you’re my hero 😉 I believe that specifying a stronger memory model is absolutely the right thing to do. The disconnect between the memory model subtleties and the relatively highlevel worldview that the CLR otherwise provides is just too big, especially since this doesn’t even result in better performance on today’s (and probably tomorrow’s) mainstream hardware.

  2. Ian Ringrose says:

    There is one common case when I would like the full speed benefit of stores being reordered. This is when I have code that needs to be fast and ALL objects the method use are only accessed from one thread. E.g. think about inverting a matrix, I may call some other methods from within my method and then the CLR will not know that the array is only accessed by a single thread.

    MAYBE what I am asking for is to be able to mark an object instance as being only used by a single thread, and have the CLR give me an error if the object is used by another when running in debug mode.

    Remember for most people, a single thread on a single CPU is the normal case, It is only a few low level types like us that every have to use more then one thread.

    Is this a new version of the C/Fortran arrays rules, e.g. due to the fact that array can not overlap in Fortran, some maths code is a LOT faster in Fortran.

  3. Chris Brumme says:

    Ian,

    Presumably you are talking about store reordering performed by the native code generator. The IL generator can follow its language rules in reordering and the CPUs that we care about won’t reorder anyway.

    We would have to enable store reordering in the native generator at the assembly, type or method level. We don’t have a good way to enable this per-instance.

    I’m not opposed to our adding this relaxation in the future. But I’m not convinced that the performance benefits will justify the risk and extra programming effort for anyone. Finalization, subtyping, resurrection, statics and many other constructs often leave you exposed to multi-threaded access, even in the "single-threaded" server request case.

    You would have to be absolutely sure that none of your instances could escape before you would be able to safely enable this behavior.

  4. Did you really mean it when you said that hitting main memory is only 1/10th the speed of hitting disk? I find this extremely hard to believe. While I’m totally prepared to believe that the relative cost of a L0 cache hit, and a main memory hit is several orders of magnitude, I can’t believe that there is only a single order of magnitude difference between main memory and disk.

    Disk takes milliseconds to find and load data. Main memory looks pretty slow from the inside of a 3GHz CPU, but it’s not going to take milliseconds is it?

    Or is there something I’m missing?

  5. Chris Brumme says:

    Ian,

    I tried a few different ways to write that sentence, but still failed to make my point.

    Say that the ratio of an L0 hit to a main memory access on your computer is X. Then the ratio of a main memory access to a disk fetch on your computer might be about 10X.

    Developers work hard to avoid paging on virtual memory systems. Increasingly, they should also work hard to avoid cache misses that cause them to access memory. And if that memory is in a cache line that is held by a different CPU, the penalty is even worse.

  6. Phil Hochstetler says:

    Chris,

    Great info on an issue that careful programmers need. I might add that the issue on multiprocessors is more complicated that even your description may seem. I was at Sequent Computer Systems in the 80’s and 90’s when we built large (30 proc) multiprocessors systems. The bus protocol was a "copy back" cache protocol where a particular "cache line" of memory could be owned by a cache. When an access occured to memory, a cpu would snoop the bus and abort the memory transfer from main memory and send a copy of its cache line instead (and mark it "write shared") iff it’s cache owned it (this could only occur if the cpu had written on it but had not yet written it back to memory).

    You can imagine all the cache to cache traffic that can be generated by having two heavy contested locks in the same cache line! In any case, to get good performance, one had to allocate memory locks at least a cache line apart so they would appear in separate units of memory cache. Issues like this are subtle details of the underlying hardware that make it hard to create portable code that performs well across multiple implementations. Some of these hw (mis)features can be detected at runtime.

    I’m sure lots of the folks who worked on NT HAL are aware of such issues. I know Ken Reneris (kenr@microsoft.com once up a time) who was Mr Hal and I talked about such issues in the mid 90’s. Some of this goes back to the old "Code Locking" versus "Data Locking" discussion but that is another discussion entirely.

    Keep up the good work!

  7. Dmitriy Zaslavskiy says:

    Chris, what about Interlocked.XXX functions they come in different flavors (i.e. acquire/release semantics) can I understand your comments to mean that CLR will always use full barrier on all those functions ?

  8. Chris Brumme says:

    I’m not sure what the official word is here. It seems to me that each of the various Interlocked operations is a combination of a load.acquire and a store.release. The combination of these two operations would effectively form a full fence. If the official word seems to be heading in a different direction, I’ll post here to warn you.

  9. Tim Sweeney says:

    .NET took a VERY conservative route with these sorts of decisions; the shared heap model is already showing signs of weakness: Getting multithreading right in C++ or C# in real-world programs is just as hard as getting static type safety right in assembly language programs of comparable line counts.

    At some point, you realize that synchronization has become the limiting factor in development and debugging, and you move to a deterministic and statically safe model, accepting a 2X or so performance hit for synchronization.

    The necessity of this move isn’t obvious with today’s single-CPU, single-threaded PC’s. But in a few years, when Intel is shipping quad-code, quad-hyperthreaded CPU’s and you’re trying to manage 16 simultaneous threads in a client application with complex thread communication dependencies, you’ are just not going to do that kind of thing in C#.

  10. Chris Brumme says:

    Tim,

    It stands to reason that affinitizing threads and even processes (like web gardens) to specific CPUs will be important in getting good scaling in the future. SQL Server uses fibers to avoid a trip through the kernel on a context switch, and these fibers are effectively affinitized to CPUs. But it’s still the case that their warmed cache will need refilling when they non-preemptively context switch to a completely different request. A pipelined server can address this for constrained applications, but we still have a lot to learn on how to efficiently execute general-purpose application code on boxes with 128+ CPUs.

    Having said that, we’re seeing scaling numbers in the low 20’s for a 32-way box, running managed C# code on a simulated business work load. This includes some XML transforms, so you can be sure the GC is kicking in.

    I think that C# is an excellent language for writing most apps, including highly concurrent ones. And I think we have some opportunities with a managed execution environment like the CLR to achieve high degrees of scaling on behalf of naively written applications. But we’ve got a very long way to go.

  11. Chris Brumme says:

    Since I wrote this, some folks at Intel have shown me a piece of code that reveals out-of-order stores on IA64. (In fact, the piece of code was a spin lock from some code at Microsoft). The out-of-order execution was very evident on a 32-way box from one manufacturer — though it never occurred on a 32-way box from a different manufacturer. Presumably the difference is in the chip set used for memory management in the caching hierarchy. Thank goodness that managed developers don’t have to be aware of this sort of thing!

  12. Hans Boehm says:

    I have two concerns with Chris Brumme’s suggestion to program to a
    stronger memory model than is required by the current language standard:

    1) I think that when you look closer, the "simplified" stronger memory model may actually complicate matters.

    As Chris stated, if two threads may concurrently access a variable,
    and one of them writes it, in almost all cases the accesses should be
    protected by a lock. If you follow this simple rule the detailed
    memory model doesn’t matter.

    Any violations of the above rule are inherently dangerous. If they are
    necessary, I think any reader of the source code deserves at least a hint
    that something subtle is going on. Purely from a legibility perspective,
    I don’t think that a volatile declaration is unreasonable if a field
    is used for thread synchronization.

    Just in case you weren’t already convinced of the danger of shared
    variables without locking, I’ll point out some further subtleties with
    Chris’ double-checked locking example, even with the stronger memory
    model. Consider the slightly completed code:

    if (a == null)
    {
    lock(obj)
    {
    if (a == null) a = new A();
    }
    }
    x = a.field;

    (Clearly we’re performing the initialization because we’re about to
    access a field. Thus I added the field reference.)

    With absolutely no ordering requirement on loads, thread 1 may first
    load a.field, and then perform the load of a to do the a == null check.
    This will fail if a second thread acquires the lock and initializes a
    in between the two operations.

    Admittedly, it seems very counterintuitive to load a.field before loading
    a. But some processors may in fact do such "data dependent" reordering.
    (Itanium doesn’t. Alpha may. Compilers might, if they have a profile
    or some other ways to guess the expected value of a pointer.)

    Based on an email discussion with Chris, and on earlier parts of the article,
    I believe Chris intends to allow reordering only if there are no
    data dependencies. But now consider the slightly modified example:

    if (!a_is_initialized)
    {
    lock(obj)
    {
    if (!a_is_initialized)
    { a = new A(); a_is_initialized = true; }
    }
    }
    x = a.field;

    This is unambiguously incorrect with Chris’ proposal, since the reads of
    a_is_initialized may be reordered with the read of a.field, and again
    the actual initialization may occur between the two in another thread.
    Furthermore, no modern processor of which I’m aware enforces ordering
    due to an intervening conditional branch. If loads may be reordered
    at all, they can generally be reordered even if there is an intervening
    branch as in this case. (Modern processors generally predict the
    outcome of the branch, and recover if they guessed wrong. If they guessed
    right execution proceeds basically as if the branch weren’t there.)
    Thus this is likely to fail on many kinds of multiprocessors, Itanium
    among them.

    Is it really easier for a programmer to understand the distinction between
    these two examples than to declare a or a_is_initialized volatile
    because they are used for thread communication?

    Consider also that it’s quite tricky to define "data dependent", as you
    would have to in Chris’ proposal. Are the loads in the first example still
    data dependent if the compiler fails to combine the two references to "a"
    into a single load instruction, e.g. if you turn off optimization?

    2) Portable Performance.

    Many newer processor designs have well-defined, but weaker, memory models, which would make it appreciably more expensive to implement the memory model that Chris proposes. It’s hard to quantify the cost, but in many
    cases a slowdown of at least a factor of two seems plausible. (This is likely to occur for two reasons: Explicit memory barriers are expensive on some processors. And such a memory model is likely to seriously inhibit compiler instruction scheduling, which is again important on some processors and not others.)

    These slowdowns are likely to impact even code that accesses no shared data, simply because the compiler will often have a hard time proving that. As far as I can tell, among the major processor families only X86 variants may get away with a relatively small penalty.

    Although it appears that such a slowdown is not necessary for current
    X86 processors, even there the official spec seems to state differently.
    (See IA-32 Intel Architecture Software Developer’s Manual, Volume 3,
    section 7.2.2: "Reads can be carried out speculatively and in any order",
    no mention of data dependence, i.e. no guarantee that the original example
    works without barriers.)

    At best this is likely to make X86 the only really viable execution platform
    for C# code. It is unclear in my mind where this leaves efforts such as Mono.

    I agree that the apparently stronger memory ordering properties of X86
    make it harder to test code against robustness with respect to a weaker
    memory model. But I believe that can be addressed with a combination
    of static tools (notably for race detection) and a debugging interpreter that
    tries to reorder aggressively. There is no real way to avoid the need
    for such tools, since even X86 allows some reordering of loads and stores,
    and not every runtime implementation will do the same reordering in the
    compiler, no matter how strong the ordering properties of the hardware.

  13. Jon Skeet says:

    I’d like to take some issue with Chris’s presentation of the double-check locking, followed by:

    [quote]
    I hope that this example has convinced you that you don’t want to try writing reliable code against the documented CLI model.
    [/quote]

    It doesn’t convince me of that at all – it convinces me that I shouldn’t try to use any "tricks" to attempt to get lock-free performance unless I’m absolutely sure that I need it, and in that case I need to think very, very hard about it.

    It doesn’t take much to convince me of that though – I’ve thought that way for a long time.

    I believe that most developers aren’t going to suffer significant performance loss due to synchronization operations themselves, compared with how much time it takes to (say) query a database, load a file, write some data to the network, etc. Even fewer developers will see much performance loss due to not being able to access single variables reliably – most of the time multi-threaded applications need effectively "transactional" behaviour – see many changes or none of them, and that’s exactly what synchronization gives, expensive as it is. I can’t see how that would be helped significantly by a stronger memory model.

    I’m not saying the current ECMA model is perfect by any means, but I don’t agree that it’s unrealistic to write to it – not for most developers, writing applications rather than device drivers etc, where every tiny, tiny bit of performance can make a significant difference.

    I realise I’m amongst rather more experienced developers in this discussion, however, so I’m quite prepared to be swayed by argument :)

    Jon Skeet

  14. Chris Brumme says:

    Hans & I have been over this a few times in private emails. We’re unlikely to persuade each other. When I look at the typical managed developer, and particularly a typical JScript or VB.NET developer, I cannot envision asking them to sprinkle ‘volatile’ through their code at the appropriate spots. This would presumably include after constructing objects but before publishing them. For general-purpose components that might be used in single-threaded applications or multi-threaded applications, this would also include any updates to potentially shared objects in the GC heap that aren’t protected by locks on the read and write paths.

    If it were necessary to sprinkle ‘volatile’ for correct execution, the default behavior must be that the system (e.g. compiler, CLR or CPU) does that sprinkling. Particularly if the developer cannot be expected or trusted to adequately test on what today is exotic hardware.

    It’s reasonable for developers to opt-in to the more stringent model that Hans prefers. Such developers are trading off productivity against performance (on some architectures) and reach to those architectures.

    This remains a controversial subject.

  15. Arch Robison says:

    I believe the problem is not the memory model, but the way it is explained. The explanations usually concentrate on the detailed reordering rules, and don’t give the programmer a simple rule of thumb on where to put "volatile".

    My article "Memory Consistency & .NET" in Dr. Dobb’s Journal, Apr2003, Vol. 28 Issue 4, p46-50 gives a simple rule of thumb. When using shared memory, you are effectively sending a message from one thread to another. The last write of the message by the sender, and the first read of the message by the receiver, must both be volatile. That simple rule covers most cases in my experience.

    The double-check idiom is simple to write in CLI – just declare "a" volatile in the example and it is fixed. This follows immediately from the rule of thumb: the first thread through the double-check region is the sender; the other threads are the receivers.

    A typical JScript or VB.NET developer should use locks, and not conjure their own synchronization primitives. In that case, the locks in CLI already imply the necessary fences.

  16. RichardH says:

    Hi All,
    I am a VB6 application developer and beginner to C#/.NET, what I am wondering about is why we need the CLR in the first place, is it like 90% for garbage collection only? I mean is, is the CLR mainly there to give us relief from allocation & deallocation of heap memory, and preventing from ‘illegal memory’ access like reading outside array bounds etc.

  17. William Stacey says:

    Related to memory barriers and concurrency, can anyone say if I implemented this non-blocking queue correctly?

    /// <summary>

    /// Summary description for NonBlockingQueue.

    /// Modeled after: http://www.cs.rochester.edu/u/scott/synchronization/pseudocode/queues.html

    /// </summary>

    public class NonBlockingQueue

    {

    Queue Q;

    int count;

    public int Count

    {

    get { return count; }

    }

    internal class Node

    {

    public object value; // User object.

    public object next; // Next Node.

    public Node(object value, Node next)

    {

    this.value = value;

    this.next = next;

    }

    }

    internal class Queue

    {

    public object Head = null; // Dequeue from head.

    public object Tail = null; // Enqueue to tail.

    }

    public NonBlockingQueue()

    {

    Q = new Queue();

    Node node = new Node(null, null);

    Q.Head = node;

    Q.Tail = node;

    }

    public void Enqueue(object value)

    {

    object node = new Node(value, null);

    object tail;

    object next;

    while(true)

    {

    tail = Q.Tail;

    next = ((Node)tail).next;

    if ( tail == Q.Tail ) // Does tail equal Queue Tail.

    {

    if ( next == null )

    {

    // Try to link node at the end of the linked list

    if ( next == Interlocked.CompareExchange(ref ((Node)tail).next, node, next) )

    break; // enqueue is done; exit.

    }

    else // Tail was not pointing to null; Swing tail to next node.

    Interlocked.CompareExchange(ref Q.Tail, next, tail);

    }

    } // End loop

    // Enqueue is done. Try to swing Tail to the inserted node.

    Interlocked.CompareExchange(ref Q.Tail, node, tail);

    Interlocked.Increment(ref count);

    }

    public object Dequeue()

    {

    object value;

    object head;

    object tail;

    object next;

    while(true)

    {

    head = Q.Head;

    tail = Q.Tail;

    next = ((Node)head).next;

    if ( head == Q.Head )

    {

    if ( head == tail ) // Is queue empty or Tail falling behind?

    {

    if ( next == null ) // Is queue empty?

    return null; // Queue is empty.

    Interlocked.CompareExchange(ref Q.Tail, next, tail);

    }

    else // No need to deal with Tail.

    {

    // Read user value before exchange.

    // Otherwise, another dequeue might free the next node.

    value = ((Node)next).value;

    if (Interlocked.CompareExchange(ref Q.Head, next, head) == head)

    {

    Interlocked.Decrement(ref count);

    return value;

    }

    }

    }

    } // End loop.

    } // End Dequeue

    } // End Class

    — William

  18. William Stacey says:

    Does this spin version work? Why or why not? Cheers!

    public sealed class Singleton

    {

    private static int spinLock = 0; // lock not owned.

    private static Singleton value = null;

    private Singleton() {}

    public static Singleton Value()

    {

    // Get spin lock.

    while ( Interlocked.Exchange(ref spinLock, 1) != 0 )

    Thread.Sleep(0);

    // Do we have any mbarrier issues?

    if ( value == null )

    value = new Singleton();

    Interlocked.Exchange(ref spinLock, 0);

    return value;

    }

    }

    This would help answer a few related questions for me on how Interlocked works with mem barriers and cache, etc. TIA — William

  19. Chris Brumme says:

    Doing busy waiting on an Interlocked operation can be a problem. If multiple threads are piled up on the spin lock, they will cause a lot of bus traffic as they fight for ownership of the cache line containing the spin lock. In your case, the construction of the singleton object is fairly fast, so perhaps this isn’t an issue.

    In this particular case, there is no strong reason to delay construction of the Singleton beyond first use of the class. So you might prefer to create the Singleton inside the class constructor — in which case the CLR is responsible for synchronization.

    Theoretically you should yield the CPU inside your busy loop in case you are executing on a hyper-threaded CPU. Thread.SpinWait can be used to ensure this happens. If you call Interlocked.Exchange inside the loop as you currently do, the SpinWait call may not be necessary. I would have to check the Intel manuals to be sure. But if you perform busy waiting without stealing the bus, as I suggested in the first paragraph, then SpinWait will become important.

    Since you have a trivial constructor for Singleton, no memory barrier is required before the publishing write to ‘value’ inside the lock. But if you had a non-trivial constructor, IA64 would require a barrier in unmanaged code. That’s because assignments you performed inside the constructor might be deferred until after your publishing write. The memory barrier would prevent this.

    In managed code the IA64 memory barrier issue is more subtle. I hope that in Whidbey the memory barrier isn’t required, even for IA64, for all the reasons I discussed in this blog.

    You are asking whether your spinlock sample works or not. I tried to answer that above. Beyond the issue of whether it actually works is the issue of whether it’s a good way to solve this particular problem. It’s not how I would have coded it.

  20. In this article , Vance Morrison describes some of the issues involved in writing managed multithreaded

  21. In this article , Vance Morrison describes some of the issues involved in writing managed multithreaded