Reliability


I’ve
been putting off writing this blog, not just because I’m on vacation in
/>Maui and have far more tempting things to do. style="mso-spacerun: yes">  It’s because one of my blogs has already
been used on Slashdot as evidence that Windows cannot scale and won’t support
distributed transactions (
href="http://slashdot.org/comments.pl?sid=66598&cid=6122733"> face=Tahoma
size=2>http://slashdot.org/comments.pl?sid=66598&cid=6122733 size=2>), despite the fact that Windows scales well and does
support distributed transactions. 
So I’m nervous that this new blog will be quoted somewhere as evidence
that managed applications cannot be reliable. "urn:schemas-microsoft-com:office:office" />


size=2> 


The fact
is that there are a lot of robust managed applications out there. style="mso-spacerun: yes">  During V1 of .NET, we all ran a
peer-to-peer application called Terrarium. 
This could run as a background application and would become a screen
saver during periods of inactivity. 
Towards the end of the release, I was running a week at a time without
incident.  The only reason for
recycling the process after a week was to switch to a newer CLR so it could be
tested.


size=2> 


Each
team ran their own stress suite in their own labs on dedicated machines. style="mso-spacerun: yes">  And many teams would “borrow” machines
from team members at night, to get even more machine-hours of stress
coverage.  These stress runs are
generally more stressful than normal applications. style="mso-spacerun: yes">  For example, the ASP.NET team would put
machines under high load with a mix of applications in a single worker
process.  They would then recycle
various AppDomains in that worker process every two or three minutes. style="mso-spacerun: yes">  The worker process was required to keep
running indefinitely with no memory growth, with no change in response time for
servicing requests, and with no process recycling. style="mso-spacerun: yes">  This simulates what happens if you
update individual applications of a web site over an extended period of
time.


size=2> 


Another
example of stress was our GC Stress runs. 
We would run our normal test suites in a mode where a GC was triggered on
every thread at every point that a GC would normally be tolerated. style="mso-spacerun: yes">  This includes a GC at every machine
instruction of JITted code.  Each GC
was forced to compact so that stale references could be detected.


size=2> 


We also
have a “threaded stress” which tries to break loose as many race conditions in
the CLR as possible by throwing many threads at each operation.


size=2> 


Our
stress efforts are getting better over time. style="mso-spacerun: yes">  The dev lead for the 64-bit effort
recently told me that we’ve already had more 64-bit process starts in our lab
than all of the 32-bit process starts during the 5 years of building V1 and
V1.1.  We’ve got a very large (and
very noisy) lab that is crammed with 64-bit boxes that are constantly running
every test that we can throw at them.


size=2> 


Having
said all that, CLR reliability in V1 and V1.1 still falls far short of where we
would like it to be.  When these
releases of the CLR encounter a serious problem, we “fail fast” and terminate
the process.  Examples of serious
problems include:


size=2> 



  1. style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>ExecutionEngineException

  2. style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>An Access Violation inside mscorwks.dll or mscoree.dll
    (except in a few specific bits of code, like the write barrier code, where AVs
    are converted into normal NullReferenceExceptions).

  3. style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>A corrupt GC heap

  4. style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>Stack overflow

  5. style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in"> face=Tahoma size=2>Out of memory

size=2> 


The
first three of the above examples are legitimate reasons for the process to
FailFast.  They represent serious
bugs in the CLR, or serious bugs in (highly trusted / unsafe) portions of the
frameworks and the application. 
It’s probably a security risk to continue execution under these
circumstances, because it’s easy to imagine cases where type safety or other
invariants have been violated.


size=2> 


But the
last two cases (stack overflow and memory exhaustion) are a different
matter.  In a perfect world – i.e.
some future version of our platform – these resource errors should be handled
more gracefully.  This means that
the CLR should be hardened to tolerate them and the managed application should
be able to trap these conditions and handle them.


size=2> 


There’s
a lot of work involved in getting there. 
I want to explain some of the complexity that’s involved. style="mso-spacerun: yes">  But first I would like to point out why
the current situation isn’t as bad as it seems.


size=2> 


For
example, ASP.NET is able to pursue 100% reliability without worrying too much
about resource errors.  That’s
because ASP.NET can use AppDomain recycling and process recycling to avoid
gradual “decay” of the server process. 
Even if the server is leaking resources at a moderate rate, recycling
will reclaim those resources.  And,
if a server application is highly stack intensive, it can be run with a larger
stack or it can be rewritten to use iteration rather than recursion.


size=2> 


And for
client processes, it’s historically been the case that excessive paging occurs
before actual memory exhaustion. 
Since performance completely tanks when we thrash virtual memory, the
user often kills the process before the CLR’s FailFast logic even kicks in. style="mso-spacerun: yes">  In a sense, the human is proactively
recycling the process the same way ASP.NET does.


size=2> 


This
stopped being the case for server boxes some time ago. style="mso-spacerun: yes">  It’s often the case that server boxes
have enough physical memory to back the entire 2 or 3 GB of user address space
in the server process.  Even when
there isn’t quite this much memory, memory exhaustion often means address space
exhaustion and it occurs before any significant paging has occurred.


size=2> 


This is
increasingly the case for client machines, too. style="mso-spacerun: yes">  It’s clear that many customers are
bumping up against the hard limits of 32-bit processing.


size=2> 


Anyway,
back to stack overflow & memory exhaustion errors. style="mso-spacerun: yes">  These are both resource errors, similar
to the inability to open another socket, create another brush, or connect to
another database.  However, on the
CLR team we categorize them as “asynchronous exceptions” rather than resource
errors.  The other common
asynchronous exception is ThreadAbortException. style="mso-spacerun: yes">  It’s clear why ThreadAbortException is
asynchronous: if you abort another thread, this could cause it to throw a
ThreadAbortException at any machine instruction in JITted code and various other
places inside the CLR.  In that
sense, the exception occurs asynchronously to the normal execution of the
thread.


size=2> 


size=2>(In fact, the CLR will currently induce a ThreadAbortException while you
are executing exception backout code like catch clauses and finally blocks. style="mso-spacerun: yes">  Sure, we’ll reliably execute your
backout clauses during the processing of a ThreadAbortException – but we’ll
interrupt an existing exception backout in order to induce a
ThreadAbortException.  This has been
a source of much confusion.  Please
don’t nest extra backout clauses in order to protect your code from this
behavior.  Instead, you should
assume a future version of the CLR will stop inducing aborts so
aggressively.)


size=2> 


Now why
would we consider stack overflow & memory exhaustion to be
asynchronous?  Surely they only
occur when the application calls deeper into its execution or when it attempts
to allocate memory?  Well, that’s
true.  But unfortunately the extreme
virtualization of execution that occurs with managed code works against us
here.  It’s not really possible for
the application to predict all the places that the stack will be grown or a
memory allocation will be attempted. 
Even if it were possible, those predictions would be
version-brittle.  A different
version of the CLR (or an independent implementation like Mono, SPOT, Rotor or
the Compact Frameworks) will certainly behave differently.


size=2> 


Here are
some examples of memory allocations that might not be obvious to a managed
developer:


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Implicit boxing occurs in some languages, causing value
    types to be instantiated on the heap.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Class constructor (.cctor) methods are executed by the CLR
    prior to the first use of a class.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>In the future, JITting might occur at a finer granularity
    than a method.  For example, a
    rarely executed basic block containing a ‘throw’ clause might not be JITted
    until first use.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Although we chose to remove this from our V1 product, the
    CLR used to discard code and re-JIT it even during return paths.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Class loading can be delayed until the first use of a
    class.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>For domain-neutral code, the storage for static fields is
    duplicated in each AppDomain. 
    Some versions of the CLR have allocated this storage lazily, on first
    access.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Operations on MarshalByRefObjects must sometimes be
    remoted.  This requires us to
    allocate during marshaling and unmarshaling. style="mso-spacerun: yes">  Along the same lines, casting a
    ComObject might cause a QueryInterface call to unmanaged code.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Accessing the Type of an instance, or accessing the current
    Thread, or accessing various other environmental state might cause us to
    lazily materialize that state.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Security checks are implicit to many operations. style="mso-spacerun: yes">  These generally involve
    instantiation.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Strings are immutable. style="mso-spacerun: yes">  Many “simple” string operations must
    allocate as a consequence.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>Future versions of the CLR might delay allocation of
    portions of an instance for better cache effects. style="mso-spacerun: yes">  This already happens for some state,
    like an instance’s Monitor lock and – in some versions and circumstances – its
    hash code.

  • style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in"> face=Tahoma size=2>VTables are a space-inefficient mechanism for dispatching
    virtual calls and interface calls. 
    Other popular techniques involve caching dispatch stubs which are
    lazily created.

size=2> 


The
above is a very partial list, just to give a sense of how unpredictable this
sort of thing is.  Also, any dynamic
memory allocation attempt might be the one that drives the system over the edge,
because the developer doesn’t know the total memory available on the target
system, and because other threads and other processes are asynchronously
competing for that same unknown pool.


size=2> 


But a
developer doesn’t have to worry about other threads and processes when he’s
considering stack space.  The total
extent of the stack is reserved for himf when the thread is created. style="mso-spacerun: yes">  And he can control how much of that
reservation is actually committed at that time.


size=2> 


It
should be obvious that it’s inadequate to only reserve some address space for
the stack.  If the developer doesn’t
also commit the space up front, then any subsequent attempt to commit a page of
stack could fail because memory is unavailable. style="mso-spacerun: yes">  In fact, Windows has the unfortunate
behavior that committing a page of stack can fail even if plenty of memory is
available.  This happens if the swap
file needs to be extended on disk. 
Depending on the speed of your disk, extending the swap file can take a
while.  The attempt to commit your
page of stack can actually time out during this period, giving you a spurious
fault.  If you want a robust
application, you should always commit your stack reservations
eagerly.


size=2> 


So
StackOverflowException is more tractable than OutOfMemoryException, in the sense
that we can avoid asynchronous competition for the resource. style="mso-spacerun: yes">  But stack overflows have their own
special problems.  These
include:


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>The difficulty of predicting how much stack is
    enough.

size=2> 


size=2>This is similar to the difficulty of predicting where and how large the
memory allocations are in a managed environment. style="mso-spacerun: yes">  For example, how much stack is used if a
GC is triggered while you are executing managed code? style="mso-spacerun: yes">  Well, if the code you are executing is
“while (true) continue;” the current version of the CLR might need several pages
of your stack.  That’s because we
take control of your thread – which is executing an infinite loop – by rewriting
its context and vectoring it to some code that throws an exception. style="mso-spacerun: yes">  I have no idea whether the Compact
Frameworks would require more or less stack for this situation.


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>The difficulty of presenting an exception via SEH
    (Structured Exception Handling) when the thread is low on stack
    space.

size=2> 


If
you are familiar with stack overflow handling in Windows, you know that there is
a guard bit.  This bit is set on all
reserved but uncommitted stack pages. 
The application must touch stack pages a page at a time, so that these
uncommitted pages can be committed in order. style="mso-spacerun: yes">  (Don’t worry – the JIT ensures that we
never skip a page).  There are 3
interesting pages at the end of the stack reservation. style="mso-spacerun: yes">  The very last one is always
unmapped.  If you ever get that far,
the process will be torn down by the operating system. style="mso-spacerun: yes">  The one before that is the application’s
buffer.  The application is allowed
to execute all its stack-overflow backout using this reserve. style="mso-spacerun: yes">  Of course, one page of reserve is
inadequate for many modern scenarios. 
In particular, managed code has great difficulty in restricting itself to
a single page.


size=2> 


size=2>The page before the backout reserve page is the one on which the
application generates the stack overflow exception. style="mso-spacerun: yes">  So you might think that the application
gets two pages in which to perform backout, and indeed this can sometimes be the
case.  But the memory access that
triggers the StackOverflowException might occur in the very last bits of a
page.  So you can really only rely
on 1 page for your backout.


size=2> 


A
further requirement is that the guard bit must be restored on the final pages,
as the thread unwinds out of its handling. 
However, without resorting to hijacking some return addresses on the
stack, it can be difficult to guarantee that the guard bits can be
restored.  Failure to restore the
guard bit means that stack overflow exceptions won’t be reliably generated on
subsequent recursions.


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>The inability of the CLR (and parts of the operating
    system!) to tolerate stack overflows at arbitrary places.

size=2> 


size=2>I’m told that a stack overflow exception at just the wrong place in
EnterCriticalSection or LeaveCriticalSection (I forget which) will leave the
critical section in a corrupt state. 
Whether or not this is true, I would be amazed if the user-mode portion
of the operating system is completely hardened to stack overflow
conditions.  And I know for a fact
that the CLR has not been.


size=2> 


In
order for us to report all GC references, perform security checks, process
exceptions and other operations, we need the stack to be crawlable at all
times.  Unfortunately, some of the
constructs we use for crawling the stack are allocated on the stack. style="mso-spacerun: yes">  If we should take a stack overflow
exception while erecting one of these constructs, we cannot continue managed
execution.  Today, this situation
drives us to our FailFast behavior. 
In the future, we need to tighten up some invariants between the JIT and
the execution engine, to avoid this catastrophe. style="mso-spacerun: yes">  Part of the solution involves adding
stack probes throughout the execution engine. style="mso-spacerun: yes">  This will be tedious to build and
maintain.


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>Unwinding issues

size=2> 


size=2>Managed exception handling is largely on the Windows SEH plan. style="mso-spacerun: yes">  This means that filter clauses are
called during the first pass, before any unwinding of the stack has
occurred.  We can cheat a little
here:  if there isn’t enough stack
to call the managed filter safely, we can pretend that it took a nested stack
overflow exception when we called it. 
Then we can interpret this failure as “No, I don’t want to handle this
exception.”


size=2> 


size=2>When the first pass completes, we know where the exception will be caught
(or if it will remain unhandled). 
The finally and fault blocks and the terminating catch clause are
executed during the second pass.  By
the end of the second pass, we want to unwind the stack. style="mso-spacerun: yes">  But should we unwind the stack
aggressively, giving ourselves more and more stack for subsequent clauses to
execute in?  If we are unwinding a
StackOverflowException, we would like to be as aggressive as possible. style="mso-spacerun: yes">  But when we are interoperating with C++
exceptions, we must delay the unwind. 
That’s because the C++ exception is allocated on the stack. style="mso-spacerun: yes">  If we unwind and reuse that portion of
the stack, we will corrupt the exception state.


size=2> 


size=2>(In fact, we’re reaching the edges of my understanding here. style="mso-spacerun: yes">  I style="mso-bidi-font-style: normal">think that a C++ rethrow effectively
continues the first pass of the initial exception, looking for a new handler
further up the stack.  And I style="mso-bidi-font-style: normal">think that this means the stack cannot
be unwound until we reach the end of a C++ catch clause. style="mso-spacerun: yes">  But I’m constantly surprised by the
subtleties of exception handling).


size=2> 


So it’s
hard to predict where OutOfMemoryException, StackOverflowException and
ThreadAbortException might occur.


size=2> 


But one
day the CLR will harden itself so it can tolerate these exceptions without
requiring a FailFast escape hatch. 
And the CLR will do some (unspecified) fancy work with stack reserves and
unwinding so that there’s enough stack available for managed code to process
StackOverflowExceptions more like regular exceptions.


size=2> 


At that
point, managed code could process
asynchronous exceptions just like normal exceptions.


size=2> 


Where
does that leave the application? 
Unfortunately, it leaves it in a rather bad spot. style="mso-spacerun: yes">  Consider what would happen if all
application code had to be hardened against these asynchronous exceptions. style="mso-spacerun: yes">  We already know that they can occur
pretty much anywhere.  There’s no
way that the application can pin-point exactly where additional stack or memory
might be required – across all versions and implementations of the
CLI.


size=2> 


As part
of hardening, any updates the application makes to shared state must be
transacted.  Before any protecting
locks are released via exception backout processing, the application must
guarantee that all shared state has been returned to a consistent state. style="mso-spacerun: yes">  This means that the application must
guarantee it can make either forward or backward process with respect to that
state – without requiring new stack or memory resources.


size=2> 


For
example, any .cctor method must preserve the invariant that either the class is
fully initialized when the .cctor terminates, or that an exception escapes. style="mso-spacerun: yes">  Since the CLR doesn’t support
restartable .cctors, any exception that escapes will indicate that the class is
“off limits” in this AppDomain. 
This means that any attempt to use the class will receive a
TypeInitializationException.  The
inner exception indicates what went wrong with initializing this class in this
AppDomain.  Since this might mean
that the String class is unavailable in the Default AppDomain (which would be
disastrous), we’re highly motivated to add support for restartable .cctors in a
future release of the CLR.


size=2> 


Let’s
forget about managed code for a moment, because we know that the way we
virtualize execution makes it very difficult to predict where stack or memory
resources might be required. 
Instead, imagine that you are writing this “guaranteed forward or
backward progress” code in unmanaged code. 
I’ve done it, and I find it is very difficult. style="mso-spacerun: yes">  To do it right, you need strict coding
rules.  You need static analysis
tools to check for conformance to those rules. style="mso-spacerun: yes">  You need a harness that performs fault
injection.  You need hours of
directed code reviews with your brightest peers. style="mso-spacerun: yes">  And you need many machine-years of
stress runs.


size=2> 


It’s a
lot of work, which is only warranted for those few pieces of code that need to
be 100% robust.  Frankly, most
applications don’t justify this level of effort. style="mso-spacerun: yes">  And this really isn’t the sweet spot for
managed development, which targets extremely high productivity.


size=2> 


So today
it’s not possible to write managed code and still be 100% reliable, and most
developers shouldn’t even try.  But
our team has a broad goal of eventually supporting all unmanaged coding
techniques in managed code (with some small epsilon of performance loss). style="mso-spacerun: yes">  Furthermore, the CLR and frameworks
teams already have a need for writing some small chunks of reliable managed
code.  For example, we are trying to
build some managed abstractions that guarantee resource cleanup. style="mso-spacerun: yes">  We would like to build those
abstractions in managed code, so we can get automatic support for GC reporting,
managed exceptions, and all that other good stuff. style="mso-spacerun: yes">  We’re prepared to invest all the care
and effort necessary to make this code reliable – we just need it to be style="mso-bidi-font-style: normal">possible.


size=2> 


So I
think you’ll see us delivering on this capability in the next release or
two.  If I had to guess, I think
you’ll see a way of declaring that some portion of code must be reliable. style="mso-spacerun: yes">  Within that portion of code, the CLR
won’t induce ThreadAbortExceptions asynchronously. style="mso-spacerun: yes">  As for stack overflow and memory
exhaustion, the CLR will need to ensure that sufficient stack and memory
resources are available to execute that portion of code. style="mso-spacerun: yes">  In other words, we’ll ensure that all
the code has been JITted, all the classes loaded and initialized, all the
storage for static fields is pre-allocated, etc. style="mso-spacerun: yes">  Obviously the existence of indirections
/ polymorphism like virtual calls and interface calls makes it difficult for the
CLR to deduce exactly what resources you might need. style="mso-spacerun: yes">  We will need the developer to help out
by indicating exactly what indirected resources must be prepared. style="mso-spacerun: yes">  This will make the technique onerous to
use and somewhat expensive in terms of working set. style="mso-spacerun: yes">  In some ways, this is a good thing. style="mso-spacerun: yes">  Only very limited sections of managed
code should be hardened in this manner. 
And most of the hardened code will be in the frameworks, where it
belongs.


size=2> 


Here are
my recommendations:


size=2> 



  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>Application code is responsible for dealing with
    synchronous application exceptions. 
    It’s everyone’s responsibility to deal with a FileNotFoundException
    when opening a file on disk.

  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>Application code is not responsible for dealing with
    asynchronous exceptions.  There’s
    no way we can make all our code bullet-proof when exceptions can be triggered
    at every machine instruction, in such a highly virtualized execution
    environment.

  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list .5in"> face=Tahoma size=2>Perhaps 0.01% of code will be especially hardened using
    not-yet-available techniques. 
    This code will be used to guarantee that all resources are cleaned up
    during an AppDomain unload, or to ensure that pending asynchronous operations
    never experience races when unloads occur. style="mso-spacerun: yes">  We’re talking tricky “systems”
    operations.

size=2> 


Well, if
the application isn’t responsible for dealing with asynchronous exceptions, who
is?


size=2> 


That’s
the job of the process host.  In the
case of ASP.NET, it’s the piece of code that decides to recycle the process when
memory utilization hits some threshold. 
In the case of SQL Server, it’s the piece of code that decides whether to
abort a transaction, unload an AppDomain, or even suspend all managed
activity.  In the case of a random
console application, it’s the [default] policy that might retain the FailFast
behavior that you’ve seen in V1 and V1.1 of the CLR. style="mso-spacerun: yes">  And if Office ever builds a managed
version, it’s the piece of code I would expect to see saving edits and unloading
documents when memory is low or exhausted. 
(Don’t read anything into that last example. style="mso-spacerun: yes">  I have no idea when/if there will be a
managed Excel).  In other words, the
process host knows how important the process is and what pieces can be discarded
to achieve a consistent application state. 
In the case of ASP.NET, the only reason to keep the process running is to
avoid a pause while we spin up a new process. style="mso-spacerun: yes">  In the case of SQL Server, the process
is vitally important.  Those guys
are chasing 5 9’s of availability. 
In the case of a random console application, there’s a ton of state in
the application that will be lost if the process goes down. style="mso-spacerun: yes">  But there probably isn’t a unit of
execution that we can discard, to get back to a consistent application
state.  If the process is corrupt,
it must be discarded.


size=2> 


In V1
& V1.1, it’s quite difficult for a process host to specify an appropriate
reaction or set of reactions when an asynchronous exception occurs. style="mso-spacerun: yes">  This will get much easier in our next
release.


size=2> 


As
usual, I don’t want to get into any specifics on exactly what we’re going to
ship in the future.  But I do hope
that I’ve painted a picture where the next release’s features will make sense as
part of a larger strategy for dealing with reliability.

Comments (28)

  1. Dmitriy Zaslavskiy says:

    Welcome back.

  2. Chris Brumme says:

    I still have a couple of weeks vacation left.

  3. Ricky Datta says:

    Chris,

    Do not worry about "Anonymous Coward" – his knowledge is only skin deep just like most of his fellow "sysadmin" types who hang out at slashdot – a mostly anti-ms bunch.

    Keep up the good work.

    Ricky

  4. Yosi Taguri says:

    hi chris,
    never stop writing here, there will always be stupid cowards that will twist what you say for their own profit/interests. you could post a disclaimer at the top of the blog and still write all those great things…..

    your blog is one of the most insightful one that there is outside of MS.

  5. Mike Dimmick says:

    Chris,

    Thanks for the material, it’s very interesting, if not always obviously and immediately useful 😉

    In response to what Anonymous Coward wrote on Slashdot, could you explain why you consider that Windows threads and processes are heavier weight than Unix processes and (kernel) threads, and which features of Windows wouldn’t be possible without the (presumably) extra features that the Windows thread and process model provides?

    Keep up the good work (both with the CLR and this blog)!

  6. Ian Ringrose says:

    Running out of memory and paging is not good. Often I have a cache in an application that can be freed if need be. A more powerful version of the “WeakReference” class would help.

    E.g. I wish to be able to say:
    “Do not collect this object in a generation 0 collection.”
    “Only collect this object if the system is getting low on RAM”
    “On any given garbage collection cycle, collect this object with a probability of ..”

    It would also be useful to have an event on the WeakReference object that is called when it’s target is collected.

    A week hash table would also be useful, however given the above I could write my own.

  7. Ian Ringrose says:

    On a related note.

    Is there any chance of getting week delegates. They would make a lot of code that use the observer pattern more robust.

  8. Gwyn Cole says:

    Chris, very interesting and like Mike says, this is "immediately useful". A couple of questions… 1) is there anything special that goes on behind the scenes in Debug builds? and 2) What advice do you have in resolving ExecutionEngineException’s?

  9. Sudhakar says:

    Chris,

    Please don’t stop writing this blog…Its a daily page for so many .net guys. Pleeeeaz.

  10. MH says:

    Don’t know much about UNIX, but at a guess:

    • Linux threads usually have a smaller stack?
    • Linux threads under NPTL are entirely kernel switched iirc (no userland element anymore)
    • Context switches on Linux are much cheaper than on Windows. I’m not sure why this is the case.
    • Processes take longer to start on Windows, hence Chris’ assertion that a well designed Windows app will create and destroy processes at a low rate.

    To see that problem in action, try running a GNU style configure script in Cygwin (ie a very large script that involves lots of programs starting and stopping constantly) – on Windows it takes several times longer to complete than on Linux.

    It’s worth noting that Linux threading is currently experiencing major changes, which have broken a few applications (for instance RealPlayer). NPTL has been designed with extreme scalability in mind – using it, it’s possible to start and stop 100,000 threads in about 2 seconds (however that requires a gig of RAM), compared to the 15 minutes it would have taken previously. That test was conducted on a PII/450Mhz Xeon.

    IIRC before the introduction of NPTL, the Wine (windows emulation) project found it was not possible to correctly emulate Win32 threading upon pthreads, so they did the reverse and implemented Win32 threads on top of kernel threads, then added a simplistic pthread emulation on top of their Win32 thread system.

    Since the introduction of NPTL, which first happened in the Red Hat 9 distribution a few months ago, Wine was forced to reimplement Win32 threads on top of NPTL threads, so presumably they are now somewhat closer in terms of feature parity. For instance inter-process synchronization primitives are now available.

    I don’t know enough about the two systems to adequately say what features Win32 threads have that Linux pthreads do not.

    However, bear in mind that there is no analog to Win32 message passing on Linux. IPC takes place either through custom protocols on domain sockets, CORBA, or potentially in the future DBUS which is a light weight message passing/IPC system (but not implemented in kernel for obvious reasons). I have no idea what the overheads of the message system might be.

    From memory, Windows lets you do things like create threads in remote address spaces. I don’t think you can do that on Linux, though I’m having a hard time thinking of reasons you might want to.

  11. "May the force be with you, use the force Chris"…

    "Anonymous Coward" doesn’t know about internals of Windows, Kernel Debugging and CLR… //

    This blog rocks!

  12. Chris Brumme says:

    Ian,

    Weak hashtable and weak delegates are popular and rational requests. Cache management and eventing have specific challenges in a GC world. We will make it easier to manage a cache in the next release of the CLR. But your post is a good reminder that we’ll still be lacking some constructs.

  13. 小说 says:

    I don’t think you can do that on Linux, though I’m having a hard time thinking of reasons you might want to.

  14. [href=http://www.dmoz.net.cn/ wangzhidaquang]

    [href=http://www.86dmoz.com/ jingpingwangzhi]

    [href=http://movie.kamun.com/ mianfeidianying]

    [href=http://www.kamun.com/ dianyingxiazai]

    [href=http://music.kamun.com/ MP3 free download]

    [href=http://www.pc530.net/ diannaoaihaozhe]

    [href=http://www.5icc.com/ duangxingcaixingxiazha]

    [href=http://www.dianyingxiazai.com/ dianyingxiazai]

    [href=http://www.yinyuexiazai.com/ yinyuexiazai]

  15. A customer recently asked a good question about some of our new reliability features in Whidbey:

    There…