Resilience is NOT necessarily a good thing


I just ran into this post by Eric Brechner who is the director of Microsoft’s Engineering Excellence center.

What really caught my eye was his opening paragraph:

I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it’s better to crash and let Watson report the error than it is to catch the exception and try to correct it.

Wow.  I’m not going to mince words: What a profoundly stupid assertion to make.  Of course it’s better to crash and let the OS handle the exception than to try to continue after an exception.

 

I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them.  In my experience handling exceptions and attempting to continue is a recipe for disaster.  At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve.  At it’s worst, exception handling can either introduce security holes or render security mitigations irrelevant.

I have absolutely no problems with fail fast (which is what Eric suggests with his “Restart” option).  I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control).  In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive.  This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you’ve configured the OS to restart it).

I also agree with Eric’s comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there’s someone who is going to actually look at the log and can understand the contents of the log, otherwise the  logs just consume disk space). 

 

But I simply can’t wrap my head around the idea that it’s ok to catch exceptions and continue to run.  Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

The bottom line is that when an exception is thrown, your program is in an unknown state.  Attempting to continue in that unknown state is pointless and potentially extremely dangerous – you literally have no idea what’s going on in your program.  Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem.  Any other attempt at continuing is a recipe for disaster.

 

——-

[1] To be clear: I’m not necessarily talking about C++ exceptions here, just structured exceptions.  For some C++ and C# exceptions, it’s ok to catch the exception and continue, assuming that you understand the root cause of the exception.  But if you don’t know the exact cause of the exception you should never proceed.  For instance, if your binary tree class throws a “Tree Corrupt” exception, you really shouldn’t continue to run, but if opening a file throws a “file not found” exception, it’s likely to be ok.  For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

 

Edit: Cleaned up wording in the footnote.

Comments (66)

  1. Anonymous says:

    I like the principle: "You should handle an exception only if you know what to do with it."

  2. Doug: Works for me, but only for C++ exceptions (and RPC exceptions, which are essentially the same as C++ exceptions except they’re propogated by SEH).

  3. JamesNT says:

    Larry,

    I think I may be a little unclear so I ask for your help.

    In your example of the binary search tree, if it throws a tree corrupt exception, what would be wrong with wiping out the tree, making a new tree, and starting over?

    Also, I assume that other exceptions, such as a thrown exception because an access database you are trying to connect to doesn’t exist in which case you tell the user to either enter a new path or give them the option to close the program gracefully, are not what you are talking about.

    Thank you for your assistance in helping me understand.

    P.S.

    I like the idea of "handle an exception if and only if you know what to do with it."  But I would extend that to languages such as those of .Net and Java.  Then again, you may have managed environments in a whole new category.  

    JamesNT

  4. Anonymous says:

    Maybe I’m not reading you right, but are you saying that if someone powers down Google’s datacenter and my C#-implemented browser gets some sort of TimeoutException from the TCP stack, the correct thing is for my browser to crash?  

  5. JamesNT: That might be ok, IF you can guarantee that the only cause of the tree corruption failure is that the trees internal state is corrupt.  

    But if the tree corruption error is thrown because of something else (I don’t know, maybe it was because of an error in an underlying heap manager that was rethrown as a tree corruption error, you can’t.

    And that’s exactly my point.  When you encounter an exception you don’t FULLY understand, you can make NO assumptions about the state of the process.  And the only safe action to take at that point is to die and let the OS restart you if possible.

    The "access database you are trying to connect to doesn’t exist" scenario is analogous to my "file not found" example – in that case, the exception really isn’t "exceptional", it’s just a mechanism used by the database library to communicate an error and you handle it just like you handle any other error.

  6. Anonymous says:

    Reliability is a complicated thing.  There’s a tradeoff between availability and integrity,  and that tradeoff becomes more severe as a system becomes larger and more distributed.  UNIX tends to choose availability over integrity,  and Windows does the opposite.

    You’re more likely to find some funny characters at the end of a file on a UNIX system after a crash,  and more likely to have a Windows machine give up the ghost or let a badly written application lock up your desktop for a few minutes.

    Life-critical systems can’t shut down just because something unexpected happened.  Neither can large scale web sites or e-commerce systems.  There’s a whole art of system recovery,  partitioning of corruption,  and having the system stay in a ‘sane’ state that isn’t necessarily correct.

    People have different expectations for desktop apps:  people expect to have them crash and lose their work.  That’s one of the reasons why the world is giving up on desktop apps.

  7. John: You’re confusing exceptions and errors (it’s really easy to confuse the two).

    Exceptions are supposed to be used to handle <i>exceptional</i> events (like corrupted internal state).  They’re not the same as errors (which are used to express "normal" failures).  

    The only kind of exception handling that is unilaterally bad is structured exception handling (except in VERY limited circumstances like handling RPC failures and kernel mode probes of user mode addresses).  

    See my footnote: C++ and C# and Java exceptions <i>might</i> be ok IF you can guarantee you know the reason for the failure.

    I’m not aware of any networking stacks that use SEH to represent network failures.

  8. Anonymous says:

    Its not that hard if its a known exception that can be handled, handle it.  If its an unknown/unexpected exception crash and report .

  9. Anonymous says:

    From that article, I didn’t get the idea that continuing from exceptions was considered a good practice. I got the idea that only that using Watson alone to handle crashes is not sufficient.

    MSN

  10. Anonymous says:

    It depends on your application domain.  In my desktop, userland world, crashing is a wonderful option.

    In my brother’s medical device world, crashing means a kid stops breathing.  The FDA kinda insists that software in such devices fails in a safe way.  Crashing isn’t a safe way if it’s providing life support to the user.  Restarting the process may or may not be depending on the situation.

    Aside from that, I agree that assertions only belong in debug builds.  But if you did have something assertion-like in a release build, then it should be treated just as critically as an exception.  If your assertion failed, then your program is in an unknown, illegal, or improper state–just as it would be if an exception is thrown.  Even if the assertion itself is the bug, others who wrote the code that follows may be counting on the assumption it represents.  Report and bail out.

  11. Tanveer Badar says:

    I agree with the statement "You should handle an exception only if you know what to do with it.".

    Somehow, it seems to me that the more you try, the harder you fall. If you intend to make a system more reliable, the few failures will be even bigger headaches.

  12. JamesNT says:

    I do believe I see where Larry is coming from now.  Exceptions for things such as missing files, incorrect database passwords, and things of that nature you can handle yourself since either you know the answer, can give the user a chance to answer what needs to be done (i.e. enter the correct password or path), or can allow the program to exit gracefully.

    But for those exceptions where you don’t have the slightest idea as to what could have happened, don’t try to continue since that is analogous to ignoring there is a problem.  Let the program die in flames, then open up a formal investigation to see what happened.

    Programs that attempt to continue after an unknown exception actually sound dubious when you think about it.

    JamesNT

  13. Anonymous says:

    I think Eric just picked a really bad quote to start off his article with.  His article doesn’t really advocate "catching the exception and try to correct it" as the quote may suggest.  The closest thing to it that was advocated was "retry"ing an operation, and the examples he described has nothing to do w/ catching an exception and correct it.

    The overall point of the article is really to make error recovery less disruptive to the user experience, and that applications needs to be written with that in mind.

    This kinda reminds me of the MobileSafari browser on the iPhone and iPod Touch.  There’s been several times where it clearly crashed, but what happens is that the OS simply closes the browser without telling the user anything.  It builds up the crash dumps silently on the device and those get send to Apple when you sync the device thru iTunes (of course they don’t call it crash dumps, but something like "customer data to improve the software").  I honestly don’t think this business of "hiding" the fact that it crashed is really that much of an improvement but I can see users getting fooled into thinking that things are working better than they actually do, and well, user perception is king.  (As an anecdote, this doesn’t always work anyway; once my iPod Touch actually wound up in a hard freeze that required a full power off/power on to reset the device.)

  14. Anonymous says:

    Hang on….under Windows I thought structured exceptions were the basic exception type, and that C++ exceptions were implemented as structured exceptions.

    But you’re saying that C++ exceptions are not implemented as SEs?

  15. Anonymous says:

    Exceptions should be used for its original intended purposes: exceptional circumstances. All those C++ exceptions for "errors" like file not found are just unnecessary complexity, they should be replaced by error codes.

    When an application catches an exception, it should save the work in progress as much as possible, and then exit and let error reporting take over.

  16. Karellen, on some platforms C++ exceptions are implemented with SEH.  But that’s one of those "ok" scenarios (as long as you never catch the SEH exception).

  17. Anonymous says:

    I spent a decade or so working on a large server application (one that you might remember).  It was written in C, meaning that C++ exceptions were not available, and used SEH to handle rare but survivable events: out of memory, database update errors, etc.  We very carefully filtered exceptions so that we only caught exceptions that we had raised.  System raised exceptions (e.g., bad pointer deref) we very carefully would not handle.

    By your argument above what we did was "profoundly stupid" because we dared to catch an SEH exception.  That doesn’t seem right.

    I think you’re conflating SEH exceptions with system exceptions, and C++ exceptions with app ones, but that isn’t always the case.

    I prefer the "don’t catch what you don’t understand" rule above, with the proviso that most times you don’t understand as much as you think.  If an app is adhering to the very strict rule "only catch what you raised yourself", why is it "stupid" to do so using SEH and perfectly OK using C++ exceptions?  Is it still stupid when SEH is the only exception mechanism available?

  18. Anonymous says:

    While I 100% agree with the main topic of your article the footnote is somewhat erroneous. I assume that you are familiar with C++ exception safety guarantees theory. If C++ code provides at least basic exception safety (and if it doesn’t you shouldn’t be using it, just like you don’t use code with known buffer overflows!) then catching any C++ exception is always safe. Whether it is wise to continue as if nothing had happened is another matter, but you will  never get to broken invariants state that is possible with plain SEH. In C# and Java where exception are used for both SEH-like and C++-like purposes you are right: only <i>some</i> exceptions are safe.  

  19. Don, did you see my footnote?  You did exactly what NTFS did (and what RPC does, and what the C++ compiler does).

    My reading of Eric’s post is that he’s saying you catch everything and attempt to muddle on, that’s what I’m saying is profoundly stupid.  If your exception filter only catches a very limited set of exceptions, you MIGHT be ok (in other words it only catches the 2 or 3 exceptions that you know you throw).

    The experience of the Windows division in the 15 years since that large server application (on which we both worked) is that SEH as an error propogation mechanism is fraught with peril.  I’m not saying you shouldn’t do it, but you need to be REALLY careful.  Next time you’re on campus, I’ll introduce you to KK (the guy who owns the top level exception handler in Windows) and he can tell you what he thinks of code that does what your large server component did.

    I still remember an issue on that same project where a particular piece of code worked perfectly EXCEPT it reliably failed when run on PPC machines – it turns out that the problem was that the RPC exception filter was trapping unaligned access errors and turning them into RPC_E_CALL_FAILED errors.  It wasn’t until I stepped into the code that I figured it out (again taking hours to chase down what should have been a 30 second bug fix).

  20. Anonymous says:

    I definitely agree that catching and eating something like an unexpected access violation is a really bad idea, and that the best course of action is to eventually crash out. What I have to take specific exception to, is the idea that the app should just defer to the OS dump mechanism. This is useless to those of us who can’t get WinQual accounts and is fairly inefficient if you need to get information that isn’t covered by a normal minidump. It’s often very valuable to log application-level information and to present an enhanced explanation to the user, and for typical user-level applications I don’t think this unreasonably impacts security. The WER dialog is pretty much useless to the user, whereas in an app-customized report I can often give some indication as to what might have triggered the crash and how to avoid it.

    What I would love is the ability to have the OS auto-launch a second process whenever it sees a crash in my app’s process, with the app frozen so the second app can analyze it safely. Unfortunately, unless I’m mistaken, the main choices only seem to be in-process handling (SEH or WER callback), or outright termination in case of severe failures.

  21. Phaeron: I’ve forwarded your suggestion to the dev lead for the Watson team, it’s an interesting idea.

  22. Anonymous says:

    I think you’ve misread Eric’s post and set up a bit of a strawman here. His point is that instead of crashing out to the OS exception handling; you should inform the user that an error has occured and (at the limit, his 5th R) offer to get the user a new device.

    So fair enough; he has some fairly odd ideas about how to handle errors, (perhaps open a browser window at dell.com to buy a new PC?) but his point is that you should let the user know an error has occured; log it; return the user to a known state; and continue from there, rather than just crashing with a useless error dialog that most users will have no idea how to respond to. (Well, they do; just click "Don’t Send")

    You’ve based your post on the idea that to "catch the exception and try to correct it" means to continue from an unknown state; which is not what he said.

  23. Anonymous says:

    Phaeron: Isn’t that what happens if you install WinDbg as the system post mortem debugger instead of Watson, or have I misunderstood?

  24. Phaeron: The Watson lead indicated to me that there is no cost to setting up a Winqual account, so he was confused about why a developer (or team of developers) wouldn’t be able to get one.  Is it the effort of getting a code signing cert?

    Steve: My point is that if you "let the user know an error has ocurred", you’re running code while your process is in an unknown state (the code to let the user know and log the error).  Let the OS let the user know an error has ocurred and restart your app.  

    The Watson team has literally spent years working on figuring out how to reliably and safely dump core from a corrupted process, it’s an extraordinarily hard problem that should not be re-invented.

  25. Anonymous says:

    IIRC, one valid use of catching and continuing SEH exceptions was when using VirtualAlloc to *reserve* a huge array and then only *commit* a page or so at a time.

  26. Anonymous says:

    Upon re-reading Eric’s article, I think Larry has misread his intent.  Eric is talking about error handling in environments that must be resilient.  Many of his examples are of non-exceptional errors.

    And other than his first R (Retry to get past a transient condition), the rest of his advice (Restart, Reboot, Reimage, Replace) is about getting the process and system back into a valid state.

    I don’t think the points of view you two are arguing are that different.  I think he starts out provocatively, and that’s what a lot of people are reacting to.  He’s not saying you can trust the state.  He’s saying you have to keep your service up, so find a way to get back to a known good state.

  27. Anonymous says:

    "let the OS exception handler dump core"

    Or have your own exception handler dump core.

  28. MSDN Archive says:

    Larry, your linked post ("Structured Exception Handling Considered Harmful") contains a description of what you consider (and I agree) a proper use of exception handling in the NT file system: each function cleaning up properly by understanding and restoring its state.

    I fundamentally believe it’s conceivable for that proper use of exception handling to move up the stack into application layers.  Like any programming tool, SEH can be used well or abused.

    What’s fundamentally lacking is the art of "failure modeling": for each possible failure, map out what can be done and what cannot.  It is absolutely true that there are fates worse than death (where death==crash) including data loss and security vulnerabilities.  Proper failure modeling — which is also state modeling — will help you figure out which case is which.

    As for debuggability/diagnosability, Eric was pretty clear that *everything* needs to be logged.  I’d never catch an exception without logging it and providing a feedback pipe back to development.  

    "Only catch what you can understand."  Also good advice.  But we should understand more.  Again, feedback loops and failure modeling help us do that.

  29. Anonymous says:

    >I also agree with Eric’s comment that asserts that

    >cause crashes have no business living in production code,

    Uh, I don’t agree with that one.

    When I code an assertion, I am saying "I, the programmer, believe that the stated condition is absolutely always true, and if it’s not true, then the subsequent code isn’t going to work, because my design constraints have been violated".

    Which is to say, if the assertion fails, I’ve lost control over the code, I don’t know what it’s doing, and I want it to stop running now before I make things worse.

    In other words, production code is where I want my assertions in line, because otherwise I’m going to damage production data.

    Naturally (1) I need to have extensive test coverage to validate that those assertions never ever fail, (2) I need the discipline to ensure I don’t write code that gets an assertion failure over something outside my control, like receiving a malformed network message, (3) I understand the difference between ‘my code is broken’ and ‘the user did something wrong’.

  30. Anonymous says:

    Catch an exception, continue running, but only to save the current document – and then exit. For that to work you need, of course, to keep the document in an consistent state in memory…

  31. Tony: That’s the rub.  How do you know that the document is in a consistent state in memory?

    It’s not an easy a challenge.  That’s why I’m continuing to state that crashing is the right thing to do.

  32. Anonymous says:

    Tom M:

    Not quite… the idea is that I want a customized application to act as the "debugger." Actually, what I probably would do is just have the SAME application launched a second time, and have it use the regular Win32 debug APIs to analyze the crashed instance and recover data. It’s mostly the same as an in-process exception handler, but the process separation would make it much more robust (although more difficult to write).

    Larry:

    Yes, the certificate is the blocking point. I’ve never been able to find very good information on what exactly is required for OCA access for individuals like me. The WinQual site itself offers little information, and I’ve seen conflicting information in various places. Cost is one issue, although it looks like a single $99 certificate is sufficient. The other big issue is that everything I’ve seen regarding WinQual and the Class 3 Verisign certificate required to sign up for it only refers to companies — it looks like individual developers not associated with a official business entity aren’t eligible. All of the OCA literature also refers to companies, which doesn’t encourage me to spend money in an experiment.

    The uncertainty as to whether I could participate in OCA wouldn’t bother me except that there seems to be a recent trend toward blocking non-OCA diagnosis methods, unintentional or not. What really pissed me off was when I found out that the Visual C++ library team stuck code in the 8.0 CRT that explicitly tears off any existing exception handler and calls Watson directly. I think that unless the Windows and WinQual teams ensure that small ISVs can participate and provide clear directions for doing so, it isn’t appropriate to assume that everyone can use Watson + WER + OCA.

    Don’t get me wrong, I’d love to get OCA reports for my application, even if the ones that fell through that path didn’t have all of the diagnostic information that my app’s normal exception handler dumps. I take all crash reports seriously. There’s just too much ambiguity and uncertainty involved in getting set up. I haven’t found any report from a Microsoft employee along the lines of, "yes, we’ve successfully had individuals not associated with a business sign up to WinQual for crash reports with just certificate X."

    On a side note, I just realized… regarding your comment about asserts that cause crashes in release code: doesn’t the NT kernel do exactly that?

  33. MSDN Archive says:

    >[Larry]: How do you know that the document is in a consistent state in memory?  It’s not an easy challenge.  That’s why I’m continuing to state that crashing is the right thing to do.

    I read that as "it’s hard, therefore we shouldn’t try".  But I know that’s not what you mean, because you have some great examples of exception handling done right.

    I think many of the commenters on both blogs (Eric’s and yours) are looking at the problem too coarsely; too black and white.  How do you know that the document is in a consistent state in memory?  By designing in consistency checks.  Little extra validation routines that take advantage of a little extra redundancy built into your document’s memory representation.

    This isn’t really novel work.  It’s just work that hasn’t traditionally been a priority at Microsoft outside of specialized teams.  But I believe, and I think this is Eric’s point as well, that Trustworthy Computing includes Highly Available Software and that we need to tackle the hard challenges associated with that.

  34. Tanveer Badar says:

    One question.

    When JIT translates callvirt into x86 assembly. It turns it into

    mov eax, [ecx]

    call whatever

    The move will raise a exception if ‘this’ is null, because the processor will attempt to move from 0 location to eax. However, this exception is caught and eventually turned into NullReferenceException.

    But, [ecx] could equally reside outside the committed memory too and it would turn that into NullReferenceException too.

  35. Anonymous says:

    Badar: I think that it operates on the assumption that in the managed code all pointers to objects must be valid or null because you have no way to generate pointer to object which points into some random memory.

  36. Anonymous says:

    How the Watson decides what memory to insert into the dump send to the Winqual? Sometimes I needed to check a structure whose reference was passed as parameter a little above in the call stack. Unfortunately most of the time that memory was not included in the dump.

  37. Alan, I’m willing to concede that for some exceptions it may be possible to catch them and terminate after saving state.  But for the many of them (like STATUS_ACCESS_VIOLATION, which is likely to be the most common one) there is no safe way of running ANY additional code in the process.  We have far too many examples of security vulnerabilities caused by people trying to be resiliant in the face of an access violation error for that kind of practice to be considered safe.

    Jiri, by default Watson generates a minidump, which consists of the stacks for each of threads and the thread contexts for each thread.  You may add additional data to the dump with WerRegisterMemoryBlock.  See here for more details: http://msdn2.microsoft.com/en-us/library/ms678713(VS.85).aspx

    Phaeron: As far as I know, the NT kernel doesn’t have many asserts that are live.  Some of them (paged fault at raised IRQL) are sort-of asserts, but they exist because (a) in all circumstances, a page fault at IRQL2 is a bug, and more importantly (b) there is no way to satisfy the page fault.

  38. Anonymous says:

    > You may add additional data to the dump with WerRegisterMemoryBlock.

    What I would like is equivalent of MiniDumpWithIndirectlyReferencedMemory. While that might be doable with the WerRegisterMemoryBlock and manual stack walk, I do not think that attempting to do that from the exception handler or crashing process is a good idea.

  39. Anonymous says:

    Phaeron:

    The main reason the NT Kernel bluescreens (i.e. ‘asserts’) is to protect the disk data and metadata from corruption.

    I disagree with Larry’s point that assertions should not be in released code.  I just think we don’t do it for performance reasons.  

  40. MSDN Archive says:

    >[Larry]: But for the many of them (like STATUS_ACCESS_VIOLATION, which is likely to be the most common one) there is no safe way of running ANY additional code in the process.

    I think that statement also is too absolute (too black and white).  Even Dave LeBlanc points out (in a post inked above) that it’s not about risk avoidance, it’s about risk management — shades of grey.

    But you yourself point out the perfect counterexample: catching access violations while probing parameters across an untrusted->trusted boundary.  In other words, there are patterns and practices where even catching access violations, in a controlled way, can increase both resiliency and security.

  41. Alan, here’s the problem.  Let’s say you have all sorts of internal consistancy checks and you KNOW that your data structures are likely to be intact.  So you install a top level exception handler wrapped around all your code that saves your state and exits.

    How do you know that the reason that the exception handler was called was because some attacker has exploited a validation check in your code (http://www.matasano.com/log/1032/this-new-vulnerability-dowds-inhuman-flash-exploit/ for the classic example of this) that has enabled him to exploit an error in your exception handler?  

    By their very nature, exception handlers are less tested than the rest of your code, and they’re vastly more dangerous.  As I mentioned above, the Watson team has literally spent years refining the built-in exception handler code (which does nothing but dump core) to reduce the likelihood of vulnerabilties in the code.   You’re proposing that not only should the exception handler be application specific, but also that it invoke the save file handler and potentially put up UI.  That means that you have even MORE code that is being run when the application is in an unknown state.

    I think a far better solution is to do what Office does: it installs an application restart handler and checkpoints it’s state periodically.  When it recovers from a crash, it looks for one of the checkpoint files and attempts to recover it.  You should also add a bunch of validation checks to the recovery process because you don’t know the full state of the file (I don’t know if Office does this).  It means that you get the resiliancy you desire WITHOUT the threats associated with running code after an access violation.

    IE has a similar solution (and they’re improving it for IE8).  In IE8, if an IExplore instance crashes, it doesn’t affect the IE hosting application, which will restart the instance in-frame.

  42. Anonymous says:

    > How do you know that the reason that the exception handler was called was because some attacker has exploited a validation check in your code that has enabled him to exploit an error in your exception handler?

    > I think a far better solution is to do what Office does: it installs an application restart handler and checkpoints it’s state periodically. […] It means that you get the resiliancy you desire WITHOUT the threats associated with running code after an access violation.

    I don’t think this is a good justification by itself for pushing the recovery process out of process. If the vulnerability that was exploited was due to malicious data, and the exceptional path is less reliable due to being exercised less, then the I’d say the vulnerability could hit the recovery code just as well as the main code. Given that save routines are often agnostic to the data being serialized, the autosave routine may just push the corrupted data to disk without itself crashing.

    Pushing the recovery handler out of process does greatly reduce the risk of recursive crashes due to general process badness, but I’d say it’s weak security-wise unless there’s something fundamentally different than the original process, such as it runs with extremely limited process privileges IE7+Vista style, or it’s written in a different language such that the original vulnerability is impossible. I’m not a fan of .NET for mainstream desktop applications, but I could see using it for the recovery app.

  43. MSDN Archive says:

    >[Larry]: …you install a top level exception handler …

    Let me make a clarification.  I think many readers are assuming (and a lot of reaction is coming from this) that Eric and I are talking about recovery from top-level (global unhandled) exception handlers.  Speaking for myself, I am not.  The only thing I’ve ever done in a global unhandled exception handler (in a managed-code hosted service) is log and die.

    Since state management is really the hard problem we’re discussing, the scope of any exception handling I’m proposing must be limited and controlled such that state is manageable.  

    (Yes, I read the Flash story.  Truly amazing.)

  44. Alan, ah, that makes a lot more sense to me – I use locally scoped exception handlers a lot (I live in an error-code based world where exceptions are evil).  As I’ve said before, in certain limited scenarios (kernel/user parameter probes, RPC error handling, etc) locally scoped handlers can be quite useful.  

    I HAD assumed (based on the recovery behaviors that Eric was proposing) that you and he were promoting the idea of global top level exception handlers.

  45. Anonymous says:

    "At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve."

    If a crash happens on an end-users PC, then it is extremely likely NOT EASILY DEBUGGABLE. Most of them are not developers. So, where is the benefit?

    Programs should try to recover as much as they safely can – but not more. And they should tell – in simple words – what went wrong.

    Did you notice that Vista does not just abort a copy operation because of a full drive, but actually allows you to free up space and retry and finish the rest of the operation? (Handy for that memory stick that’s always full.)

    "The bottom line is that when an exception is thrown, your program is in an unknown state."

    Agreed. But what about all the _known_ states that a programmer could, but did not handle?

    I can actually do without a functioning spell checker if all I want to do is print a page of a document. But I can not do with an app dying or even a BSOD, only because some data file – not necessary for the task at hand – was not found.

    If I understood Erics article correctly, he focused on the user experience. Not the developer experience. 🙂

  46. Anonymous says:

    @HagenP:

    Microsoft typically can debug problems that happen on end-user PCs if the markers are clear enough (i.e. the crash happens close enough to the cause of the corruption) through the Watson facility.  From the perspective of a developer trying to diagnose crashes in the field, having something crash early is good.

    And by the time the product ships, there should be no known states that the programmer coulda, wouda, shoulda covered, but didn’t.  If the feature in question is not of that quality level, then it should be pulled out or the release delayed.  At least this is the case in the commercial world.

  47. Anonymous says:

    @nksingh:

    "And by the time the product ships, there should be no known states that the programmer coulda, wouda, shoulda covered, but didn’t."

    Of course we must apply the statement to all software involved, including OS, drivers, etc.

    If all these are covered, all known states handled, then the only thing LEFT to cause a crash is faulty hardware, correct?

    "If the feature in question is not of that quality level, then it should be pulled out or the release delayed.  At least this is the case in the commercial world."

    I totally agree with you here. The key word here being "should".

    Unfortunately, modern software is so complex that with this precondition nothing could be shipped anymore.

    Including Operating Systems.

  48. Anonymous says:

    Tell me about it.  I’ve just recently started working with an internal project maintained by another group of developers.  All of their public APIs are wrapped in code like this:

    __declspec(dllexport) BOOL SomeFunction(…)

    {

      BOOL bResult = FALSE;

      __try

      {

          __try

         {

            // do stuff

         }

         __finally

         {

            // clean up any resources

         }

      }

      __except(EXCEPTION_EXECUTE_HANDLER)

      {

         // log some generic exception message

      }

      return bResult;

    }

    And they validate all of their parameters with IsValidWritePtr().  Needless to say I hate my life.

  49. Anonymous says:

    Hey Larry… long time no comment on your blog 😉

    I think there’s two different classes of application here.

    Most apps, you really do want to have them crash as soon and as hard as possible.

    Other apps, you want to recover. Eg. any kind of compiler. It’s much better to get as much data on the errors in the dataset you’re handing to it, than to have it abort on the first one.

    I’ve seen some horrible apps recently (naming no names – and not MS ones either) which choose to explicitly crash whenever they hit issues… which makes debugging the bad data one hell of a chore.

  50. Wow Simon, long time no hear :).

  51. Anonymous says:

    I think that, like many other issues in writing commercial software, this one is compounded by conflicting interests from the business side and the software side:

    Business side wants software that never crashes, ever. Creates a better customer experience.

    Software side wants software that is easier to debug, since it will be more maintainable and improve faster.

    However, speaking from a purely theoretical standpoint, is it not possible to have an exception handler that is located in read-only memory, allocates it’s own memory for storing variables to avoid corruption, and that examines the data included with the exception and the entire program state to determine exactly what is inconsistent about it, and possibly restore it to a consistent state? Is this something the Watson team have considered? Or would it be deemed a potential security vulnerability?

  52. Anonymous says:

    Hi programmers,

    if you let a user intermingle with a comment:

    Larry, one sentence caught my attention (about Office resiliency). I’ve been using and providing kinda technical support (as the most computer literate person in a large organisation in my country) for 10 years and all versions of Office inbetween and never succeeded to (auto)recover any usable data after application crash. Or were you talking about application resiliency? But that’s not what we users care for…

    Bye

  53. Anonymous says:

    I understand the appeal to authority that Larry is using ("trust us, we’ve been bitten so many times that we know what NOT to do") and in fact I agree with what he means to say.

    But it’s a horrible argument to make. You have to present the case for benefit to the end user, because the psychology you are fighting against is a conscious developer (or developer management) decision to ease a pain that the user faces by taking a shortcut that will be in almost all cases seen only by a developer as a bad thing.

    And Larry, you’re too smart to ignore that when that non-showstopper bug sneaks in near release, you are going to get a suboptimal solution to the problem. In addition to this, tracking down the specific exceptions that will happen requires some rigor in testing approach and collection of failure data. Only since perhaps 2002 has Microsoft even approached the point where this would make it past triage.

    I’m just saying that a flippant response to people acting in good faith (and decent knowledge of the consequence) is distasteful and reflects badly on you.  It may feel pedagogical for you to challenge the assumptions underlying these practices, but you’re not backing it up with a real strong argument addressing the root cause.

    Boiling this down to security in Windows OS code or assumptions about invariants is just bolstering the real reaction to your talk – which is "well, yeah. But that security issue in top-level exception handler is way more important to you poor saps at Microsoft who have deliberately ignored security issues for years. I’ll make a more robust in the 99% case and worry about global consequences once I’ve made my billions, thanks very much. That’s what you did."  Your appeal to authority falls hard on the mea culpas that have been given multiple times by your authority, even if no one doubts your current sincerity or ability. Your "OK maybe back in Windows 3.1 it was OK" comment seems to really show your own evolution and coming of age more than any useful guidepost for others (IE made horrible decisions even leading into Win2k, well after 3.1, and I know you know about them).

    But it’s still fun to see you openly sniping at Engineering Excellence publicly.

  54. Jay: I’m not on the Watson team, so I can’t explain the problems.  But the Watson folks have been trying to solve the problem (generating reliable crash dumps from corrupted processes) for a long time and it’s still not perfect.  It’s a very hard problem, and thus not one to be undertaken lightly.  

    Friday: Interesting, I’ve never had a problem with Word’s autorecovery.  Go figure.

    Triangle: You can’t make any assumptions when an exception happened.  You might be able to write out the minidump, but how would you "fix" the state of memory?

  55. Anonymous says:

    ‘You cant make any assumptions when an exception happened’

    Are you sure of this? For example, would it be wrong to assume that code is stored in read-only memory, and that though the state of data may be inconsistent, the code will not? Or that there must be addresses that are hard coded into the code that reference static or thread-local data? Can I not assume, that if one knows about the internal structure of the program, it would be possible inspect data values in it and determine wether or not they lie inside the range of valid values for that program? How can you make such a great generalization like that when there is so much read-only state in a program?

  56. Triangle: On platform with DEP (or NX or W^X) enabled, you might (not always, but sometimes – it depends on what was happening at the time of the crash).  The problem is that Windows runs on platforms where DEP is not supported, and DEP is optional – not every process running on the machine has NX enabled.

    I’m not an area expert, but I have talked to the area expert, and he’s pretty adamant that you can’t trust the state of the process.

  57. Anonymous says:

    Triangle: I suspect one of the problems is that it isn’t just your exception handling code that has to work.  If your handler is going to actually do anything, it has to call API functions, and (it is my understanding that) the API libraries also keep data in the process memory space.

    If the system heap is corrupt, or the handle table, or whatever, most API functions aren’t going to be safe to call, and it is likely to be difficult to work out which ones are.

    Microsoft might be able to address this by creating a separate limited API for this specific purpose, but probably the only way to get very much done safely would be to launch a new process to examine the crashed one, as Phaeron suggested.

  58. Harry, it’s my understanding that the watson guys solved it by writing an app (werfault) that generates the dump information and processes it.  Essentially the same idea but much more reliable.

  59. Anonymous says:

    Larry, I did read your footnote, and it said "For some C++ and C# exceptions, it’s ok to catch the exception and continue", but that "For structured exceptions, I know of NO circumstance under which it is appropriate to continue running."

    Given that we were writing in C, the C++ and C# exception handling mechanisms were not available.  If we wanted to use exceptions at all, SEH was our only option.  You know as well as I that "it will all be fine if you re-write your entire app in {Lisp|Prolog|C#|today’s new language}" is rarely, if ever, a feasible alternative.

    We were as careful as possible (given the bugs in the SEH mechanism at the time!) to catch only those exceptions that we had raised, and what we did still doesn’t seem "profoundly stupid".

    You could argue that C++ & C# exceptions are safer than SEH, for the standard reasons that separate address spaces are better, in that they avoid accidental collisions between different bodies of code.

    I think it’s better, though, to argue that you should only handle your own exceptions.  C++ and C# might make that automatic, but when they’re unavailable it’s still possible to use SEH properly.  That’s not what you argued, though.

    Don

    P.S.  In the 14 years I worked on that code, the Windows guys never once complained to me about exception handling, although possibly only because they were too busy complaining about how I used heaps.

  60. Don, 14 years ago, you probably made the right decision.  And if your exception filter is carefully constructed, it’s possible to do it right.  

    It’s also possible to ride a motorcycle without a helmet or to walk a tightrope from one 100 story building to another or to strap a jet engine on your back and rollerskate.

    But I wouldn’t advise any of the above unless you were REALLY careful and knew exactly what you were doing.

  61. Anonymous says:

    Larry, at the UK Vista launch last year, one of the presentations was about the Vista application recovery "feature".

    The guy demonstrating this explained that this allowed an application that has *crashed* to have  a chance to be called back to save data. I must admit my jaw dropped – we were being told that it was a good idea to grovel around our data structures *after* the application has entered an unknown state – and then write things to disk.

    What are your feelings about the app recovery feature? Surely this must be identically bad to doing

    things after you have an unknown exception.

  62. Julian, I mentioned the Vista application recovery feature above – the restart manager restarts the application that’s crashed with a command line parameter you set.  It does nothing to the crashed process.

  63. Anonymous says:

    Larry, from MSDN, for RegisterApplicationRecoveryCallback:

    "If the application encounters an unhandled exception or becomes unresponsive, Windows Error Reporting (WER) calls the specified recovery callback."

    To me, that looks as if the *crashed* application is called.  I understand that *another* instance can be created automatically by the manager later.

    My problem (and yours, if I understand your article) is with the idea that it can ever be a good idea to try to continue when an *unknown* error has occurred.

    Restarting the app is ok, the application recovery callback is the stupid idea.

  64. Igor Levicki says:

    So if I write a kernel mode driver to access the CPU MSR registers it is OK to BSOD the OS because someone using the driver has attempted to access the MSR register which doesn’t exist on a certain CPU?

    Bear in mind that new MSRs are added as new CPUs are made, and driver cannot be responsible for validating the input. The only thing driver _can_ do is to catch the exception and return error just like it does at present:

    __try {

    __writemsr(reg, value);

    } __except (EXCEPTION_EXECUTE_HANDLER) {

    return GetExceptionCode();

    }

    I also prefer the early crash but sometimes it is not neccessary.

    As for Restart manager, it is a nice idea but useless in its current incarnation. Once the application crashes the context has been lost. If there was no (auto) save and thus the work has been lost then the user can restart the application on his own, there is no benefit in offering him to restart automatically.

    What would be usefull however is if the restart manager could periodically take snapshots of the registered and running application and restore the snapshot on crash.

  65. Anonymous says:

    The dilemma:

    When an unexpected exception is caught at a high level layer of the applications what should we do?

    Approach 1. Add an entry in the event log, send a Watson report and then crash the application because we don’t know and we can’t predict the state of the application accurately. Therefore it is unsafe, and continuing may result in "incorrect" and "unexpected results".

    Approach 2.  Add an entry in the event log, send a Watson report and continue the process. This way the application is more available and reliable and if user or admin see a problem they can terminate the application and restart it.

    What’s common in approach 1 and 2

    1. log + send Watson  

    2. Predict the nature of the error, i,e is it fatal or not fatal. Is it safe to continue?  

    3. Use Approach 1 and 2 based on some heuristic.

    Some comments for Approach  1

    Taking approach 1 "blindly" is fatal for Server based applications. Server apps need availability and reliability. Suppose there is a bug in the server code I.e. a client can crash the server by sending a bad request which was not expected by the server.  Due to a coding bug we might get a NullRef and should we crash the process if it is caught at a higher level exception handler layer.

    So in this context submitting the error report and continue process may seem like a right decision.  Because it is coding bug, and crashing the process means DAS Attack. In Addition if you crash the process you can also put other on going request accounts in an unstable state.

    However I can come up with a different example which may portray approach 1 is good because with context it is easy without context it is a guess game.

    In reality is that there is no Right answer. You have to compromise and need to predict. You have the following information about the exception

    1. Type of Exception

    2. Where it occurred.

    Here is my heuristic list.  Your list may vary depending on the application nature.

    1. If there is any known fatal exception such as OutOfMemory excpetion then CRASH the server. Never mess with OOM, even though some people who work with .NET and COM a lot may have a different opinion because COM sometimes return OOM (E_OUTOFMemory) but the process is not OOM.

    2. NullReferenceException, or ArgumentException, CastExceotion, ArrayOutOfBoundException are usually considered as less fatal exceptions and MAY be due to coding bug. We should not crash the server. The statement 2 depends on the state fullness of the application. The more state you have the more chances that the exception is fatal. Corruption of in memory state can result in unpredictable results, which even worse than incorrect results.

    3. Give exception more weight based on the location(exception.TargetSite.DeclaringType) it occurred. Keep a list of all high profile classes (which have state and synchronization logic), and you should  crash the process if an unexpected exception occurs there.

    I am someone can come up with a better heuristic list.

  66. Anonymous says:

    Monday, May 12, 2008 10:23 AM by LarryOsterman

    "Friday: Interesting, I’ve never had a problem with Word’s autorecovery.  Go figure."

    Interesting.  I guess that means it works in English language versions of Word.  I’ve never seen it work though, just like Friday.

Skip to main content