Hey, let’s report errors only when nothing is at stake!


Only an idiot would have parameter validation, and only an idiot would not have it. In an attempt to resolve this paradox, commenter Gabe suggested, “When running for your QA department, it should crash immediately; when running for your customer, it should silently keep going.” A similar opinion was expressed by commenter Koro and some others.

This replaces one paradox with another. Under the new regime, your program reports errors only when nothing is at stake. “Report problems when running in test mode, and ignore problems when running on live data.” Isn’t this backwards? Shouldn’t we be more sensitive to problems with live data than problems with test data? Who cares if test data gets corrupted? That’s why it’s test data. But live data—we should get really concerned when there’s a problem with live data. Allowing execution to continue means that you’re attempting to reason about a total breakdown of normal functioning.

Now, if your program is mission-critical, you probably have some recovery code that attempts to reset your data structures to a “last known good” state or which attempts to salvage what information it can, like how those space probes have a safe mode. And that’s great. But silently ignoring the condition means that your program is going to skip happily along, unaware that what it’s doing is probably taking a bad situation and subtly making it slightly worse. Eventually, things will get so bad that something catastrophic happens, and when you go to debug the catastrophic failure, you’ll have no idea how it got that way.

Comments (37)
  1. Blake says:

    Always mount a scratch monkey.

    (And more seriously – yes, fail fast, always.)

  2. Ian says:

    I think this (yet again) illustrates the failure of the 'one size fits all' approach. An intelligent solution needs to look at what exactly is at stake, and what (if anything) could be done about the error.

    An application that presented a message like 'A fatal error has occurred. Press OK to quit. All unsaved changes will be lost.' when all you did was try to save over a read-only file would be quite annoying. On the other hand, in the kind of scenario Raymond refers to above it is obvious that blindly continuing is the last thing you want to do. Context is everything.

  3. Dan Bugglin says:

    Log files would help with the "you'll have no idea how it got that way" scenario.  Silently ignore errors for the user, but log them.

    Of course behavior should be based on the severity of the error.  When in QA, make every error act like the most severe one.  When shipping, only errors that would leave the application in an unpredictable or dangerous state should result in an application crash… otherwise just only clean up what malfunctioned and try to give the user back control so they can restart the failed task.

  4. Roger Lipscombe says:

    And why would I want the build that QA's testing to be different from the one I ship? That way, you're not actually testing what the customer gets. Sure, I'd want the ability to get more diagnostics out of it, but I might need that in a production environment anyway.

  5. CarlD says:

    Surely "the right thing" also varies with the nature of the program, and the ability of the user to ascertain the impact of an error.  For example, a simple bitmap editor like MSPaint has little "live data" other than a bitmap – and that entire bitmap is visible to the user.  If a program error caused, for example, a circle to be drawn when a rectangle was requested, the user will recognize this and act according to their own needs and desires.  They might try invoking the "Undo" operation – which might actually succeed.  

    I'm generally not in favor of continuing after an error without at least giving the user a heads-up that something might be amiss, but I'm willing to accept that there may be circumstances where doing so would in fact be acceptable behavior.  I suspect those circumstances are a tiny minority.

  6. Someone else says:

    "Isn't this backwards? Shouldn't we be more sensitive to problems with live data than problems with test data?"

    I don't see why you mention that here. Are you saying that assert() should not be used? Or that every kind of verification code must also be part of the released version?

  7. creaothceann says:

    This is why I leave overflow checking etc. enabled even for release builds.

  8. Danny Moules says:

    @CarlD "If a program error caused, for example, a circle to be drawn when a rectangle was requested" And then tries to put data related to a rectangle into a circle object which causes an overflow causing junk to spill into unprotected memory and possibly corrupting the machine's permanent state meaning it won't boot…

    I'm not saying your point is moot but your example is certainly flawed.

  9. bobmiddlbury1 says:

    I always test the same application I am releasing, now thats just me.. :)

  10. JW says:

    @Gabe:

    You are aware that if your error can affect your document, saving said document might very well persist the corruption already incurred? Remember: users are idiots. Rather than letting your users save it over the older non-corrupt copy, I'd hope you'd save to a special crashed-document file and implement a crash recovery modus that at least prevents users from nuking their own documents. (Word does this, for example.) If you don't, and just say 'AppX has encountered an unknown error and needs to be restarted. Please save your documents, yadayada.', users will blame you for telling them to save.. or if you were to leave that sentence out, they would blame you for giving them the chance to save a corrupted file over their good one.

    If your car engine is very likely to blow up any minute now, would you want the car to just muzzle it up and play russian roulette without your knowing? Even if you could freeze to death outside the car in a matter of hours, I am sure it would be preferable to getting blown to pieces in an instant. The former gives one a chance of survival at least.

  11. Crescens2k says:

    Gabe:

    Testing an application which is different from the user puts some uncertenty into it. It is like how sometimes bugs occurs only or release builds because the Debug CRT and Windows manages to stop the bugs manifesting in Debug builds. The same could happen with this situation. If the code written for QA happens to move the stack variables just ever so slightly, and a stack corruption bug occurs, then it could very well corrupt a different stack variable in the QA code than in the end user's code. This could completely hide the problem if it is a QA only variable or if it is a variable which gets set to a sane value after it is corrupted.

    This is why I am for giving the QA team and the end users the exact same program.

  12. I've seen a few people now advocating that the 'debug' build of an application should, at least in most respects, be the released version as well. I'm inclined to agree, too: if you put an assert(x!=NULL) in at some point, shouldn't you be checking for x being null in the released build too? Maybe it "shouldn't" happen – but then, it shouldn't have happened in the debug version either! Obviously you strip symbols out, things like that – but I would say error checking should very much stay in place in the release version too.

  13. Miles Archer says:

    Power Plant Control Room software error strategies might need to be a little bit different from web browser error strategies, to state the obvious. (what Ian said)

  14. Gabe says:

    I should point out that I'm advocating only that something NOT crash if it doesn't HAVE to. That's why your car has a "limp home" mode. The limp home mode doesn't keep the engine working when it's about to blow up; it keeps the engine working within safe parameters when something unexpected but noncatastrophic happens. If you're driving down a narrow mountain road and your engine decides to shut down because of a bad sensor reading, you'll probably die because your power steering and power brakes stop working.

    I'm merely suggesting that programs should also have a limp home mode. In fact, VS and IE frequently pop up error messages saying that some part of the program has caused an exception. I'm always relieved when I can click the OK button and keep working!

    [The difference is that with physical objects, you can perform physical isolation so that one broken part cannot affect another. But if somebody crashed with a bad pointer, you have no idea what else got corrupted before they crashed. (The software version of physical isolation is process isolation, which is what IE uses.) -Raymond]
  15. zondrac says:

    are people still using windows? lol

  16. Gabe says:

    I would suggest that it's not actually backwards. The QA department is generally running your program for the express purpose of finding bugs, not for using the program for its intended purpose. If your program didn't crash for QA, there's no point in even having that department. On the other hand, your users are presumably using your program to get work done, not for finding bugs. Since crashing is generally the opposite of getting work done, I don't think there is much value in crashing for a user who is trying to actually *use* your program.

    Of course, there's nothing wrong with taking action (logging the exception, showing an error dialog), but once the program has gotten to the user, you can't debug it anymore. If you find yourself actually debugging on a user's system then by all means put it in QA mode where it crashes instantly, but otherwise there's no point in giving a user an error that they can't do anything about.

    Consider a car, for example. A car's engine computer detects a misfire, indicating a potential problem — who knows, maybe the engine is about to blow up. As a manufacturer, you can say that the car isn't mission critical; if it fails, a driver can use their cell phone to call for a tow truck to get the car looked at and in the mean time use a cab or rent a car. In reality, though, the car may be mission-critical to the driver. A driver could be driving along a dark country road where there's no help available, or driving in the middle of winter when it's so cold that an engine failure could mean the driver dies within hours, or driving down an evacuation route such that the stopped car causes thousands of people behind it to be unable to evacuate a natural disaster. So wouldn't you say that it's much better to log the exception, maybe turn on a "check engine" light, and keep going?

    Similarly, you may not consider your program mission-critical, but your users might. If I'm a reporter trying to get a story in, I'll be pretty upset if the word processor crashes just before deadline! I could even lose my job as a result. You may think that an error in the spellchecker could mean that the anything is corrupted and it's safer to just immediately shut down the whole program. I would say that I don't even care about the spell checker, so just shut that down and let me keep working. Now, that *could* mean that the error will somehow affect my document and cause untold problems, but crashing immediately will *also* affect my document and cause untold problems! If the problem could affect my data, just let me know so I can save my document and restart at least.

    [If the car detects a misfire, then it knows that it encountered a misfire, and the engineers know the scope of damage a misfire can cause and can perform appropriate recovery. (Which might be "log an error and keep going, but if it keeps happening, turn on the Check Engine light and go into safe mode"). But what does a car do when the fuel injectors crash? -Raymond]
  17. Jim Lyon says:

    I always advise people design for errors with steps like the following:

    1. Figure out how the user recovers if the system crashes. Implement it.

    2. Figure out how the user recovers if your process crashes. For machines intended to run a single app, crashing the system may be the right answer.

    3. Check the return code from *every* API call. Crash if someone returns an error code that you don't understand.

    4. Having done (1) through (3), it's now your job to make sure that your program understands those error returns that are likely to be encountered in real life, if you can do something more useful than crashing.

    (4) means that you'll probably code for ERROR_FILE_NOT_FOUND from an open call when the user typed a file name. It means that you almost certainly won't try to recover from an ERROR_INVALID_HANDLE return from close.

  18. Joshua says:

    Incidentally, I've managed some rather serious memory corruption in kernel mode before. It's surprising how well some things limp along despite damage to kernel file tables. I'm convinced that most of the time you can limp on a bad pointer for quite awhile. But yeah, there's always that one.

  19. Cheong says:

    Btw, in 64-bit environment, does Windows reuse handles as frequent as before?

    Say, if there's good chance that Windows will not reuse the handle within a day or even a month, wouldn't it make sense to allow the application to walk pass the error and just prompt whoever user to save things (as recovery copies like what JW said) as fast as possible and restart?

    I recalled that 2 months ago, when my company's AD fails and cannot restarts, I found that since there's no new account records be added for a long time, it might be safe to delete the transaction log and rebuild it, and it works! Sometimes if your application can give us remains of whatever I was working on before the crash, we might be able to figure out what to do to recover the file or even application state before bad things happened.

  20. Fritz says:

    I actually get pushback from testers when I let unexpected exceptions go and let the program crash.  Test automation is easier to write if the program doesn't crash. :)  Of course, just because I get pushback from testers does not mean I put in blanket exception handling.

  21. Crescens2k says:

    zondrac:

    And what was the point of this post? But it is a well known fact that the top 3 Operating System usage is Windows in number 1, Mac OSx in number 2 and iOS (iPhone) in number 3. I suppose you are just some fanboy who doesn't like this.

    Cheong:

    How would stepping past the error help when the nature of the error is an AV because somehow the pointer to your data got corrupted? How would stepping past the error help when the error is that the data itself is corrupted. The fact is, at the point of error the program is in an inconsistant state and you just don't know what is good and what isn't. So if you do allow saving, you just can't guarantee that you can save or that the data you are saving is good.

  22. Gabe says:

    Crescens2k: Do you have any statistical data to show whether data loss is more likely to occur by crashing or by continuing? If not, then why assume that crashing immediately is going to lead to less corruption?

  23. Christian says:

    The principle to crash quickly once an access violation has happened or random address space is to be overwritten is fine. But it's not good if programmers apply this hard rule to .NET or similar environments: I find that most often these programms will continue to run fine after an exception. So every button handler or menu entry should simply wrap everything in a try/catch/log&MsgBox.

    And why would a native programm need to crash for every unexpected error code of any api function? It should just show a message and continue or abort the current function. The programmer cannot anticipate every error code and it's nice to be able to run a programm in an environment it was not meant for, e.g. an old app which does stuff only admins can do and which are not really necessary would continue to work or it might even work if some api is absent (think of a cd burning programm which fails somewhere because the codecs or the cd-labeling third party malware was uninstalled)

    I would not like to see programms which force crashs upon me unless strictly necessary, but it really depends on the context

    [You're confusing external errors (which you need to defend against) with internal logic errors ("The program should never have gotten into this state"). -Raymond]
  24. Spock says:

    I continue to be amazed at the number of people who propose the "let's keep running in an unplanned corrupted state" approach. As if somehow crashing is worse than corrupting your data, or pouring molten steel on someones head. You can be a little more robust to unforeseen errors, but "ignore them and continue" is not one of the ways. Creating redundant processes, for example, is one such method. Now, an external factor may trigger the same bug in both processes, but you improve your reliability. Put those processes on two separate bits of hardware, and you have hardware redundancy too. Break your application up into separate processes depending on task and you improve your robustness further. Once something has gone bang in a process though, it's time to stop. When something falls off from under your car, it's not a good idea to just keep driving hoping everything will be ok.

  25. Steve says:

    @Gabe: The second law of thermodynamics.

  26. Cheong says:

    My point (and I think some others's too) is that if the resource (handles, memory regions, thread, etc.) is well compartmented, unexpected state of one region of application shouldn't require the whole application to restart. At the worst case your application should be able to just destory that part of application (or withhold any suspected used handle and prevent them to be reused) and restart that part at the user's choice to make it continue to run. Those suspected unreliable resources can be marked as "defunct" state and be reclaimed at the next reboot.

    It's good to allow the user know which part of resource is still in "known good" condition and allow them to save the reliable work if they think it is appropiate.

  27. Spock says:

    @Cheong

    There is a simple practical way to compartmentalise your resources. The name for it in Windows is a "process". If you receive an unexpected system exception in one of these compartments, it is perfectly expectable to destroy that component, and start it up again. In fact, the Windows operating system is very kind in giving you all sorts of guarantees about what happens to those resources in this case. To try and do the same from within a process is simply foolish, especially when you are simply re-inventing the operating system provided wheel.

    As a simple example; if my process receives a memory access violation exception, is there any way I can safely continue? Do I have any way to know what parts of my heap/stack are now valid or invalid? No I do not! If the exception has occurred I must crash that process, or risk completely random behaviour. Well OK then, you might say, I will crash on memory access violations, but continue to run on others. Continue down this path and where does it logically lead? To creating a white list of exceptions that you can return to a known good state from. This is the only sane way to handle exceptions. Other than that, let the process crash. If you need to be robust in the face of crashes, then divide your application up in to relevant processes and restart them on a crash.

  28. 640k says:

    @Spock: That wouldn't be a problem if windows wasn't 1000x slower than unix at starting processes.

  29. Gechurch says:

    @640k

    Windows takes, what – a couple of milliseconds to create a process. Even if that was, as you incorrectly state, 1000x slower than Unix it still wouldn't matter because it is so far below the threshold that humans can perceive. I guess if your process was crashing several per second this could conceivably be an issue, but if your process is crashing several times per second you've got a lot bigger problems than how quickly Windows can restart your process.

  30. Jim Lyon says:

    Re FailFast:

    My suggestions are based on two observations over the long term:

    1. If you die when an unexpected error occurs, the odds are much higher that the underlying bug will get fixed before you ship. Thereby sparing your customers the pain of finding it.
    2. The "muddle on regardless" philosophy too often leads to systems entering the "all lights on but nobody home" state. It hasn't crashed, but isn't doing anything useful either. This usually results in a long delay before somebody reboots/restarts it, with the corresponding hit to availability.

    Remember, availability = MTBF / (MTBF + MTTR). One can argue that FailFast decreases MTBF (I don't believe it, but I understand the argument). However, in real life FailFast substantially decreases MTTR.

  31. jader3rd says:

    Always crash immediatly (and get the proper report to the proper channels).

  32. Gabe says:

    Imagine what would happen if, whenever a mutation was found in a strand of DNA, an organism is immediately killed. You could easily argue that any particular corrupted gene could cause untold damage — it might cause an animal to grow a second head, or even make a person turn Evil! Any mutation must be stopped before the corrupt DNA gets a chance to propagate. If you need another such organism, just create a new one from scratch with fresh DNA, right?

    Well, it turns out that mutations happen all the time. There are so many bits of DNA and each one is largely inconsequential. What are the odds that a mutation will actually express itself in a noticeable way? They're actually pretty slim. So if you killed all organisms with mutations, you'd have essentially no organisms. The fact that there are so many organisms on the planet indicates that most mutations can be safely ignored.

    In the same way, your process's address space has billions of bits in it. What are the odds that a few bad ones will cause corruption of user data without killing it, or that the corruption will propagate as the process continues to live? Remember, most bugs in your program are inconsequential — if they weren't, you'd fix them. Heck, most bugs in your program probably haven't even been detected yet.

    It's no longer kosher to keep running after an AV (like accessing a null pointer) due to the risk that the corrupted bits could cause your process to turn Evil, but even so, most null pointer problems are probably harmless. I base my claim on the frequency with which unhandled exception dialogs in JavaScript and .NET apps can be dismissed without causing problems.

    [JavaScript and .NET contain frameworks which are designed to contain the damage of a bad pointer. (For example, a "pointer to freed memory" is impossible to generate in those environments. Well, okay, if you use interop you can cause arbitrary damage.) -Raymond]
  33. Cheong says:

    @Gabe: I'd have to disagree with your argument too. If mutation to your DNA happens, for the most of time your immune system will kill the mutated cells first (i.e.: our body chooses "fail fast" strategy by default.

    If mutated cell somehow survived the attack from immune system, you either successfully mutates, have deformed organs (not necessarily dead), or have cancer. Just to remind you that the possibility of successful mutation is much much lower than the other two.

    On the other hand, killing abnormal process on a multiprocess application is like modern surgery – cutting abnormal parts out to save normal parts.

    What I'm unsure of, is that whether there's way to implement thread-based resource isolation because Windows favours multithreading more than multiprocessing.

  34. Dave says:

    This reminds me of some code I ran into years ago.  It looked kind of like this:

     Dim SQL As String = "Delete from Orders "

       SQL &= GetWhereClause()

    Private Function GetWhereClause() As String

       Dim rv As String = ""

       Try

           'lots of code here.

       Catch ex As Exception

           MsgBox(ex.ToString)

       End Try

       Return rv

    End Function

  35. Spock says:

    @Gabe

    For an interesting study comparing the Linux kernel with the ecoli bacteria genome see: http://www.physorg.com/news192128818.html. The quick take home is that DNA is more reliable due to massive specialization (I.e. Zero code re-use), and enormous numbers of iterative cycles to have weeded out the issues. The specialization leaves the organism robust to mutation (as there are no generic routines), and obviously the "billions" of users (organisms) over billions of iterations (death/reproduction) weeds out the problems. Neither of these things are remotely pratical from a software development view point. I'm always amazed at people that think "evolution" should be copied as a software development process. We don't have millions of years to get our products right, and our customers are none to pleased to take on the role of improving "fitness".

  36. Gabe says:

    Raymond: As I said, continuing after an AV is no longer kosher like it was in Win3.x, but when you talk about "errors" or "problems" in general, things are different.

    Cheong: Your immune system will only attack the mutated cell if it can detect the problem (i.e. the mutation is such that the cell appears to not be a part of you). Your immune system doesn't do a DNA comparison; it does something much more complicated like binding to proteins. The mutation would have to affect the proteins that your immune system recognizes in order for it to attack. The cell will only turn into cancer if the mutation is something that causes the cell to divide unchecked without triggering your immune system. A deformed organ is only going to happen if the right mutation happens before the organ is formed.

    In reality, mutations happen all the time. You're constantly bombarded with radiation (cosmic rays, sunlight, decaying of isotopes like C14 and K40) that can cause DNA mutations. Most of those mutations will be in genes that have already expressed themselves, will never express themselves (most recessive traits), or will express in harmless ways.

    An interesting experiment would be to write a program that flips random bits in the address space of some running app (say, a word processor), then see how long it takes before you notice. Then do it many, many times to see how often it results in a crash, how often it results in corruption of the file, etc. I'm fairly confident that you will rarely notice it.

  37. Joe says:

    Most people here are speculating only over bad pointers / access violations. Ok, it is very bad that C compilers don't support all the array index checking, overflow checking and value range checking, that Delphi can compile into the code. With this, many things are checked at runtime just like in managed code.

    (Side note: Because "string" is a built-in managed type in Delphi, pointers for string-handling can be completely avoided. This is a very big advantage.)

    Because all of this constant checking of everything indexed access and every assignment makes the programs slower and bigger, you can disable this checks in the release build.

    This changes the point of view: During development, I do get AVs very seldom. Objects are referenced by pointers, so AVs *are* possible, but most of the time, there are index or range violations, or (because most of our programs are database-centric) database-related errors.

    All this non-AVs don't leave the program in an unknown global state. If there is some bad array index, its really save to let the resulting ERangeError exception propagate to the top-level window procedure, where it is automatically catched and displayed.

    (Of course, you must use try-finally constructs everywhere to free-up allocated resources. Try-catch is used seldom, and the exception should be reraised at the end of the exception handling code.)

    Because AVs are extremly seldom, and most of the data is stored in a database (which usess referential and check constraints as much as possible), the policy is to let every exception propagate to the top level window. In most cases, the exception text is a big help in locating the bug (development) or in troubleshoting (customer support).

    So its an fail-fast strategy, but its not crashing fast.

    [That's a really big "of course" you stuck in parentheses there. Because it applies not only to you but also to all the libraries you consume. -Raymond]

Comments are closed.