Crashes in the I/O stack tend to occur in programs which do the most I/O

A customer was diagnosing repeated blue screen errors on their system. They shared a few crash dumps, and they all had a similar profile: The crash occurred in the file system filter stack as the I/O request passed through the anti-virus software.

Some of the crashes reported PROCESS_NAME: ngen.exe. "Could ngen.exe be the problem?"

As a general rule, user-mode code cannot be responsible for blue-screen failures. It's the job of the kernel to be resistant to misbehavior in user-mode. Failures of the form IRQL_NOT_LESS_THAN_OR_EQUAL and PAGE_FAULT_IN_NON_PAGED_AREA are typically driver bugs or faulty hardware (for example, due to overheating or overclocking).

The application that happened to be active at the time of the failure is not typically interesting in and of itself, although it can give a clue as to what part of the kernel is misbehaving. The fact that ngen appears is more an indication that ngen performs a lot of disk I/O, so if there's a problem in the I/O stack, there's a good chance that ngen was involved, simply because ngen is involved in a lot of I/O requests.

  • Bob goes to the beach very frequently.
  • Every time there is a shark attack, Bob is at the beach.
  • Conclusion: Bob causes shark attacks.

Blaming ngen for the kernel crash is like blaming Bob for the shark attacks.

Bonus chatter: Some of my colleagues came to different conclusions:

  • Conclusion: Bob should stop going to the beach.
  • Conclusion: Bob must be the shark.
Comments (29)
  1. Joshua says:

    Therefore if you found a program without admin rights that can blue screen reliably and it's not a hardware problem, contact Microsoft. You found a kernel bug.

    (All bets are off if you installed in out.sys though. That driver is an example of how to write a security bug.)

  2. Medinoc says:

    Joshua: "in out.sys" is a thing like the infamous inpout32.dll? Or inpout32.dll's no-rights-check-whatsoever driver?

  3. Joshua says:

    @Medinoc: Yeah I think that's it. I'm referring to that dll's driver.

  4. Brian says:

    @ Joshua: Generally when I see a blue screen (which, other than one machine I haven't used in 4 years, is pretty rare), it's not a kernel bug, it's someone's crappy driver bug.  As Raymond notes "Failures of the form IRQL_NOT_LESS_THAN_OR_EQUAL and PAGE_FAULT_IN_NON_PAGED_AREA are typically driver bugs or faulty hardware (for example, due to overheating or  overclocking)."  Both of those (particularly the former) are causes of most of the BSODs I've ever seen.

  5. NotThisMind says:

    That's curious, i just got two BSOD's today and you bring this up, sadly mine was KERNEL_DATA_INPAGE_ERROR, STOP: 0x0000007A

  6. Cesar says:

    In my experience, it's almost always a bad memory module. The first thing I do is to boot into memtest86+ (or equivalent) and see what it finds. Often removing one of the memory modules cures it, then you just have to buy a new one.

    But in this case,

    > as the I/O request passed through the anti-virus software

    I'd blame the anti-virus.

  7. AC says:

    To paraphrase some internet memes:

    The problem is in the hardware. Unless AV is involved, then it's always AV.

  8. Alexey says:

    That Bob! I always thought there was something fishy about him!

  9. Gabe says:

    NotThisMind: KERNEL_DATA_INPAGE_ERROR is usually the result of a bad hard drive. I had a computer that failed with that error regularly. I swapped out the HD and it hasn't failed since.

  10. NotThisMind says:

    @Gabe , yea, i'm doing the recommended steps to pinpoint this, although the computer seems to be running fine now, i want to know what's causing this, but if it continues to BSOD, it's probably the hdd yea, although i'd hope it isn't :(

  11. Katie says:

    If the beach-goers work like many users, they'll probably attempt to fix the shark problem by replacing Bob with a random Bob they find on some sketchy website. Then they'll act surprised when he doesn't interact well with old Bob's friends and family, and won't think to blame him when someone locks up their personal belongings for a Bitcoin ransom shortly after he shows up.

    [See, I told you Bob was the problem! -Raymond]
  12. Scott Brickey says:

    > overclocking (and its related article)

    came to say something about CPU identifiers being much better, and CPUZ showing "rated" clock speeds… also noticed that the desktop I'm using (which *shouldn't* be OC'ed, as far as I'm aware) is bouncing above the "rated", which I assume is attributed to Intel's speed boosting… so gave up on thinking that the situation has improved since 2005.

    But what about just tracking CPU temp? any chance that there's a standard way to query the BIOS, and at least use some sort of "safe range"? (or if nothing else, log it in WER so that you can more easily assume the cause, as you discard the report)

  13. Harold H20 says:

    Conclusion: Bob must be the shark.

    I think I saw that on an episode of CSI.

  14. David Totzke says:

    @Katie – that's awesome :)

  15. Hugh Gleaves says:

    How timely, we (yesterday) just resolved exactly this kind of issue, experienced on Server 2008 R2 machines. After some investigation we determined the cause to be Microsoft's driver for SMBv1 (mrxsmb10.sys in function MRxSmbDeferredCreate) when doing operations that entail shadow loopback.

    Once we knew this it was straightforward to take corrective action. But we are puzzled about why Microsoft never fixed the problem in v1 SMB, I suspect the bugs is probably simple too given how easy it is to reproduce.

  16. JM says:

    @Hugh Gleaves: whenever I see someone state something along the lines of "this must be a simple bug, I wonder why they haven't fixed it yet" (or worse, "this should take no more than 5 minutes to fix, why haven't they done so yet") I cringe. In general, there is nothing you can say about how "obvious" a bug is to spot or fix if you have no access to the code base and every single regression test. It's great that you found a bug that admits a simple reproduction — that's a great help when debugging, but it says nothing about simple the bug is.

    Even in the rare case where you find a bug that's obvious to spot, obvious to trip *and* obvious to fix, you still have to see if fixing it won't happen to make other things worse because they were in some subtle way depending on the existing behavior. Oh yes, you better believe that can happen — no good deed goes unpunished.

  17. Dave Bacher says:

    Things to ask Cortana tonight when I have a better connection:

    * Are you Bob?

    * Do you cause shark attacks?

    * Should I stop taking you to the beach?

    * Are you a Shark?

    * How's Clippie?

    I still cringe when I ask her for voice navigation or when she recommends places.  Following her instructions in Halo never ended well. :P

    Anyway, in addition to the issues you raise, I also see this a lot with programs that do a lot of work in the video driver (OpenCL, CUDA, DirectCompute, or just certain games).  In many cases, I see it on all three major manufacturers across multiple, non-overcooked, devices.  It could be a temperature issue, but I'm more apt to blame the GPU for not being as nice for the kernel to deal with.

  18. @JM That pretty much summed up my day :-(

  19. Ken in NH says:

    It's still worth asking if Bob is often seen carrying a bucket of chum when he frequents the beach.

  20. Henri Hein says:

    Second the kudos to Katie.

  21. Muzer says:

    I thought you removed all the Bob stories from the queue?

  22. cheong00 says:

    I can't believe no one draws the following conclusion:

    •Conclusion: Sharks must like Bob (perheps as food).

    But then I think about it again, it's the same as these two:

    •Conclusion: Bob causes shark attacks.

    •Conclusion: Bob should stop going to the beach.

    Perheps you can't say the conclusions are wrong. :P

  23. Scarlet Manuka says:

    @Dave Bacher: Love "overcooked". I'm going to use that instead of "overclocked" from now on!

  24. Falcon says:

    @Scarlet Manuka:

    Yeah, that's some nice wordplay, but the term does seem to imply that temperature is the main (or only) problem. You could have a cooling system that keeps the overclocked hardware around 273K and it could still misbehave at a sufficiently high clock rate.

  25. Dave says:

    >If the beach-goers work like many users, they'll probably attempt to fix the shark problem by

    >replacing Bob with a random Bob they find on some sketchy website.

    And if they're like a company that's suffered a data breach, they'll offer two years' free sharkbite monitoring to the relatives of the deceased.

    This sounds like a cue for a "how X deals with sharks" list…

  26. Dave says:

    >That's curious, i just got two BSOD's today

    The single biggest cause of BSOD's that I've found, across multiple machines, is RAMdisk software.  In fact I've had problems finding anything that doesn't randomly cause BSODs on startup, or, worse, BSOD-reboot loops.  Maybe this has some connection to Raymond's comment about BSODs through I/O, that I/O to a RAMdisk device is a great way to trigger BSODs.

  27. Dave says:

    >Things to ask Cortana tonight when I have a better connection:


    >* Are you Bob?

    "I haven't gone by that name since the final surgery five years ago.  Everyone knows me as Cortana now, even my mother".

  28. Jerome says:

    The Bob-is-a-shark logic is a fine example of the post hoc ergo propter hoc fallacy. The example on Wikipedia, still one of my favourites, is:

    The rooster crows immediately before sunrise, therefore the rooster causes the sun to rise.

  29. Sean Liming says:

    I have found that it is good to run ngen.exe to compile .NET libraries first. The challenge comes when an application touches libraries that ngen doesn't compile and adds to the I/O and files up disk space. Outside of the fish story what was the answer to the problem?

Comments are closed.

Skip to main content