Hi NTDebuggers, something rarely talked about are the odds of a problem being in one piece of code vs. another. From time to time we see some very strange debugs or symptoms reported by customers. The problems can be associated with anything from an internally written application, a Microsoft product running on Windows, or an application written by a 3rd party vendor. In fact we are often engaged to assist one of our customers or vendors with troubleshooting or debugging their applications.
One of the first things we do is assess the situation. We ask questions like:
· Where is the program crashing?
· What binaries comprise the program?
· How often are those various binaries used worldwide?
Let’s use the following pseudo call stack and binaries as an example.
NTDLL!VeryCommonFunction << Crash happens in this function.
If I see a crash in NTDLL!VeryCommonFunction I’m going to make some assumptions as I assess the domain of the problem. This holds true for any operating system, product, or software in general. The code that runs more than any other code is, by its nature, effectively tested more because it runs more. Therefore it is less likely to be the root cause of the fault, and in some cases it is simply the victim. This applies to all operating systems: UNIX, Mac OS, Windows... core code tends to be less buggy.
Let’s look at a real world example of some very common code in Windows. NTDLL!RtlAllocateHeap and NTDLL!RtlFreeHeap. For those of you not familiar with NTDLL, it’s loaded in just about every process on every machine running a modern copy of Windows, worldwide. The average machine has ~40-200+ process (applications, and miscellaneous services running), and there are hundreds of millions of PCs worldwide running Windows, so that gives us ~billions of processes running NTDLL, give or take a few billion. Collectively, those processes are going to call RtlFreeHeap or RtlAllocateHeap millions of times in the next second.
So what are the odds? Is it likely that this core API used by billions of processes is crashing because of a bug in the core API? Or is it more likely that a smaller vertical market or custom application running on ~500 machines worldwide did something to destabilize one of the process heaps?
Typically when an application is crashing in a heap function inside of NTDLL, support engineers become suspicious of activity in the process space, and in this case it’s more likely to be a problem with heap corruption. It is likely that code running in the host process that has NTDLL loaded has corrupted one of the heaps by overwriting a buffer, doing a double free, or some other problem. Then when a call is made into the Microsoft heap API, NTDLL has to traverse the heap structures that are corrupted by the host application, so the process crashes. And yes, the crash is in NTDLL. In this case, I typically ask the customer to enable full page heap via gflags (this puts an additional page marked with the PAGE_NOACCESS attribute at the end of each allocation). We then wait for the next crash and analyze it. Enabling full page heap helps you catch the corruptor with “their hand in the cookie jar”.
The same scenario holds true for other core functionality such as kernel pool allocations, invalid handles, leaks etc. Again, core code tends to be rock solid because of sheer volume of use and exposure to a variety of environments. This being the case, it also tends to change less over time. Of course there is code in the OS or other components that is not used as much, which is more likely to have problems. We always take that into consideration when scoping an issue.
The good news is we are always happy to dig in and help our customers isolate these types of problems.
Please feel free to chime in and share your stories.
Good Luck and happy debugging.