High memory utilization in dbghost.exe when using Debugdiag

I have seen lately many issues where Administrators and developers alike are claiming that Debugdiag is causing the whole server to become unresponsive.

In some situation, the debugger host (dbghost.exe) grows uncontrollably in memory consumption causing it to shutdown which will bring the target process the debugger was attached to to terminate as well. 

In other situations, the debugger host (dbghost.exe) along with the target process seem to just be consuming all CPU cycles rendering the whole server unresponsive.

I looked into these issues to try to determine the root cause of these problems. And what I have noticed was something very interesting.

The first issue in which dbghost.exe seems to be leaking memory was actually an issue with .net 2.0 debugger extension (sos.dll).  Here is a description of the problem and how to resolve it.

As you may have already known, Debugdiag relies mainly on sos.dll for .net debugging. So when you create a crash rule to catch a .net exception and run actions, dbghost.exe loads sos.dll and runs the "!dumpobj" command to determine the type of exception that was throws every time a .net exception is thrown. Below how the script for a crash rule determines the type of .net exception thrown.

    1: Function GetCLRExceptionType(ByVal ExceptionObjHexAddr, ByVal bInnerException)
    2:     Dim Output, Lines, i
    3:  
    4:     If Debugger.IsClrExtensionMissing Then
    5:         WriteToLog "Unable to determine CLR exception type - extension dll could not be loaded."
    6:     Else
    7:         Output = Debugger.Execute("!DumpObj " & ExceptionObjHexAddr)
    8:         Lines = Split(Output, Chr(10))
    9:         For i = 0 To UBound(Lines)        
   10:             If bInnerException Then
   11:                 If InStr(Lines(i), "_innerException") <> 0 Then
   12:                     Tokens = Split(Lines(i), " ")
   13:                     For j = 0 To UBound(Tokens)
   14:                         If Len(Tokens(j)) = 8 Then                                
   15:                             GetCLRExceptionType = GetCLRExceptionType(Tokens(j), False)
   16:                             Exit For
   17:                         End If
   18:                     Next
   19:                 End If
   20:             ElseIf Len(Lines(i)) >= 7 Then
   21:                 If InStr(Lines(i), "Name: ") = 1 Then
   22:                     GetCLRExceptionType = Mid(Lines(i), 7)
   23:                     Exit For
   24:                 End If
   25:             End If
   26:         Next
   27:     End If
   28: End Function

Line 7 is where the "dumpobj" command is ran against the exception object to determine the type.

There is a known issue with the "dumpobj" command that it leaks 4k every time it is ran. This leads to a situation where dbghost.exe runs out of memory and terminates.

There is an updated sos.dll that fixes a different issue but contains a fix  for this problem. (https://support.microsoft.com/kb/967812). So all what needs to be done is to replace the existing sos.dll, and reactivate the crash rule.

The second issue has to do basically with the high rate of handled exceptions thrown by the target process (1st chance exceptions. It does not matter of the exceptions are managed or native.

For instance, if you look at the below log file generated for a crash rule, you see that there is a high number of exceptions thrown per sec.

 

    1: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 2200
    2: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.Threading.ThreadAbortException'
    3: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 2200
    4: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.Web.HttpContext'
    5: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
    6: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
    7: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
    8: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
    9: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   10: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   11: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   12: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   13: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   14: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   15: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   16: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   17: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   18: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   19: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   20: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   21: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   22: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   23: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   24: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   25: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   26: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   27: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   28: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   29: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   30: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   31: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   32: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   33: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   34: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   35: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   36: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   37: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   38: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   39: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 2656
   40: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'
   41: [1/27/2010 5:48:08 AM] First chance exception - 0xe0434f4d caused by thread with system id 6780
   42: [1/27/2010 5:48:08 AM] CLR Exception Type - 'System.InvalidOperationException'

 

An average of around 22 exceptions/sec.  Let look closely at what happens when an exception is raised: 

From https://support.microsoft.com/default.aspx?scid=kb;EN-US;105675...

.... if the application is being debugged, the debugger sees all exceptions before the program does. This is the distinction between the first and second chance exception: the debugger gets the first chance to see the exception (hence the name). If the debugger allows the program execution to continue and does not handle the exception, the program will see the exception as usual. If the program does not handle the exception, the debugger gets a second chance to see the exception. In this latter case, the program normally would crash if the debugger were not present.

...

In other words, when an exception is raised, control is passed to the debugger, the debuggers acts on that exception and passes control back to the program to handle the exception. This is an expensive operation and involve a lot of context switching. 

So, in this particular example, things are worse because for every CLR exception raised, debugdiag will run the command (!dumpObj ExceptionAddress ) to determine the type  "GetCLRExceptionType(GetCLRExceptionPtr(CausingThread), False)".  If sos.dll being used is the one with the leak issue, you will notice a high CPU utilization in both the debugger (dbghost.exe) and the target process  along with a quick increase in memory utilization in dbghost.exe. This will eventually lead to an unexpected termination of the debugger and the target process.

In the case of native exceptions, the issue will manifest itself as a high CPU that may lead to an unresponsive server as a whole! To be able to debug such environment, you would need to know what are the exceptions that are thrown in a high rate and avoid them.