Debugging ISV Application Crashing Dynamics

Patrick Roth - Click for blog homepageI recently had a case that I thought I'd share with you.

A well known ISV opened an incident with Developer Support to help solve an issue at their customer site.

This ISV has an application that encrypts certain sensitive data in the custom application in the ISV's custom table in their own Dexterity based window.

The complaint given: After opening the custom window and making the application perform the decrypted data, when the window closes and the Dynamics navigation shell has focus - Dynamics GP crashes with the "Dynamics Runtime has encountered an error and needs to close.." dialog.

After discussing the issue with partner & customer, I was able to discover a few facts:

  1. The customer has been running this application for years as a strict Dexterity based application.  However around Christmas this new version was installed.  Immediately several machines started crashing - apparently randomly.  With vacations and the time required pinpointing exactly the sequence of events to cause this issue, it took until now to determine what app what causing this. Unfortunately only 2 machines worked OK, all the rest ran into this issue.  Dynamics 10.0 SP4 was installed already so there wasn't a Dynamics version change muddying the waters.
     
  2. The OS's used were across the board.  Some WinXP SP2, WinXP SP3, Vista, and one Citrix box.  There wasn't a pattern between working/non-working machines from an OS perspective.
     
  3. After some experimentation, it turns out that it wasn't "focus to the navigation shell" that causes the crash.  If the user selects the menus, the toolbars, or a list (like the Salesperson list) then Dynamics does not crash.  It is only when a web page such as the Home Page or an area page is displayed when Dynamics crashes.
     
  4. The next thought was to check the version of IE loaded on a good and bad machine.  Both were on version 8.0.6001.18702 so that didn't seem to be the issue.
     
  5. After discussing their application with the ISV, he revealed that Dexterity is making a COM interop call to a .NET assembly.  That assembly is then making a .NET Remoting call to SQL Server.  The ISV was "looking into" getting me source for their application.
     
  6. A DEXSQL.LOG and Script.log were captured after the ISV window was closed but before the navigation shell regained focus.  Both were empty meaning that the navigation pane wasn't hitting any SQL tables nor were any Dexterity scripts running when it got focus.   And we noted that this wasn't a timing issue of some sort since the crash didn't happen when the ISV form closed - only when the IE page gained focus.  If instead you had another window such as Customer Maintenance open and used that window Dynamics wouldn't crash.  So it wouldn't seem that the ISV app would to be the culprit.  Except that this only happens after the .NET call.  If the window is just opened and closed without invoking that bit of code Dynamics does not crash.

After launching Dynamics GP, we used Windbg to attach to the Dynamics.exe process.  We then replicated the error and examined the stack trace.

What the stack dump showed as the cause of the error was a stack overflow.  Odd - can't say I've ran across that before.

Looking at the stack dump, if you have the proper symbols, we can see the area of the issue.

0:000> kn
# ChildEBP RetAddr
00 0004385c 7c90df5a ntdll!KiFastSystemCallRet
01 00043860 7c8025cb ntdll!NtWaitForSingleObject+0xc
02 000438c4 7c802532 kernel32!WaitForSingleObjectEx+0xa8
03 000438d8 7a0454ad kernel32!WaitForSingleObject+0x12
04 00043908 7a0459f7 mscorwks!ClrWaitForSingleObject+0x24
05 00043dc4 7a048194 mscorwks!RunWatson+0x1df
06 00044508 7a04885a mscorwks!DoFaultReportWorker+0xb59
07 00044544 7a097542 mscorwks!DoFaultReport+0xc3
08 00044568 7a09618a mscorwks!WatsonLastChance+0x3f
09 000445d0 7a0962d1 mscorwks!EEPolicy::HandleFatalStackOverflow+0xdb
0a 000445f4 7a00b70f mscorwks!EEPolicy::HandleStackOverflow+0x173
0b 00044610 7c9032a8 mscorwks!COMPlusFrameHandler+0x10b
0c 00044634 7c90327a ntdll!ExecuteHandler2+0x26
0d 000446e4 7c90e48a ntdll!ExecuteHandler+0x24
0e 000446e4 7c812a6b ntdll!KiUserExceptionDispatcher+0xe
0f 00044a30 7a0364b4 kernel32!RaiseException+0x53
10 00044a4c 7a0c878a mscorwks!ReportStackOverflow+0x61      <------------here is the stack overflow
11 00044a5c 79e7cc34 mscorwks!Alloc+0x3b
12 00044a9c 79ee1e9d mscorwks!AllocateObject+0x59
13 00044cfc 79ee220e mscorwks!AssemblyNameNative::AssemblyNameInit+0x95
14 00044e40 792efdb3 mscorwks!AssemblyNameNative::Init+0x196
15 00044e64 792d36ed mscorlib_ni+0x22fdb3
16 00044e88 792d1a52 mscorlib_ni+0x2136ed
17 00044eb0 79e71b4c mscorlib_ni+0x211a52
18 00044ec0 79e821f9 mscorwks!CallDescrWorker+0x33
19 00044f40 79e96591 mscorwks!CallDescrWorkerWithHandler+0xa3
1a 00045084 79e965c4 mscorwks!MethodDesc::CallDescr+0x19c
1b 000450a0 79f27ac3 mscorwks!MethodDesc::CallTargetWorker+0x1f
1c 000453f4 79fae617 mscorwks!AppDomain::RaiseAssemblyResolveEvent+0x17e
1d 00045414 79fae6a0 mscorwks!AssemblySpec::ResolveAssemblyFile+0x24
1e 0004549c 79f27cbe mscorwks!AppDomain::TryResolveAssembly+0x76
1f 0004610c 79ea15c0 mscorwks!AppDomain::BindAssemblySpec+0x5a3
20 00046174 79eddefa mscorwks!AssemblySpec::LoadDomainAssembly+0x114
21 00046198 79ee268b mscorwks!AssemblySpec::LoadAssembly+0x1d
22 000462e4 792f11ec mscorwks!AssemblyNative::Load+0x240
23 0004630c 792efc00 mscorlib_ni+0x2311ec
24 00046340 792efdcf mscorlib_ni+0x22fc00
25 00046364 792d36ed mscorlib_ni+0x22fdcf
26 00046388 792d1a52 mscorlib_ni+0x2136ed
27 000463b0 79e71b4c mscorlib_ni+0x211a52

<Below we repeat this block for the length of the dump--------------------------------------------->

28 000463c0 79e821f9 mscorwks!CallDescrWorker+0x33
29 00046440 79e96591 mscorwks!CallDescrWorkerWithHandler+0xa3
2a 00046584 79e965c4 mscorwks!MethodDesc::CallDescr+0x19c
2b 000465a0 79f27ac3 mscorwks!MethodDesc::CallTargetWorker+0x1f
2c 000468f4 79fae617 mscorwks!AppDomain::RaiseAssemblyResolveEvent+0x17e
2d 00046914 79fae6a0 mscorwks!AssemblySpec::ResolveAssemblyFile+0x24
2e 0004699c 79f27cbe mscorwks!AppDomain::TryResolveAssembly+0x76
2f 0004760c 79ea15c0 mscorwks!AppDomain::BindAssemblySpec+0x5a3
30 00047674 79eddefa mscorwks!AssemblySpec::LoadDomainAssembly+0x114
31 00047698 79ee268b mscorwks!AssemblySpec::LoadAssembly+0x1d
32 000477e4 792f11ec mscorwks!AssemblyNative::Load+0x240
33 0004780c 792efc00 mscorlib_ni+0x2311ec
34 00047840 792efdcf mscorlib_ni+0x22fc00
35 00047864 792d36ed mscorlib_ni+0x22fdcf
36 00047888 792d1a52 mscorlib_ni+0x2136ed
37 000478b0 79e71b4c mscorlib_ni+0x211a52

On line 10, we get the stack overflow and it bubbles up from there and Dynamics crashes.

Why is there a stack overflow?  Good question and the reason for it is the section from line 28 to 37 which repeats over and over again - about 200 times until the stack overflows.  Here something is trying to load an assembly.  But unfortunately, we cannot determine exactly what that assembly might be - even looking in the some of the memory locations referenced in the stack trace.  Perhaps I might need a bit more practice or perhaps it is just not there. 

The next thing I did was detach from the crash and then reset Dynamics and Windbg to reproduce the issue.  I noticed this time that Windbg reported "C++ EH exception - code e06d7363 (first chance)." being thrown many times.  This wasn't a big surprise to me as I had deduced that from the stack trace above.  An assembly is trying to be loaded and an error is thrown (mscorwks!AppDomain::RaiseAssemblyResolveEvent).  And then it must be handled in a try block and retried.  Again and again and again.  We did use the Windbg command "sxe eh" to see if a first chance exception break would show us any further information however it did not.  OK - dead end there as best I can tell.

Edit: 3/5/2010: Still trying to pinpoint the assembly trying to be loaded, I tried the FUSLOGVW.exe application -The Assembly Binding Log Viewer.  This app will log .NET assembling binds (success or failures) done by all applications.  Surely this will tell me the name of the missing assembly.

After testing this a bit with a few .NET apps and even Dynamics GP starting up, I disabled the app and went to the point of reproducing the issue and then enabled the Assembly Binding Log Viewer.  After reproducing the issue, I checked the application and it seemed to not have logged anything!  I was surprised at the time.  On retrospect, that might be because the assembly was already "bound" by Dynamics GP.  Or perhaps I just wasn't logging correctly although my other tests did show the successful binds happening.  In either case, this didn't give me the name of the assembly.

Running out of ideas, the next thought would be to to run Process Monitor as that perhaps would pinpoint the file trying to be loaded.

We did a Process Monitor trace and I noticed that Dynamics.exe is attempting to find the Microsoft.Mshtml.dll in the GP folder and the results are "Name Not found".  Then again and again, Dynamics.exe is trying to load that assembly in the GP folder repeatedly and not looking in the typical search path.

Here obviously is the file that is trying to be loaded - the one that the crash dump didn't reveal the name of.

Finding that puzzling but having a good lead on the resolution, we did a search on the machine having the issue.  The Microsoft.Mshtml.dll couldn't be found anywhere on the local machine.

From the IT persons machine we did find the file Microsoft.MSHtml.dll located in the \Primary Interop Assembly folder which I believe was installed with Visual Studio.  Looking in the Dynamics GP folder, this assembly didn't exist there either.  However this machine didn't crash like the other one did.

Since Dynamics.exe was insisting on looking in the GP folder for this file, we copied the file from the IT persons machine into the GP folder of the machine that was crashing and tried to repro the issue.

Happily, we could not!  Everything worked fine.  We then went to a 2nd machine that was crashing and searched for the Microsoft.MSHtml.dll and it also didn't have it on the machine.  After copying the Microsoft.Mshtml.dll to the GP folder, we could once again no longer reproduce the issue and declared victory.

A partial victory anyway.  This isn't the real solution but the customer was quite happy since we now had a way to make all the machines workable.

There are several questions still outstanding that we don't know about.  I could have checked further but after quite a few hours we were all happy that everything was working and called it good.

The unanswered questions are:

  1. Why was Dexterity looking for this file in the GP folder only?
     
  2. Why is it even looking at all?  Obviously that file didn't exist when it displayed the area page BEFORE opening the ISV window so why does it need to find it AFTER it made the COM call?
     
  3. If we exit GP and rename the file and then use the Process Monitor trace while the navigation shell displayed the IE frame, would we see Dynamics.exe looking for the dll?  Or not?  We didn't go back and retry this scenario.
     
  4. What exactly was the ISV code doing?  Was it using some kind of call to the .NET equivalent and since this all happened in the context of Dynamics.exe it caused IE to start trying to use this dll.  I probably won't be able to figure that out as after I notified the ISV we had a resolution he didn't respond back to my requests for source.  So there obviously is an 'issue' here somewhere - very likely in the Dynamics runtime.  But without the source code to the ISV app we likely won't know what that is.

Edit 3/5/2010  I was thinking about this a bit more the last few days.  I have a feeling that I can answer #2 above.  Even though we did a search of the machine for the Microsoft.Mshtml.dll and didn't find it, I don't believe that the search looks in the GAC.  I just tested that theory on my own machine and that appears true.  So in #2 above, I had assumed that the assembly didn't exist on the machine at all and therefore GP wasn't using it at all.  I don't believe that was a correct assumption.  I tested on my own Dynamics GP 10.0 just now by attaching Windbg to my Dynamics.exe process.  Looking in the loaded modules list I see:

C:\WINNT\assembly\GAC\Microsoft.mshtml\7.0.3300.0__b03f5f7f11d50a3a\Microsoft.mshtml.dll

So indeed on my machine Dynamics.exe found it in the GAC and is using it.  I suspect that we'd see that as well as well on the customer machine before opening the custom window.  After using the custom code, Dynamics.exe/.NET somehow loses track of the assembly and insists on only looking in the GP folder.  So I think that answers questions #2 & #3 of mine above, but the rest remain.

Hope this helps,

Patrick
Developer Support

Edit 3/5/2010 - Additional thoughts added to the end.  Also added that we used the Assembly Binding Log Viewer tool to troubleshoot.