Investigating an app compat problem: Part 1: The initial plunge

Today we're going to look at an application compatiblity problem. Actually, today, we'll just look at the crash that is the reason why we have an application compatibility problem. We'll then spend the next few days trying to figure out what went wrong. Today's installment will be rather long because I want to go through the entire process of making the initial diagnosis, so that we won't have to spend each day re-establishing context.

I also want to point out that even though this is presented as a straightforward narrative, the actual analysis involved a good number of dead ends and sitting around thinking, "Great, what am I supposed to do now?"

We start with the actual crash.

eax=00000001 ebx=31410000 ecx=00000000 edx=40000000 esi=001345a8 edi=1fee0a9c
eip=314259d3 esp=0013443c ebp=0013444c iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
314259d3 mov     eax,dword ptr [ecx+eax*4]            ds:0023:00000004=????????

(Obviously, the symbols are wrong. Which makes sense because we don't have symbols for contoso.dll.)

What we have here is an array index operation into an array, but the array is a null pointer. Let's see where that null pointer came from.

314259a0 push    ebp
314259a1 mov     ebp, esp
314259a3 sub     esp, 10h
314259a6 mov     dword ptr [ebp-10h], ecx
314259a9 mov     eax, dword ptr [ebp+8]
314259ac mov     dword ptr [ebp-8], eax
314259af lea     ecx, [ebp-0Ch]
314259b2 push    ecx
314259b3 lea     edx, [ebp-4]
314259b6 push    edx
314259b7 mov     eax, dword ptr [ebp-8]
314259ba push    eax
314259bb call    contoso!ContosoInitialize+0x4db0 (31425a10)
314259c0 add     esp, 0Ch
314259c3 mov     edx, 1
314259c8 mov     ecx, dword ptr [ebp-0Ch]
314259cb shl     edx, cl
314259cd mov     eax, dword ptr [ebp-4]
314259d0 mov     ecx, dword ptr [ebp-10h]
314259d3 mov     eax, dword ptr [ecx+eax*4]

Ah, it came from the this pointer. the object itself has an array as its first member, and we're trying to index into it, but the this pointer is null, so there is nothing to index into. Where did that null pointer come from? We'll have to go up the stack to see what the caller is passing as this.

Here's the stack trace, or at least part of it.

ChildEBP RetAddr
0013444c 31426201 contoso!ContosoInitialize+0x4d73
00134478 31427401 contoso!ContosoInitialize+0x55a1
00134484 3142b030 contoso!ContosoInitialize+0x67a1
00134494 3142afdd contoso!ContosoInitialize+0xa3d0
001344c4 314279e0 contoso!ContosoInitialize+0xa37d
001344dc 31427e86 contoso!ContosoInitialize+0x6d80
001344f8 31427c2f contoso!ContosoInitialize+0x7226
00134504 31427bd3 contoso!ContosoInitialize+0x6fcf
00134530 314332b8 contoso!ContosoInitialize+0x6f73
00134554 31432278 contoso!ContosoInitialize+0x12658
00134578 3142dd7a contoso!ContosoInitialize+0x11618
001345a8 3142e326 contoso!ContosoInitialize+0xd11a
00134668 3142e4b5 contoso!ContosoInitialize+0xd6c6
0013468c 3142e5b6 contoso!ContosoInitialize+0xd855
001346b0 31424766 contoso!ContosoInitialize+0xd956
001346bc 31421c0d contoso!ContosoInitialize+0x3b06
001346e0 314c8cba contoso!ContosoInitialize+0xfad
00134790 314b5e08 contoso!ContosoInitialize+0xa805a
001347b0 77c60c6e contoso!ContosoInitialize+0x951a8
001347d0 77c12f17 ntdll!RtlAddAuditAccessAceEx+0x5e
00134820 77c09331 ntdll!RtlActivateActivationContextUnsafeFast+0xd7
001348a0 77c091b4 ntdll!RtlLoadString+0x3c1
001348c4 77c07d68 ntdll!RtlLoadString+0x244
001348e0 77c0bd21 ntdll!LdrGetDllHandleByMapping+0xe8
00134928 77c0b941 ntdll!RtlAcquirePebLock+0x391
00134a74 77c0b615 ntdll!LdrLoadDll+0x3a1
00134af8 751a7f94 ntdll!LdrLoadDll+0x75
00134b3c 751aad06 KERNELBASE!LoadLibraryExW+0x124
00134b64 018aa533 KERNELBASE!LoadLibraryExA+0x26
00134c18 018aa410 setup+0x3a533

At this point, you might choose to shudder in horror, because you can see that this DLL is running a ton of code during DLL_PROCESS_ATTACH

Once you overcome your initial shock, you can pick off the caller (highlighted in the stack trace as the return address) and take a look at the enclosing function.

314261d0 push    ebp
314261d1 mov     ebp, esp
314261d3 push    0FFFFFFFFh
314261d5 push    offset contoso!ContosoInitialize+0xa5798 (314c63f8)
314261da mov     eax, dword ptr fs:[00000000h]
314261e0 push    eax
314261e1 mov     dword ptr fs:[0], esp
314261e8 sub     esp, 14h
314261eb mov     dword ptr [ebp-20h], ecx // "this"
314261ee mov     eax, dword ptr [ebp+8]
314261f1 push    eax
314261f2 mov     ecx, dword ptr [ebp-20h] // "this"
314261f5 call    contoso!ContosoInitialize+0x5620 (31426280)
314261fa mov     ecx eax
314261fc call    contoso!ContosoInitialize+0x4d40 (314259a0)

The null pointer came from the call to mystery member function 31426280. Let's see how that function calculated its return value. Since the function is long and contains branches, we'll follow the code forward. This is not guaranteed to be accurate, because the contents of memory at the time of the crash may not match the contents of memory at the time the code was executed, but we'll take that chance for now. (Remember, debugging is an exercise in optimism.)

31426280 push    ebp
31426281 mov     ebp, esp
31426283 push    0FFFFFFFFh
31426285 push    offset contoso!ContosoInitialize+0xa57c3 (314c6423)
3142628a mov     eax, dword ptr fs:[00000000h]
31426290 push    eax
31426291 mov     dword ptr fs:[0], esp
31426298 sub     esp, 24h
3142629b mov     dword ptr [ebp-2Ch], ecx // "this"
3142629e mov     eax, dword ptr [ebp-2Ch] // "this"
314262a1 mov     ecx, dword ptr [eax+400h] // read member at offset 0x400
314262a7 cmp     ecx, dword ptr [contoso!ContosoInitialize+0xe5d3c (3150699c)]
314262ad jne     contoso!ContosoInitialize+0x56c0 (31426320)

Was this jump taken? Let's find out. First, extract the this pointer passed to the mystery member function 31426280, then look at the member at offset 0x400, then compare it to the value at 3150699c.

0:000> dd 00134478-20 L1
00134458 1fee50f8
0:000> dd 1fee50f8+400 L1
1fee54f8  00000041
0:000> dd 3150699c L1
3150699c  00000000

The values do not match, so let's assume the jump was taken.

31426320 mov     dword ptr [ebp-10h], 0
31426327 lea     eax, [ebp-10h]             // bonus parameter
3142632a push    eax
3142632b mov     ecx, dword ptr [ebp-2Ch]   // "this"
3142632e mov     edx, dword ptr [ecx+400h]  // the value is 0x41
31426334 push    edx
31426335 call    contoso!ContosoInitialize+0x50b0 (31425d10)

Let's follow the money into mystery free function 31425d10.

31425d10 push    ebp
31425d11 mov     ebp, esp
31425d13 push    0FFFFFFFFh
31425d15 push    offset contoso!ContosoInitialize+0xa5738 (314c6398)
31425d1a mov     eax, dword ptr fs:[00000000h]
31425d20 push    eax
31425d21 mov     dword ptr fs:[0], esp
31425d28 sub     esp, 10h
31425d2b call    contoso!ContosoInitialize+0x952f3 (314b5f53)
31425d30 push    eax
31425d31 lea     ecx, [ebp-14h]
31425d34 call    contoso!ContosoInitialize+0xfe0 (31421c40)
31425d39 mov     dword ptr [ebp-4], 0
31425d40 mov     eax, dword ptr [ebp+8]
31425d43 push    eax
31425d44 call    dword ptr [contoso!ContosoInitialize+0xa8590 (314c91f0)] // TlsGetValue -> goes into the variable
31425d4a mov     ecx, dword ptr [ebp+0Ch]
31425d4d mov     dword ptr [ecx], eax
31425d4f mov     edx, dword ptr [ebp+0Ch]
31425d52 cmp     dword ptr [edx], 0
31425d55 jne     contoso!ContosoInitialize+0x5125 (31425d85)
31425d57 call    dword ptr [contoso!ContosoInitialize+0xa840c (314c906c)] // GetLastError
31425d5d push    eax
31425d5e lea     ecx, [ebp-14h]
31425d61 call    contoso!ContosoInitialize+0x5150 (31425db0)
31425d66 test    eax, eax
31425d68 je      contoso!ContosoInitialize+0x5125 (31425d85)
31425d6a mov     dword ptr [ebp-18h], 0FFFFFFFFh
31425d71 mov     dword ptr [ebp-4], 0FFFFFFFFh
31425d78 lea     ecx, [ebp-14h]
31425d7b call    contoso!ContosoInitialize+0xff0 (31421c50)
31425d80 mov     eax, dword ptr [ebp-18h]
31425d83 jmp     contoso!ContosoInitialize+0x513e (31425d9e)
31425d85 mov     dword ptr [ebp-1Ch], 0
31425d8c mov     dword ptr [ebp-4], 0FFFFFFFFh
31425d93 lea     ecx, [ebp-14h]
31425d96 call    contoso!ContosoInitialize+0xff0 (31421c50)
31425d9b mov     eax, dword ptr [ebp-1Ch]
31425d9e mov     ecx, dword ptr [ebp-0Ch]
31425da1 mov     dword ptr fs:[0], ecx
31425da8 mov     esp, ebp
31425daa pop     ebp
31425dab ret

There are two exit paths for this function; one returns -1 and the other returns 0. Let's see if we can figure out which path got executed. The execution is straight-line until we reach this decision:

31425d39 mov     dword ptr [ebp-4], 0
31425d40 mov     eax, dword ptr [ebp+8]     // Parameter 0 = 0x00000041
31425d43 push    eax
31425d44 call    dword ptr [contoso!ContosoInitialize+0xa8590 (314c91f0)] // TlsGetValue
31425d4a mov     ecx, dword ptr [ebp+0Ch]   // Parameter 1 (the bonus parameter)
31425d4d mov     dword ptr [ecx], eax       // Save the TLS value into parameter 1
31425d4f mov     edx, dword ptr [ebp+0Ch]   // Parameter 1
31425d52 cmp     dword ptr [edx], 0         // Was the TLS value zero?
31425d55 jne     contoso!ContosoInitialize+0x5125 (31425d85)

I was able to figure out that the called function was TlsGetValue by looking at the value in the dump file.

0:000> ln poi 314c91f0
7603a1f0 mov     edi,edi

The branch is taken if the value in the TLS slot is nonzero. So is it? Let's manually re-execute the call to TlsGetValue.

751b6ed0 mov     edi, edi
751b6ed2 push    ebp
751b6ed3 mov     ebp, esp
751b6ed5 mov     ecx, dword ptr fs:[18h] // get the TEB
751b6edc mov     eax, dword ptr [ebp+8] // tlsIndex (0x00000041)
751b6edf mov     dword ptr [ecx+34h], 0 // SetLastError(NOERROR);
751b6ee6 cmp     eax, 40h
751b6ee9 jae     KERNELBASE!TlsGetValue+0x26 (751b6ef6) // Jump taken
751b6ef6 cmp     eax,440h
751b6efb jae     KERNELBASE!TlsGetValue+0x42 (751b6f12) // Jump not taken
751b6efd mov     ecx,dword ptr [ecx+0F94h]
751b6f03 test    ecx,ecx
751b6f05 je      KERNELBASE!TlsGetValue+0x4c (751b6f1c)

0:000> dd @$teb+f94 L1
7ffdff94  05f5bf30

The value is nonzero, so the jump is not taken.

751b6f07 mov     eax,dword ptr [ecx+eax*4-100h]
751b6f0e pop     ebp
751b6f0f ret     4
0:000> dd 05f5bf30+41*4-100 L1
05f5bf34  00000000

The TLS value is zero.

Okay, let's go back to the function that called TlsGetValue and see what it does when the TLS value is zero.

31425d52 cmp     dword ptr [edx], 0         // Was the TLS value zero?
31425d55 jne     contoso!ContosoInitialize+0x5125 (31425d85) // Jump not taken
31425d57 call    dword ptr [contoso!ContosoInitialize+0xa840c (314c906c)] // GetLastError
31425d5d push    eax                        // NOERROR
31425d5e lea     ecx,[ebp-14h]
31425d61 call    contoso!ContosoInitialize+0x5150 (31425db0)
31425d66 test    eax,eax
31425d68 je      contoso!ContosoInitialize+0x5125 (31425d85)

We don't pay attention to Get­Last­Error because it doesn't affect flow control. The thing that does affect flow control is function 31425db0, so let's see what it does.

31425db0 push    ebp
31425db1 mov     ebp,esp
31425db3 push    ecx
31425db4 mov     dword ptr [ebp-4],ecx
31425db7 mov     eax,dword ptr [ebp-4]
31425dba mov     ecx,dword ptr [ebp+8]
31425dbd mov     dword ptr [eax+4],ecx
31425dc0 mov     eax, dword ptr [ebp+8]
31425dc3 mov     esp, ebp
31425dc5 pop     ebp
31425dc6 ret     4

The function returns its input parameter. (It does other stuff, but we aren't interested in that part yet.) Therefore, since we passed NOERROR as the error code, it also returns NOERROR, which is zero. We also saw that the TLS value zero is stored into the bonus parameter, which was [ebp-10h]. Armed with this information, we can continue our analysis back to the caller:

31425d66 test    eax,eax
31425d68 je      contoso!ContosoInitialize+0x5125 (31425d85) // jump taken

And we saw from the analysis earlier that once you hit the code path at 31425d85, the function will return zero.

Now we can unwind back to the caller and see what happens when function 31425d10 returns zero.

31426335 call    contoso!ContosoInitialize+0x50b0 (31425d10) // Returns zero
3142633a add     esp, 8
3142633d cmp     eax, 0FFFFFFFFh
31426340 jne     contoso!ContosoInitialize+0x56e9 (31426349) // jump not taken
31426342 xor     eax, eax                           // return zero
31426344 jmp     contoso!ContosoInitialize+0x5783 (314263e3)

314263e3 mov     ecx, dword ptr [ebp-0Ch]
314263e6 mov     dword ptr fs:[0], ecx
314263ed mov     esp, ebp
314263ef pop     ebp
314263f0 ret

Okay, so now we know where the null pointer came from: The null pointer was stored in the TLS slot, and this function returns whatever is in the TLS slot.

That is a superficial analysis of why the program crashed. It doesn't really tell us what happened in the operating system that induced the crash; we'll have to dig in deeper.

The story continues next time.

Comments (16)
  1. Yuri Khan says:

    Fhtagn, Raymond, what’s up with your blog‽ It’s covered in canvas!

    Good news: the arrows drawn on the canvas align with the text correctly. Which is no small feat.

    Bad news: the canvas is opaque with respect to clicking (incl. clicking on the hyperlinks) and text selection (incl. [selection reading][1]). And it blocks the horizontal scroll bar on the widest listing (the one about TlsGetValue -> g… the rest did not fit my half-screen window).


    1. It worked fine in staging, but when published, the canvas vanished! I hacked it by raising the canvas, but yeah, it messes with clicks. I’ll see what I can do, but I don’t have time right now. Probably some bad interaction with the site’s style sheet.

      1. DWalker says:

        Raymond, you used to say “I don’t control the blog software”. Now you have been sucked into maintaining something approximating the blog software. :-)

        1. Joshua says:

          Nope; he’s just maintaining his own HTML.

      2. ChDF T says:

        You could use the css rule pointer-events: none;. According to this should work in all major desktop browsers.

        1. Yay! Thanks for the tip.

  2. Jonathan McDougall says:

    That canvas overlay is broken for font sizes that are not 100% by the way, and kills text selection.

    Apart from that, very interesting read. I’m looking forward to next episode.

  3. DWalker says:

    Small nit: The crash isn’t the reason we have an appcompat problem; the crash is the notification that we have the problem! The appcompat problem is the cause of the crash.

    But, still, very interesting post.

  4. I’m curious: How does one obtain the rights to do something like this? I thought the standard Microsoft EULA banned reverse engineering.

    1. Chris Iverson says:

      One, I suppose it depends on how far you take the definition of “reverse engineering”. I would consider creating source code from binary code to be reverse engineering, but not debugging executable code.

      Two, he mentions dump files a couple times. I’m guessing this is a crash dump from one of the “send crash info to Microsoft” notifications that you can enroll your programs into. I don’t know the agreement for enrollment like that, but I would imagine something in it states Microsoft can examine your crash dumps to improve Windows compatibility.

      1. Karellen says:

        One, I wonder[0] what Microsoft’s own position is on other people debugging their executable code with respect to the anti-reverse-engineering provisions in their EULAs.

        Two, the person sending the crash dump is likely *not* the author of the program, but a user of it, and therefore won’t have the right to authorise Microsoft to reverse-engineer a 3rd-party program.

        [0] ’tis but an idle wonder – am not trying to put OGH on the spot.

        1. Brian says:

          Lots of folks in the Microsoft support organization (and in other Microsoft groups) do post-mortem crash-dump debugging. For the most part, it is done on customer code, using the standard support agreements. Crash-dump debugging is a well known art at Microsoft, and everything is above board legally (as far as I know – the closest to lawyering I ever get is watching one on TV).

        2. Chris Iverson says:

          “Two, the person sending the crash dump is likely *not* the author of the program, but a user of it, and therefore won’t have the right to authorise Microsoft to reverse-engineer a 3rd-party program.”

          Except the “send crash data to Microsoft” option only appears if the software vendor has enrolled in Microsoft’s crash labs(can’t remember what the actual service is called) and already granted permission to MS to gather the data.

          At least, I’m fairly certain that’s the case…that’s why it doesn’t appear on all applications when they crash.

    2. Anon says:

      Debugging and reverse-engineering are only tangentially related. You might need a debugger to do reverse-engineering, but they have nothing to do with each other beyond that.

      If “debugging” was “reverse-engineering,” then editing configuration files by hand would be as well.

    3. Yuhong Bao says:

      Yea, it should be removed. Fun quote from Volume Licensing Product Terms:
      “Licensed User may reverse engineer, decompile or disassemble the Advanced Threat Analytics software, or otherwise attempt to derive the source code for the Advanced Threat Analytics Server software solely to the extent required to debug changes to any libraries licensed under the GNU Lesser General Public License that may be included with and linked to by the software.”

    4. I would greatly appreciate it if you didn’t bring up legal issues in the comments. I don’t want my blog shut down. (Do you post questions like this whenever Mark Russinovich debugs a program?)

Comments are closed.

Skip to main content