What’s wrong with this code, part 21, a psychic debugging example


Over the weekend, one of the developers in my group sent me some mail – he was seeing one of the registers in his code getting corrupted across a procedure call.  He was quite surprised to see this, and asked me for any suggestions.

With the help of the info he gave me, I was able to figure out what had gone wrong with his code, and I realized that it’d make a great “what’s wrong with this code” example.

There are three parts to the code associated with this “what’s wrong”.  The first is an interface definition:

class IPsychicInterface
{
public:
    virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
};

Next, you have a tiny test application:

int _tmain(int argc, _TCHAR* argv[])
{
    register int value1 = 1;
    IPsychicInterface *psychicInterface = GetPsychicInterface();
    register int value2 = 2;

    psychicInterface->DoSomeOperation(argc, argv);
    assert(value1 == 1);
    assert(value2 == 2);
    return 0;
}

The failure  happened when the caller returned from psychicInterface->DoSomeOperation – upon the return, the ESI register, which is supposed to be preserved got trashed.  Further debugging showed that the reason that ESI was trashed was that the stack was imbalanced after the call to DoSomeOperation.

There’s one more piece of information that I was given that let me immediately realize the root cause of the problem.

 

I know that if I include that information, what went wrong should be blindingly obvious, so I’m going to be mean and ask you to tell me what the one additional piece of information was.   The reason that the other developer in my group didn’t find it was simply because he was looking at too much data – if I had pointed out that one piece of additional data, he’d have instantly figured it out too.

 

So the “answer” to this part of the “What’s wrong” problem is “What single additional piece of information was I given that made this problem simple to solve?”

Comments (48)

  1. Anonymous says:

    The additional piece of information may be it works if the developer makes the definition ‘register int value=1’ global. In most of the compilers the stores to local register variables are clobbered because the assemble code in the calling function may use this same register(the availability of the register depends on dataflow analysis of the compiler so local registers are reused) for some other purpose.

    one suggestion would be to find which registers are saved and restored during a function call smilar to %EBP ,%ESP (which are the local frame and stack pointers), don’t store any thing to %EBP,%ESP your stack will be corrupted 🙂

  2. Anonymous says:

    Does the calling convention on the implementation of DoSomeOperation() match the interface definition?

  3. Anonymous says:

    Although slightly off topic, Vamsi’s comment reminded me of a Michael Abrash article at http://www.ddj.com/184405848

    "Many people don’t realize that under Windows it’s safe to use ESP as a general-purpose register because the OS switches to a system stack when fielding interrupts and context switching. (Note, though, that the code is no longer multithread-safe unless thread-local storage is used because ESP must be stored somewhere other than the stack so it can be restored at the end of the routine.)"

    Health warning – messing around with esp is dangerous, and has to be treated with extreme care – however in certain situations (such as what Michael was facing) it may be worthwhile.

    It’s tempting to try this – does anyone have any experience with this approach? Is the behaviour the same across versions of Windows?

  4. Skywing says:

    Sounds like vtable ordering was changed or a function parameter was added to the implementation of an interface without changing the caller, and whatever mechanism that the caller used to locate the interface (i.e. the IID if this is a COM interface) wasn’t changed for the new interface revision.

    For example, this might be a COM interface, and somebody got lazy and didn’t change the IID when they changed the interface, and now instead of failing to create the interface, callers blow up in bad ways (if the call is in-proc) or get mysterious call failures (if the call is marshalled).

    – S

  5. Anonymous says:

    He told you that the class that implemented the interface was exported from a DLL.  That changed the calling convention from __cdecl to __stdcall.

  6. Anonymous says:

    The calling convention.

    It wasn’t specified as part of the method declaration. If either GetPsychicInterface() or main() was compiled using different default calling conventions (e.g. one _cdecl, the other _stdcall) then the stack will go nuts.

  7. Rick Schaut says:

    I’m probably just shooting in the dark on this, but the lack of a prototype for GetPsychicInterface() bothers me.  What if this is a 64-bit system, and the compiler defaults the return type to "int"?

    ‘Course that wouldn’t be the problem here, ’cause you’d very likely crash on the call to DoSomeOperation–unless you just got lucky, or unlucky depending on your point of view.

  8. Anonymous says:

    I’d guess that the piece of information is that GetPsychicInterface() is defined in a different DLL than the application.  Then you would go on to assume that the app was using wide chars and the DLL was not, or vice versa,

  9. Anonymous says:

    "What single additional piece of information was I given that made this problem simple to solve?"

    The threat model? 🙂

  10. dewb: DoSomeOperation uses the thiscall calling convention as do all C++ methods.

  11. Adam: Good answer, unfortunately not right – this isn’t a threat modeling problem :).

    Ben: This all happens in a single executable.

    It’s not a calling convention mismatch, because the calling convention is specified by the fact that this is a pure virtual method on a function.

    Several people are close to the root cause, but I’ll go back to my original question: What was the one piece of information that’s necessary to confirm the diagnosis.

  12. Anonymous says:

    The information that would tip me off is that it only happens in an optimized build or when frame pointer omission is enabled.

    I’m surprised this didn’t expose itself in a debug build, because the enhanced stack checking added in VS2002 will ordinarily flag errors like this.

  13. Anonymous says:

    I’d guess that the IPsychicInterface implementation returned BOOL rather than bool. (Although this requires a C-style cast too and missing the actual inheritance from the implementation.)

  14. James, that wouldn’t cause a stack overrun (the compiler still returns 4 bytes of data when returning a bool, it’s just that only the low 8 bits are used).

    Phaeron: The problem occurs on both optimized and non-optimized builds.  And FPO doesn’t change the outcome.  There was a debug/retail mismatch (the OS was retail, and the binary was debug), but the problem ocurred even with the retail bits.

  15. Anonymous says:

    My guess would be that multiple inheritance is used and there’s an incorrect cast, which causes DoSomeOperation() to call a different method (with a different signature) than actually intended.

  16. Anonymous says:

    This is not necessarily the problem here, but using _TCHAR in an interface definition like this can be risky. Even if _tmain() and the object implementation are in the same executable, they could be compiled with differing ANSI/Unicode modes.

    But again, I don’t think that’s the problem.

  17. Anonymous says:

    The name IPsychicInterface multiply self-documents itself as being an interface, so obviously there is multiple inheritance.

    I’m not sure what piece of information I’d ask for though.  I’d ask to see the code of GetPsychicInterface, and the object that GetPsychicInterface gets the interface interface of — which means I fail my psychics exam.

  18. Anonymous says:

    My guess is that the missing piece is "It runs fine on x86 but it fails on x64"?

  19. Anonymous says:

    I’d place a wild guess at the _TCHAR being the root cause: one part of the code being ANSI and the other being Unicode.

  20. Anonymous says:

    "the stack was imbalanced after the call to DoSomeOperation" is the clue that others have focused on, and I will too. It sounds like a calling convention mismatch. So, the single piece of missing information is "have you had your morning coffee yet?" or "what is the default calling convention". I’d guess that the CC used was __cdecl and the code that implemented the interface was declared as __stdapi.

    As a side note, the code reads like unmanaged C++ (TCHAR)…if this were managed code it never would have happened.

  21. Anonymous says:

    DoSomeOperation had an errant inline assembly routine that left the stack imbalanced for certain values of argc/argv 🙂

  22. Anonymous says:

    Was the program compiled with ANSI or UNICODE defined?

    Was the interface compiled with ANSI or UNICODE defined?

  23. Anonymous says:

    Different Unicode settings resulting in TCHAR #defined as char in one module, and wchar in another?

  24. Anonymous says:

    You’re passing TCHARs to DoSomeOperation. Was this an ANSI build?

  25. Laura: Actually it failed on x86 as well.  

    Charlie: You’re right, but _TCHAR is just what I used in my example – the real version didn’t do that.

    GetPsychicInterface isn’t relevant to the problem.

  26. Mo: _TCHAR won’t cause a stack imbalance.

  27. Most of you are dancing around the right answer (calling convention mismatch).  But what’s the one piece of information that I needed to exactly diagnose the problem?

  28. Anonymous says:

    Is it that this is managed C++ and DoSomeOperation calls a .NET delegate improperly?

  29. Anonymous says:

    My "education" guess is this.

    "static_cast didn’t work so we restored to a "C" style cast".

    Why?

    With a thiscall, the stack must be cleaned up by the callee.  So since the routines are defined properly in the interface, what would cause the callee to improperly clean the stack?  Well, as far as the callee is concerned, the stack was cleaned properly.  Why?  Because due to the "C" style cast, the wrong object was returned resulting in the wrong vtable being used.  This would cause the wrong routine to fail.

    Sadly, that is the best I can do.

  30. Anonymous says:

    Was ESP too low (not enough popped off) or too high (too much popped off) after the call?

  31. nksingh: woah, good question.  In this case, it was too high.

    Jason: Nope, all native.

    Tim: The cast shouldn’t matter, but you’re on the rght track.

  32. Anonymous says:

    Is the correct function actually being called?

  33. nathan_works says:

    I think I see the problem. Your coworker was working on the weekend. Maybe mixed in a little beer for a pre-work tailgate.

    But, if the stack pointer was too high, than was it caused by both caller and callee popping it ?

    or more obscure and probably wrong: was it popped once, and the party popping it had the wrong interface prototype/signature, so it popped "extra" argument entries off that weren’t actually there ?

  34. Anonymous says:

    (since this moderated I have seen up through nksingh’s post and Larry’s response. I also looked up thiscall on wikipedia)

    My guess…

       The function sig. changed from var args so both the caller and callee cleaned up the stack. The server was not recompiled.

       I’m still not happy though since one of the early debugging steps is ‘rebuild all’ and that should be before ‘ask Larry’. Anyway that is my guess for today.

  35. Anonymous says:

    Larry, that cast would matter with multiple inheritance and forward declarations.  Thus the part of "static_cast wouldn’t work".  C casting a between forward declarations would just result in the lhs being the same as the rhs pointer.  But with multiple inheritance, any casting of a base class that isn’t the first class usually results in a different lhs pointer (very compiler specific).

    But I can’t think of another way the wrong routine would be called.

    ponder… ponder… ponder…

    I am assuming the code in question is all being built by the same compiler.  C++ isn’t very inter-operable between DLLs and different compilers.

  36. Anonymous says:

    Does IPsychicInterface have operators overloaded that we should know about? -> ? Just a wild guess…

  37. Anonymous says:

    slightly off topic

    isn’t register deprecated in C?

    I believe all compilers threat it like normal declaration and will probably not put the variable in register?

  38. Anonymous says:

    As I mentioned yesterday , one of the other developers in my group had hit a sticky problem, and he asked

  39. Anonymous says:

    As I mentioned yesterday , one of the other developers in my group had hit a sticky problem, and he asked

  40. dimkaz says:

    Larry, are the classes involved have multiple inheritance and a virtual base?

  41. dimkaz says:

    One usually cannot cast from a pointer to derived class to a virtual base. (unless forced by c style cast). the call usually goes via a thunk which would expect 2 arguments (vtable and offset).

    In this case only one was provided.

  42. Anonymous says:

    What the file extention was. e.g. .cpp or .c

  43. Anonymous says:

    Were you given compiler output that included a warning about inconsistent linkage or something like that?

  44. Nope, there were no warnings at all.  Tomorrow I’ll post both the .cpp files associated with the bug.

  45. Anonymous says:

    Second guess: The one piece of information is “if I change the order of value1/value2/psychicInterface, it works”.

    (Where “works” is “seems to work”).

  46. Anonymous says:

    So for the past couple of posts , I’ve been walking through a psychic debugging experience I had over

  47. Anonymous says:

    So for the past couple of posts , I've been walking through a psychic debugging experience I had