What's wrong with this code, part 21 - A Psychic Debugging Example - The answers.

So for the past couple of posts, I've been walking through a psychic debugging experience I had over the weekend.

As I presented the problem, there were three pieces of information needed to debug the problem.

An interface:

class IPsychicInterface
{
public:
    virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
};

A test application:

int _tmain(int argc, _TCHAR* argv[]){    register int value1 = 1;    IPsychicInterface *psychicInterface = GetPsychicInterface();    register int value2 = 2;

    psychicInterface->DoSomeOperation(argc, argv);
    assert(value1 == 1);
    assert(value2 == 2);
    return 0;
}

and some assembly language code:

0040106B  pop         edi 
0040106C  pop         ebx 
0040106D  mov         al,1
0040106F  pop         esi 
00401070  ret         0Ch 

As I mentioned in my last post, the problem was tracked down to a stack imbalance when calling the DoSomeOperation method, and when I saw the postamble for DoSomeOperation, I quickly realized the answer to the problem.

There are essentially 4 separate calling conventions supported by Microsoft's compilers - it turns out that you can figure out several of them from just looking at the code.  For the stdcall and thiscall calling conventions, input parameters are passed onto the routine, and the callee is responsible for cleaning up the stack (this contrasts with the cdecl calling convention where the caller is responsible for cleaning the stack).  From the postamble, we know that this function is either a stdcall or a thiscall function, since the "ret" instruction adjusts the stack.

I've already stated that this is x86 code, and the RET 0CH indicates that the routine pops off 12 bytes of values off the stack.  This is clearly a problem, because the DoSomeOperation routine only takes two parameters (which would take 8 bytes). The RET 0CH implies that the implementation of DoSomeOperation took 3 parameters!

 

This implies that we're dealing with a violation of the one definition rule (ODR).  The One Definition Rule is a part of the C++ standard (section 3.2) which states: "No translation unit shall contain more than one definition of any variable, function, class type, enumeration type or template.".  In other words, when you declare a function in separate object files, you need to make sure that they all use the same definitions of structures. 

Most commonly ODR violations this show up when you change a header file but don't rebuild all the source files that depend on that file - there's a ton of work that's been done to automatically manage dependencies to avoid this particular issue.

And if you look at the source code for the PsychicInterface logic, you'll see the problem immediately:

class IPsychicInterface {public:    virtual bool DoSomeOperation(int argc, _TCHAR *argv[], _TCHAR *envp[]) = 0;}; bool CPsychicInterface::DoSomeOperation(int argc, _TCHAR *argv[], _TCHAR *envp[]){    int count = argc;    while (count--)    {        printf("%S", argv[count]);    }    return true;}

The PsychicInterface code has it's own private definition of IPsychicInterface which doesn't match the definition in the test application.

Obviously this is an utterly contrived example.  The real problem was much more complicated than this - the violation was in an export from a DLL, and involved external components, which made this more complicated.  In many ways, it was similar to the problem that Raymond talked about here (except in this case, we're in a position to fix the code involved).