What's wrong with this code, Part 21 - A psychic debugging example: The missing piece

As I mentioned yesterday, one of the other developers in my group had hit a sticky problem, and he asked me for my opinion on what was going wrong.

There were 3 pieces of information that I needed to use to diagnose the problem, I gave you two of them yesterday:

The interface:

class IPsychicInterface
{
public:
    virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
};

And the test application:

int _tmain(int argc, _TCHAR* argv[]){    register int value1 = 1;    IPsychicInterface *psychicInterface = GetPsychicInterface();    register int value2 = 2;

    psychicInterface->DoSomeOperation(argc, argv);
    assert(value1 == 1);
    assert(value2 == 2);
    return 0;
}

Originally the problem was that the ESI register was being trashed.  Since the C and C++ calling convention requires that the ESI register be preserved and the ESI register was trashed, that narrowed down the failure to three possible causes to the problem:

  • Somewhere inside DoSomeOperation, there was a stack overflow that caused the saved version of ESI to be corrupted.  This was actually my first thought.
  • Somewhere inside DoSomeOperation, there was a stack imbalance, which would cause garbage to be restored when the ESI register was popped off the stack.  Normally the compiler catches these errors, so I originally discounted this possibility.
  • There was a horrible compiler bug or OS bug which caused the register to be trashed (which is extraordinarily unlikely (but has happened)).

The other developer had chased the problem down further and realized that there was a stack imbalance on the call to DoSomeOperation.  There are basically two things that can cause a stack imbalance, most of the people who left comments in the original post caught one of them, some caught the other:

  • A calling convention mismatch.
  • A parameter declaration mismatch.

But I didn't have enough information to figure out which of the two it was.  That's when he gave me the final piece that let me accurately figure out what was going wrong.

The final piece was the last bit of assembly language in the DoSomeOperation function:

0040106B  pop         edi 
0040106C  pop         ebx 
0040106D  mov         al,1
0040106F  pop         esi 
00401070  ret         0Ch 

Now that you have the last piece, what was the bug?  Be specific - we already know that the problem is a stack imbalance, but what's the root cause?

For a bonus, why didn't the compiler catch it?