Does __fastcall make a difference for C++ classes?


We're running through a routine round of code reviews of the audio engine, and I noticed the following code (obscured):

HRESULT __fastcall CSomeClass::SomeMethod(SomeParameters);  

I looked at it a couple of times, because it seemed like it was wrong.  The thing that caught my eye was the "__fastcall" declaration.  __fastcall  is a Microsoft C++ extension that allows the compiler to put the first 2 DWORD parameters to the routine into the ECX and EDX registers (obviously it's x32 only).

But when compiling C++ code, the default calling convention is "thiscall", and in the thiscall convention, the "this" pointer is passed in the ECX register, which seems to collide with the __fastcall declaration.

So does it make a difference?  I could have left a code review comment and made the person who owned the code run through the exercise, but I figured why not figure out the answer myself?  And, to be honest, I found the path to the answer almost more interesting than the answer itself.

As I usually do in these cases, I wrote a tiny little test application to test it out:

class fctest
{
   
int _member;
public:
    fctest::fctest(
void);
    fctest::~fctest(
void);
    int __fastcall fctest::FastcallFunction(int *param1, int *param2)
    {
       
return *param1 * *param2;
    }

    int fctest::ThiscallFunction(int *param1, int *param2)
    {
       
return *param1 * *param2;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    fctest test;
   
int param1, param2;
   
int result;
    result = test.FastcallFunction(&param1, &param2);
    result = test.ThiscallFunction(&param1, &param2);
   
return 0;
}

 

I compiled it for "Retail", and then I looked at the generated output.  Somewhat to my surprise, the code generated was:

main:
    xor eax, eax
    ret

Yup, the compiler had optimized out my entire program.  Crud, back to the drawing board.

Try #2:

int _tmain(int argc, _TCHAR* argv[])
{
    fctest test;
   
int param1, param2;
   
int result;
    result = test.FastcallFunction(&param1, &param2);
    printf("%d: %d: %d", param1, param2, result);
    result = test.ThiscallFunction(&param1, &param2);
    printf("%d: %d: %d", param1, param2, result);
   
return 0;
}

This one was somewhat better:

main:
    mov eax, [sp]
    imul eax, [sp+4]
    <call to printf #1>
    <call to printf #2>
    xor eax, eax
    ret

Hmm, that wasn't much of an improvement.  The compiler realized that FastcallFunction and ThiscallFunction did the same thing and not only did it inline the call, but it optimized out the 2nd call.

Try #3:

int _tmain(int argc, _TCHAR* argv[])
{
    fctest test;
   
int param1, param2;
   
int result;
    param1 = rand();
    param2 = rand();
    result = test.FastcallFunction(&param1, &param2);
    printf("%d: %d: %d", param1, param2, result);
    param1 = rand();
    param2 = rand();
    result = test.ThiscallFunction(&param1, &param2);
    printf("%d: %d: %d", param1, param2, result);
   
return 0;
}

 

Try #3's code:

main:
    call rand
    mov [sp], eax
    call rand
    mov [sp], eax
    mov eax, [sp]
    imul eax, [sp+4]
    <call to printf #1>
    call rand
    mov [sp], eax
    call rand
    mov [sp], eax
    mov eax, [sp]
    imul eax, [sp+4]
    <call to printf #2>
    xor eax, eax
    ret

Much better, now at least both functions are inlined.  But the stupid function is STILL inlined, I haven't learned anything yet.

Try #4: I moved fctest into its own source file (I'm not going to show the source code for this one).

The code for this one finally got it right:

            

param1 = rand();
00401029 call rand (401131h)
0040102E mov dword ptr [esp+4],eax
            
param2 = rand();
00401032 call rand (401131h)
00401037 mov dword ptr [esp],eax
            
result = test.FastcallFunction(&param1, &param2);
0040103A lea eax,[esp]
0040103D push eax
0040103E lea edx,[esp+8]
00401042 lea ecx,[esp+0Ch]
00401046 call fctest::FastcallFunction (4010E0h)
            printf("%d: %d: %d", param1, param2, result);
0040104B mov ecx,dword ptr [esp]
            param1 = rand();
00401062 call rand (401131h)
00401067 mov dword ptr [esp+4],eax
            param2 = rand();
0040106B call rand (401131h)
00401070 mov dword ptr [esp],eax
            result = test.ThiscallFunction(&param1, &param2);
00401073 lea eax,[esp]
00401076 push eax
00401077 lea ecx,[esp+8]
0040107B push ecx
0040107C lea ecx,[esp+10h]
00401080 call fctest::ThiscallFunction (4010F0h)

So what's in all this gobbeldygook?

Well, the relevant parts are the instructions from 0x4013a to 0x40146 and 0x401073 to 40107c.  Side by Side, they are:

0040103A lea eax,[esp]
0040103D push eax
0040103E lea edx,[esp+8]
00401042 lea ecx,[esp+0Ch]
00401046 call fctest::FastcallFunction (4010E0h)
00401073 lea eax,[esp]
00401076 push eax
00401077 lea ecx,[esp+8]
0040107B push ecx
0040107C lea ecx,[esp+10h]
00401080 call fctest::ThiscallFunction (4010F0h)

Note that on both functions, the ECX register is loaded with the address of "test".  But in the fastcall function, the 1st parameter is loaded into the EDX register - in the thiscall function, it's pushed onto the stack.

So yes, __fastcall makes a difference for C++ classes.  Not as much as it does for C functions, but it DOES make a difference.

 

Comments (15)
  1. Anonymous says:

    Maybe next time turn off optimization instead of rewriting your test case over and over again? Or does turning off optimization disable __fastcall?

  2. Joe, that’s a good point, and the answer is "I don’t know". The optimizer introduces a LOT of changes to the system, it might make a difference here, it might not. But without turning the optimizer on, I wouldn’t be certain how the code behaves with the optimizer enabled.

    I might have been able to guess that the compiler wouldn’t optimize the thiscall code but I wouldn’t be sure. Actually, in this case, it probably would be safe, but…

  3. Anonymous says:

    Great article!

    I am wondering how are you going to deal with the code after the review. As you pointed out, the performance gain is not significant. If the code is left untouched, some time later it may confuse another maintainer. Of course this not important enough, I just want to know if this kind of small optimization are encouraged at MS (Personally I am reluctanted to do it).

  4. Anonymous says:

    Great article!

    I am wondering how are you going to deal with the code after the review. As you pointed out, the performance gain is not significant. If the code is left untouched, some time later it may confuse another maintainer. Of course this not important enough, I just want to know if this kind of small optimization are encouraged at MS (Personally I am reluctanted to do it).

  5. This should make sense if you think about it. C++ member functions take an implicit "this" as the first parameter. The first declared parameter is really the second actual parameter.

  6. Nicholas, yup, I should have pointed that out, but…

    yawl, Actually the perf gain CAN be significant, especially on x32 machines where the register set is so tightly constrained. For x64 and ia64 machines it’s not as important because there are more general purpose registers.

    IIRC, back in the NT4 days, the entire NT kernel was recompiled with __fastcall and it got something like a 10% overall speedup.

    So being able to save the transfer through memory of even one parameter to a routine can result in huge perf improvements.

  7. Anonymous says:

    Even the small difference you’ve noted goes away if you crank up the amount of optimization. If you use whole program optimization (compile with /GL, link with /LTCG), the optimizer will make up a calling convention for each function based on how that function uses its parameters. When I tried it on your example, both FastcallFunction and ThiscallFunction wound up having ‘param1’ passed in eax, ‘param2’ passed in ecx, and ‘this’ not passed at all, since it was unused.

    BTW, I find that __declspec( noinline ) is a handy way to tell the compiler not to inline something when I’m just playing around looking at the generated code. It doesn’t disable any other optimizations; it just tells the compiler not to inline calls to that particular function.

  8. Anonymous says:

    (Apologies for the duplicate post, if any).

    If you crank up the optimization level even further, you can get to the point where the optimizer just ignores the calling convention you specify and picks a custom convention for each function based on how it uses its parameters. This only happens when you use Whole Program Optimization (compile with /GL, link with /LTCG). When I tried this on your example code, both ThiscallFunction and FastcallFunction wound up with ‘param1’ passed in eax, ‘param2’ passed in ecx, and ‘this’ not passed at all, since it was unused.

    Note that if there’s any way that your function might be called from outside your DLL/EXE, then the custom calling convention optimization won’t take place, since the external code would be expecting the calling convention that you declared in the source code.

    Also, I find that using __declspec( noinline ) is a handy way to keep the optimizer from reducing my test case to two instructions, without having to turn off optimizations entirely. Even if you put the function in a different source file, the optimizer might still decide to inline it if you use WPO.

  9. Anonymous says:

    Since it’s a _calling convention_, the optimizer can’t actually do anything different with it. The only difference the optimizer made was getting in the way of your testing, since it realized everything was local.

    For a function/method that will be visible to callers the optimizer can’t analyze (e.g. exported functions), the calling convention will _always_ be followed to the letter. Anything different would just not work.

    For "everything local" cases, it’s really an exploration of the optimizer, not the calling convention 😉

  10. Anonymous says:

    On a different note, it was fun to see the path around the optimizer. And of course, the conclusion was worth it 🙂

  11. Anonymous says:

    If this method is non-virtual, I would have thought an explicit __fastcall would be frowned upon in favor of inlining or letting LTCG create a custom calling convention. Register pressure is so bad on x86 that I wouldn’t use it without looking at the disassembly.

  12. Anonymous says:

    Any time you have cross-DLL calling you really need to have your calling convention called out clearly in your headers.

    Otherwise maybe you compiled with /Gd (__cdecl default) and someone else compiles /Gz (__stdcall default) and lo and behold, you can’t actually run the code.

  13. Anonymous says:

    I was just about to say that if you put one function in a DLL and call it from another then you’ll test exactly what you were trying for and the optimizer can’t delete it.

    But Michael Grier brought up something that looks like yet another Windows bug, detouring what I was just about to say. When the Setup APIs report that some Setup API was compiled with the wrong calling convention, it looks like Windows is reporting a Windows bug. There’s a chance that some higher level application might be stepping on something so the Setup APIs might just be victims, but it looks like that chance is a rather small one. It looks like some Setup API DLL neglected to have its calling convention called out clearly in its headers, and Windows diagnoses itself. If Microsoft would test its own stuff on checked builds, maybe Microsoft would find the same.

  14. Anonymous says:

    This question might be out of the intended scope but…

    Which convention does .NET uses?

  15. Anonymous says:

    Considering the way .NET programs have to declare their uses of unmanaged DLLs, it sure looks like __stdcall and __thiscall.

Comments are closed.

Skip to main content