A function pointer cast is a bug waiting to happen


A customer reported an application compatibility bug in Windows.

We have some code that manages a Win32 button control. During button creation, we subclass the window by calling Set­Window­Subclass. On the previous version of Windows, the subclass procedure receives the following messages, in order:

  • WM_WINDOWPOSCHANGING
  • WM_NCCALCSIZE
  • WM_WINDOWPOSCHANGED

We do not handle any of these messages and pass them through to Def­Subclass­Proc. On the latest version of Windows, we get only the first two messages, and comctl32 crashes while it's handling the third message before it gets a chance to call us. It looks like it's reading from invalid memory.

The callback function goes like this:

LRESULT ButtonSubclassProc(
    HWND hwnd,
    UINT uMsg,
    WPARAM wParam,
    LPARAM lParam,
    UINT_PTR idSubclass,
    DWORD_PTR dwRefData);

We install the subclass function like this:

SetWindowSubclass(
    hwndButton,
    reinterpret_cast<SUBCLASSPROC>(ButtonSubclassProc),
    id,
    reinterpret_cast<DWORD_PTR>(pInfo));

We found that if we changed the callback function declaration to

LRESULT CALLBACK ButtonSubclassProc(
    HWND hwnd,
    UINT uMsg,
    WPARAM wParam,
    LPARAM lParam,
    UINT_PTR idSubclass,
    DWORD_PTR dwRefData);

and install the subclass function like this:

SetWindowSubclass(
    hwndButton,
    ButtonSubclassProc,
    id,
    reinterpret_cast<DWORD_PTR>(pInfo));

then the problem goes away. It looks like the new version of Windows introduced a compatibility bug; the old code works fine on all previous versions of Windows.

Actually, you had the problem on earlier versions of Windows, too. You were just lucky that the bug wasn't a crashing bug. But now it is.

This is a classic case of mismatching the calling convention. The SUB­CLASS­PROC function is declared as requiring the CALLBACK calling convention (which on x86 maps to __stdcall), but the code declared it without any calling convention at all, and the ambient calling convention was __cdecl. When they went to compile the code, they got a compiler error that said something like this:

error C2664: 'SetWindowSubclass' : cannot convert parameter 2 from 'LRESULT (__cdecl *)(HWND,UINT,WPARAM,LPARAM,UINT_PTR,DWORD_PTR)' to 'SUBCLASSPROC'

"Since the compiler was unable to convert the parameter, let's give it some help and stick a cast in front. There, that shut up the compiler. Those compiler guys are so stupid. They can't even figure out how to convert one function pointer to another. I bet they need help wiping their butts when they go to the bathroom."

And there you go, you inserted a cast to shut up the compiler and masked a bug instead of fixing it.

The only thing you can do with a function pointer after casting it is to cast it back to its original type.¹ If you try to use it as the cast type, you will crash. Maybe not today, maybe not tomorrow, but someday.

In this case, the calling convention mismatch resulted in the stack being mismatched when the function returns. It looks like earlier versions of Windows managed to hobble along long enough before things got resynchronized (by an EBP frame restoration, most likely) so the damage didn't spread very far. But the new version of Windows, possibly one compiled with more aggressive optimizations, ran into trouble before things resynchronized, and thus occurred the crash.

The compiler was yelling at you for a reason.

It so happened that the Windows application compatibility team had already encountered this problem in their test labs, and a shim had already been developed to auto-correct this mistake. (Actually, the shim also corrects another mistake they hadn't noticed yet: They forgot to call Remove­Window­Subclass when they were done.)

¹I refer here to pointers to static functions. Pointers to member functions are entirely different animals.

Comments (35)
  1. Joshua Ganes says:

    I suppose I knew that I could reinterpret any pointer as any type, but it simply never occurred to me to try to cast a function pointer. Imagine trying to cast a functor — now that would really be something.

  2. Random Coder says:

    They even figured out their own problem, yet they still thought it was a Windows bug.  Did they think "CALLBACK" had no meaning?  It boggles the mind.

  3. Joshua says:

    Nice. Since when is "Windows is compiled with frame pointers" part of the ABI?

  4. Rick C says:

    Joshua, your question does not compute.  Casting away compiler errors != making things work.

  5. cast away says:

    Only use cast when you KNOW you have the right type. If you using cast because it doesn't crash when executing, you're doing it wrong.

    But as usual, the real fail is the lousy msdl lib. It doesn't mention the CALLBACK calling convention at all:

    msdn.microsoft.com/…/bb762102.aspx

  6. Joshua says:

    Sure it does. The code works if Windows is compiled with frame pointers. But that's not part of the ABI so don't depend on it.

  7. WndSks says:

    You cannot put of 100% of the blame on these idiots, Visual Studio also used to ship with a wizard that would generate code for you with casts.

    You can still find code samples on MSDN that casts the proc used by CreateDialog and DialogBox: msdn.microsoft.com/…/ms644996%28v=vs.85%29.aspx has two samples with casts: CreateDialog(…(DLGPROC)GoToProc) and DialogBoxIndirect(…(DLGPROC)DialogProc);

  8. blah says:

    "Actually, the shim also corrects another mistake they hadn't noticed yet"

    So MS found this second bug and didn't notify the vendor at the time? Sounds like they masked a bug instead of facilitating a fix.

  9. jader3rd says:

    If only there were a language which would cause an error on an invalid cast…..

  10. Sunil Joshi says:

    @jader3rd

    The compiler raised an error, the programmer told it to buzz off because he knew better; only he really didn't.

    In general having the power to shut the compiler up is an advantage; but you have to know what you're doing.

    A language where you couldn't do that would be limiting.

  11. CarlD says:

    @jader3rd, @Sunil Joshi

    That is exactly why reinterpret_cast was given such a long and ugly name: to call out explicitly that you're doing something inherently dangerous and you'd better know exactly what you're doing.  

  12. prshaw says:

    "Actually, you had the problem on earlier versions of Windows, too. You were just lucky that the bug wasn't a crashing bug."

    I always thought that was unlucky. I was lucky when it did crash on my dev box and I found it then.

  13. Tim says:

    In the next standard for C++, reinterpret_cast<T>() will be renamed reinterpret_cast_i_know_what_i_am_doing_i_promise<T>() ;)

  14. SimonRev says:

    @Tim

    Unfortunately reinterpret_cast_i_know_what_i_am_doing_i_promise<T>() will do us no good as long as the abomination known as C-Style casts exists.

  15. voo says:

    Agree with cast away. msdn.microsoft.com/…/bb776774.aspx actually defines the call exactly as they did. Sure it says "defines the prototype for a callback function" which I'm sure is defined somewhere as having to have the CALLBACK calling convention, but really who thought it'd be a good idea to not mention such an important detail anywhere?

    [It looks like the docs are inconsistent whether they include the calling convention or not. my guess is that at some point there was an editorial decision to remove clutter from the prototypes, so functions added during the "no clutter" era were missing calling convention declarations. In retrospect, I should have chosen an example that didn't involve SDK headers. -Raymond]
  16. Semi Essessi says:

    Actually its kind of cool to cast a char array to a void (*)() function pointer for the sake of executing byte code – although you have to be careful to allocate the memory as executable – and of course make sure your bytes are good code and not a crash waiting to happen. :)

    This might be a lazy way to implement a compiled scripting language for instance… not that I've ever done that… ( code.google.com/…/fridgescript )

  17. Seva says:

    Had a similar issue once. The compiler did not complain, because the code in question started its life in Win16, and there was a GetProcAddress() call.

    Around year 1994, casting callback function pointers back and forth was the norm. The one backed by by full faith and credit of Microsoft Corp., no less.

  18. Burak KALAYCI says:

    So there was a compatibility bug in an unmentioned version of Windows. A customer reported it but the Application Compatibility Team already had a shim. Kudos to Microsoft!

  19. davep says:

    voo: "actually defines the call exactly as they did. Sure it says "defines the prototype for a callback function" which I'm sure is defined somewhere as having to have the CALLBACK calling convention, but really who thought it'd be a good idea to not mention such an important detail anywhere?"

    So, the documentation and the header disagree. Which one is used to compile the program?

  20. Voo says:

    @davep: So your solution is.. "Just ignore the documentation and always look at the header"? Well, while that'll surely work, it somehow makes me doubt the value of the documentation. Sure the compiler error should be obvious, but it's such an easy fix in the documentation and would avoid quite some headaches, not just for developers but for those poor people having to write the comp. shims for that (and I bet there are dozens of functions where such shims do exist).

    I'm sure that there was quite some discussion in the team responsible for the documentation and they surely had valid arguments (less clutter, compiler error [at least in most scenarios] if you get it wrong, ..), I think including the calling convention would lessen such problems and would hardly annoy people who'd get it right anyway.

    Although looking at ThreadProc it seems, there'll always be some ingenious bastards that get around even the best documentation.. I cite:

    "Do not declare this callback function with a void return type and cast the function pointer to LPTHREAD_START_ROUTINE when creating the thread. Code that does this is common, but it can crash on 64-bit Windows."

  21. Joshua says:

    @davep, Voo: If we assume what is implied by old blogs that Windows itself is compiled with the SDK then we should trust the header files over the docs. I'm used to wrong docs but wrong headers is a stretch for my imagination when they're used like this.

    Personally I'd like to know what version of Windows this was from/to. My guess based on other things would be NT4->2000 but that's just a guess.

  22. GWO says:

    If "a function pointer cast is a bug waiting to happen", how is one supposed to use GetProcAddress() in a bug free manner? Only use functions that are "int fn(void)"?

    [Great, now I have to put footnotes in the title? -Raymond]
  23. Bulletmagnet says:

    A reinterpret_cast is a lie to the compiler.

    "If you lie to the compiler, it will get its revenge." –Henry Spencer

  24. Karellen says:

    @SimonRev – that's what -Wold-style-cast is for. :-)

  25. davep says:

    voo: "@davep: So your solution is.. 'Just ignore the documentation and always look at the header'?"

    No. It's your job as a programmer to be -smart- about it. In this case, the compiler was basically telling you the documentation was wrong. It's bad practice to ignore the compiler. And the header -is- documentation.

  26. 640k says:

    Here raymond makes fun of people looking i header files (or tools using them) instead of reading the docs:

    blogs.msdn.com/…/9148951.aspx

    Great work dude!

  27. davep says:

    640k: "Here raymond makes fun of people looking i header files (or tools using them) instead of reading the docs:"

    No, it's another example of not being smart.

    Chen: "But you don't even need to read the documentation to know that Process.Refresh has no chance of working."

    That link is another example (like this post) of "wishful thinking" instead of understanding how things work.

  28. 640 says:

    Ofcourse Process.Refresh doesn't refresh the process in explorer, that's obvious long before even looking in either headers or docs. And that wasn't the point.

    The point was that reading headers, or using tools which read them, was discouraged by raymond. But now he made a 180, by recommending using headers instead of docs.

    [Reading headers and ignoring the documentation is discouraged. Do not interpret this to mean "never read the headers" any more than saying "Don't use an air bag without wearing your seat belt" should be interpreted as "never use an air bag." -Raymond]
  29. voo says:

    @davep: Sure I'm not disputing that you can do without a complete documentation, my point is that it's unnecessary and can only lead to bugs. Or as a quite far-fetched example: Raymond posted about assembly instructions for toys or something a while ago and on some weird level it's similar: Sure they could just tell you to "insert screw there" and let you find out which of the different variants fits, but that's an unnecessary indirection that would only need to some quite strangely looking things. If you have to look at the implementation to understand the documentation, that's a sign that the documentation could be clearer (ah I know how Raymond feels, I can already see people finding some rare edge cases).

  30. Slapout says:

    "The compiler was yelling at you for a reason."   I love that line.

  31. 640k says:

    Reading headers & docs is the easy part. Guessing which is right is the hard part.

  32. AZ says:

    WndSks: "You cannot put of 100% of the blame on these idiots, Visual Studio also used to ship with a wizard that would generate code for you with casts."

    That wasn't due to idiocy. In the old days (before STRICT), you *had* to use casts because the header definitions for DLGPROC etc. didn't include the parameter lists (they were simply aliases for FARPROC).

    msdn.microsoft.com/…/aa280394%28v=VS.60%29.aspx

  33. GWO says:

    [Great, now I have to put footnotes in the title? -Raymond]

    That doesn't answer the question.  Or even address the question. Or even make any point besides an implicit statement of how smart you are (which doesn't wash in the slightest) together with an implicit admission that the answer is "You can't".

    [Would you have preferred the subject line "Calling through a function pointer that has been cast to a type different from the original function prototype is a bug waiting to happen"? Man, you really like to take the fun out of life. (I expect people to be able to apply rules like this one.) -Raymond]
  34. WndSks says:

    @AZ: Even if you define STRICT, the VC6 Hello World sample application will not compile without a cast since the DLGPROC return type is not LRESULT but INT_PTR (or BOOL back then)

  35. Brian G. says:

    @GWO: or maybe it's yet another case where he makes a bold headline which is not to be taken literally as Gospel truth in 100% of all situations without exception.

Comments are closed.