Why is my FormatMessage call crashing trying to read my insertion parameter?


A customer was looking for assistance in debugging a crash in their product. The stack trace looked like this:

ntdll!_woutput_l+0x356
ntdll!_vsnwprintf_l+0x81
ntdll!_vsnwprintf+0x11
ntdll!RtlStringVPrintfWorkerW+0x3c
ntdll!RtlStringCchPrintfExW+0xa8
ntdll!RtlFormatMessageEx+0x324
KERNELBASE!BaseDllFormatMessage+0x18e
KERNELBASE!FormatMessageW+0x37
contoso!FormatWithFallbackLanguage+0x8a
contoso!RecordFailure+0x56

The string being formatted is There was an error processing the object '%1'., and the insertion is a long (but valid) string. A unit test which passes a similarly long object name to Record­Failure does not crash.

What is the problem? There are clues in the stack trace.

The natural place to start is the function that calls Format­Message to see what parameters are being passed in. And that's where you see something strange:

// Code in italics is wrong
BOOL FormatWithFallbackLanguage(
    DWORD dwMessageId, PCTSTR pszBuffer, SIZE_T cchBuffer, ...)
{
 va_list ap;
 va_start(ap, cchBuffer);

 // Format from the user's preferred language.
 DWORD cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_preferredLangId,
               pszBuffer, cchBuffer, &ap);

 // If that didn't work, then use the fallback language.
 if (cchResult == 0) {
  cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_fallbackLangId,
               pszBuffer, cchBuffer, &ap);
 }
 va_end(ap);
 return cchResult != 0;
}

(The clue in the stack trace was the word fallback in the function name, which suggests that if the formatting attempt fails, it'll try again some other way.)

The purpose of this function is to format the message using the specified message ID, first looking in the preferred language message table, and if that fails, by looking in the fallback language message table.

The crash occurred on the second call to Format­Message. The customer said, "I'm guessing that the parameters being passed to Format­Message may be causing this, but I'm not sure how to proceed. Can you provide pointers for further investigation?"

The bug is that code is reusing the ap parameter without resetting it. The documentation for Format­Message says about the Arguments parameter:

The state of the va_list argument is undefined upon return from the function. To use the va_list again, destroy the variable argument list pointer using va_end and reinitialize it with va_start.

The function Format­With­Fallback­Language calls Format­Message, which consumes and renders unusable the ap variable. If the first format attempt fails, the code passes the same (now invalid) va_list to a second Format­Message, which is now operating on undefined data and is therefore within its rights to crash.

In practice, what happens is that the Format­Message function calls va_arg on the va_list to extract the insertions, and since va_lists are single-use, that pretty much renders it useless for anything else. If you want to walk the parameters a second time, you need to clean up the va_list and then reinitialize it.

BOOL FormatWithFallbackLanguage(
    DWORD dwMessageId, PCTSTR pszBuffer, SIZE_T cchBuffer, ...)
{
 va_list ap;
 va_start(ap, cchBuffer);

 // Format from the user's preferred language.
 DWORD cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_preferredLangId,
               pszBuffer, cchBuffer, &ap);

 // If that didn't work, then use the fallback language.
 if (cchResult == 0) {
  va_end(ap);
  va_start(ap, cchBuffer);
  cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_fallbackLangId,
               pszBuffer, cchBuffer, &ap);
 }
 va_end(ap);
 return cchResult != 0;
}

By ending the old argument list and restarting it, the second call to Format­Message has the correct starting point for extracting the variadic parameters. An alternate (and in my opinion better) way to fix the bug would be

BOOL FormatWithFallbackLanguage(
    DWORD dwMessageId, PCTSTR pszBuffer, SIZE_T cchBuffer, ...)
{
 va_list ap;

 // Format from the user's preferred language.
 va_start(ap, cchBuffer);
 DWORD cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_preferredLangId,
               pszBuffer, cchBuffer, &ap);
 va_end(ap);

 // If that didn't work, then use the fallback language.
 if (cchResult == 0) {
  va_start(ap, cchBuffer);
  cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_fallbackLangId,
               pszBuffer, cchBuffer, &ap);
  va_end(ap);
 }
 return cchResult != 0;
}

This avoids the "magic switcheroo" and more clearly scopes the region of validity of the ap variable to "solely for the purpose of the Format­Message function."

Bonus chatter: Suppose the Format­With­Fallback­Language accepted a va_list parameter directly. You might be tempted to implement it like this:

// code in italics is wrong
BOOL FormatWithFallbackLanguage(
    DWORD dwMessageId, PCTSTR pszBuffer, SIZE_T cchBuffer, va_list ap)
{
 va_list apOriginal = ap;

 // Format from the user's preferred language.
 DWORD cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_preferredLangId,
               pszBuffer, cchBuffer, &ap);

 // If that didn't work, then use the fallback language.
 if (cchResult == 0) {
  cchResult = FormatMessage(
               FORMAT_MESSAGE_FROM_HMODULE,
               g_hinst, dwMessageId, g_fallbackLangId,
               pszBuffer, cchBuffer, &apOriginal);
 }
 return cchResult != 0;
}

This is not legal because a va_list is not directly copyable. Some architectures have rather complicated calling conventions that entail memory allocation in order to enumerate the parameters passed to variadic functions, and a bitwise copy will not respect those complexities. You have to use the va_copy macro to make a copy of a va_list.

Exercise: How did this error elude unit testing?

Exercise: What else can go wrong in this function?

Comments (19)
  1. Anon says:

    The error passed unit testing because 'Optimism' is the word of the day. No one would ever dare be so pessimistic as to believe that the message might not exist in the preferred language (And, of course, even if it isn't, we clearly don't support that other language, just tell the customer to use OUR preferred language!).

  2. Brian_EE says:

    A. The unit test was incomplete and didn't test the second conditional call to FormatMessage…

    or

    B. The test string used didn't require any va args

  3. Gabe says:

    You don't go to the hassle of creating a "WithFallback" version of a function and not test the fallback feature, so I'm going to assume that the unit test didn't require any insertion parameters.

    Further, the first thing I assumed would be the problem was that the fallback string was out of sync and had the wrong number of insertions (e.g. including a %2 when only one parameter was passed). So I think that's the other thing that can go wrong with this function.

  4. Medinoc says:

    Unit-testing the fallback mechanism requires purposely de-synching the two languages (adding a string that's present in the fallback language but not the preferred one). AND said extra string must have inserts.

  5. Perhaps better still:

    DWORD g_langIdsInOrderOfPreference[] =

    {

       g_preferredLangId,

       g_fallbackLangId,

    };

    BOOL FormatWithFallbackLanguage(

       DWORD dwMessageId, PCTSTR pszBuffer, SIZE_T cchBuffer, …)

    {

       DWORD cchResult = 0;

       // try the languages in order of preference

       for (int i = 0; i < ARRAYSIZE(g_langIdsInOrderOfPreference); i++)

       {

           va_list ap;

           va_start(ap, cchBuffer);

           cchResult = FormatMessage(

                  FORMAT_MESSAGE_FROM_HMODULE,

                  g_hinst, dwMessageId, g_langIdsInOrderOfPreference[i],

                  pszBuffer, cchBuffer, &ap);

           va_end(ap);

           if (cchResult != 0)

           {

               // got it

               break;

           }

       }

       return (cchResult != 0);

    }

  6. Brian_EE says:

    @Gabe: "You don't go to the hassle of creating a "WithFallback" version of a function and not test the fallback feature,"

    I'm glad to know that you work with all top-notch developers, but don't let that color your vision that there aren't sloppy developers out there. SW has it's share of lazy people just like every other profession.

  7. Mark says:

    Maurits, can't you get your own blog to post all this stuff on?

  8. Adam Rosenfield says:

    The code passed the unit tests because the implementation of va_arg() likely just slurps the next value off of the stack, and the stack frames from the unit test code likely had particular values at the right spots that happened to not crash.  Undefined Behavior can rear its head in any way, including appearing to succeed without any ill effects.

    Whenever I write a variadic function, I always make two versions, one which takes arbitrary arguments with an ellipsis and one which takes a va_list, like the *printf/v*printf family of functions.  With this implementation, it's impossible to forward a va_list from another function into FormatWithFallbackLanguage(), at least not without resorting to fragile platform-specific hacks like the avcall module of the GNU ffcall library.

  9. Yuri Khan says:

    > Exercise: What else can go wrong in this function?

    In the worst case, it will work as intended and give users localized error messages that are impossible to google for.

  10. Gabe says:

    Brian_EE: Don't be ridiculous — the people I work with wouldn't even be doing unit testing!

    To figure out the answer, you just have to think about the question. The real answer to "How did this error elude unit testing?" could easily be "We don't do unit testing", or "The unit test doesn't cover that code path", or "The test is correct but the guy running the test didn't notice that it failed". If we assume incompetent unit testers then any of those answers are equally likely — and equally uninteresting.

    But if the unit testers were that bad, Raymond wouldn't have bothered to ask the question. The reason he mentioned unit testing in the first place is presumably because it was being done correctly (all code paths covered and producing expected results), yet still didn't catch this particular case.

    And that's the only answer that makes it interesting enough to be worth asking.

  11. Joshua says:

    Ah yes someone writes a not-completely-thought-out fallback and it goes bad.

    Seen it too often.

  12. foo says:

    Wow. MSVC 2013 implements va_copy (C99). Learn something new everyday.

  13. Neil says:

    Are you supposed to be able to determine how big the buffer needs to be?

  14. AndreiM says:

    Quote from MSDN: "If this lpSource handle is NULL, the current process's application image file will be searched. "

    My guess is that they had some fallback resources in the main EXE, but different language in another dll. When testing, they had only the exe, which means that the function could exit with error before touching va_list ap.

  15. dave says:

    >Are you supposed to be able to determine how big the buffer needs to be?

    If you want to. Or you can let FormatMessage do the buffer allocation. Both are supported, see the doc.

  16. Pessimistic says:

    What else could go wrong?

    – more replacement strings (%2, %3) than insertion parameters, possibly only in fallback string

    – overlapped buffers of result and insertion parameters

    – required result buffer size is bigger than 64kByte

  17. Pessimistic says:

    And: the whole printf-pandora of non-matching format spec to argument type, if %n!format-string! style is used.

  18. Michael says:

    Why does FormatMessage take a pointer to a va_list anyway? Why not a "value" va_list, like vprintf and friends? If they can do the magic without having a pointer, then so could FormatMessage? That could have "preempted" this crash.

  19. @Michael says:

    "Why does FormatMessage take a pointer to a va_list anyway?" It takes a pointer, not necessary a va_list pointer.

    This is the only way this function can be used from any program which is not compiled with Microsoft C, like Delphi, VisualBasic, etc etc

    MSDN: "If you do not have a pointer of type va_list*, then specify the FORMAT_MESSAGE_ARGUMENT_ARRAY flag and pass a pointer to an array of DWORD_PTR values; those values are input to the message formatted as the insert values. Each insert must have a corresponding element in the array."

Comments are closed.

Skip to main content