How does the MultiByteToWideChar function treat invalid characters?


The MB_ERR_INVALID_CHARS flag controls how the Multi­Byte­To­Wide­Char function treats invalid characters. Some people claim that the following sentences in the documentation are contradictory:

  • "Starting with Windows Vista, the function does not drop illegal code points if the application does not set the flag."

  • "Windows XP: If this flag is not set, the function silently drops illegal code points."

  • "The function fails if MB_ERR_INVALID_CHARS is set and an invalid character is encountered in the source string."

Actually, the three sentences are talking about different cases. The first two talk about what happens if you omit the flag; the third talks about what happens if you include the flag.

Since people seem to like tables, here's a description of the MB_ERR_INVALID_CHARS flag in tabular form:

MB_ERR_INVALID_CHARS set? Operating system Treatment of invalid character
Yes Any Function fails
No XP and earlier Character is dropped
Vista and later Character is not dropped

Here's a sample program that illustrates the possibilities:

#include <windows.h>
#include <ole2.h>
#include <windowsx.h>
#include <commctrl.h>
#include <strsafe.h>
#include <uxtheme.h>

void MB2WCTest(DWORD flags)
{
 WCHAR szOut[256];
 int cch = MultiByteToWideChar(CP_UTF8, flags,
                               "\xC0\x41\x42", 3, szOut, 256);
 printf("Called with flags %d\n", flags);
 printf("Return value is %d\n", cch);
 for (int i = 0; i < cch; i++) {
  printf("value[%d] = %d\n", i, szOut[i]);
 }
 printf("-----\n");
}

int __cdecl main(int argc, char **argv)
{
 MB2WCTest(0);
 MB2WCTest(MB_ERR_INVALID_CHARS);
 return 0;
}

If you run this on Windows XP, you get

Called with flags 0
Return value is 2
Value[0] = 65
Value[1] = 66
-----
Called with flags 8
Return value is 0
-----

This demonstrates that passing the MB_ERR_INVALID_CHARS flag causes the function to fail, and omitting it causes the invalid character \xC0 to be dropped.

If you run this on Windows Vista, you get

Called with flags 0
Return value is 3
Value[0] = 65533
Value[1] = 65
Value[2] = 66
-----
Called with flags 8
Return value is 0
-----

This demonstrates again that passing the MB_ERR_INVALID_CHARS flag causes the function to fail, but this time, if you omit the flag, the invalid character \xC0 is converted to U+FFFD, which is REPLACEMENT CHARACTER. (Note that it does not appear to be documented precisely what happens to invalid characters, aside from the fact that they are not dropped. Perhaps code pages other than CP_UTF8 convert them to some other default character.)

Comments (9)
  1. Gabe says:

    Now I'm sort of curious as to what prompted the behavior change. This seems like it could be a breaking change for apps that tended to get illegal characters in their inputs and were able to safely ignore them.

    [My guess is security. -Raymond]
  2. dave says:

    Some people claim that the following sentences in the documentation are contradictory

    One might assume that people reading those sentences are, generally speaking, programmers.  One might assume that programmers, who after all work with logical devices, could mentally derive that little table as they were reading the text.   Alas, one would likely be wrong.

    Lawyers, on the other hand … ;-)

  3. 640k says:

    Documentation is overcomplicated with negatives. That's why Raymond now has to spend time on explaining it.

  4. Joshua says:

    Wasn't there some discussion on this in the comments awhile back? Unfortunately the Web 2.0 abuse makes it impossible to use a search engine to search the comments on this blog.

  5. Wayne says:

    The documentation should have Raymond's tables!!  It's so much easier to parse than the text.

  6. Ben says:

    I too am one of the humans who likes tables.

  7. Cheong says:

    [My guess is security. -Raymond]

    Agreed. I see the same potential as "canonical Unicode forms" abuse as it were in IIS before Win2k. If an application do checking on MBCS then convert to Unicode, and the source string contains discardable character that'd make some illegal sequence seems legal, that'll effectively allow bad people write code to bypass the checking.

  8. Cheong says:

    I'll also note that if certain application is written for MBCS and not Unicode, all it's validation would be done in with MBCS data. And if that application passes the data to COM, since it'll transparently do the MBCS to Unicode conversion, it may unknowingly pass inadequately validated data to the COM component.

    Another reason for not writing non-Unicode applications these days…

  9. KJK::Hyperion says:

    Codepages can specify a replacement character (most commonly "?"). It's documented as the default replacement character used by WideCharToMultiByte, presumably it's now used by MultiByteToWideChar, too. It could be made clearer

Comments are closed.