This pretty much demonstrates another reason to Use Unicode, but if you do need to use some non-Unicode encoding until you can convert to Unicode, please don’t use these flags.
MultiByteToWideChar() and WideCharToMultiByte() provide some interesting sounding flags that are actually useless, slow, badly broken, or far worse. All of these flags would be expected to behave like Unicode Normalization, so you should instead use NormalizeString() to handle the desired behavior, either Form C for composed strings or Form D for decomposed strings.
MB_PRECOMPOSED is the simplest to address: Basically this flag doesn’t really do anything. Nominally it would put data into something like Normalization Form C, however most code pages are already in a composed form, so there’s little real impact. Just to make sure, the flag’s ignored internally 🙂
MB_COMPOSITE is my most hated of these flags. First of all, it nominally pretends to put the data into something like Normalization Form D, decomposed into a base character and combining characters. To me that’s the opposite of “Composite”. Indeed, I’ve seen numerous code examples that seem to be passing MB_COMPOSITE expecting Form C data, and pretty much zero examples expecting Form D data. Windows leans towards Form C internally (though you may use Form D or mixed data), so this flag probably isn’t that helpful. If you really want to decompose your data, then use NormalizeString with Form D instead of this flag.
MB_COMPOSITE also is very slow because it does a lookup in some data tables. NormalizeString with Form D is probably faster.
MB_COMPOSITE also has some horrible behavior for many code points:
- Several code points will not round trip if this flag is set, even if WC_COMPOSITECHECK is used when converting back to the code page.
- Additionally its data tables are incomplete and inconsistent with the normalization
- Worse, some characters are decomposed into nonsensical sequences.
- Lastly some sequences decompose to strange choices, breaking some text. Japanese is particularly impacted.
WC_COMPOSITECHECK basically has all of the problems of MB_COMPOSITE (its used in the other direction). Its name isn’t as annoying to me though. Nominally WC_COMPOSITECHECK puts the data into Normalization Form C before encoding. Since most code pages are in a composed form Normalization Form C isn’t a bad idea, however please use NormalizeString with Form C instead of this flag.
WC_COMPOSITECHECK is also very slow because of the way it does lookup. NormalizeString with Form C is probably faster.
WC_COMPOSITECHECK also has horrible behavior for many code points:
It will convert sensible sequences into a form that, when round tripped by MB_COMPOSITE will end up in nonsensical forms.
Sequences of 3 code points created by MB_COMPOSITE aren’t correctly decoded by WC_COMPOSITECHECK back into their single code point form, resulting in extra ? when round tripping data.
Several sequences map to a single code point, which MB_COMPOSITE will map back to a single form, so they won’t round trip. If you really need similar behavior try Normalization Form C, or KC if you really need the multiple mappings. KC causes data to not round trip, so it might not be appropriate for all applications. (Of course converting to the code page will also likely cause data to be lost so that may not matter so much).
Again some sequences are composed in a strange form based on appearance rather than linguistics. This could cause some unexpected behavior.
Some scripts, like Japanese, are particularly impacted.
Hopefully I’ve terrified you and you’ll stop using these flags, perhaps using NormalizeString() if you really need similar behavior. Most applications don’t even really need that though. Of course you always have the option of Using Unicode!
’til next time,