This post describes some seemingly anomalous behavior that can happen when you type characters that have Unicode code points above U+00FF, such as Cyrillic and Greek characters, while a SYMBOL_CHARSET font like Wingdings is active. By definition such fonts are not Unicode fonts and don’t have characters with code points above 255 (0xFF in hexadecimal). Different SYMBOL_CHARSET fonts have different characters at a given code point. In contrast, if they have the character at all, Unicode fonts display the same character at a given code point, although possibly with slightly different glyphs. The strange thing is that even though the SYMBOL_CHARSET fonts don’t have characters defined for code points above U+00FF, Microsoft Office applications and WordPad may display characters for such code points! You might wonder how and why these characters are chosen.
To figure out what’s going on, let’s go back to the old, pre-Unicode days when people used character sets defined by code pages and charsets. For example, Russian keyboards generated Cyrillic characters defined in the Windows 1251 code page or in the ISO-8859-5 code page, which is a subset of 1251. Code page 1251 corresponds to the RUSSIAN_CHARSET charset, which is used in creating fonts on Windows using the LOGFONT structure. Similarly Greek has the Windows 1253 and ISO-8859-7 code pages and the GREEK_CHARSET charset. The Windows code pages 1250—1258 and the Thai code page 874 are 8-bit code pages, i.e., their character codes are less than 256. So when a user types using such a code page, it generates a character code that may well have a character defined in a SYMBOL_CHARSET font. Accordingly when a SYMBOL_CHARSET font was selected, typing with a Russian or Greek keyboard in the old days would display the characters at the code points defined by the corresponding 8-bit code page. For example, if you typed a Щ with an old Russian keyboard and Wingdings was active, you’d see Ù, the Wingdings character at 0x00D9, since Щ has the code point 0x00D9 in the Russian 1251 code page. For some reason, the Firefox browser won't use a SYMBOL_CHARSET font, so it displays Ù instead of the Wingdings fancy up arrow tip and displays the wrong characters for the “Собака” string below too. Case in point!
Enter Unicode. People expected the same display behavior even when Unicode keyboards generate character codes above 255, such as for Cyrillic and Greek. To get that behavior with a SYMBOL_CHARSET font, Microsoft Office applications including Word and Excel figure out what script the characters belong to. If the script corresponds to an 8-bit code page, the programs use that code page to convert the characters back into the 0—255 range and voila! You see what you used to see in the old pre-Unicode days. Nowadays if you type Щ with a Russian keyboard, you enter the Unicode Cyrillic character U+0429, which nevertheless displays as Ù if formatted with Wingdings.
So far everything seems sort of reasonable, but what if you copy text formatted with Wingdings to plain text, such as in Notepad or in plain-text email? If the source is Word or Excel, you see the corresponding characters defined in Unicode, which don’t look anything like the characters in Wingdings. For example, suppose you type in “Собака” using a Russian keyboard. In Word, Excel, and WordPad when Wingdings is the font, you see “Ñîáàêà”. But if you copy this from Word or Excel to Notepad, you see the original “Собака”.
What may seem even more anomalous occurs with RichEdit, which you can try out using WordPad. You can type with Russian or Greek keyboards and see the same Wingding characters as displayed by Word and Excel. But if you copy the characters to plain text, you see the “high-ANSI” characters in the range 0x00A0..0x00FF range instead of the original Unicode Cyrillic or Greek characters. This is because RichEdit converts the Unicode characters to the 8-bit code page values in the memory backing store instead of converting them for display only. This is exactly what happened in the old, pre-Unicode days. But it ends up creating an incompatibility in the formula bar of the immersive version of Excel, which uses RichEdit, while the traditional Win32 Excel uses another editor. The difference surfaces because, in general, the formula bar doesn’t use the fonts specified by the user and, in particular, it doesn’t use SYMBOL_CHARSET fonts. So on the desktop Excel, you see “Собака” instead of “Ñîáàêà” and on the current immersive version you see “Ñîáàêà”. While this is incompatible and undesirable, it’s a bit bizarre that the formula bar doesn’t display “Ñîáàêà” for both editors.
The RichEdit implementation difference resulted because I didn’t fully appreciate what Excel and Word were doing when years ago I set out to preserve the SYMBOL_CHARSET font input experience for Russian, Greek and other users in the then-new Unicode era. It’s true that converting the characters in the backing store instead of in the display might boost performance slightly since the results are cached, but changing what the user actually types isn’t desirable since other apps don’t. RichEdit does remember how to convert back to the original characters if a non SYMBOL_CHARSET font is applied. And RichEdit’s implementation may change to agree with Word and Excel in the future.
All characters defined in code pages are included in Unicode and now code pages are no longer used internally to define character codes in main stream software. Meanwhile SYMBOL_CHARSET fonts such as Wingdings don’t have a code page (sometimes 0042 is used informally) and they don’t have a general Unicode mapping. The characters of some SYMBOL_CHARSET fonts (Windings, Webdings, Symbol) have been added to Unicode, so in principle you can use a Unicode symbol font like Segoe UI Symbol instead of those fonts. In contrast, Marlett is a particularly strange SYMBOL_CHARSET font. It contains glyphs for a few icons and carets. Many of Marlett’s code points in the range 0020—00FF, let alone all those above this range, are empty. Some of Marlett’s characters are already in Unicode, but it doesn’t seem likely that all will be.
At the outset of this post, I wrote that SYMBOL_CHARSET fonts only have characters for code points in the 0—255 range. That’s not quite true: the code points for 0x0020—0x00FF are mirrored at the Private Use Area range 0xF020—0xF0FF as explained in the post Weird F020-F0FF Characters in Word's RTF. One good thing about using the latter range is that you know almost for sure in plain text that a SYMBOL_CHARSET font was used; you just don’t know which one!
Here is an interesting coda to this tale featuring the ubiquitous smiley face ☺, which is at the J position (0x004A) in the Wingdings font. If you copy this smiley face to a plain-text context, instead of the smiley face you may see a J or even a missing-glyph box. This happens a lot since by default Microsoft Word autocorrects the emoticon sequence 🙂 to the smiley face in the Wingdings font. Nevertheless, when you copy this smiley face as plain text to WordPad, you see the Wingdings smiley face. How can this be?! The answer is that Word puts a U+F04A on the clipboard, which WordPad (actually RichEdit) recognizes as a SYMBOL_CHARSET font. Lacking any unambiguous font-binding choice, RichEdit uses Wingdings since that font seems to be the most widely used SYMBOL_CHARSET font. But if you paste U+F04A into desktop Excel, Excel just displays a missing-glyph box, since Excel doesn’t recognize U+F04A as anything special and doesn’t change the currently active font. (This may change in the future…)
The smiley face is also given by the Unicode code point U+263A, so you can enter a smiley face into Word by typing 263A alt+x. In fact, you can edit your autocorrect file to use the Unicode smiley face instead of the Wingdings smiley face. Then if you copy your smiley face to plain text, you see a smiley face with any program and it might even be a colorful emoji-style smiley face! The Segoe UI Symbol font contains all Unicode symbols including the smiley face. One problem is that for a given font height this font displays a somewhat larger glyph for the smiley face than the Calibri font displays for ordinary letters, which ends up with a larger line spacing if you mix the two fonts on a line. So you may want to scale Segoe UI Symbol down about 10% in such scenarios. Interestingly Segoe UI Emoji displays glyphs the same size as Calibri, so you might want to scale it up if used together with Segoe UI. To illustrate these cases, here’s an image of a Calibri ‘a’ followed by smiley faces formatted with Wingdings, Segoe UI Emoji, and Segoe UI Symbol, respectively
Happy New Year! ☺