Entering Unicode Characters

As noted in the post Symbols and Emoji we have the ability to input characters in much more powerful ways than possible before the advent of modern computers and smart phones. We can insert symbols chosen from large galleries (Character Map, Office Insert Symbol Dialog, Office math ribbon, soft keyboards) to represent words and ideas. We can use Input Method Editors (IMEs) to enter any East Asian character and, for that matter as we see below, any Unicode character. In Microsoft Office applications, we can use the math linear format to enter arbitrary built-up mathematical expressions. We can use autocorrect to replace symbol names by the corresponding symbols. This post summarizes methods of entering symbols by their character codes including a couple of methods that probably aren’t familiar to audiences outside of China.

First let’s look at entering symbols by their Unicode code points. The alt+x “input method” is discussed in several of my posts, such as in Sans Serif Mathematical Symbols. For this method, you type the hexadecimal character code and then alt+x to convert the code to the symbol. This works in Word, Outlook, OneNote, RichEdit-based programs like WordPad. But it doesn’t work in NotePad, for example.

There are a couple of other ways of entering any character by its code. RichEdit supports arbitrary Unicode entry via alt+numpad digits. The code is entered as a decimal number while an alt key is pressed. Decimal isn’t very convenient, since the Unicode Standard displays code charts with hexadecimal character codes. Accordingly alt+x is easier with RichEdit. A curious anomaly came to my attention recently: alt+numpad numbers below 256 use the original IBM PC character set, all of which have counterparts in Unicode. For example, 1 is a smiley face. Windows dutifully translates the codes to the corresponding Unicode character. One user (at least) wants to include such choices in his password characters. Seems nice and secure, although you do need a numeric keypad which may limit its utility.

The Simplified Chinese IME on Windows 8.1 offers two ways of inputting characters by code: the vgb and vuc methods. These approaches have the advantage that they work with all applications that handle East Asian IME’s. For these you switch to the Chinese IME and type either vgb or vuc. Immediately the text switches from lower case to upper case. Next you type the character code. For vgb, you type the eight-digit character code in the GB18030 code page. For example, to enter ᥐ (U+1950, Tai Le letter KA), you can type vgb8134F434. No space is needed at the end since the eighth digit automatically terminates the field and replaces the vgb entry with the resulting character. This is handy if you’re familiar with GB18030. Most people would prefer to use Unicode since it’s so widely accessible and well documented. For this you type vuc followed by the hexadecimal Unicode value. To enter ᥐ with the vuc method, type vuc1950<space>.

That these IME methods allow a user to enter arbitrary Unicode characters has caused problems for RichEdit’s font binding. Basically an assumption was made long ago that a single font could handle all characters that a given IME could deliver. Gradually this assumption has had to be relaxed. It turns out that the vuc method can be recognized and font bound just as alt+x is font bound. But the vgb method is sneakier in that you don’t get a chance to recognize the full vgb code before the resulting character arrives.