Ligatures, Clusters, Combining Marks and Variation Sequences

On the surface, Unicode appears to be a just large collection of characters. But before Unicode text is displayed, substantial “shaping” can occur. This shaping is the process of mapping the Unicode characters to glyphs and placing them correctly on the display. The mapping is, in general, n characters to m glyphs. For most characters n = m = 1, but there are many exceptions. For example in Arabic, a lam ل (U+0644) followed by an alef ا (U+627) maps to a lam-alef ligature لا. In English print, you often see the character sequences fi, ff, ffi, and fl displayed as single-character ligatures. Sometimes the distinction isn’t obvious unless you look carefully, but it may well be there. This post discusses the user interfaces involved in editing text with ligatures and other nm mappings.

[La]TeX uses the standard English ligatures, so my interest was piqued in them a long time ago. Later on (2006), I decided to implement a feature in RichEdit called default Latin ligatures, which is enabled by sending an EM_SETEDITSTYLE message with wparam = lparam = SES_DEFAULTLATINLIGA. When a font contains the fi ligature, the feature glyphs all text runs with that font. Glyphing a text run automatically uses the default ligatures, kerning, and some kinds of contextual shaping. The feature was active for roughly two years during the development of Office 2007 when a tester discovered that the f and i were somehow connected! Big bug! So reluctantly we disabled the feature unless the message above is received.

Living with the feature enabled in my stand-alone builds, I realized that when you have active default ligatures, the arrow keys and the selection need to be handled carefully to avoid user confusion and ire. If you do nothing, typing the → key appears to bypass an fi ligature, but the program thinks the insertion point is between the f and the i. So if you type the delete key, the i is deleted instead of the character that follows the i. This can be disconcerting and the editor appears to be buggy.

The solution is to move the caret 1/m way through the fi ligature. In this case, that means half way through the ligature. In fact, if you don’t look carefully it seems to be exactly what you’d have if the two characters were displayed instead of the ligature.

Typing shift+→ selects a character. If the editing program does nothing special with ligatures, selecting the first character of a ligature will probably appear to select the whole ligature. But hitting the Delete key only deletes the first character of the ligature, once again confusing the user. The solution is similar to partial caret motion. Specifically the selection highlighting goes 1/m of the way through the ligature. For the fi ligature, this is half way. It looks as if the f is selected and this is, in fact, what is actually selected. Most users won’t even realize that a single glyph is used. The user is happy and no confusion arises. This technique is called partial ligature selection.

Generally English ligatures resemble the layout of the individual characters, so partial ligature selection is unambiguous. But occasionally there are English ligatures that display the component characters more over one another than side by side. For example, note the oo ligature in the logo for a nifty Australian Shiraz

Partial ligature selection of the first o would go half way through the oo ligature and is no longer unambiguous as it is for an fi ligature. The technique may nevertheless be good enough or conceivably it would be better to treat the ligature as a cluster, that is, as a single unit for selection purposes. If so, trying to select the first o would select both.

This leads one to scripts for which clusters are the norm, such as Thai and Indic scripts like Devanagari. Clusters are combinations of characters that are not displayed side by side. Typically they are displayed above one another or with completely different glyphs. Accordingly, they are treated as multicharacter units by the arrow keys. If the insertion point is at the start of a cluster, the Delete key deletes the whole cluster and shift+→ selects the whole cluster. For both ligatures and clusters (as well as combining-mark sequences), the Backspace key removes one character at a time.

Multiple character codes are also used in a common Unicode encoding called UTF-16. Many characters are represented by 16-bits. Many more are represented by two 16-bit codes, the first in the range U+D800..U+DBFF and the second in the range U+DC00..U+DFFF. Such a combination is called a surrogate pair. It must be treated as a single unit by the arrow, delete, and backspace keys. The same is true for variation sequences, which consist of a base character followed by a variation selector character. The base character may be represented by a surrogate pair and so may the variation selector. These sequences must be treated as single units by the arrow, delete, and backspace keys.

Back in the late 1980’s, people dreamed that Unicode would be able to represent all text characters by simple 16-bit units. Well it turned out to be a lot more complicated than that. Some folks say one should use UTF-32 (32-bit character codes), which at least gets rid of surrogate pairs. But the underlying characters of complex scripts can still consist of multiple codes or can be transformed into glyphs of various shapes. And that’s where much of the real complexity in editing and displaying Unicode occurs.