Math Font Binding

The post RichEdit Font Binding outlines how RichEdit chooses fonts when you paste or otherwise enter plain text into a RichEdit control. But it doesn’t describe how math font binding differs from natural-language font binding. The differences are due to

1)      Math styles like math italic, bold, script, Fraktur and double-struck are obtained by character code changes instead of font changes.

2)      Inside a math zone, all characters that can be displayed with the math font should be displayed with the math font unless marked as “math ordinary” text.

3)      If the default math font is given as a document property, it should be used instead of the regular default math font.

The present post describes the first two of these differences in greater detail. The discussion applies to math zones in general, that is, it’s not restricted to RichEdit.

Math Styles

Math styles are discussed in Section 2.2 of Unicode Technical Report #25 Unicode Support for Mathematics. There under Semantic Distinctions, it is noted that

Mathematical notation requires a number of Latin and Greek alphabets that initially appear to be mere font variations of one another. For example, the letter H can appear as plain or upright (H), bold (𝐇), italic (𝐻), and script (ℋ). However, in any given document, these characters have distinct, and usually unrelated, mathematical semantics. For example, a normal  represents a different variable from a bold , etc . If these attributes are dropped in plain text, the distinctions are lost and the meaning of the text is altered. Without the distinctions, the well-known Hamiltonian formula

turns into the integral equation in the variable H:

 H=∫dτ(ϵE2+μH2)

Mathematicians will object that a properly formatted integral equation requires all the letters in this example (except perhaps for the d) to be in italics. However, because the distinction between ℋ and H has been lost, they would recognize the equation as a fallback representation of an integral equation, and not as a fallback representation of the Hamiltonian. By encoding a separate set of alphabets, it is possible to preserve such distinctions in plain text.

(Actually I wrote that text for UTR #25 along with similar text in the original Unicode proposal for math styles). The key is that a single math font, such as Cambria Math, has a variety of math styles including bold, italic, script, Fraktur, double-struck and various sans serif styles. For example in a math zone, when a user selects a character and formats it as math bold via, say, the Ctrl+B hot key or math italic via Ctrl+I, the character code is changed, not the font as done in the usual font binding. This process and examples of the math alphanumeric character codes are given in the post Using Math Italic and Bold in Word 2007 and in Chapter 6 of the book Creating Research and Scientific Documents using Microsoft Word (see the section entitled “Use mathematical bold, italic, and sans serif).

Summarizing the reasons for using the math alphanumerics over character-format markup, the primary reason is to preserve math character semantics in plain text. E.g.,  is different from , a difference that is lost in plain text without a character-code change. In addition the math alphanumerics can have different math spacings and different glyphs than the corresponding ordinary text characters with styling. One example of a different glyph is the math italic 𝑎 of the Cambria Math font which doesn’t look like the italic a in the Cambria Italic font. Another reason is to limit the set of math alphanumerics. The Unicode Technical Committee didn’t want to endorse a mechanism like italic/bold variation selectors that would allow people to encode general italic and bold in plain text. Finally while the math alphanumerics complicate math font binding, they simplify some kinds of processing. In particular, you know what the math alphanumerics are from their code points alone; you don’t need to examine the associated character formatting.

Pasting Text into a Math Zone

If you paste text into a math zone, the characters need to be bound to a math font if the math font can display them and the bold and italic properties of the insertion point need to result in the corresponding character code changes. More specifically, any characters considered to be math characters in Section 2.4 “Locating Mathematical Characters” of UTR #25 need to be font bound to the math font. This is an example of context-dependent font binding. Inside a math zone, math operators should be bound to a math font like Cambria Math. Outside a math zone, math operators might be better bound to a symbol font like Segoe UI Symbol. In all backing stores (Word, OfficeArt, RichEdit), math alphanumerics are stored using their UTF-16 codes. So math italic a () is stored as the surrogate pair for U+1D44E, that is, U+D835 U+DC4E. Math italic h (ℎ) is stored as U+210E, etc. This is a different kind of font binding since ordinarily bold would choose a bold font, but for math zones, bold uses the regular math font.

Toggling the Math-Zone Attribute

A related but trickier process occurs when you toggle the math-zone property on and off. Consider first toggling the math zone property on, say by selecting some text not in a math zone and typing the math-zone hot key Alt+=. Each character needs to be examined for changing the font to a math font if the character can be displayed with the math font and if the bold/italic effects are active, the character codes need to be changed to the corresponding math bold/math italic characters, respectively. For example, the ASCII letter ‘a’ (U+0061) is converted to the math-italic 𝑎 (U+1D44E) if italic is active, which is usually the case in a math zone. In this process, only ASCII alphanumeric characters and Greek alphabetic characters are affected. Operators and other characters remain upright and have normal weight (not bold). If you want bold and/or italic effects for these characters in a math zone, you need to mark the characters as “math ordinary” characters, for which ordinary bold and italic font conventions apply. Word names “math ordinary” text as “normal” text.

If you select text in a math zone and toggle the math-zone property off, the reverse process needs to occur. To aid with the conversions, RichEdit exports the function GetMathAlphanumericCode() to convert a math alphanumeric to the corresponding ASCII/Greek character code and to return a code identifying the math style. Similarly RichEdit exports GetMathAlphanumeric() to get the math alphanumeric character corresponding to an ASCII/Greek character and a specific math style. The implementation is a bit intricate since some math alphabetic symbols were defined in the Unicode Letterlike Symbols block (U+2100..U+214F) before the sets were completed with the addition of the Mathematical Alphanumeric Symbols block (U+1D400..U+1D7FF). A straightforward translation to the latter can lead to missing glyphs since holes are left where the symbols are defined in the Letterlike Symbols block. Also there are miscellaneous special cases, such as for dotless i and j, and conversions to and from the Arabic Mathematical Alphabetic Symbols.

Conclusions

The bottom line is that natural-language font binding techniques aren’t able to handle most math font binding and special handling is required. While this complicates things, the resulting math typography looks excellent and plain-text copies retain the original rich-text math symbol semantics. Math font binding is one kind of context-dependent font binding. Context dependence also plays a role with font binding other kinds of text, such as emoji (see also this post), end user defined characters, Chinese characters (different fonts for Simplified Chinese, Traditional Chinese, Japanese, etc.), neutral characters of various kinds, and variation-selector sequences. Font binding in math zones still needs natural language font binding for non-math characters, so math font binding can become intertwined with natural-language font binding.