For the most part, the mappings are straightforward as illustrated in the table below. But due to its generative use of type-form and alphabetic indicators, Nemeth braille encodes some math alphabets not in Unicode, e.g., Greek Script and Russian Script. Meanwhile, Unicode has math double-struck and monospace English alphanumerics, which don’t exist in Nemeth braille. Unicode also has six alphabets that aren’t mentioned in the Nemeth specification but that can be defined unambiguously with Nemeth indicators, namely bold Fraktur (Nemeth calls Fraktur “German”), bold Script, and Sans Serif bold and/or italic. The table below includes unambiguous prefixes for these alphabets chosen such that the Nemeth bold indicator precedes the italic or script indicators, and the Sans Serif indicator precedes the bold indicator. These choices correspond to the orders in which the Unicode math alphabets are named. Changes in this ordering result in alternative prefixes that are also unambiguous, but it seems simpler for implementations and users to standardize on the Unicode name ordering.
The Nemeth specification has Script Greek (in §22) as well as “alternative” Greek letters (in §23). Some of the latter may be referred to as “script”. Specifically, the Unicode math Greek italic letters 𝜃𝜙𝜖𝜌𝜋𝜅 have the alternative counterparts 𝜗𝜑𝜀𝜚𝜛𝜘, respectively. The symbol 𝜗 can be called “script theta”. Since Unicode doesn’t have a math script Greek alphabet, it makes sense to map Nemeth math script Greek letters to the alternative Greek letters, if they exist, on input and use the Nemeth alternative notation on output. In addition, in Unicode the upper-case Θ has the alternative ϴ. In TeX and Office math, the alternative letters are identified by control words with a “var” prefix, as in \varepsilon for 𝜀 as contrasted with \epsilon for ϵ. Interestingly, modern Greek uses 𝜑 and 𝜀 instead of 𝜙 and 𝜖, but math considers the script versions to be the alternatives.
Nemeth braille has several Russian alphabets (see §22 of the Nemeth spec). These alphabets map to characters in the Cyrillic range U+0410..U+044F. Unicode has no math Russian alphabets, but italic and bold Russian alphabets can be emulated using the appropriate Cyrillic characters along with the desired italic and bold formatting. The Unicode Technical Committee, which is responsible for the Unicode Standard, has not received any proposals for adding Russian math alphabets. At least in my experience, technical papers in Russian use English and Greek letters in math zones. In Russian documents, this has the nice advantage of easily distinguishing mathematical variables from normal text.
Unicode has four predefined Hebrew characters in the Letterlike Symbols range U+2135..U+2138: ℵ, ℶ, ℷ, ℸ, respectively. In math contexts, it makes sense to map those Hebrew letters in Nemeth braille to the Letterlike Symbols and to map the other Nemeth Hebrew letters to characters in the Unicode Hebrew range U+05D0..U+05EA. The Unicode Technical Committee has not received any proposals for adding more Hebrew math letters so they probably won’t appear in math zones, except, perhaps, as embedded normal text.
The majority of Unicode math digits can be represented by the appropriate type-form indicator sequences in the table above followed by the numeric indicator ⠼ (if necessary) and the corresponding ASCII digits. For example, a math bold 2 (𝟐—U+1D7D0) can be represented by ⠸ ⠼ ⠆ or “_#2”. This works for the bold and/or sans-serif digits, but not for the double-struck and monospace digits, which have no Nemeth counterparts. Meanwhile Nemeth notation supports italic and bold italic digits, which aren’t in Unicode.
Some math contexts don’t need a numeric indicator, e.g., most digits in fractions, subscripts or superscripts. To optimize common numeric subscript expressions like a_{1}, the numeric indicator and the subscript indicator are omitted. In Nemeth ASCII braille, a_{1} is “A1” and in Nemeth braille it’s ⠁ ⠂ . The ASCII braille representation is tantalizing since variables like A1, B2, etc., are used to index spreadsheets and it would be more natural if spreadsheet indices were a_{1}, b_{2}, etc., at least for people with a mathematical background.
In general, Unicode’s math characters are simpler to work with since they can be assigned separate character codes instead of being composed as combinations of 64 braille codes. Unicode has about 2310 math characters (see Math property in DerivedCoreProperties.txt) and to distinguish all of those without indicators would require 12-dot braille! Such a system would be really hard to learn. LaTeX describes characters using control words consisting of a backslash followed by combinations of the 64 ASCII letters. That approach has mnemonic value, but it’s not as concise as the Nemeth braille character code sequences. When you get a feel for the Nemeth approach, a character’s Nemeth sequence gives a good idea of what a character is even if you haven’t encountered it before. UnicodeMath and Nemeth braille are intended to be read by human beings, whereas LaTeX and MathML are intended to be read by computer programs, notwithstanding that some TeXies can read LaTeX pretty fluently! Considering that Unicode math alphabets like double-struck and monospace aren’t yet defined in Nemeth braille, it would be worthwhile to choose appropriate type-form indicators for them. Nemeth math alphabets not in Unicode probably don’t have to be considered unless they show up in published documents.
]]>
First note that Nemeth Braille can be displayed in 6-dot ASCII Braille as shown in this table
The dots are numbered 1..6 starting from the upper left, going down to 3 and continuing with 4..6 in the second column. The letters and numbers look like themselves as do the / and (). The braille cells for 1..9 are the same as those for the letters A..I, but shifted down one row. The cells for the letters K..T are the same as those for A..J but with a lower-left dot (dot 3). Letters are lowercase unless prefixed by a cap prefix code (solo dot 6) or pair of cap prefixes for a span of uppercase letters.
A simple table look up converts Nemeth braille codes to 8-dot Unicode Braille in the U+2800 block. The braille cells for 6-dot braille are the first 64 characters of Unicode braille block. With a little practice you can enter braille codes into Word, OneNote, and WordPad by typing 28xx <alt+x>, where xx is the hex code given by the braille dots. To do this, read dots as binary 1’s and missing dots as 0’s, sideways from right to left, top to bottom. So ⠮ is 101110_{2} = 2E_{16} and the corresponding Unicode character is U+282E.
To get a feel for simple Nemeth braille math, consider the expression 12x^{2}+7xy-10y^{2}. In ASCII Braille it displays as
#12x^2″+7xy-10y^2_4
In Nemeth Braille it displays as
In the linear format and TeX, it displays as 12x^2+7xy-10y^2.
It’s tantalizing that the superscript code ⠘ has the ASCII braille code ‘^’ used by the linear format and [La]TeX. But the subscript code is ⠰, which has the ASCII braille code ‘;’ instead of the ‘_’ used by the linear format and TeX. These braille codes also work differently from the linear format and TeX superscript/subscript operators in that they are script level shifters that must be “cancelled” instead of being ended. So in the formula above, the Nemeth ‘^’ for the first square is cancelled by the ‘”’, while the ‘+’ terminates the superscript for the linear format and a TeX superscript consists of a single character or an expression of the form {…}. The following table compares how the three formats handle some nested superscripts and subscripts
Here to keep the Nemeth braille code sequences simple, I’ve omitted the Nemeth math italic, English-letter prefix pair ⠨ ⠰ before each math variable. Hopefully there’s a way to make math italic the default, as it is in the linear format, MathML, and TeX, but I didn’t find such a mode in the full specification. A space before literary text terminates the current script level shift, that is, it initiates base level. This is also true for a space that indicates the next column in a matrix, but it’s not true for a function-argument separator as illustrated in the table below. Spaces can also be used for equation-array alignment (you need to think in terms of a fixed-width font).
Simple fractions are written in a fashion similar to TeX’s {<numerator>\over <denominator>}. For example,
or in ASCII braille as ?1/2#. The ⠹ and ⠼ work as the curly braces do in TeX fractions as in {1\over 2}. In the linear format, the fraction is given by 1/2. Fractions can be laid out in a two-dimensional format emulating built-up fractions but using Nemeth braille. Nested fractions require additional prefix codes (solo dot 6). For single-line braille devices it seems worthwhile to use the linear display since the fraction delimiters can be nested to any depth. Stacked, slashed, and linear fractions can be encoded and correspond to those structures in the linear format and in TeX.
The Nemeth alphabets are similar to the Unicode math alphanumerics discussed in Sections 2.1 and 2.2 of Unicode Technical Report #25. One difference is that math script and math italic variants exist for English, Greek, Cyrillic, and German (Fraktur) alphabets, whereas in Unicode math script variants are only available for the English alphabet. We may need to generalize Unicode’s coverage in this area, since TeX also has the ability to represent more math alphabets (see, for example, Unicode Math Calligraphic Alphabets).
At some point, I hope to give a listing of correspondences between the linear format and Nemeth Braille. It’s a long topic, so as a start the following table gives some more examples. Note the spaces needed around the equals sign (and other relational operators), but the lack of a space between the ‘a’ and “sin” in “a sin x”. The Nemeth notation is ambiguous with respect to using asin for arc sine.
The Unified English Braille code can handle some mathematical notation, but it’s not general enough to deal with Office math zones. Some discussion on the differences is given here, and the Accessible Math Editor author Sam Dooley explained to me that more advanced math needs the power of the Nemeth encoding. One possible way to reduce the large number of rules governing Nemeth braille would be to use an 8-dot standard in which math operators could be encoded with the aid of bottom row dots. This would work with current technology since Braille displays let you read and enter all possible 8-dot Braille codes. In fact, dot 7 is sometimes used to change lower case into upper case, thereby not needing an upper-case prefix code (solo dot 6) for upper-case letters.
]]>Understand at the outset that two granularities of math speech are needed: coarse-grained, which speaks math expressions fluently in a natural language, and fine-grained, which speaks the content at the insertion point. The coarse-grained granularity is great for scanning through math zones. It doesn’t pretend to be tightly synchronized with the characters in memory and cannot be used directly for editing. It’s relatively independent of the memory math model used in applications.
In contrast, the fine-grained granularity is tightly synchronized with the characters in memory and is ideal for editing. By its very nature, it depends on the built-up memory math model (described below), which is the same for all Microsoft math-aware products, but may differ from the models of other math products. Coarse grained navigation between siblings for a given math nesting level can be done with Ctrl+→ and Ctrl+← or Braille equivalents, while fine-grained navigation is done with → and ← or equivalents. The latter allows the user to traverse every character in the display math tree used for a math zone. The coarse- and fine-grained granularities are discussed further in the post Math Accessibility Trees. In addition to granularity, it’s useful to have levels of verbosity. Especially when new to a system, it’s helpful to have more verbiage describing an equation. But with greater familiarity, one can comprehend an equation more quickly with less verbiage.
To represent mathematics linearly and unambiguously, the linear format may introduce parentheses that are removed in built-up form. Speaking the introduced parentheses can get confusing since it may be hard for the listener to track which parentheses go with which part of the expression. In the simple example above of (a+b)/2, it’s more meaningful to say “start numerator a plus b end numerator over 2” than to speak the parentheses. Or to be less verbose, leave out the “start”. This idea applies to expressions that include square roots, boxed formulas and other “envelopes” that use parentheses to define their arguments unambiguously. For the linear format square-root √(a^2-b^2), it’s clearer to say “square root of a squared minus b squared, end square root” instead of “square root of open paren a squared minus b squared close paren”. This is particularly true if the square root is nested inside a denominator as in
which has the linear format 1/(2+√(a^2-b^2)). By saying “end square root” instead of “close paren”, it’s immediately clear where the square root ends. Simple fractions like 2/3 are spoken using ordinals as in “two thirds”. Also when speaking the linear format text ∑_(n=0)^∞, rather than say “sum from open paren n equal 0 close paren to infinity”, one should say “sum from n equal 0 to infinity”, which is unambiguous without the parentheses since the “from” and “to” act as a pair of open and close delimiters. This and similar enhancements are discussed in the ClearSpeak specification and in Significance of Paralinguistic Cues in the Synthesis of Mathematical Equations. Such clearer start-of-unit, end-of-unit vocabulary mirrors what’s in memory. The parentheses introduced by the linear format are not in memory since the memory version uses special delimiters as explained below. Parentheses inserted by the user are spoken as “open paren” and “close paren” provided they are the outermost parentheses. Nested parentheses are spoken together with their parenthesis nesting level as in “open second paren”, “open third paren”, etc.
Such refinements can be made by processing the linear format, but some parsing is needed. It’s easier to examine the built-up version of expressions, since that version is already largely parsed. The built-up format is a display tree as described in the post Math Accessibility Trees. For example, to know that an exponent in the linear format equation a^2+b^2=c^2 is, in fact, a 2 and not part of a larger argument, one must check the character following the 2 to make sure that it’s an operator and not part of the exponent. If the letter z follows the 2 as in a^2z, the z is part of the superscript and the expression should be spoken as “a to the power 2z”. In memory one just checks for a single code, here the end-of-object code U+FDEF. If that code follows the 2, the exponent is 2 alone and “squared” is appropriate, unless exponents are indices as in tensor notation.
The built-up memory format represents mathematical objects like fraction, matrix and superscript by a start delimiter, the first argument, an argument separator if the object has more than one argument, the second argument, etc., with the final argument terminated by the object end delimiter. For example, the linear format fraction a/2 is represented in the built-up format by {_{frac} a|2} where {_{frac} is the start delimiter, | is the argument separator, and } is the end delimiter. Similarly a^2 is represented in the built-up format by {_{sup} a|2 }. Here the start delimiter is the same character for all math objects and is the Unicode character U+FDD0 in RichEdit (Word uses a different character). The type of math object is given by a rich-text object-type property associated with the start delimiter as described in ITextRange2::GetInlineObject(). The RichEdit argument separator is U+FDEE and the object end delimiter is U+FDEF. These Unicode codes are in the U+FDD0..U+FDEF “noncharacters” block reserved for internal use only.
Another scenario where the built-up format is very useful for speech is in traversing a math zone character by character, allowing editing along the way. Consider the integral
When the insertion point is at the start of the math zone, “math zone” is spoken followed by the speech for the entire math zone. But at any time the user can enter → (or Braille equivalent), which halts the math-zone speech, enters the numerator of the leading fraction, and speaks “1”. Another → and “end of numerator” is spoken. Another → and “2 pi” is spoken. Another → and “end of denominator” is spoken and so forth. In this way, the user knows exactly where the insertion point is and can edit using the usual input methods.
This approach is quite general. Consider matrices. At the start of a matrix, “n × m matrix” is spoken, where n is the number of rows and m is the number of columns. Using →, the user moves into the matrix with one character spoken for each → up until the end of the first element. At that end, “end of element 1 1” is spoken, etc. Up and down arrows can be used to move vertically inside a matrix as elsewhere, in all cases with the target character or end of element being spoken so that the user knows which element the insertion point is in.
Math variables are represented by math alphabetics (see Section 2.2 of Unicode Technical Report #25). This allows variables to be distinguished easily from ordinary text. When converted to speech text, such variables are surrounded by spaces when inserted into the speech text. This causes text-to-speech engines to say the individual letters instead of speaking a span of consecutive letters as a word. In contrast, an equation like rate = distance/time, would be spoken as “rate equals distance over time”. Math italic letters are spoken simply as the corresponding ASCII or Greek letters since in math zones math italic is enabled by default. Other math alphabets need extra words to reveal their differences. For example, ℋ is spoken as “script cap h”. Alternatively, the “cap” can be implied by raising the voice pitch.
Some special cues may be needed to convince text-to-speech engines to say math characters correctly. For example, ‘+’ may need to be given as “plus”, since otherwise it might be spoken as “and”. The letter ‘a’ may need to be enclosed in single quotes, since otherwise it may be spoken as the ‘a’ in “zebra” instead of the ‘a’ in “base”.
Another example of how the two speech granularities differ is in how math text tweaking is revealed. First, let’s define some ways to tweak math text. You can insert extra spaces as described in Sec. 3.15 of the linear format paper. Coarse-grained speech doesn’t mention such space but fine-grained speech does. More special kinds of tweaking are done by inserting phantom objects. Five Boolean flags characterize a phantom object: 1) zero ascent, 2) zero descent, 3) zero width, 4) show, and 5) transparent. Phantom objects insert or remove precise amounts of space. You can read about them in the post on MathML and Ecma Math (OMML) and in Sec. 3.17 of the linear format paper. The π in the upper limit of the integral above is inside an “h smash” phantom, which sets the π’s width to 0 (smashes the horizontal dimension). Notice how the integrand starts at the start of the π. Coarse-grained speech doesn’t mention this and other phantom objects and only includes their contents if the “show” flag is set. Fine-grained speech includes the start and end entities as well as the contents. This allows a user to edit phantom objects just like the 22 other math objects in the LineServices math model.
The approaches described here produce automated math speech; the content creator doesn’t need to do anything to enable math speech. But it’s desirable to have override capability, since the heuristics used may not apply or the content author may prefer an alternate phrasing.
]]>As explained in the post Flyweight RichEdit Controls, an important design criterion was to make plain-text editing fast and small. Accordingly, Christian Fortini’s original model for the text pointers into the RichEdit 2.0 backing store gave priority to plain-text controls. Since RichEdit would also be used for rich-text controls, the design had to accommodate rich text as well. The first attempt was the double-diamond, multiple-inheritance, text pointer hierarchy
CTxtSelection → CTxtRange → CTxtPtr
↑ ↑ ↑
CRchTxtSelection → CRchTxtRange → CRchTxtPtr
Here the CTxtPtr class manipulates the Unicode plain text in the memory backing store, the CTxtRange class manipulates ranges of plain text and the CTxtSelection is a CTxtRange that has added user-interface functionality such as keyboard and mouse handling. The rich-text row in the hierarchy adds the ability to manipulate text runs with different character and paragraph formatting. I implemented this hierarchy back in 1995 partly as an exercise in learning C++. Up to then the only major C++ feature not in C that I had used was operator overloading for handling complex arithmetic elegantly and efficiently in quantum optics calculations.
The double-diamond hierarchy worked. Nevertheless, it seemed overly complex, so one weekend Alex Gounares simplified it to the simple single-inheritance model
CTxtSelection → CTxtRange → CRchTxtPtr
in which CRchTxtPtr contains a CTxtPtr text-run pointer along with similar run pointers for character formatting and paragraph formatting. The resulting riched20.dll went from 145KB down to 90KB! (Now it’s 2.5 MB!) There was a bunch of hidden overhead in the multiple inheritance hierarchy. For sufficiently simple text, the single-inheritance model didn’t instantiate any formatting runs, which boosted performance for plain text, a goal of the original model. Ironically the double-diamond inheritance hierarchy turned out to be a bad approach also from a functional point of view, since a multilingual plain-text editor needs multiple fonts to handle multiple fonts and proofing tools need some text run character formatting. As such any international plain-text editor must have at least some degree of richness.
RichEdit 2.0 also shipped with Version 1 of the Text Object Model (TOM). This object model includes the ITextSelection and ITextRange interfaces. CTxtSelection inherits from CTxtRange, since it’s adding UI functionality to a range. Meanwhile ITextSelection inherits from ITextRange. So how can CTxtSelection inherit from ITextSelection without another diamond? For RichEdit up through version 5, we would have had
ITextSelection → ITextRange
↑ ↑
CTxtSelection → CTxtRange → CRchTxtPtr
The single inheritance solution for the ranges was to have CTxtRange inherit from ITextSelection and have it return E_NOTIMPL for the selection-specific UI methods. This gives the simplified inheritance
CTxtSelection → CTxtRange → ITextSelection, CRchTxtPtr
RichEdit 6.0 added several more TOM interfaces including ITextRange2 and ITextSelection2. To avoid diamond inheritance, ITextRange2 inherits from ITextSelection and ITextSelection2 inherits from ITextRange2. Unlike ITextSelection, ITextSelection2 doesn’t add any methods to ITextRange2. Starting with RichEdit 6.0, CTxtRange inherits from ITextSelection2 and CTxtSelection continues to inherit from CTxtRange. CTxtRange also inherits from CRchTxtPtr, which has some virtual methods, but the overhead for switching “this” pointers is substantially less than it would be with diamond inheritance.
There are other C++ areas with surprise overhead. Smart pointers have become popular since they don’t need explicit clean up, even when exceptions are thrown. But despite clever operator overloading, smart pointers involve a new language to learn and result in extra steps in debugging. Smart pointers are built into C++ for Windows Universal apps and use the ^ to indicate the smart pointer. Meanwhile in spite of a plethora of new notation, templates, and classes, C++ operators are still mired in the ASCII world, using <= for ≤ and != for ≠. Why not accept these well-defined operators as aliases for the original ASCII operator sequences?
Some habits don’t result in code bloat, but can slow down reading and code maintenance. Some people treat C++ like Lisp, sticking in quantities of unnecessary parentheses and curly braces. When there’s more syntactic sugar to read, you have to wade through it. Mathematics is successful in part due to its conciseness. (Although one shouldn’t be concise to the point of inscrutability). One good technique in writing functions is to return as soon as the results or an error are found. Often you find code where the only return is at the end of a function which may force the code to be deeply nested in hard to follow curly braces.
There’s a moral to be learned from the RichEdit text pointer design: keep things simple and easy to read. Avoid multiple inheritance (except for interfaces) unless it dramatically improves your model. And in any event, avoid diamond inheritance!
]]>
More than one kind of tree is possible and this post compares two possible kinds using the equation
We label each tree node with its math text in the linear format along with the type of node. The linear format lends itself to being spoken especially if processed a bit to say things like “a^2” as “a squared” in the current natural language. The first kind of tree corresponds to the traditional math layout used in documents, while the second kind corresponds to the mathematical semantics. Accordingly we call the first kind a display tree and the second a semantic tree.
More specifically, the first kind of tree represents the way TeX and Microsoft Office applications display mathematical text. Mathematical layout entities such as fractions, integrals, roots, subscripts and superscripts are represented by nodes in trees. But binary and relational operators that don’t require special typography other than appropriate spacing are included in text nodes. The display tree for the equation above is
Note that the invisible times between the leading fraction and the integral isn’t displayed and the expression a+b sinθ is displayed as a text node a+b followed by a function-apply node sinθ, without explicit nodes for the + and the invisible times.
To navigate through the a+b and into the fractions and integral, one can use the usual text left and right arrows or their braille equivalents. One can navigate through the whole equation with these arrow keys, but it’s helpful also to have tree navigation keys to go between sibling nodes and up to parent nodes. For the sake of discussion, let’s suppose the tree navigation hot keys are those defined in the table
Ctrl+→ | Go to next sibling |
Ctrl+← | Go to previous sibling |
Home | Go to parent position ahead of current child |
End | Go to parent position after current child |
For example starting at the beginning of the equation, Ctrl+→ moves past the leading fraction to the integral, whereas → moves into the numerator of the leading fraction. Starting at the beginning of the upper limit, Home goes to the insertion point between the leading fraction and the integral, while End goes to the insertion point in front of the equal sign. Ctrl+→ and Ctrl+← allow a user to scan an equation rapidly at any level in the hierarchy. After one of these hot keys is pressed, the linear format for the object at the new position can be spoken in a fashion quite similar to ClearSpeak. When the user finds a position of interest, s/he can use the usual input methods to delete and/or insert new math text.
Now consider the semantic tree, which allocates nodes to all binary and relational operators as well as to fractions, integrals, etc.
The semantic tree has two drawbacks: 1) it’s bigger and requires more key strokes to navigate and 2) it requires a Polish-prefix mentality. Some people have such a mentality, perhaps having used HP calculators, and prefer it. But it’s definitely an acquired taste and it doesn’t correspond to the way that mathematics is conventionally displayed and edited. Accordingly the display tree seems significantly better for blind reading and editing, as well as for sighted editing.
Both kinds of trees include nodes defined by the OMML entities listed in the following table along with the corresponding MathML entities
Built-up Office Math Object | OMML tag | MathMl |
Accent | acc | mover/munder |
Bar | bar | mover/munder |
Box | box | menclose (approx) |
BoxedFormula | borderBox | menclose |
Delimiters | d | mfenced |
EquationArray | eqArr | mtable (with alignment groups) |
Fraction | f | mfrac |
FunctionApply | func | &FunctionApply; (binary operator) |
LeftSubSup | sPre | mmultiscripts (special case of) |
LowerLimit | limLow | munder |
Matrix | m | mtable |
Nary | nary | mrow followed by n-ary mo |
Phantom | phant | mphantom and/or mpadded |
Radical | rad | msqrt/mroot |
GroupChar | groupChr | mover/munder |
Subscript | sSub | msub |
SubSup | sSubSup | msubsup |
Superscript | sSup | msup |
UpperLimit | limUpp | mover |
Ordinary text | r | mrow |
MathML has additional nodes, some of which involve infix parsing to recognize, e.g., integrals. The OMML entities were defined for typographic reasons since they require special display handling. Interestingly the OMML entities also include useful semantics, such as identifying integrals and trigonometric functions without special parsing.
In summary, math zones can be made accessible using display trees for which the node contents are spoken using in the localized linear format and navigation is accomplished using simple arrow keys, Ctrl arrow keys, and the Home and End keys, or their Braille equivalents. Arriving at any particular insertion point, the user can hear or feel the math text and can edit the text in standard ways.
I’m indebted to many colleagues who helped me understand various accessibility issues and I benefitted a lot from attending the Benetech Math Code Sprint.
]]>This post deals with a problem I’ve had that doesn’t occur with RichEdit font binding, but does happen in Word and Outlook. Often I need to document a particular Unicode character such as ⬚, U+2B1A which is used as a place holder in empty math objects, or <the new blog editor can’t handle U+20000>, U+20000 which is the first Unicode plane-2 character. To enter such characters, I type the Unicode hex value followed by alt+x as described in the post Entering Unicode Characters. If you do this in WordPad (which uses RichEdit) and continue typing, the font changes from Calibri to Cambria Math for ⬚ and to SimSun-ExtB for U+20000 and then switches back to Calibri for the subsequent text.
But in Outlook and Word the font switches to these other fonts and then continues to use them as long as the new font has the characters you type. The problem is that ASCII letters are supported in the vast majority of fonts, so invoking the rule “stick with the current font as long as it supports the characters” is insufficient for proper font binding. You can work around this error by using the handy Format Painter tool on the Home tab to restore the original font to subsequent text or more easily by typing on both sides of the character’s hex code before typing the alt+x hot key after the hex code.
Interestingly, if you paste a plain-text string containing such characters into Word, e.g., from NotePad, only the fonts for the special characters change. But relying on NotePad for entering such mixed text isn’t practical since NotePad doesn’t support alt+x. PowerPoint doesn’t have the problem since it doesn’t support alt+x either, sigh. (We might add the alt+x hot key to PowerPoint someday…)
There are a couple of ways to avoid this pitfall. RichEdit has the CHARFORMAT2 attribute CFE_FONTBOUND, which marks a run as being font bound when a different font is used to display a character. As such the font-bound font has lower priority for subsequent font binding than the previous font. Also if the font fix up occurs just as the text is input into the RichEdit backing store, it doesn’t change the selection’s current font. Both of these choices result in the font being restored to the previous font after a special character is font bound.
Another problem with Word’s font binding is that it switches to SimSun or if you enter a right arrow like → (U+2192). This is annoying especially since Calibri and most other Latin fonts support the simple arrows ←↑→↓, so no font binding is needed. This font switch occurs for both alt+x entry and pasting. But at least the font switches back to the Latin font after the arrow symbol is stored. Hopefully we’ll fix these problems before too long.
RichEdit font binding is overruled in the XAML edit controls, TextBox and RichEditBox, partly to maintain consistency with the companion TextBlock and RichTextBlock controls. A similar consistency is desired in Excel spreadsheets. A future post will describe how these approaches work.
]]>
First, here’s an example of script and calligraphic F’s being used in the same document:
And here are examples featuring P’s and C’s in which script letters denote infinity categories
Accordingly the need for both script and calligraphic alphabets is attested.
Let’s turn now to the unfortunate fact that the current math script alphabets may be fancy script in one font and calligraphic in another. Cambria Math, the first widely used Unicode math font, has calligraphic letters at the math script code points, while STIX and the Unicode Standard have fancy script letters at those code points. For example, here’s the upper-case math script H (U+210B) in Cambria Math followed by the one in STIX:
We really can’t change Cambria Math’s math script alphabet choice at this late stage in computing history; too many documents use it. Consequently it is inadequate to add only bold and regular Calligraphic alphabets, expecting the current bold and regular script alphabets to fulfil the need for bold and regular math script alphabets. Unfortunately, the latter are deliberately ambiguous with respect to calligraphic versus script.
There are two unambiguous ways to allow math script and math calligraphic symbols to appear in the same plain text document:
1) Follow a character in the current math script alphabets with one of two variation selectors similar to the way we use variation selectors (U+FE0E, U+FE0F) for emoji to force text and emoji glyphs, respectively. Specifically, to ensure use of the math calligraphic alphabet, follow the current math script letter with U+FE00. To ensure use of the math fancy script alphabet, follow the current math script letters with U+FE01.
2) Add four new unambiguous math alphabets: bold and regular, fancy script and calligraphic, leaving the current math script alphabets as ambiguous.
The variation selector choice has the advantages
a) Contemporary software supports variation sequences for East Asia and emoji, so adding new variation sequences shouldn’t be much of a burden
b) The variation selector U+FE00 is already used with a number of math operators
c) No new code points need to be allocated
d) Typical documents can continue to do what they have been doing: ignore the distinction
e) If a math font doesn’t support the variation sequences, it falls back naturally to the current script/calligraphic letters instead of displaying the missing-glyph box
These advantages together with the fact that the majority of documents don’t require a script/calligraphic distinction seem to make the variation selector approach preferable. Adding two variation selectors for the math script letters may make people ask why the math alphabets weren’t implemented with variation selectors in the first place. They were considered, but the Unicode Technical Committee was concerned that people might misuse them to encode rich-text properties which are not the domain of plain text. Adding two variation selectors seems to solve the present calligraphic quandary quite well, although the use of variation selectors is generally a poor one for situations where symbol shapes need to be used in a contrastive manner. This case should therefore not serve as a general precedent, but should be seen as an exception, tailored to fit this specific case.
In fact, LaTeX has the \mathsf{} and \mathsfit{} control words for math sans serif upright and italic characters, respectively, and they work with Greek letters. Unlike the calligraphic/script distinction which is seldom used contrastively, upright and italic are usually used contrastively in mathematics. Unicode has normal weight upright and italic sans serif math alphabets corresponding to the ASCII letters, but not for the Greek letters. Accordingly, these two math Greek alphabets will probably be added, perhaps in the range U+1D3F80..U+1D3FF. This range has been reserved for math alphanumeric symbols and immediately precedes the Mathematical Alphanumeric Symbols block at U+1D400..U+1D7FF.
It might also be worthwhile for programs like Word to have a math document-level property that specifies which script/calligraphic alphabet to use for the whole document. Then a user who wants the fancy script glyphs could get them without making any changes except for choosing the desired document property setting. A similar setting could be used for choosing sans serif alphabets as the default. It appears such alphabets are often used in chemical formulas.
The choice of calligraphic glyphs for the math script letters in Cambria Math is partly my fault. I had expected to see fancy script letters in Cambria Math as in the Unicode code charts. In my physics career I used math script letters a lot, starting with my PhD thesis on laser theory (1967) and followed by many published papers in the Physical Review and elsewhere and in my three books on lasers and quantum optics. Occasionally in a review article, calligraphic letters were substituted for the fancy script letters because the publishers didn’t have the latter. And in the early days, the IBM Selectric Script ball and the script daisy wheels only had calligraphic letters. So I kind of got used to this substitution.
In addition, Cambria Math was designed partly to look really good on screens, which didn’t have the resolution to display the narrow stem widths of Times New Roman and fancy script letters well. ClearType rendering certainly helped, but it seemed like a good idea to use less resolution demanding calligraphic letters. (Later Word 2013 disabled ClearType for various reasons and many readers of this blog have complained passionately ever since! With high resolution screens as on my Samsung laptop and the Surface Book, even Times New Roman looks crisp and nice with only gray-scale antialiasing, so hopefully this problem will diminish in time.) In contrast, it’s appropriate that the STIX font, based on Times Roman with its narrow glyph stems, would have the fancy script glyphs. With the mechanism described here, people could use calligraphic and script letters contrastively in the same document (assuming the fonts add the missing glyphs).
]]>To figure out what’s going on, let’s go back to the old, pre-Unicode days when people used character sets defined by code pages and charsets. For example, Russian keyboards generated Cyrillic characters defined in the Windows 1251 code page or in the ISO-8859-5 code page, which is a subset of 1251. Code page 1251 corresponds to the RUSSIAN_CHARSET charset, which is used in creating fonts on Windows using the LOGFONT structure. Similarly Greek has the Windows 1253 and ISO-8859-7 code pages and the GREEK_CHARSET charset. The Windows code pages 1250—1258 and the Thai code page 874 are 8-bit code pages, i.e., their character codes are less than 256. So when a user types using such a code page, it generates a character code that may well have a character defined in a SYMBOL_CHARSET font. Accordingly when a SYMBOL_CHARSET font was selected, typing with a Russian or Greek keyboard in the old days would display the characters at the code points defined by the corresponding 8-bit code page. For example, if you typed a Щ with an old Russian keyboard and Wingdings was active, you’d see Ù, the Wingdings character at 0x00D9, since Щ has the code point 0x00D9 in the Russian 1251 code page. For some reason, the Firefox browser won’t use a SYMBOL_CHARSET font, so it displays Ù instead of the Wingdings fancy up arrow tip and displays the wrong characters for the “Собака” string below too. Case in point!
Enter Unicode. People expected the same display behavior even when Unicode keyboards generate character codes above 255, such as for Cyrillic and Greek. To get that behavior with a SYMBOL_CHARSET font, Microsoft Office applications including Word and Excel figure out what script the characters belong to. If the script corresponds to an 8-bit code page, the programs use that code page to convert the characters back into the 0—255 range and voila! You see what you used to see in the old pre-Unicode days. Nowadays if you type Щ with a Russian keyboard, you enter the Unicode Cyrillic character U+0429, which nevertheless displays as Ù if formatted with Wingdings.
So far everything seems sort of reasonable, but what if you copy text formatted with Wingdings to plain text, such as in Notepad or in plain-text email? If the source is Word or Excel, you see the corresponding characters defined in Unicode, which don’t look anything like the characters in Wingdings. For example, suppose you type in “Собака” using a Russian keyboard. In Word, Excel, and WordPad when Wingdings is the font, you see “Ñîáàêà”. But if you copy this from Word or Excel to Notepad, you see the original “Собака”.
What may seem even more anomalous occurs with RichEdit, which you can try out using WordPad. You can type with Russian or Greek keyboards and see the same Wingding characters as displayed by Word and Excel. But if you copy the characters to plain text, you see the “high-ANSI” characters in the range 0x00A0..0x00FF range instead of the original Unicode Cyrillic or Greek characters. This is because RichEdit converts the Unicode characters to the 8-bit code page values in the memory backing store instead of converting them for display only. This is exactly what happened in the old, pre-Unicode days. But it ends up creating an incompatibility in the formula bar of the immersive version of Excel, which uses RichEdit, while the traditional Win32 Excel uses another editor. The difference surfaces because, in general, the formula bar doesn’t use the fonts specified by the user and, in particular, it doesn’t use SYMBOL_CHARSET fonts. So on the desktop Excel, you see “Собака” instead of “Ñîáàêà” and on the current immersive version you see “Ñîáàêà”. While this is incompatible and undesirable, it’s a bit bizarre that the formula bar doesn’t display “Ñîáàêà” for both editors.
The RichEdit implementation difference resulted because I didn’t fully appreciate what Excel and Word were doing when years ago I set out to preserve the SYMBOL_CHARSET font input experience for Russian, Greek and other users in the then-new Unicode era. It’s true that converting the characters in the backing store instead of in the display might boost performance slightly since the results are cached, but changing what the user actually types isn’t desirable since other apps don’t. RichEdit does remember how to convert back to the original characters if a non SYMBOL_CHARSET font is applied. And RichEdit’s implementation may change to agree with Word and Excel in the future.
All characters defined in code pages are included in Unicode and now code pages are no longer used internally to define character codes in main stream software. Meanwhile SYMBOL_CHARSET fonts such as Wingdings don’t have a code page (sometimes 0042 is used informally) and they don’t have a general Unicode mapping. The characters of some SYMBOL_CHARSET fonts (Windings, Webdings, Symbol) have been added to Unicode, so in principle you can use a Unicode symbol font like Segoe UI Symbol instead of those fonts. In contrast, Marlett is a particularly strange SYMBOL_CHARSET font. It contains glyphs for a few icons and carets. Many of Marlett’s code points in the range 0020—00FF, let alone all those above this range, are empty. Some of Marlett’s characters are already in Unicode, but it doesn’t seem likely that all will be.
At the outset of this post, I wrote that SYMBOL_CHARSET fonts only have characters for code points in the 0—255 range. That’s not quite true: the code points for 0x0020—0x00FF are mirrored at the Private Use Area range 0xF020—0xF0FF as explained in the post Weird F020-F0FF Characters in Word’s RTF. One good thing about using the latter range is that you know almost for sure in plain text that a SYMBOL_CHARSET font was used; you just don’t know which one!
Here is an interesting coda to this tale featuring the ubiquitous smiley face ☺, which is at the J position (0x004A) in the Wingdings font. If you copy this smiley face to a plain-text context, instead of the smiley face you may see a J or even a missing-glyph box. This happens a lot since by default Microsoft Word autocorrects the emoticon sequence to the smiley face in the Wingdings font. Nevertheless, when you copy this smiley face as plain text to WordPad, you see the Wingdings smiley face. How can this be?! The answer is that Word puts a U+F04A on the clipboard, which WordPad (actually RichEdit) recognizes as a SYMBOL_CHARSET font. Lacking any unambiguous font-binding choice, RichEdit uses Wingdings since that font seems to be the most widely used SYMBOL_CHARSET font. But if you paste U+F04A into desktop Excel, Excel just displays a missing-glyph box, since Excel doesn’t recognize U+F04A as anything special and doesn’t change the currently active font. (This may change in the future…)
The smiley face is also given by the Unicode code point U+263A, so you can enter a smiley face into Word by typing 263A alt+x. In fact, you can edit your autocorrect file to use the Unicode smiley face instead of the Wingdings smiley face. Then if you copy your smiley face to plain text, you see a smiley face with any program and it might even be a colorful emoji-style smiley face! The Segoe UI Symbol font contains all Unicode symbols including the smiley face. One problem is that for a given font height this font displays a somewhat larger glyph for the smiley face than the Calibri font displays for ordinary letters, which ends up with a larger line spacing if you mix the two fonts on a line. So you may want to scale Segoe UI Symbol down about 10% in such scenarios. Interestingly Segoe UI Emoji displays glyphs the same size as Calibri, so you might want to scale it up if used together with Segoe UI. To illustrate these cases, here’s an image of a Calibri ‘a’ followed by smiley faces formatted with Wingdings, Segoe UI Emoji, and Segoe UI Symbol, respectively
Happy New Year! ☺
]]>1) Math styles like math italic, bold, script, Fraktur and double-struck are obtained by character code changes instead of font changes.
2) Inside a math zone, all characters that can be displayed with the math font should be displayed with the math font unless marked as “math ordinary” text.
3) If the default math font is given as a document property, it should be used instead of the regular default math font.
The present post describes the first two of these differences in greater detail. The discussion applies to math zones in general, that is, it’s not restricted to RichEdit.
Math styles are discussed in Section 2.2 of Unicode Technical Report #25 Unicode Support for Mathematics. There under Semantic Distinctions, it is noted that
Mathematical notation requires a number of Latin and Greek alphabets that initially appear to be mere font variations of one another. For example, the letter H can appear as plain or upright (H), bold (𝐇), italic (𝐻), and script (ℋ). However, in any given document, these characters have distinct, and usually unrelated, mathematical semantics. For example, a normal represents a different variable from a bold , etc. If these attributes are dropped in plain text, the distinctions are lost and the meaning of the text is altered. Without the distinctions, the well-known Hamiltonian formula
turns into the integral equation in the variable H:
H=∫dτ(ϵE^{2}+μH^{2})
Mathematicians will object that a properly formatted integral equation requires all the letters in this example (except perhaps for the d) to be in italics. However, because the distinction between ℋ and H has been lost, they would recognize the equation as a fallback representation of an integral equation, and not as a fallback representation of the Hamiltonian. By encoding a separate set of alphabets, it is possible to preserve such distinctions in plain text.
(Actually I wrote that text for UTR #25 along with similar text in the original Unicode proposal for math styles). The key is that a single math font, such as Cambria Math, has a variety of math styles including bold, italic, script, Fraktur, double-struck and various sans serif styles. For example in a math zone, when a user selects a character and formats it as math bold via, say, the Ctrl+B hot key or math italic via Ctrl+I, the character code is changed, not the font as done in the usual font binding. This process and examples of the math alphanumeric character codes are given in the post Using Math Italic and Bold in Word 2007 and in Chapter 6 of the book Creating Research and Scientific Documents using Microsoft Word (see the section entitled “Use mathematical bold, italic, and sans serif).
Summarizing the reasons for using the math alphanumerics over character-format markup, the primary reason is to preserve math character semantics in plain text. E.g., is different from , a difference that is lost in plain text without a character-code change. In addition the math alphanumerics can have different math spacings and different glyphs than the corresponding ordinary text characters with styling. One example of a different glyph is the math italic 𝑎 of the Cambria Math font which doesn’t look like the italic a in the Cambria Italic font. Another reason is to limit the set of math alphanumerics. The Unicode Technical Committee didn’t want to endorse a mechanism like italic/bold variation selectors that would allow people to encode general italic and bold in plain text. Finally while the math alphanumerics complicate math font binding, they simplify some kinds of processing. In particular, you know what the math alphanumerics are from their code points alone; you don’t need to examine the associated character formatting.
If you paste text into a math zone, the characters need to be bound to a math font if the math font can display them and the bold and italic properties of the insertion point need to result in the corresponding character code changes. More specifically, any characters considered to be math characters in Section 2.4 “Locating Mathematical Characters” of UTR #25 need to be font bound to the math font. This is an example of context-dependent font binding. Inside a math zone, math operators should be bound to a math font like Cambria Math. Outside a math zone, math operators might be better bound to a symbol font like Segoe UI Symbol. In all backing stores (Word, OfficeArt, RichEdit), math alphanumerics are stored using their UTF-16 codes. So math italic a () is stored as the surrogate pair for U+1D44E, that is, U+D835 U+DC4E. Math italic h (ℎ) is stored as U+210E, etc. This is a different kind of font binding since ordinarily bold would choose a bold font, but for math zones, bold uses the regular math font.
A related but trickier process occurs when you toggle the math-zone property on and off. Consider first toggling the math zone property on, say by selecting some text not in a math zone and typing the math-zone hot key Alt+=. Each character needs to be examined for changing the font to a math font if the character can be displayed with the math font and if the bold/italic effects are active, the character codes need to be changed to the corresponding math bold/math italic characters, respectively. For example, the ASCII letter ‘a’ (U+0061) is converted to the math-italic 𝑎 (U+1D44E) if italic is active, which is usually the case in a math zone. In this process, only ASCII alphanumeric characters and Greek alphabetic characters are affected. Operators and other characters remain upright and have normal weight (not bold). If you want bold and/or italic effects for these characters in a math zone, you need to mark the characters as “math ordinary” characters, for which ordinary bold and italic font conventions apply. Word names “math ordinary” text as “normal” text.
If you select text in a math zone and toggle the math-zone property off, the reverse process needs to occur. To aid with the conversions, RichEdit exports the function GetMathAlphanumericCode() to convert a math alphanumeric to the corresponding ASCII/Greek character code and to return a code identifying the math style. Similarly RichEdit exports GetMathAlphanumeric() to get the math alphanumeric character corresponding to an ASCII/Greek character and a specific math style. The implementation is a bit intricate since some math alphabetic symbols were defined in the Unicode Letterlike Symbols block (U+2100..U+214F) before the sets were completed with the addition of the Mathematical Alphanumeric Symbols block (U+1D400..U+1D7FF). A straightforward translation to the latter can lead to missing glyphs since holes are left where the symbols are defined in the Letterlike Symbols block. Also there are miscellaneous special cases, such as for dotless i and j, and conversions to and from the Arabic Mathematical Alphabetic Symbols.
The bottom line is that natural-language font binding techniques aren’t able to handle most math font binding and special handling is required. While this complicates things, the resulting math typography looks excellent and plain-text copies retain the original rich-text math symbol semantics. Math font binding is one kind of context-dependent font binding. Context dependence also plays a role with font binding other kinds of text, such as emoji (see also this post), end user defined characters, Chinese characters (different fonts for Simplified Chinese, Traditional Chinese, Japanese, etc.), neutral characters of various kinds, and variation-selector sequences. Font binding in math zones still needs natural language font binding for non-math characters, so math font binding can become intertwined with natural-language font binding.
]]>These methods offer a variety of options including the tomConvertRTF (0x2000) option to insert and get RTF strings. This capability is exposed in the XAML text object model of the RichEditBox via the Windows.UI.Text.ITextRange::SetText() and GetText() methods with the FormatRtf option. Officially RTF is a byte format, that is, the character codes are 8-bits or multiples thereof and the TOM methods support BYTE strings, even though housed in BSTRs. In addition, the methods support UTF-16 strings that can contain any Unicode characters, so you don’t need to use the hard-to-read RTF Unicode \uN control word. On input (SetText2), UTF-16 RTF is recognized automagically. To retrieve UTF-16 RTF, bitwise OR in the tomGetUtf16 flag (0x00020000), that is, call ITextRange2::GetText2(tomConvertRTF | tomGetUtf16, &bstr). International and math UTF-16 RTF is so much easier to read than the standard 8-bit RTF! Internally, RichEdit reads UTF-16 RTF by converting it to UTF-8 RTF, a format introduced by RichEdit 4.1 in 2002, and reading in the resulting UTF-8 RTF.
Since the tomConvertRTF option made handling RTF easier, we decided to add some more options for the SetText2 and GetText2 methods. Specifically the latest Microsoft Office RichEdit supports the tomConvertMathML (0x00010000), tomConvertOMML (0x00080000), and tomConvertLinearFormat (0x00040000) options for converting MathML, OMML (Office MathML), and the math linear format, respectively. These options let you use RichEdit as a math-zone conversion machine: input math in one format and retrieve it in another.
Note that while RTF is a general rich-text format that can include math zones, MathML and OMML only represent math zones. In a fashion similar to RTF, plain text can include math zones in the linear format delimited by square brackets with quills ⁅ (U+2045) and ⁆ (U+2046). In fact, you can copy RichEdit text with math zones to a plain-text editor such as NotePad, copy that plain text back into RichEdit, Select All and use the ctrl-alt-shift-= hot key to build up all the math zones! Although most rich-text formatting is lost in plain-text copies, the math zones come through with little or no loss.
The new RichEdit can also copy and paste MathML and OMML using the usual copy/paste hot keys and commands. Older RichEdit versions contain the MathML and OMML converters used by PowerPoint and OneNote, but ironically RichEdit itself didn’t expose MathML/OMML copy/paste functionality, so that’s been remedied.
In a nonmath context, there’s a new SetText2 option, tomConvertRuby (0x00100000), to convert strings like “{…|…}” to ruby inline objects, where the first ellipsis represents the ruby text and the second ellipsis the base text. The ASCII curly braces and vertical bar are translated to the internal ruby-object structure characters U+FDD1, U+FDEF, and U+FDEE, respectively. Alternatively the string can contain those structure characters directly. If a digit follows the start delimiter (‘{‘ or U+FDD1}, the digit defines the ruby options
rubyAlign val |
Meaning |
center (0) |
Center <ruby> with respect to <base> |
distributeLetter (1) |
Distribute difference in space between longer and shorter text in the latter, evenly between each character |
distributeSpace (2) |
Distribute difference in space between longer and shorter text in the latter using a ratio of 1:2:1 which corresponds to lead : inter-character : end |
left (3) |
Align <ruby> with the left of <base> |
right (4) |
Align <ruby> with the right of <base> |
If you add 5 to these values, the ruby object will display the ruby text below the base text instead of above it. For example, calling ITextRange2::SetText2(tomConvertRuby, bstr) with bstr containing the string “{1にほんご|日本語}” inserts
The string can contain text in addition to ruby objects and the ruby objects can be nested to create compound ruby objects such as
We see that the ITextRange2::SetText2() and GetText2() methods provide helpful conversion facilities. You may have noticed that the very important LaTeX/TeX math format doesn’t have such an option yet: no tomConvertTeX. That’s a serious shortcoming that needs to be addressed.
]]>