Editing Math using MathML for Speech

The posts Microsoft Office Math Speech and Speaking of math… describe two kinds of math-speech granularities: coarse-grained (navigate by words), which speaks math expressions fluently in a natural language, and fine-grained (navigate by characters), which reveals the content at the insertion point (IP) in sufficient detail to enable editing. Several Assistive Technologies (ATs) use MathML to speak math zones fluently. But as far as I know, ATs can only currently produce fine-grained speech using the explicit math speech obtained via UI Automation from Microsoft Office applications. That’s the way Narrator does it.

The present post describes how MathML could be used in generating fine-grained speech. The trick is to reveal where the insertion point (IP) is so that the user knows where the next character input will go.

Using speech cues to edit a fraction

To see how this works, consider the fraction 1/2π displayed in built-up form as

1over2pi

The coarse-grained speech for this (in English) is “1 over 2 pi”. The fine-grained speech resulting from moving right one character at a time is

“start fraction”
”1”
“end numerator”
“2”
“pi”
“end denominator”

With character navigation, Narrator speaks these strings for the fraction in Word, PowerPoint, and OneNote documents. Hearing this speech, the user knows where the IP is and hence where the next character typed is entered. To enable editing, MathML content needs to offer the same functionality.

The MathML for the fraction is

<mml:math xmlns:mml="https://www.w3.org/1998/Math/MathML" display="block">
    <mml:mfrac>
        <mml:mn>1</mml:mn>
        <mml:mrow>
               <mml:mn>2</mml:mn>
               <mml:mi>𝜋</mml:mi>
          </mml:mrow>
     </mml:mfrac>
</mml:math>

This doesn’t name the numerator and denominator explicitly. Instead, the numerator is defined to be the first child of the <mml:mfrac> entity and the denominator is defined to be the second. The MathML can be used in generating the speech “1 over 2 pi” in a natural language, that is, the coarse-grained speech. Fine-grained speech needs MathML that identifies what’s at the insertion point, which can be a character, the start of the fraction, the end of the numerator, or the end of the denominator. The MathML above doesn’t offer such information.

<maction>

There are at least two ways to produce such per-character-position speech using MathML by including an <maction> entity. For the first way, when the IP moves by character in front of the fraction, the MathML would be

<mml:math xmlns:mml="https://www.w3.org/1998/Math/MathML">
     <mml:maction actiontype="input">start fraction</mml:maction>
</mml:math>

Dropping the <mml:math> entity for brevity, the MathML output for subsequent move-by-character navigation actions would be

     <mml:maction actiontype="input">1</mml:maction>
     <mml:maction actiontype="input">end numerator</mml:maction>
     <mml:maction actiontype="input">2</mml:maction>
     <mml:maction actiontype="input">pi</mml:maction>
     <mml:maction actiontype="input">end denominator</mml:maction>
 
The text in the <maction> entity can be localized into various languages. If this approach becomes popular, it’d be worth standardizing on text strings like “end numerator” to help users as well as localization. The Microsoft Office math-speech engine produces strings with 16-bit speech tokens that index sets of language strings in over 18 languages. But that process occurs internally. For general implementation by ATs, it seems better to use a set of standardized English strings that an AT can associate with other language string sets. A set of such English strings can be obtained by running Narrator over a Word document with equations on an English operating system.

A second way to produce such per-character-position speech using MathML is to generate the MathML for the math object that has the insertion point and include an <maction> revealing where the IP is. For example, if the IP is at the end of the numerator in the fraction above, the MathML would be

<mml:math xmlns:mml="https://www.w3.org/1998/Math/MathML" display="block">
    <mml:mfrac>
         <mml:mrow>
               <mml:mn>1</mml:mn>
               <mml:maction actiontype="insertion point"/>
          </mml:mrow>
        <mml:mrow>
               <mml:mn>2</mml:mn>
               <mml:mi>𝜋</mml:mi>
          </mml:mrow>
     </mml:mfrac>
</mml:math>

Since this MathML has the full context of the insertion point, the AT can create suitable speech. It requires more analysis by the AT than the first <maction> approach, but is more flexible. Such approaches using <maction> are quite general and don’t need specialized methods to decode the math in memory. They could work for all operating systems and applications that support MathML.

(Thanks to Sue-Ann Ma, Neil Soiffer, James Teh, Volker Sorge, Peter Frem and Ziad Khalidi for encouraging me to come up with a way to use MathML for editing).