Bidi Paragraph with Parenthesized Text

The previous post described four tailorings of the Unicode Bidi Algorithm (UBA) in situations where the UBA display is confusing or even misleading. The present post adds another set of scenarios to this list, namely strange renderings of paragraphs that contain parenthesized or quoted text. An algorithm for displaying such text in a reasonable way is given. This algorithm first shipped in Microsoft Office 2007 and you can see it in action by typing bidi text into Excel 2007/2010 cells. The problem is sort of mathematical in nature, since parenthesized text is like a parenthesized expression. It can be nested and the text should display inside the parentheses.

Nevertheless, according to the UBA, there are cases for which both parentheses of a parenthesized expression have the same glyph. In a rich-text editor like Microsoft Word if you type

(a)b

using an English keyboard and type Ctrl+RightShift to switch to an RtL paragraph you see

(a)b

This is because the parentheses are stamped with LtR directionality by the keyboard language’s directionality, thereby overruling the UBA. In NotePad, which follows the UBA, the LtR paragraph version looks the same as in Word, but the RtL version looks like

a)b)

Here the UBA classifies the opening parenthesis as RtL, thereby appearing first (on the far right) and mirrored instead of to the left of the letter a. Similarly a(b) appears as (a(b in an RtL paragraph.

As a more complicated example, consider the nested case a(bف(فc)dc)d. According to the UBA, this displays in an LtR paragraph as a(bف(فc)dc)d and in an RtL paragraph as

a(bف(فc)dc)d

Neither of these renderings preserves the visual nesting of the characters.

We can fix such parenthesized text displays using a bidi-parenthesis-matching algorithm due to Ayman Aldahleh. A basic idea is to ensure that both parentheses of a matched pair have the same directionality and that what’s inside the parentheses has bidi level(s) greater than or equal to the parenthesis pair level. The algorithm uses an open-parenthesis stack along with a parenthesis-pair information array. The algorithm can be easily generalized to handle brackets, braces, and other character pairs, but for simplicity we stick with ASCII parentheses and assume that Unicode bidi embeddings aren’t present.

1) Run the UBA on a paragraph noting the bidi levels of the characters.

2) Scan the paragraph for parentheses. When an open parenthesis is found (U+0028), record its bidi level and character position in the next information element and push the element index onto the open-parenthesis stack.

3) If a close parenthesis is found (U+0029) and the stack has an entry, a matched pair is found. Use the element index on the top of the stack to find the corresponding pair information element. If both parentheses have the same bidi level, use that for the pair. If they differ, start by setting the pair level equal to the smaller level. If in an RtL/LtR paragraph, the pair level isn’t odd/even, increment the pair level. Set both parenthesis levels equal to the resulting pair level and record the character position ending the pair.

4) If an unpaired or improperly nested parenthesis is found, abandon the matching process.

5) Process the pair information elements from first to last. For each element, if any character inside the pair has a level smaller than the pair level, increment the level by 2. This forces the character to display inside the pair and doesn’t change its directionality.

This last step is recursive, since outer pair elements precede elements for any pairs they contain and increment the latters’ levels. When an inner pair is processed, its level is guaranteed to be greater than or equal to the parent pair, etc.

For the simple paragraph above in an RtL paragraph, the (a)b appears as (a)b, since the opening parenthesis is promoted to level 2. In a case like a(bفc)d in an RtL paragraph, the a(b and c)d all have level 2 and the ف is bumped up to level 3. To see more cases, type various parenthesized expressions into an Excel cell and click on the paragraph direction tool to change the paragraph direction. If you have properly matched parentheses, the results will always look properly nested. It also works with brackets [] and braces {}.