Bidirectional behavior in IE 7 is better than IE 6


Hello. My name is Wujun Wang and I am a tester on the IE team in Beijing. My area of test focus is BiDi. Wait, what is BiDi? When I searched for “bidi” at http://www.Dictionary.com, it defined bidi as “A thin, often flavored Indian cigarette made of tobacco wrapped in a tendu leaf.” So you know what I do. I am testing cigarettes! Well, not really. BiDi is a short name often used for the Unicode Bidirectional Algorithm (http://www.unicode.org/reports/tr9/).

When text is presented in horizontal lines, most scripts display characters from left to right. However, there are several languages (such as Arabic, Divehi, Hebrew and Syriac) where the natural ordering of horizontal text in display is right to left. Ambiguities can arise in determining the ordering of character display when text flows in two directions (hence Bidirectional) is present. For example, Hebrew text containing Latin letters and/or digits flows in two directions.  Here is a short example:

<HTML DIR=RTL>
 <BODY>
  <P>1+2+ש</P>
 </BODY>
</HTML>

This HTML example will show a short string. The DIR attribute specifies which default directionality to apply to those characters. <HTML DIR=RTL> means all elements in this file are defined with a default right-to-left directionality. This seems pretty straight forward until directions get going right-to-left and left-to-right next to each other. For example, let’s look at the string “1+2+ש“. “ש” is the HEBREW LETTER SHIN. Since Hebrew’s natural ordering is from right to left, you may guess this string will be displayed from right to left as “ש+2+1″. However, this is not correct! Although Hebrew is written from right to left, digits are written from left to right. Another tricky thing about this string is that the “+” is a special character in the BiDi algorithm. Therefore, according to the BiDi algorithm in Unicode, the correct visual rendering of the sample should be “ש+1+2″.

You might wonder how we went about coming up with tests to make sure we are going the right direction in bidi land. First, we spent time going through the Unicode Bidi Algorithm document step by step and figured out what it was saying. (Warning: reading the bidi algorithm can make your head hurt.) Next we identified groups of characters and combinations that would be helpful to test each of the rules and combinations. Then we worked out on paper what we thought should be happening with the bidi levels and reordering. Finally we made HTML test cases and checked to make sure the browser gave us that same result that we had figured out on paper.

Improving our support for the Unicode BiDi Algorithm fixes a number of problems our customers have had with IE. Because we live on a small planet, having correct support for all languages is important to us.

Some other articles that might be appealing to you if you are interested in content for right to left languages are:

Authoring HTML for Middle Eastern Content:
http://www.microsoft.com/globaldev/handson/dev/Mideast.mspx

Justifying Text using Cascading Style Sheets (CSS) in Internet Explorer 5.5:
http://www.microsoft.com/middleeast/msdn/JustifyingText-CSS.aspx

Thanks in advance for your feedback!

 – Wujun Wang

Comments (33)

  1. game kid says:

    "So you know what I do. I am testing cigarettes!"

    *submits resume* –joking, it’s good to see more stuff fixed.  One doesn’t think of BiDi amid the other sylistic stuff.

  2. zcorpan says:

    Have you addressed bidi in title="" attributes (rendered in tooltips)?

  3. Kearns says:

    If Hebrew is like Arabic (and I suspect it is) they don’t read the numbers from left to right, but they don’t reverse the direction of the numbers either. So a number like 2938 would be read "eight and thirty and nine hundred and two thousand", which I suspect is much better for building suspense on the Hebrew or Arabic version of the Price is Right…

  4. Greg Martin says:

    When I saved your snippet as test.html and displayed it in IE 7, I saw shin first followed by the +1+2. All of this from the right margin. When I copied and pasted string into notepad I saw 1+2+{shin} (Of course, from the left margin).  I am using the name of the letter, because whenever I type the string in this comments field the letters always change to 1+2+{shin} and my normal left to right typing moves in unexpected and uncontrollable directions.

    I hope these comments clearly describe what I observed.

  5. alexander says:

    How do you feel, living in China, to have your own company (microsoft) working with your government to support the oppression of the Chinese people, imprisonment of people criticizing your government?

    (maybe we will even see some censoring of this very comment)

    http://www.guardian.co.uk/china/story/0,7369,1506601,00.html

    http://rconversation.blogs.com/rconversation/2006/01/microsoft_takes.html

    http://www.voanews.com/english/2006-02-01-voa86.cfm

  6. PaulNel [MSFT] says:

    In Arabic I would read the number 2938 basically from left to right, e.g. two thousand nine hundred eight and thirty. When I write in Arabic digits I write them from left to right.

    Greg – Us Ctrl+Right Shift in Notepad to change Notepad to read from RTL. 😎

  7. PaulNel [MSFT] says:

    zcorpan – Tooltips are part of the COMCTL32.DLL. The bidi behavior in those are controled by the OS and not IE. Not sure when that will be updated.

  8. mike says:

    Well this is good. Are there any improvements to the CSS support for the ‘direction’ and ‘unicode-bidi’ or other such properties?

  9. PaulNel [MSFT] says:

    I put support for CSS ‘direction’ and ‘unicode-bidi’ in place in IE5. These should all be functioning correctly after we fixed some bugs. If you come across a case that is not working, please let me know so I can have a look.

  10. > Tooltips are part of the COMCTL32.DLL. The bidi behavior in those are controled by the OS and not IE

    That’s probably fine as long as the OS gets it right.  Luckily you can’t style a title= so there’s no <select>-style problems with using an OS widget to render titles.

  11. Claw says:

    Is there any work being done for vertically written scripts such as traditional Chinese and Japanese?

  12. PatriotB says:

    Of course, COMCTL32.DLL used to be upgraded as part of an IE installation.  Looks like that won’t be happening any more. 🙁

  13. PaulNel [MSFT] says:

    IE currently supports the use of the CSS style="writing-mode: tb-rl;" for East Asian text. We are not making any changes/improvements in IE7. However, I am hoping we can do some vertical text improvements in a future version.

    Any comments you have on the existing behavior would be helpful as I work on editing the CSS Text module and plan for future IE improvements.

  14. Xepol says:

    So, let me get this right.  You are taking languages which are expressed right to left, putting them  in a document left to right and then an algorithm works out how to reverse it all again.

    All this because sometimes it doesn’t go right to left huh?

    <shudder>

  15. PaulNel [MSFT] says:

    We take a logical string of Unicode characters and then process them into homogeneous runs of text that are then ordered using the Unicode Bidi Algorithm so they can be rendered in the correct direction for the writing system.

    The important thing to remember is that logical character order in the HTML document (backing store) doesn’t care about writing direction. One character simply follows the next. This is why lexical tools can work for any language of the world.

  16. send2kb says:

    bidi is good but what good is it when I am not able to type anything but english on IE 7.  My IME does not work at all anymore so I am still using Firefox for IME support.

  17. Arash Salarian says:

    Good news. A question: are you going to support unicode’s LRE, PDF and RTE characters in IE7?

  18. PaulNel [MSFT] says:

    The named entities &lre; &rle; &pdf; &lro; and &rlo have been supported since IE5. You can also use the decimal or hex entity format as well. I find using the named entities easier to read when I have to use them.

    If you use HTML markup (dir=ltr, dir=rtl or <bdo>)or CSS properties (direction and unicode-bidi) you can achieve the same thing.

  19. PaulNel [MSFT] says:

    Hi "Send2kb" – Thanks for reporting the IME problems you encountered. I did a little digging and found that IME support is either being fixed/or has been fixed. [:$] (embarrassed smiley)

  20. Ingo Chao says:

    In my copy of IE7b2, the rendering of position:relative in an rtl context is not correct.

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;

    <html dir="rtl" xmlns="http://www.w3.org/1999/xhtml"&gt;

    <head>

     <meta http-equiv="content-type" content="text/html; charset=utf-8" />

     <title></title>

    </head>

    <body>

     <p><span style="position:relative; border: 1px solid red;">1+2+ש</span></p>

    </body>

    </html>

  21. Alex says:

    Just for once, can you guys PLEASE post some descent sample code!?

    By presenting bad code, as a sample, you do absolutely nothing in terms of helping developers learn to do things right.

    Your sample, should be:

    <html dir="rtl">

    <body>

     <p>1+2+ש</p>

    </body>

    </html>

    1.) Lower case tags. 1995 called, they want their <UPPERCASE TAGS> back!

    2.) All attributes, should be wrapped in double quotes.

    3.) Your actual sample, is horrible, for trying to describe what you are doing.  A "Hello World" would have made much more sense, as would a second <p> tag, with a number sample.

    4.) Post a SCREENSHOT! of what it is supposed to render as!

    5.) Minor, but a <head> and <title> tag would have made the example a bit more complete.

    Thank you.

    A.G.

  22. PaulNel [MSFT] says:

    Ingo – Thanks for reporting this nice bug.

  23. Eric K. says:

    > 2.) All attributes, should be wrapped in double quotes.

    > 3.) Your actual sample, is horrible, for trying to describe what you are doing.  

    Your use of commas within the above sentences is incorrect.  🙂

  24. Alex says:

    @Eric K.

    Yes, my commas. periods: Capitalization be may wrong grammar but I not the one posting the articles!!!

    (yes, all the above was intentionally wacky)

    The point is, MS should be posting better samples, if they expect us to follow their guidance!

  25. send2kb says:

    Great!!!! I see that you guys are working hard over there.  Keep up the good work!  Can’t wait until the final product.

  26. Josh says:

    I’m not sure if anyone else is having problems with the examples above. Firefox and IE6 both display the characters in different orders, and they both differ from the source code.

    Perhaps the examples should be placed in an image?

  27. tsahi says:

    IE6 has pretty good BiDi support, except for a quirk or two (like treating a <br> like a <p>, so this:

    עברית eng1.<br>

    eng2

    puts the dot at the wrong side of eng1). i hope you will folo the BiDi algorithm really closely, or you’ll break lots of web pages.

  28. PatriotB says:

    Alex — There is nothing wrong with uppercase tags and unquoted attributes (depending on the content of the attribute) in HTML 4.01; they are valid.  Of course since the code sample didn’t have a doctype, we can only assume that it’s straight HTML and not meant to be interpreted as XHTML.

  29. alm says:

    Random XSL bug … if you look at an XML file to which an XSL is attached and the XSL is rendered correctly … then you Ctrl-N … there’s not XSL in the new window.

  30. tendu says:

    Alex, chill out and smoke a bidi.

  31. Japanese User says:

    When entering a Japanese IDN in IE7, a button to the right of the URL, when clicked, displays a window (International Website Address) and shows "character sets currently in use."

    When using a single-Kanji dot-com, it shows languages: Hani and Latn.

    When using a Katakana web address, it shows: Kana and Latn.

    However, when using a Katakana web address with the "dash" character (ー), the browser displays a message (under the URL): "The web address contains letters or symbols that cannot be displayed with the current language settings." The URL is also displayed in Punycode and it indicates the languages being used are: Zyyy, Kana and Latn.

    So, does anyone know what "Zyyy" is? And is there a language to add (under

    "Language Preference Settings") to include the "dash" (ー) character in katakana domains so that IE7 displays the IDN instead of punycode?

    Or is there another solution to this problem?