Displaying tfeL-ot-thgiR Scripts - Can your computer handle it?

If you plan to use Complex Scripts (like Arabic, Hebrew, Thai, etc), you should ensure that you have support for complex scripts installed. I know that sounds obvious, but we wanted to make sure people know how and why to do this. Let us examine this in detail using an international URL issue that Dean recently encountered while dogfooding IE7.

What is the issue?

Internet Explorer will sometimes display Right-to-left (RTL) languages such as Arabic or Hebrew, in the address bar in left-to-right (LTR) order. Due to IE’s IDN homograph-spoofing mitigation; this would be an issue only for users who have that language in their language settings (otherwise navigating to an international URL will display the hostname as punycode). Still, this is a genuine customer scenario that should just work. For example, consider a Hebrew URL:
https:// gimel beth aleph dot com

This would show up in the address-bar as https://xn--4dbcd.com by default. But a user with Hebrew in their accepted-languages might see:

https:// aleph beth gimel dot com

We found it very surprising that such an important issue was not reported widely during Beta 1 and Beta 2, so we wanted to dig a little deeper.

What did our investigation reveal?

This happens only on machines where support for complex scripts is not installed. Complex Scripts are those which require contextual processing for display, editing, and other functions, such as:

  1. Bi-directional (BiDi) reordering (e.g. Arabic, Hebrew)
  2. Contextual shaping (e.g. Arabic, Indic family)
  3. Display of combining characters (e.g. Arabic, Thai, Indian)
  4. Specialized word-break and justification rules (e.g. Thai)
  5. Disallowing illegal character combinations (e.g. Indian, Thai)

IE’s address bar uses the Windows EDIT control which uses GDI to display characters. When support for complex scripts is not installed, GDI can just map a given Unicode codepoint to a glyph and display it, but can’t do context-specific stuff like changing the order or shape of characters. All this language specific complexity is handled by Uniscribe, which is installed when you install support for complex scripts. Inside the browser, of course, we support all these scripts ourselves, regardless of whether the Windows EDIT control support is installed – the problem only occurs in the address bar.

Who does this issue impact?

You can run into Unicode URLs incorrectly displayed in the address bar if:

  1. You are running a non-complex-language build of Windows XP/Server 2003, such as an English build. Uniscribe support is always installed on complex-language-builds of Windows XP/Server 2003. Windows Vista always has this support installed.
  2. You have not added support for input in any complex language using the system control panel. The complete support is installed if you install support for any complex script input.
  3. You have that particular complex/bidi language in your accepted-languages. As I mentioned above, if the language is not added to your accepted-languages, you’ll just see punycode in the address-bar.

We have not seen any bug-report mentioning this, and do not think this is an issue that a typical user of complex scripts would run into.

How can you fix the issue?

If the description above matches you, or if you just want to play it safe, you should install complex script and RTL languages support. This not only corrects the IE behavior, but will give you a more consistent experience when using complex scripts in other components and software running on Windows XP.

Go to “Control Panel-> Regional and Language Options” and check the setting shown below. You’ll need the Windows XP CD and will need to restart.

Right-to-Left language Settings

Feedback?

Comments and feedback are welcome! Make sure you try out IE7 Beta 3!

-Vishu Gupta
Developer