Brain Dump: International Text

Article
07/13/2012

Note: The “brain dump” series is akin to what the support.microsoft.com team calls “Fast Publish” articles—namely, things that are published quickly, without the usual level of polish, triple-checking, etc. I expect that these posts will contain errors, but I also expect them to be mostly correct. I’m writing these up this way now because they’ve been in my “Important things to write about” queue for ~5 years. Alas, these topics are so broad and intricate that a proper treatment would take far more time than I have available at the moment.

Handling of non-ASCII text is a common source of compatibility and interoperability problems. This post covers a variety of tidbits related to this topic, and it will be expanded (and likely corrected) over time.

RFC2616 defining HTTP/1.1 suggests that non-ISO-8859-1 text in HTTP headers must be encoded according to the rules of RFC2047, an approach that was not commonly implemented by many web clients. Many clients will instead send or accept raw UTF-8 or bytes encoded using the current system’s ANSI codepage instead. Character-set mismatches often result in interoperability problems.

Internet Explorer’s handling of non-ASCII text is partially controlled by these checkboxes in the Advanced tab:

Always show encoded addresses is disabled by default will force IE to show the raw Punycode in the address bar at all times when viewing an IDN site, even if that site’s IDN URL is following the non-spoofability rules.

Send IDN server names is enabled by default and will force IE to encode hostnames in URLs following the rules of RFC3491 and RFC3492. The user will be shown the URL in the address bar in Unicode form if and only if the URL is deemed non-spoofable. Please see this IEBlog post on the rules of IDN Non-spoofability.

Send IDN server names for Intranet addresses is disabled by default for compatibility with legacy Windows networks that were using UTF-8 to support non-ASCII hostnames. Other browsers, to the best of my knowledge, do not have special handling for Intranet sites, and I believe that current versions of Active Directory and the Windows DNS server support punycoded hostname registration and lookup.

Send UTF-8 URLs is checked by default, but doesn’t behave as broadly as its name implies. This option controls whether certain URL components and headers are sent and interpreted using UTF-8 or the system’s ANSI codepage, but it does not apply to the entire URL.

Show Notification bar for encoded addresses checked by default, informs the user that they are seeing punycoded text in the address bar only because the non-spoofability rules have determined that the current site’s address follows the rules for IDN non-spoofability except that the address uses characters outside of the current user’s configured Accept-Languages. The notification bar allows the user to adjust the configured Accept-Languages using the Internet Control Panel.

Use UTF-8 for mailto links is unchecked by default, but is checked when installing current versions of Outlook. You can learn a lot more about this option in this IEBlog post. The option has been removed for Windows 8 / Internet Explorer 10, and mailto links are always passed to the client application using %-encoded UTF-8.

Submission of text in HTML forms in Internet Explorer is a fascinating and complex topic. The design of form encoding in IE8 and earlier was to submit forms using the encoding of the submitting page by default. If the FORM element on the page declared the Accept-Charset attribute equal to UTF-8 (which is the only supported value) and if the form results contained data that could not be encoding in the page's encoding, then the form results would be sent as UTF-8. In IE9 standards-mode and later, IE will always encode form results as UTF-8 if the accept-charset="UTF-8" attribute is present.

If your web form contains an INPUT TYPE=HIDDEN element with the name _charset_ this field will be automatically filled with the name of the character set used to encode the form when it is submitted. This helps permit your server to decode the form using the proper encoding.

In contrast, it’s not always possible to reliably reconstruct querystrings at the server (no, that was not a typo!), because IE does not pass any state information to the server which would indicate what encoding was used.

URLs in IE may use up to three (!!) different encodings at once: punycode in the hostname, %-escaped UTF-8 for the path, and raw codepaged-ANSI for the query and fragment components. This is clearly a mess, but fixing it to match the IRI specification incurs compatibility costs. (Trust me, we’ve tried!)

Internet Explorer’s XMLHTTPRequest object will not automatically encode your URIs for you (e.g. %-escaping UTF8 characters). If you want to send such characters to the server following the rules of IRI, you should encode them before passing them to the open() method, using the encodeURIComponent JavaScript API.

If you’re downloading files to IE9+ or other modern browsers, you should use RFC5987 encoding for the Content-Disposition header. If you need to support old versions of IE, the story is more complicated. This IEInternals post explores that topic.

In WordPad (and most RichEdit controls in Windows) you can simply type a four-digit hexadecimal number, (e.g. 30C4) and then hit ALT+X to convert that sequence to the corresponding Unicode character (i.e. ツ). Similarly, you can paste a Unicode character into WordPad and hit ALT+X to convert it back to its Unicode value.

In Windows, encoding of non-ASCII characters in File-scheme URIs (e.g. file://server/path/file.txt) is different than in other schemes. %-encoded octets in a FILE uri are always interpreted using the system’s ANSI codepage, not UTF-8. Learn more about this and File URIs in general here.

-Eric

Brain Dump: International Text

Additional resources