URLs in Internet Explorer 7

Internet Explorer 7 includes a new URL handling architecture known internally as CURI.  The new optimized URI functions provide more secure and consistent parsing of URIs to reduce attack surface and mitigate the threat of malicious URIs.

When designing our security strategy for IE7, malicious URIs were near the top of the list because secure handling of URIs throughout IE is critical to the security of the system. Hence, a major architectural investment was made in CURI for IE7.

Unlike most of the new features in IE7, most end users will never notice CURI working “under the hood” on their behalf.  For the technical readers in the audience, however, the details behind CURI may be of some interest.

Background

Uniform Resource Locators (URLs) are one of the most important and seemingly simple concepts web users encounter.  Almost everyone recognizes a Uniform Resource Locator as the character string which allows the browser to find a website.

Uniform Resource Identifiers (a superset of URLs) were most recently formally specified in RFC3986, the fourth significant revision in the evolving definition of URIs. To quote the RFC,

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource.

Pretty simple stuff, right?  Alas, as usual, the devil is in the details.

Strings vs. Objects

Given the definition of a URI, it seems natural to represent a URI as a character string. And, in fact, simple character strings are how URIs are most often stored and transferred.  For instance, to navigate to a web page, the user types a URI string into the browser’s address bar, or clicks an HTML anchor tag containing a HREF attribute whose value is a string URI.

Unfortunately, there are some downsides to using character strings to store URIs. The biggest problem is that a string is a simple data structure which only holds a sequence of characters, and contains no further information or logic for how that data should be interpreted. Having more information available about a URI is useful for a number of reasons, but security tops that list.

How the browser uses URIs

When you visit a webpage, chances are that there are dozens to hundreds of embedded resources, many of which are addressed via relative URIs. For instance, if you visit https://search.msn.com/default.aspx, the HTML source contains the following image tag:

<img src="/s/hp/bluesky_logo.gif" title="MSN" alt="MSN" height="32" width="81" />

In order for the image to be downloaded, the browser must first combine https://search.msn.com/default.aspx with /s/hp/bluesky_logo.gif to come up with a complete URI which can be downloaded: https://search.msn.com/s/hp/bluesky\_logo.gif. In order to combine the base URI with the relative URI, the browser must first crack the base URI and the relative URI to retrieve their respective components.

A URI consists of multiple components, each of which helps the browser and server to retrieve the requested file.

For example, given the URI https://search.msn.com/results.aspx?q=ie7\#listings

  • The scheme component is http
  • The hostname component is search.msn.com
  • The path component is /results.aspx
  • The query component is q=ie7
  • The fragment component is listings

Thus, when generating the full URI to the image, IE must combine the scheme and hostname from the base URI (https://search.msn.com) with the path of the relative URI (/s/hp/bluesky_logo.gif). As you might imagine, performing this crack-and-combine process hundreds of times per page is quite inefficient, and introduces the risk of inconsistent parsing or evaluation.

Security

All web browsers make security decisions based upon URIs.  Many security features, from Security Zones to the JavaScript same-origin policy, depend on the browser being able to consistently evaluate URIs to determine their components, and to compare them to other URIs.

If a bad guy (or gal!) can get a browser to incorrectly or inconsistently crack or combine a URI, the user’s security may be compromised.  Over the years, a significant percentage of browser patches have been issued to address exploits against URI parsing flaws, for instance the CAN-2005-0054 vulnerability in IE, or Opera’s older %2f bug, to name but two of many.

URI-parsing attacks against Internet Explorer typically attempt to trick a security function (like MapURLToZone) into evaluating an exploit URI incorrectly (for instance, by returning the wrong security zone). If the URI is zoned into a more trusted zone than it deserves, the content at that URI might execute with elevated privileges.  Other common attacks attempt to set or steal cookies from other domains, read content from one domain and send it to another, or spoof the user by displaying the URI incorrectly.

Difficulties in securely handling string-based URIs are often rooted in the fact that there are an infinite number of possible representations for a single URI, so a simple string comparison isn’t possible. RFC3986 specifies conditions in which a character in a URI is equivalent to a percent-encoded character in the format %HH, where HH is the hexadecimal-formatted integer representation of the character. Equivalence rules vary depending on where a character appears in a URI; for instance, the scheme and hostname are case-insensitive, but paths and queries are not.

The following URIs are all equivalent:

All code paths in the browser must be fully knowledgeable about the rules of URI-parsing in order to correctly evaluate a URI. Any failures could enable an attacker to circumvent security restrictions.

The Solution: CURI

CURI is a lightweight object which holds a single URI in normal form. If the CURI is constructed from a string URI, that string URI is cracked just once when the object is first constructed. After construction, callers may access any of the URI components using members provided by the object. This ensures that URIs are evaluated consistently throughout both security and feature code paths. We’ve re-plumbed Internet Explorer to accept and use CURI objects internally; most of this work has already shipped in Beta-1.

The CURI object is available for consumption by external callers like ActiveX controls and Browser Helper Objects; documentation will be provided on MSDN as the CURI class is finalized. It’s worth noting that even external code that does not directly consume CURI objects will benefit from the change, because Unicode string serialized out of CURI objects will be consistently normalized, decreasing the likelihood of incorrect parsing even outside of IE.

Future Directions: International URIs

One advantage of the centralization provided by the CURI object is that it enables future URI-handling enhancements. In particular, working with international URIs is a key scenario for Internet Explorer 7, and the fully-Unicode CURI object is the keystone for our worldwide support. International URIs are critical to the future of the web as ever more international sites come online and more of the world’s diverse languages appear on the web.

I’m not quite ready to talk about IE7’s support for International Domain Names (IDN) yet, but expect to hear more as Beta-2 approaches. In particular, we’ll be talking about how IDNs work within existing network infrastructure, and how IE7 will mitigate the threat of Unicode homograph attacks.

- EricLaw