URI Comparison Functions


Investigating URI parsing related issues in various products, I’ve run across many instances of code erroneously attempting to compare two URIs for equality. In some cases the author writes their own comparison and seems to be unaware of URI semantics and in other cases the author delegates to a Windows provided function that doesn’t quite work for the author’s scenario. In this blog post I’ll describe some of the unmanaged URI comparison functions available to Win32 developers, and a few common mistakes to avoid.

The latest URI RFC 3986 does an excellent job of describing a ladder of URI comparisons. The range on the ladder trades off comparison speed for number of false negatives. False negative in this case means that the URI comparison function says two URIs are not equivalent when they are.  However, nowhere on the ladder will a comparison generate a false positive. That is, a URI comparison function should never incorrectly report that two URIs are equivalent.

IUri::IsEqual

The IUri::IsEqual method is the comparison method provided by the IUri associated APIs. IUri::IsEqual is able to perform potentially very fast since its based on state parsed out of the string URI during the creation of the IUri object. This comparison method is semantically equivalent to taking two URIs, performing the canonicalization methods described in the CreateUri documentation, and comparing the result character by character.  Knowledge of common schemes such as http and ftp is built-in and so by the URI RFC’s terminology this is a Scheme-Based Normalization equality comparison. IUri and associated APIs are available on systems with IE7 which includes all Vista systems. If this method is available and you don’t need a comparison that takes into account protocol specific information then this is the preferred method of URI equality comparison.

IMoniker::IsEqual

You can use CreateURLMonikerEx to create an IMoniker object that represents your URI and use IMoniker::IsEqual to compare it with another such IMoniker. The comparison used here is a case sensitive string comparison of the display strings of the IMonikers. These display strings are a normalized form of the URIs passed into CreateURLMonikerEx, so the comparison is a Scheme-Based Normalization equality comparison similar to IUri::IsEqual. The difference between the two is that the UrlMoniker implementation of IMoniker::IsEqual may not perform all of the normalizations that IUri::IsEqual does including percent-encoding normalization. CreateURLMonikerEx and IMoniker::IsEqual have been available since Windows 95 so it is an acceptable alternative to IUri::IsEqual if the IUri APIs are not available to you. If you do use CreateURLMonikerEx be sure to pass the correct flags to avoid creating legacy file URIs.

String Comparison

At one extreme of the URI comparison ladder is the simple but trusty string comparison. A function such as wcscmp can say if two URIs are equal.  If IUri::IsEqual is unavailable or you use a URI normalization function that is specific to your own URI scheme you can create a URI comparison function around any normalization function. Simply apply your favorite URI normalization function to two string URIs and then use a case sensitive string comparison on the results.

URIs Are Case Sensitive

There’s not much more to say on this topic that the URI RFC hasn’t already, except to warn against using case insensitive string comparisons. Only the scheme, hostname, and percent-encoded octets of URIs are case insensitive so a case insensitive string comparison is not appropriate in general and will generate false positives when comparing URIs. Note that this is a difference from Windows file paths which are case insensitive throughout.

UrlCompare Issues

The function UrlCompare is deceptively named in that it sounds like it compares two URIs for equality. Unfortunately it has a couple of significant issues that result in false positives and as a result you should avoid using it when possible, or at least be aware of and compensate for the cases when it can generate an incorrect result.

Percent-Encoding Makes a Difference

The function takes two input URIs, decodes all percent-encoded octets and compares the results character by character. This is inappropriate because you cannot necessarily decode arbitrary percent-encoded octets in a URI and get an equivalent URI as a result. See the URI RFC’s section when to encode or decode for more information. For example, the following two non-equivalent URIs would be declared equivalent by UrlCompare:

http://example.com%2Fwww.contoso.com/
http://example.com/www.contoso.com/

Even though the first URI has a sub-domain of contoso.com and the second URI has example.com they are declared to be equivalent. A worse consequence of the same issue is that because the percent-encoded sequence %00 is decoded to a NULL terminator anything following a %00 is ignored by the comparison. For example, the following two non-equivalent URIs would be declared equivalent:

http://%00.example.com/
http://%00www.contoso.com/downloads/details.aspx?foo=baz

Trailing Slashes Are Important Too

The second issue concerns the function’s fIgnoreSlash parameter which when set TRUE tells UrlCompare to ignore any trailing ‘/’ characters on either of the input URIs. This is not appropriate for general use with URIs because in general URI comparison trailing slashes cannot be ignored. From Windows file paths that refer to non root directories you can generally remove trailing backslashes without worrying about changing the path’s semantics because a file and a directory in the same path cannot have the same name. Accordingly there’s no ambiguity between “C:Users” and “C:Users”. This is not the case for URIs. Two URIs that are equal except for a trailing slash on the path may resolve to completely different resources. I should point out too that UrlCompare ignores the slash that literally trails at the end of the URI and not slashes at the end of the path component of the URI.

Do not depend on UrlCompare to correctly say whether two URIs are equivalent. Relying on UrlCompare for general URI comparison could result in security issues.

CoInternetCompareUrl Issues

The function CoInternetCompareUrl delegates its comparison to an interface registered for the URI scheme but unfortunately in some cases CoInternetCompareUrl has the same issues as UrlCompare.

CoInternetCompareUrl delegates to the IInternetProtocolInfo::CompareUrl method of the pluggable protocol registered for the URI scheme of CoInternetCompareUrl’s first parameter. This means that the comparison is as good as the pluggable protocol’s implementer made it. A CompareUrl could be tied to the protocol’s caching implementation and report two URIs as being equal if it knows that the same content will be delivered for two different URIs. On the other hand, a CompareUrl could report that two character for character equal URIs are not equal. That’s a hypothetical example but it illustrates that CompareUrl and consequently CoInternetCompareUrl don’t necessarily follow URI comparison rules.

As noted in the documentation for CompareUrl, CompareUrl may return INET_E_DEFAULT_ACTION to let CoInternetCompareUrl take care of the comparison in a generic fashion. Sadly, the method of comparison used in this case is exactly the same as UrlCompare.

Accordingly, for the same reasons why you shouldn’t use UrlCompare, if you must use the CompareUrl methods defined by pluggable protocol handlers you should use them directly rather than relying on CoInternetCompareUrl. But in general, if you don’t care about pluggable protocol handlers avoid CoInternetCompareUrl and IInternetProtocolInfo::CompareUrl.

Conclusion

To summarize, IUri::IsEqual is a good Scheme-Based Normalization URI comparison function, UrlCompare and CoInternetCompareUrl should be avoided for fear of security bugs, and with no better choices a simple case sensitive string comparison will suffice.

If you know of other URI comparison functions or have other related comments or questions please let us know!

Dave Risney
Software Design Engineer

Comments (30)

  1. Anonymous says:

    I think it’s silly for you to request feedback on this blog, since you get _tons_ of feedback about the issues people really care and want to know about every time you post something new, and for the most part you just flat ignore it.

  2. Anonymous says:

    Thanks for the update on IE8!

    Glad to see all those things are being fixed!

    Oh, and those new features are really cool!

    ZZZZZzzzzzzzzzzzzzz……

    Back to dreamland I guess!

  3. Anonymous says:

    http://blogs.msdn.com/jscript/archive/2007/10/17/performance-issues-with-string-concatenation-in-jscript.aspx

    I think I might have an update for people wanting information about IE8. String concatenation improvements for JScript are mentioned as possibly being part of IE 8:

    "JP, I Need this fix on my box, now!!!

    I wish I could help you. You will have to wait till next release of Internet Explorer."

  4. Anonymous says:

    But how do I use these in IE in my JavaScript application?

    Thanks!

  5. Anonymous says:

    @Jorrit,

    The auto-linker for this blog is garbage… all you can do is drop a straight link in:

    Bill Gates says that IE browser releases will be more frequent, and MS won’t go to dev/null for years in between:

    http://www.crn.com/it-channel/183701230

    Unfortunately, this blog (1 year after IE7 was released) has just proven this statement to be false.

    As for Anphanax and Robbert (and anyone else)… the string concatenation will continue to cause issues until IE8 (or later is released).

    The best thing you can do (although quite an ugly workaround) is something like this:

    (Copy the HTML below into your text editor, save, and view in Firefox, IE, Opera, Safari, etc.  It isn’t perfect, but it is much better. (note the speed difference in other browsers will be negligible since they don’t suffer from this bug.)

    <!– saved from url=(0014)about:internet –>

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;

    <html>

    <head>

    <title>Fixing IEs String Contatination Performance</title>

    </head>

    <body>

    <h2>Trying to Fix IEs String Contatination Performance</h2>

    <hr/>

    <pre>

    Generated Content Goes Here:

    </pre>

    <script>

    function Str(init){

    this.data = [];

    this.toString = function(){

    return this.data.join(”);

    }

    this.add = function(s){

    this.data.push(s);

    }

    if(init){

    this.add(init);

    }

    return this;

    }

    //To use…

    var foo = new Str(‘Hello World’);

    for(var i=0;i<15000;i++){

    foo.add(‘ntAdded item ‘ + i + ‘ to the string.’);

    }

    document.getElementsByTagName(‘pre’)[0].appendChild( document.createTextNode( foo.toString() ) );

    </script>

    <strong>Note:</strong><em>MSIE does not handle rendering of preformatted text when it is generated.  Thats why in IE it does not display the data as a column, but rather as a row.</em>

    </body>

    </html>

  6. Anonymous says:

    There is also the open-source Google URL Parsing and Canonicalization library:

     http://code.google.com/p/google-url/

    It sounds like this library is more thorough than the IURI approach. It will handle IDN, escaping, unescaping, etc. and tries to be compatible with IE. It is on the conservative side, for example, it doesn’t normalize the case of percent-escaped characters that should remain escaped, although this particular behavior may be changed.

  7. Anonymous says:

    @Robbert Broersma, these are all unmanaged win32 Windows functions and aren’t available in JavaScript.  I don’t know of any URI APIs available via JavaScript running in IE.  You will have to use a string comparison or look for some sort of JavaScript library that does this.

  8. Anonymous says:

    I have to recant my statement about IE’s lack of support for Flash’s transparent support via wmode, all versions from IE4+ do in fact support it. However it’s been tricky figuring out exactly how to get it to work so for those interested here is the XHTML code that works for me in all versions of IE (and don’t ask me how or what other aspects of the object element and it’s child elements besides the wmode param because I’m still scratching my head on this one)…

    <object data="example.swf" id="mplayer" type="application/x-shockwave-flash">

    <param name="movie" value="example.swf" />

    <param name="play" value="true" />

    <param name="quality" value="high" />

    <param name="wmode" value="transparent" />

    <a href="http://www.adobe.com/shockwave/download/download.cgi?P1_Prod_Version=ShockwaveFlash&quot; target="_blank" style="height: 32px; width: 32px;" tabindex="3" title="Install Flash Plugin"><img alt="Install Flash Plugin" src="images/interface-plugin-flash.gif" style="height: 32px; width: 32px;" title="Install Flash Plugin" /></a>

    </object>

    @ Ted

    You can simply stick the ‘saved from url’ HTML comment within the head element and it will still work fine, not sure why you have it before the doctype.

    @ IE Team

    I’d really like to see support for the CSS2 outline property in IE8. Using the outline property on button elements instead of the border property in the screenshot below (Opera 9.5 Beta (Build 9613)) I have a :hover and a :focus instance at the top right for two buttons (a local build only where as in Preview III of my site will display the undesired behavior in Opera) my outline approach compensates for Opera’s rendering inconsistency as you can see in the screenshot below…

    http://img147.imageshack.us/img147/6972/opera95betaal3.gif

    Is anyone here making use of the :active pseudo-class? I’ve found out that it works great as an alternative to :focus in IE4+.

    Webkit is still inaccessible…

    http://bugs.webkit.org/show_bug.cgi?id=7138

  9. Anonymous says:

    > Is anyone here making use of the :active pseudo-class? I’ve found out that it works great as an alternative to :focus

    The main benefit of :focus is that it’s activated when you tab-select an element, :active is not an alternative/workaround for :focus.

  10. Anonymous says:

    This is regarding the IE developer toolbar. Please change the shortcut combination for the ruler to something other than Shift+R, because many time when we need to type capital R, the ruler comes up. I downloaded the final version and still this issue is not fixed.

  11. Anonymous says:

    @ IE8

    It would be nice if focus was (correctly) supported by more then just Gecko browsers. Opera has added support during the first 9.5 and Webkit does support it though it’s mostly useless since it still does not properly support tabindex (no native OSX browser does and Firefox requires you to manually change a setting in about:config). So my point was that you could use active to emulate focus until browser vendors start supporting more of CSS2. If you don’t want to taint in a sense your code for just IE then use IECCSS and limit which versions you apply it to…

    http://www.jabcreations.com/web/ieccss.php

  12. Anonymous says:

    @anonymuos: Thanks for the feedback, and sorry for the inconvenience.  We’ve got a bug on this.

  13. Anonymous says:

    Regarding: http://www.jabcreations.net/

    The goggles, they do nothing!

  14. Anonymous says:

    @EricLaw [MSFT]

    Glad to see this is being tracked.  Is it being tracked in the Web Developer Toolbar DB or in the generic IE DB?

    Either way, can you post a URL to it so that we can vote on it, and track it too.

    I’d also like to enter some issue for the various focus stealing bugs in the tool but I don’t know where the bug tracking page is.

    thanks

  15. Anonymous says:

    hi ie dev team!

    make sure to include lot’s of unusable filters to replace todays W3C standards in IE8 too, so you just prove once again the ego of microsoft!

  16. Anonymous says:

    This article is really interesting. Thanks for the information.

  17. Anonymous says:

    When is the Title of this post going to be the title of the Blog?  We’ve been waiting patiently, and impatiently for news on this for a year.

    I think rc is right.  You guys have stopped development AGAIN, and won’t start up again until you start to feel your browser market share slide under 50% again.

    Its a shame. IE has such potential to become a decent browser.

    So sad.

  18. Anonymous says:

    Hi Dave,

    Thanks for the summary of functions and details regarding this issue. Much appreciated.

  19. Anonymous says:

    "Only the scheme, hostname, and percent-encoded octets of URIs are case insensitive".

    Is that really true? RFC 3986 states:

    "When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase.  For example, the URI <http://www.EXAMPLE.com/&gt; is equivalent to <http://www.example.com/&gt;.  The other generic syntax components are assumed to be case-sensitive unless specifically defined otherwise by the scheme (see Section 6.2.3)."

    Windows Search, for example, does case-insensitive comparisons of URLs it recieves.  This makes sense for Windows Search since it is dealing with arbitrary schemes and doesn’t own or know about the definition of the scheme, and has to choose a default behavior for what might vary between scheme definitions.

    I would recommend sticking to case-insentive URLs unless you know the scheme definition, in which case you should definitely try to make comparisons case-sensitive.

  20. Anonymous says:

    @Alan,

    Schemes are described as case-insensitive in RFC 3986 section 3.1 <http://tools.ietf.org/html/rfc3986#section-3.1&gt;:

    "Although schemes are case-insensitive, the canonical form is lowercase and documents that specify schemes must do so with lowercase letters."

    And percent-encoded octets in section 2.1 <http://tools.ietf.org/html/rfc3986#section-2.1&gt;:

    "The uppercase hexadecimal digits ‘A’ through ‘F’ are equivalent to the lowercase digits ‘a’ through ‘f’, respectively.  If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent.  For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings."

    If you’d like an accurate URI comparison you should use a case sensitive string comparison over insensitive.  In some specific cases it may make sense to use a case insensitive comparison but those would be exceptions to the rules defined by the URI RFC and if you’re breaking those rules you should be aware of how that comparison is used, what URIs will be incorrectly determined to be equivalent, and what bugs could come from this.  That is, if you’re not sure what you should use, in general, a case sensitive string comparison is the safer choice.

  21. Anonymous says:

    @@howard: IE marketshare hasn’t been below 50% in more than 8 years or so, and they’re waaaay above that number now: http://en.wikipedia.org/wiki/Image:Layout_engine_usage_share.svg  

  22. Anonymous says:

    Here’s a good JavaScript URL parser I’ve used: http://blog.stevenlevithan.com/archives/parseuri

    I think Dojo has one too.

  23. Anonymous says:

    @Jon: "There are lies, damn lies, and statistics" (credit to whomever coined that one)

    Browser usage share stats are always biased, depending on the site used, purpose, OS, etc.

    All I know, is that as a developer, and a web user, I will never use IE again.  Now never is a strong word, and I welcome MS to provide a version worthy of switching back, but right now IE is the ant hill, and Firefox/Safari etc. are fighting over which is K2, and which is Everest.

    IE just isn’t in the game anymore (I mean, lets be serious for a minute) When was the last time this blog (the official IE Blog!) posted something about IE8!?

    (ie fanboys start your googling, dig it up and post it.)

    whatever the date, it was at least "6 months too late" ago.

    Before it was easy for MS.  Force a browser with the OS to compete with Netscape (which had crashing/memory issues)… it was an easy battle.

    Now IE has to battle browsers that are far superior to itself, that run faster, and better, on the OS that IE should be able to take insider advantage of, but can’t.

    The trick is that when enough developers throw in the towel, or enough shareholders get vocal, they will have to restart development again, and once again, they will be back at the base of K2/Everest, trying to climb their way back up into the game.

    Those of us at the top now, aren’t looking down to help, from those that "said" they were building the next Everest, but didn’t.

    Chao

  24. Anonymous says:

    @Jon

    I’m afraid you can’t view your link in IE.  Microsoft browsers don’t support SVG.

    Bruce

  25. Anonymous says:

    @Bruce

    What are you talking about? I can view it fine and I’m in IE6. After bringing it up in Firefox it doesn’t look much different, either.

  26. Anonymous says:

    @huh

    Wikipedia shows a pixelized PNG version of the actual SVG file when it detects IE.  Click on the link (either the text below the image or the image itself) to go to the actual SVG file and IE will ask "What do you want to do with this?…"

    Opera, Firefox, Safari all show the SVG layout.

    If you see the image in IE then you have a plug-in from Adobe installed.

  27. Anonymous says:

    I’m afraid you can’t view your link in IE.

  28. Anonymous says:

    Having a path to boundless authorities concerned with this is incomparable.

  29. Anonymous says:

    This post is directly related to some work I’m going to be doing so I was happy to stumble across…