URI Comparison Functions

Investigating URI parsing related issues in various products, I’ve run across many instances of code erroneously attempting to compare two URIs for equality. In some cases the author writes their own comparison and seems to be unaware of URI semantics and in other cases the author delegates to a Windows provided function that doesn’t quite work for the author’s scenario. In this blog post I’ll describe some of the unmanaged URI comparison functions available to Win32 developers, and a few common mistakes to avoid.

The latest URI RFC 3986 does an excellent job of describing a ladder of URI comparisons. The range on the ladder trades off comparison speed for number of false negatives. False negative in this case means that the URI comparison function says two URIs are not equivalent when they are.  However, nowhere on the ladder will a comparison generate a false positive. That is, a URI comparison function should never incorrectly report that two URIs are equivalent.

IUri::IsEqual

The IUri::IsEqual method is the comparison method provided by the IUri associated APIs. IUri::IsEqual is able to perform potentially very fast since its based on state parsed out of the string URI during the creation of the IUri object. This comparison method is semantically equivalent to taking two URIs, performing the canonicalization methods described in the CreateUri documentation, and comparing the result character by character.  Knowledge of common schemes such as http and ftp is built-in and so by the URI RFC’s terminology this is a Scheme-Based Normalization equality comparison. IUri and associated APIs are available on systems with IE7 which includes all Vista systems. If this method is available and you don’t need a comparison that takes into account protocol specific information then this is the preferred method of URI equality comparison.

IMoniker::IsEqual

You can use CreateURLMonikerEx to create an IMoniker object that represents your URI and use IMoniker::IsEqual to compare it with another such IMoniker. The comparison used here is a case sensitive string comparison of the display strings of the IMonikers. These display strings are a normalized form of the URIs passed into CreateURLMonikerEx, so the comparison is a Scheme-Based Normalization equality comparison similar to IUri::IsEqual. The difference between the two is that the UrlMoniker implementation of IMoniker::IsEqual may not perform all of the normalizations that IUri::IsEqual does including percent-encoding normalization. CreateURLMonikerEx and IMoniker::IsEqual have been available since Windows 95 so it is an acceptable alternative to IUri::IsEqual if the IUri APIs are not available to you. If you do use CreateURLMonikerEx be sure to pass the correct flags to avoid creating legacy file URIs.

String Comparison

At one extreme of the URI comparison ladder is the simple but trusty string comparison. A function such as wcscmp can say if two URIs are equal.  If IUri::IsEqual is unavailable or you use a URI normalization function that is specific to your own URI scheme you can create a URI comparison function around any normalization function. Simply apply your favorite URI normalization function to two string URIs and then use a case sensitive string comparison on the results.

URIs Are Case Sensitive

There’s not much more to say on this topic that the URI RFC hasn’t already, except to warn against using case insensitive string comparisons. Only the scheme, hostname, and percent-encoded octets of URIs are case insensitive so a case insensitive string comparison is not appropriate in general and will generate false positives when comparing URIs. Note that this is a difference from Windows file paths which are case insensitive throughout.

UrlCompare Issues

The function UrlCompare is deceptively named in that it sounds like it compares two URIs for equality. Unfortunately it has a couple of significant issues that result in false positives and as a result you should avoid using it when possible, or at least be aware of and compensate for the cases when it can generate an incorrect result.

Percent-Encoding Makes a Difference

The function takes two input URIs, decodes all percent-encoded octets and compares the results character by character. This is inappropriate because you cannot necessarily decode arbitrary percent-encoded octets in a URI and get an equivalent URI as a result. See the URI RFC’s section when to encode or decode for more information. For example, the following two non-equivalent URIs would be declared equivalent by UrlCompare:

https://example.com%2Fwww.contoso.com/
https://example.com/www.contoso.com/

Even though the first URI has a sub-domain of contoso.com and the second URI has example.com they are declared to be equivalent. A worse consequence of the same issue is that because the percent-encoded sequence %00 is decoded to a NULL terminator anything following a %00 is ignored by the comparison. For example, the following two non-equivalent URIs would be declared equivalent:

https://%00.example.com/
https://%00www.contoso.com/downloads/details.aspx?foo=baz

Trailing Slashes Are Important Too

The second issue concerns the function’s fIgnoreSlash parameter which when set TRUE tells UrlCompare to ignore any trailing ‘/’ characters on either of the input URIs. This is not appropriate for general use with URIs because in general URI comparison trailing slashes cannot be ignored. From Windows file paths that refer to non root directories you can generally remove trailing backslashes without worrying about changing the path’s semantics because a file and a directory in the same path cannot have the same name. Accordingly there’s no ambiguity between “C:Users” and “C:Users”. This is not the case for URIs. Two URIs that are equal except for a trailing slash on the path may resolve to completely different resources. I should point out too that UrlCompare ignores the slash that literally trails at the end of the URI and not slashes at the end of the path component of the URI.

Do not depend on UrlCompare to correctly say whether two URIs are equivalent. Relying on UrlCompare for general URI comparison could result in security issues.

CoInternetCompareUrl Issues

The function CoInternetCompareUrl delegates its comparison to an interface registered for the URI scheme but unfortunately in some cases CoInternetCompareUrl has the same issues as UrlCompare.

CoInternetCompareUrl delegates to the IInternetProtocolInfo::CompareUrl method of the pluggable protocol registered for the URI scheme of CoInternetCompareUrl’s first parameter. This means that the comparison is as good as the pluggable protocol’s implementer made it. A CompareUrl could be tied to the protocol’s caching implementation and report two URIs as being equal if it knows that the same content will be delivered for two different URIs. On the other hand, a CompareUrl could report that two character for character equal URIs are not equal. That’s a hypothetical example but it illustrates that CompareUrl and consequently CoInternetCompareUrl don’t necessarily follow URI comparison rules.

As noted in the documentation for CompareUrl, CompareUrl may return INET_E_DEFAULT_ACTION to let CoInternetCompareUrl take care of the comparison in a generic fashion. Sadly, the method of comparison used in this case is exactly the same as UrlCompare.

Accordingly, for the same reasons why you shouldn’t use UrlCompare, if you must use the CompareUrl methods defined by pluggable protocol handlers you should use them directly rather than relying on CoInternetCompareUrl. But in general, if you don’t care about pluggable protocol handlers avoid CoInternetCompareUrl and IInternetProtocolInfo::CompareUrl.

Conclusion

To summarize, IUri::IsEqual is a good Scheme-Based Normalization URI comparison function, UrlCompare and CoInternetCompareUrl should be avoided for fear of security bugs, and with no better choices a simple case sensitive string comparison will suffice.

If you know of other URI comparison functions or have other related comments or questions please let us know!

Dave Risney
Software Design Engineer