The Bizarre and Unhappy Story of ‘file:’ URLs

For my first blog entry, I will start with something I wrote for my team back in 1997 (reedited for this blog) that has been helpful to generations of subsequently bewildered developers:

Go back to 1995 and a small team dreaming an impossible dream of the Internet on Windows!  URLs were young, so young they still hadn't been seen in a television ad.... 

IE1.0 was just a fledgling trying to get some features that Mosaic (or the looming competitor) didn't have.  Since we shipped with Win95 and a big feature for Win95 was Long File Names, so IE needed to have support for spaces in names.  Since there was no spec for the "file" protocol scheme (heck there was practically no spec for any protocol) we added automagical escaping spaces.  We happily escaped spaces all over the place, and since nobody used our browser it didnt make a whits difference.

 Now IE began to mature and grew into a precocious 3-year-old, and when the powers that be noticed this, they decided that 3 was such a big number that the browser needed a new image.  That image was componentization.  The browser was taken and split in three major parts.  One part became mshtml (for rendering HTML) another shdocvw (for hosting DocObjects), and the third was wininet and urlmon (two binaries that were basically inseparable for downloading and caching content).  Unfortunately a key component from the platform was missing, so it was replicated into each of the other components.  That was an URL parsing platform. 

No one component could agree with the others about the best way to handle the wild URLs that were so plentiful.  So each component came up with slightly different rules, although very logical, based on its personal fantasies of the one true URL.  The fantasies were fairly similar but the devil is in the details and these URLs went through a lot of metamorphoses on each trip down the stack to the wire.  Now each breed of URL had its own issues with these transformations, but in the end all children must be pushed from the nest to learn to fly on their own. 

Enter IE4, Nashville, Win98, the next generation of everything.  And with it the new patron saint of URL parsing.  Now someone undertook the plight of the underrepresented URLs very seriously and wished to give them the respect they deserved.  Some cried with fear, and others with joy, but none could debate the necessity of a savior.  So the little URLs were taken and given a safe place to live and be loved:  shlwapi.dll.  And the saint went amongst the browser and the browser's friends and said, "Let no one but the chosen ones touch or talk or molest these URLs, and there will be a peace among us."  Some resisted, but they were put down as heretics should be and a peace came upon the land.  Except for the one abused problem URL:  file:.

No one had been quite as abused as the the little file: URL.  This URL was special because we had always used files and DOS paths (and no one at the time knew about path canonicalization attacks), everyone was quite sure what they looked like , acted like, and even tasted like.  It didn't help that the file: protocol remained in RFC limbo as a platform/OS specific protocol.  So the browser and the browser's little friends would take turns dressing a DOS path like an URL in a pink bunny suit and undressing the URL with a pair of rusty scissors, pretending it was the same DOS path they started with.  Only the simplest of URLs was able to withstand this abuse, and it soon became clear that something would have to be done, lest the little file: URLs go off on their own and be lost forever.

After much praying and meditation and consultation, it was finally decided that file: URLs must have two forms, one being the well formed easily communicable URL form and the other a legacy mutant form of a DOS path.  These two forms had different purposes, and soon different personalities.  The URL form lived happily inside the browser, but any time the browser needed to talk about the file: URL with its friends, it would clone and mutate the URL into its ugly half DOS form.

What this MEANS:

There are two kinds of file: URLs.  The first is the well formed URL style.  This is allowed query strings, fragment IDs, escape sequences and all the other goodies that URLs are supposed to support.  We call this a healthy file: URL.  The other kind is basically a DOS path with "file://" stuck on the front.  We use this for legacy communication with outsiders only.  We call this the legacy file: URL. Examples:

DOSPATH:         c:\windows\My Documents 100%20\foo.txt
HEALTHY:         file:///c:/windows/My%20Documents%20100%2520/foo.txt

LEGACY:          file://c:\windows\My Documents 100%20\foo.txt

DOSPATH:         \\server\share\My Documents 100%20\foo.txt
HEALTHY:         file://server/share/My%20Documents%20100%2520/foo.txt
LEGACY:          file://\\server\share\My Documents 100%20\foo.txt

NOW all of the Url* APIs in shlwapi, and thus everyone in IE4, understand the difference between these two types of URLs and act accordingly.  as long as a client of the APIs, doesn’t try to access the data directly, there is no issue whatsoever.  in other words as long as you dont have code anywhere that looks like this:

LPTSTR GetFilePath(pszUrl)


     if (0 == lstrcmpn("file://", pszUrl, sizeof("file://")))


          return pszUrl + sizeof("file://");
     return NULL;

but instead always call shlwapi like this:

PathCreateFromUrl(pszUrl, pszPath, cchPath, dwFlags);

You will always be happy and golden.  Straying from the path is straying into unknown lands.  There are of course flags that you can pass into some of the APIs that will change the form the file: URLs when they come out, but I highly recommend not using it for any reason, as it just increases the chances of creating more bizarre unsupportable behavior.

Comments (6)

  1. PatriotB says:

    Neat posting–exactly the kind of IE/shell stuff I dig. Keep it up!

  2. davewood says:

    Thanks Zeke – I have always found your story very helpful {and tragi-comic}. It might be interesting to talk about all the different Url parsing APIs in Microsoft as it’s pretty confusing to decide whether to use shwapi Url* functions, or something like urlmon CoInternetParseUrl, or something like wininet InternetCrackUrl {there’s are others as well}.

  3. IEBlog says:

    While working on IE7 application compatibility, we’ve seen many cases of interesting and strange invalid…

  4. IEBlog says:

    Invalid file URIs are among the most common illegal URIs that we were forced to accommodate in IE7. As

  5. marvind says:

    Can you comment on how InternetCanonifyUrl fits in here?

    I was playing with it and it seems to do exactly the wrong thing with respect to file URL. If I pass in your example of a "healthy" URL above: "file:///c:/foo/bar" it gives back as canonical the legacy version "file://c:foobar". I assume this is for backwards compatability, but it’s annoying because I want to do the right thing with file URLs.

Skip to main content