The great thing about URL encodings is that there are so many to choose from

The phrase URL encoding appears to mean different things to different people.

First, Tim Berners-Lee says that URLs are encoded by using %xx to encode "dangerous" characters, or to suppress the special meaning that would normally be assigned to characters such as / or ?. For example, the URL http://server/why%3F/?q=bother is a request to the server server with the path /why?/ and with the query string q=bother. Notice that by escaping the question mark, we prevent it from being interpreted as the start of the query portion of the URL.

Now, it so happens that when a form is submitted via GET, then the contents of the form are encoded (by default) into the query according to a set of rules laid out in the HTML 4.01 specification: The query string takes the basic form of var=value&var=value&.... If a variable name or a value contains a "dangerous" character or a special character like = or &, then it must be %-escaped. For example, co=AT%26T says that the variable co has the value AT&T. Encoding the ampersand prevents it from being interpreted as a separator.

And here is the special additional rule that confuses a lot of people: When submitting a form via GET, the form data is encoded into the query portion of a URL, and under the default encoding, the character U+0020 (space) is encoded as U+002B (plus sign). This special use of the plus sign applies only to the query portion of the URL. Sometimes people get confused and think that it applies to URLs in general.


The base URL and fragment (colored in blue) use the %20 sequence to encode the embedded space, whereas the query (colored in green) uses the plus sign.

You'd think that would be the end of the story, but in fact it's just the beginning, because now we get to throw in all sorts of nonstandard URL encoders.

The PHP function urlencode treats the entire string as if it were a value (or variable name) in a query string, encoding spaces as a plus sign and being careful to escape all other punctuation. Not to be confused with rawurlencode which encodes everything (even characters like /).

JScript comes with a whole bucketload of functions for URL encoding. There's escape(), which encodes almost everything but leaves the slash and—bafflingly—the plus sign unencoded. And then there's the encodeURI() function which leaves a few more characters unencoded (including the colon (U+003A), and question mark (U+003F)). But wait, there's also encodeURIComponent() which goes to the effort of encoding slashes too. It's a total mess, but this site tries to make some sense out of the whole thing.

The ASP.Net function Server.UrlEncode behaves the same way as the PHP urlencode function.

There are probably a dozen other functions which purport to perform some form of URL encoding. You have to read the documentation on each one carefully to see whether it does the type of encoding you want.

But wait, you're not done yet. There are URL encodings which are built on top of the basic URL encoding.

The punycode encoding is used to encode Unicode characters in domain names, which have an even more limited character set than URLs.

When auto-generating a URL from a string, different Web sites use different algorithms. This isn't really an encoding in the URL encoding sense; it's just a convention for generating names for Web pages. The result of these conversion algorithms still need to be URL encoded.

For example, Wikipedia's URL auto-generation algorithm changes spaces to underscores. It leaves most punctuation marks unchanged, which means that once you've gone through Wikipedia's auto-generation algorithm, you still have to go back and escape all the characters which require escaping according to RFC3986.

As another example, it is popular with many blog software packages to change spaces to hyphens when auto-generating a URL from the title of a blog post. The handling of special characters varies. Some packages simply omit them; others try to encode them, resulting in a double-encoded string if the encoding uses characters for which RFC3986 requires encodings!

So if somebody asks a question about URL encoding, before you answer, make sure you understand what sense of the phrase "URL encoding" is being used.

Comments (14)
  1. James Schend says:

    If anybody reading this hasn’t had the pleasure of doing web development, let me just make it clear that every aspect of web development is exactly like this example! Every single aspect.

    Baffling standards, wrongly-implemented, woefully incomplete for real-world use.

  2. webdev says:

    @James Schend

    Couldn’t agree more. Web development is 95% compensating for bad frameworks/browsers, and 5% being productive/creative.

  3. Last but not least, we have the "it works in my browser so it must be valid" school of thought :)

    A nice and instructive article. Nuff said.

  4. How do you know all this stuff since you do rather low-level C stuff?

    I imagine you needed to do some quick web think and was hit by the pain…

    [Sorry, didn’t realize I was allowed to know about only one thing. -Raymond]
  5. Tom says:

    Douglas Crockford, inventor of JSON: "The Web is broken and we need to fix it.  … the Web is so poorly specified.  The specifications were incomplete and were largely misinterpreted and many of those misinterpretations have become part of the canon."  From his interview in the book Coders at Work.

  6. Anonymous Coward says:

    I’ve done quite a bit of web scripting in my time, and in my opinion the people who came up with that standard, as well as those who followed it and left us with the mess we’re in today, should be shot.

  7. And woe is unto anybody whose users type non-English letters into form text fields. The browser will, in its infinite wisdom, choose some way to convert the non-English text into bytes, percent-encode those bytes, and then happily neglect to tell the server which charset it used.

    The only decoding that seems to work in practice is that if the bytes look like a valid UTF-8 encoding, then it’s very likely UTF-8; otherwise it is probably ISO-8859-1 or possibly Windows-1252, and if the user types one of those (not very common) glyph sequences whose 8859-1 representations happen to be valid UTF-8 encodings of something else, then he may or may not deserve what he gets.

    Of course there are probably other corner cases that this heuristic does not catch…

  8. Random832 says:

    There’s a standard that says it has to be UTF-8 regardless of the encoding being used for content. So any browser still using the system codepage for it is wrong.

    There was also a proposal floating around for a while to use %uXXXX via UTF-16 – javascript escape() does this, I don’t know that anything else does.

  9. Watches says:

    This is why so many security vulnerabilities exist in which protection methods are bypassed by encoding.

  10. Jonathan says:

    I think ASP.NET does %uXXXX too.

  11. jalf says:

    And then of course, there are gems such as the RFC for IRC, which leaves the encoding **entirely unspecified**.

    It’s up to the individual client which encoding they’d like to use, and there is absolutely no mechanism for even telling other clients (or the server) which encoding to expect.

    But yes, URL encodings are fun times too.

  12. Grumpy says:

    I’ve done quite a bit of web scripting in my time, and in my opinion the people who came up with that standard, as well as those who followed it and left us with the mess we’re in today, should be shot.

    Goes for EVERYONE who has ever done work on a "standard". I’ve recently been slimed by a standard for document management and it’s… well… crap. Too much thought, too little reality. These fellers need to get out into the world and face reality once in a while. I realize this may reduce them to a pile of dust but no pain, no gain.

  13. html says:


    This was what motivated the HTML5 standard. The previous committees had abstracted away the simplicity and reality of the web.

  14. It doesn’t help that there are multiple RFCs covering this, and that the common practices have changed over time.  It also doesn’t help that the most popular RFC to follow varies by protocol (i.e. file: vs HTTP).

    The space-to-plus conversion isn’t even part of the protocol RFCs and is instead part of the HTML spec, offering further inconsistency and potential for confusion since it’s only relevant to form submissions.

    Another major trap people fall into is the desire to call a function like UrlEscape and expect it to magically make your URL "right." In reality such a function would truly have to be magical.   Not only do you need to encode each value individually with the proper mechanism, but you need to truly understand your clients and the encoding function(s) you’re relying on.  Woe to thee with the task of consuming URLs from arbitrary sources, then parsing them for display and submission to arbitrary clients.  I had this pleasure during Win7 development and probably jump-started the greying of my hair in the process.    

Comments are closed.