Overlong UTF-8 Escapes Bite


Every once in a while a security bug pops up that really piques my interest, and a new directory traversal bug that affects Apache Tomcat (http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-2938) most certainly made me take notice because I haven’t seen this bug type in a lllooonnnggg time.


It caught my eye because of these six little characters:


%c0%ae


Many people think these characters represent a 16-bit Unicode character. Wrong. They are an invalid sequence of characters that represent the ‘.’ (%2e) character, it’s often called an “overlong UTF-8 escape”. You may be wondering why I know this little piece of trivia about UTF-8; IIS4 and IIS5 were bitten by the same class of bug eight years ago, and was an attack vector for the Nimda worm. The bulletin that fixed the bug is MS00-078.


Thumbing to page 379 of Writing Secure Code 2nd Edition, I am reminded that the canonical form of a UTF-8 character is the smallest number of bits that can represent that character. Remember, UTF-8 can encode characters wider than 8 bits. Without going into all the involved bit-manipulation, the correct form for a ‘.’ character is a one-byte escape: %2e, not a two-byte escape: %c0%ae.


RFC 3629 states that “Implementations of the decoding algorithm MUST protect against decoding invalid sequences.”


UrlScan for IIS6, and IIS7’s Request Filtering detect and reject non-canonical UTF-8 URLs by default.

A patch for Apache Tomcat is available at http://tomcat.apache.org/security.html.

Comments (6)

  1. Ted says:

    Regarding long time – there are canonicalization bugs in Windows still existing, so it is no solely TomCat problem

  2. Nathan_works says:

    Is the problem in Apache, or is it in the 3rd party i18n library they used for translation ? (rolling your own is never a good idea, but take in a 3rd party lib and you assume all their bugs/errors.. Something Larry Osterman talked a good deal about in his threat-modeling set of posts)

  3. michael_HOWARD says:

    Ted, I mean this particular bug type, not canonicalization generaically

  4. michael_HOWARD says:

    Nathan, I don’t know where the issue is – I doubt it’s httpd though.

  5. A { COLOR: #0033cc } A:link { COLOR: #0033cc } A.local:visited { COLOR: #0033cc } A:visited { COLOR: