Downloads and International Filenames

A few times a year, I get a question about Internet Explorer's behavior when it comes to downloading files that have non-ASCII characters in the filename, because different browsers have different behavior when handling such files.

The server can suggest the name for a file download in one of two ways:

  1. Explicitly, by including a filename token in the Content-Disposition response header
  2. Implicitly, by not including the filename token and instead simply making the path component of the download's URL contain the desired filename.

The challenge with approach #1 is that the HTTP specification doesn't permit non-ASCII characters to appear within HTTP headers. Early versions of Internet Explorer worked around this limitation by assuming that any non-ASCII characters within HTTP headers were encoded using the local system's Windows codepage—a not unreasonable assumption at that time. However, as time has passed and the Internet has grown increasingly multi-lingual, it becomes more and more likely that the user will encounter file downloads with names that are not represented using characters from their own local codepage.

For instance, consider the case of a file delivered with a filename specified using characters in the server's codepage (Windows-1251 Cyrillic):

Content-Disposition: attachment; filename="Текстовый документ док.doc"

When a user on a client configured to use that codepage attempts to download the file, the filename is displayed correctly:

Download UI showing correct Cyrillic filename

When a user on a client configured for a different codepage (Windows-1252 Western European) attempts to download the file, the filename is corrupted:

Download UI showing corrupted Cyrillic filename

If the server is reconfigured to send the filename using raw UTF-8 bytes, the filename remains corrupted, because the client interprets those bytes using the system codepage:

Download UI showing corrupted Cyrillic filename 

Internet Explorer permits use of UTF-8 in the filename token only if it is represented in %-escaped-hexadecimal:

Content-Disposition: attachment; filename="%d0%a2%d0%b5%d0%ba%d1%81%d1%82%d0%be%d0%b2%d1%8b%d0%b9 %d0%b4%d0%be%d0%ba%d1%83%d0%bc%d0%b5%d0%bd%d1%82 %d0%b4%d0%be%d0%ba.doc"

When sent this way, systems running in any codepage will display the filename correctly.

Unfortunately, however, while this syntax works in IE and Chrome, it doesn't work in Firefox, Opera, or Safari, which do not unescape the %-escaped characters:

Firefox UI showing escaped UTF-8

RFC2231 proposed a mechanism whereby a server could specify the character set before the token value:

Content-Disposition: attachment; attachment; filename="LegacyFileйame.doc"; filename*=utf-8''%d0%a2%d0%b5%d0%ba%d1%81%d1%82%d0%be%d0%b2%d1%8b%d0%b9%20%d0%b4%d0%be%d0%ba%d1%83%d0%bc%d0%b5%d0%bd%d1%82%20%d0%b4%d0%be%d0%ba.doc

Unfortunately, Internet Explorer, Safari and Chrome do not support this syntax, and Firefox and Opera will only use the RFC2231 filename* token's value if it appears before the legacy filename token.

Update: IE9 now supports RFC5987/RFC2231 formatted tokens using the UTF-8 character encoding. IE9 prefers the filename* token over the filename token, although, for legacy compatibility, you should send the filename token before the filename* token.

Notably, if the Content-Disposition specifies that the file is an attachment without specifying a filename:

Content-Disposition: attachment;

… all browsers will attempt to derive the filename from the path component of the URL. So, if the file is downloaded from:

https://www.example.com/%d0%a2%d0%b5%d0%ba%d1%81%d1%82%d0%be%d0%b2%d1%8b%d0%b9%20%d0%b4%d0%be%d0%ba%d1%83%d0%bc%d0%b5%d0%bd%d1%82%20%d0%b4%d0%be%d0%ba.doc

…without a filename token in the Content-Disposition header, the file will be named properly by all browsers.

I've posted a Meddler Script which demonstrates the various mechanisms for naming the file; download it here.

Ũńťīŀ Ņĕxţ Ŧĩmе,

Eric