…or how HTML editor handles file encoding.
First, Visual Studio is a Unicode application and actually even supports Unicode Surrogates Pairs. Most of Web pages, however, are not stored in Unicode. Therefore when opening a Web page VS has to figure out how to convert document to Unicode and how to convert it back on save. Here is how Visual Studio does it:
- First VS looks for a file signature and/or byte-order-mark (BOM). If Unicode or UTF-8 BOM is present, then the file is converted according to the BOM type (or not converted at all if it is already in Unicode).
- If file does not contain BOM, VS looks at the file type. In XML file it tries to locate XML processing instruction and its encoding attribute. If one is missing, VS assumes UTF-8 since it is default for XML files.
- If file is HTML, VS tries to locate XML processing instruction first (it may be XHTML file after all). If XML PI is not present, VS runs simple HTML parse in order to locate META element such as <meta http-equiv=”content-type” content=”text/html; charset=windows-1252″ /> and extract character set information from it.
- If nothing of the above is found, VS assumes default OS codepage.
Pretty much the same happens on file save when file has to be converted from Unicode back to its original (or new, if you changed charset or encoding attributes) encoding. Now, here a trick. Although ASPX files are HTML, you won’t find Charset property in the DOCUMENT properties anymore. You will find it in VS 2003 though. Why is that?
The explanation lies in how ASP.NET runtime handles file encoding. In fact, ASP.NET runtime does not pay any attention to the character set specified in the META element. Instead, if looks at fileEncoding attribute in <globalization> section.) This means that META/CHARSET in ASPX files may confuse client browser if charset specified in the META is different from what is specified in responseEncoding attribute.
Quite a few VS 2003 users were confused when they created a new ASP.NET Web form, entered new character set in the META element, saved the file, ran it in the browser and saw garbage characters. This was because ASP.NET runtime ignored the charset specified in the META element and instead used fileEncoding which is by default UTF-8. It is difficult to properly adjust fileEncoding attribute in the web.config file upon save of the Web form file since fileEncoding applies to all files in a folder and settings may be inherited from a parent folder.
VS 2005 is different. if HTML editor is unable to save file in its original encoding, it automatically switches to UTF-8 for all Web Form type files such as ASPX, ASCX, ASMX, etc. We do not use UTF-8 with byte order mark since BOM in visible in certain text editors and some users may get confused and even delete those strange characters in the beginning of the file. UTF-8 without BOM is no different from ANSI unless you start using extended characters. It is not bigger than ANSI for English-only text and is able to handle any language. In plain HTML pages and other files Visual Studio shows warning when it was unable to save file using the encoding specified.
What if you really want to use a specific encoding? There are a few options. First, you should properly specify character encoding in the fileEncoding and responseEncoding attributes in the web.config file. Then you can specify character set either in XML PI encoding attribute (provided the file is XHTML) or in META/CHARSET. Visual Studio will respect them during file open and will convert the file accordingly. Upon save VS will detect those attributes again and will correctly convert file back to its disk format.
You can also use Open With… and Save With… operations. In the File Open and File Save As dialogs Open and Save buttons have little arrows next to them. You can click the arrow and choose Open With… and select, for example, HTML Editor with encoding. This will bring up a dialog that will allow you to pick the encoding manually.
Be very careful so you don’t shoot yourself in a foot: if you have one encoding specified in META/CHARSET and you pick a different one in the Open With… dialog and have yet another listed in the fileEncoding attribute in the web.config <globalization> section, you can easily corrupt the file permanently.