So you want to save some data and don’t know which Encoding to use. My biggest suggestion is please do NOT use Encoding.Default.
Huh? That can’t be right.
You heard me right, please don’t use Encoding.Default. Encoding.Default sounds like the right thing to do (after all it does say “Default” right there in its name), and its pretty easy, and it even seems to work OK, but there are some pretty big gotchas with Encoding.Default.
- Encoding.Default returns the current system code page. If someone changes the code page or if the saved data is shared with a different machine, it might be decoded as gibberish. If you use latin-based languages you might not notice this very quickly, but once you start thinking globally you’ll might find all sorts of strange encoding/code page related bugs.
- Since Encoding.Default changes depending on what machine you’re using, you might find users are sending data files to other users who are complaining because those files are corrupt. They probably really aren’t corrupt, they’re probably just using the wrong Encoding to decode them.
- Encoding.Default provides an “ANSI” code page, which can only support a small fraction of the characters in Unicode, particularly for single byte locales such as those used in the US. That means that users can probably enter characters that would be translated to ? or cause fallback behavior.
- Encoding.Default doesn’t provide any information about what Encoding it is. So if you do use it, it would be wise to use some sort of higher level protocol to explicitly declare what Encoding the files is encoded as. Some encodings like UTF-8 or UTF-16 allow for a byte order mark that can be used as a signature to be fairly certain that the file is correctly encoded.
- Encoding.Default uses best fit behavior, which is bad, see Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided.
So if you can’t use Encoding.Default, what should you use? I’d recommend UTF-8 (Encoding.UTF8) or UTF-16 (Encoding.Unicode). Either of these support all of the characters that the framework can handle, so no more unexpected ?s. For English, UTF-8 is effectively as efficient as 1252, but UTF-8 supports unexpected characters that 1252 would drop. UTF-16 is a better choice for most scripts that require double byte encodings. For most scripts UTF-8 or UTF-16 have only slightly larger data sizes than the more restrictive “default” encoding. The extra confidence that the data is correctly encoded is almost always worth this small cost in data size. Even a web site on a dial up modem the size difference would be a negligible fraction of the total text and graphic sizes.