[Updated Mar 30 2007: Mike pointed out errors which I’ve corrected]
Many applications are “still” ANSI and can’t handle Unicode. We (Microsoft) have even released non-Unicode applications reasonably recently. even though we should know better. In particular there are a bunch of good reasons to move your app to Unicode. I’m rushed so I’m only listing a few here.
- We have been adding many new locales and keyboards. Many of those don’t have code pages. These “Unicode Only” locales have to either pick a system code page that only marginally supports their language, if it supports it at all. In these cases your ANSI application will be completely unusable in these locales.
- Data passed between ANSI systems is easy to misinterprete if the systems have different code pages. This leads to random data corruption, some of which isn’t always recoverable.
- ANSI apps don’t support the full range of characters, so users with unique requirements may not be able to enter data completely or correctly.
- Mixed language environments fail with ANSI only applications.
- ANSI only bugs like Some Keyboards fail with ANSI applications on Windows Vista RTM won’t impact your application
We have encountered numerous customer issues which could’ve been solved fairly trivially by using Unicode applications.
- Some popular messenging applications are not Unicode, so users cannot always send or receive messages properly in multi-lingual environments. For people working in other countries this is a common case.
- Most people have seen “gibberish” on web sites due to mistagged data. UTF-8 or UTF-16 would solve most of this confusion.
- Many media tagging systems didn’t originally specify an encoding for metadata, causing corrupted metadata when viewing on other machines. (This is also an example of data that can be very difficult to recover)
- Data being sync’d to phones, etc. hasn’t always worked if part of the chain is ANSI.
- Wireless SSIDs (wireless WLAN names) don’t specify code pages, so if you’re in a foreign airport or other multicultural environment you might get gibberish for the names when trying to find a network to connect to.
- Customer names have accents dropped or turned into ? when unexpected code points are expected. (for example, you have a web form that the user enters their own name correctly, but when a printer merges this for a magazine subscription or whatever the non-ANSI/non-ASCII data gets lost). Some users are very irritated when their name gets misprinted in this manner.
Most of these issue are easily solved by considering the encoding requirements and character repertoires, but it is often overlooked. US developers seem to be particularly susceptible to this design problem since ANSI is easy to use and their applications have a large US market even if they only use ANSI characters.
There are some occasions, primarily for backward compatibility or existing protocols that didn’t plan for Unicode, where applications can’t avoid ANSI. In those cases I suggest A) trying to get a plan to remove the back compat issue or fix the protocol so that this problem doesn’t continue for decades, B) use Unicode in the meantime and only convert it to the ANSI code page when necessary, and C) tag the data with the appropriate code page when possible so the receiver has a hope of decoding it properly. It is best to avoid these situations though because they invariably have edge cases that are difficult to handle.
Use Unicode 🙂