Converting text file code pages

I’ve said “use Unicode” a lot, but sometimes there are programs that aren’t doing what you’d expect, and outputting stuff in a different code page.  Additionally, you might sometimes encounter a text file that was created using the system code page of a different machine.  (Like if someone emailed me a txt file from a…


What is Title Case?

Disclaimer: I’m not an English teacher (that’s my mom), so I’m sure my description of title casing in English probably has exceptions/variations. Title casing has an interesting history in computer programming.  Programmers like to use CamelCase to make variable names more readable, and, particularly amongst developers native to some languages, there’s an idea that title…


Writing "fields" of data to an encoded file.

The moral here is “Use Unicode,” so you can skip the details below if you want A common problem when storing string data in various fields is how to encode it.  Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file.  However, sometimes data gets…


Why can’t we strip the diacritics?

We have some “best-fit” behavior which we generally consider to be “bad”.  Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don’t lose anything).  Assuming you can’t use Unicode, why is it so bad to just make everything ASCII-like?  Maybe you have a published house or…


How do I get HKSCS 2004 characters from Big-5 in .Net?

Well, that’s pretty tricky.  We provide the Microsoft Character Code Conversion Routines For HKSCS-2004 functions, but those are intended for use with unmanaged code. The fundemental problem is that these “HKSCS” characters were in use prior to the assigment of a code point for them in Unicode.  In order to support them, we mapped Big 5…

3

Please avoid UTF-7

UTF-7 inherently some of the security issues that concern people about encodings.  For example, by shifting in & out of the base64 mode one can create multiple representations of the same string, enabling spoofing and other problems. UTF-7 is primarily interesting for legacy mail and NNTP applications that don’t properly handle native or MIME encoded…

1

Some Reasons to Make Your Application Unicode

[Updated Mar 30 2007: Mike pointed out errors which I’ve corrected]  Many applications are “still” ANSI and can’t handle Unicode.  We (Microsoft) have even released non-Unicode applications reasonably recently. even though we should know better.  In particular there are a bunch of good reasons to move your app to Unicode.  I’m rushed so I’m only…


Expected names of Microsoft Windows "ANSI" Code Pages (Encodings)

I was asked about our use of the windows “ansi” code page names, as used in things like MIME types, http content-type tags, etc.  Each “code page” has a name that most accuratly round trips back to the same code page, which I’ve listed as the “preferred name” below.  Additionally, when you ask for a code page…