Encoding.GetEncodings() has a couple "duplicate" names


The Microsoft.Net v2.0 Encoding.GetEncodings() method returns a complete list of supported encodings, uniquely distinguished by code page.  Note that in general I consider the code page number to be a poor way to exchange code page information since its not a standard, however for now it does provide a unique ID for .Net Encodings.  By Name there are some duplicates, although they have different DisplayNames.

Encodings 20932 and 51932 both return the Name “euc-jp”, and indeed are identical code pages.  If you ask for “euc-jp”, the framework will return 51932, so if you want to remove one, I’d remove 20932 from any list you make.

UPDATE – 11/29/1006 (snow day in Redmond).  Actually it was pointed out that 51932 doesn’t work in native windows APIs, so you’d have to pick 20932 for native applications and 51932 for .Net applications (so that it would round trip in .Net).

50220 and 50222 also return “iso-2022-jp” for their Name.  If you ask for “iso-2022-jp”, you’ll end up with 50220, so I’d remove 50222 from any list of encodings.  Which one you should prefer depends on the preferred treatment of the half width katakana.

Unlike the euc-jp encodings, the 50220 and 50222 encodings are slightly different.  When encoding, 50220 will convert half width katakana to full width and 50222 will use a shift-in/shift-out sequence to encode half width katakana.  The DisplayName for 50222 is “Japanese (JIS-Allow 1 byte Kana – SO/SI)” to distinguish it from 50222 “Japanese (JIS)” even though they have the same iso-2022-jp Name.

 

Comments (4)

  1. I was asked about our use of the windows "ansi" code page names, as used in things like MIME types, http

  2. MSDNArchive says:

    You said

    "I consider the code page number to be a poor way to exchange code page information…"

    but you don’t go on to explain what IS a good way to exchange encoding (code page) information.

    Perhaps there is value for .NET to define a coherent naming system of it’s own, unless there is already a comprehensive international standard with the properties we need.

    What we want is a canonical way to exchange encoding information that is unambiguous and unique. We want something that isn’t full of aliases. Code page numbers do seem to provide many of the properties we want from an encoding’s identifier, aside from mnemonic or descriptive value. ‘mnemonic’ and ‘descriptive’ are human interface, so they’re going to be dependent on the language of the reader/speaker.

  3. MSDNArchive says:

    Ok, I see the point in not attempting to make sense of the encoding mess. Through pain, force people to move to Unicode. There is teremendous value in helping people move into Unicode by providing a good way to tag data for conversion.

  4. shawnste says:

    The IANA registry of names is a good starting place for tagging code pages, which is a reasonable standard.

    Unfortunately its adoption isn’t particularly consistent.  On different platforms, code pages with the same "name" may have different behaviors due to various interpretations of the standards, leading to additional complications.

    See "What’s My Encoding Called" http://blogs.msdn.com/shawnste/archive/2005/12/28/507816.aspx for problems with naming within .Net.

    Tagging of untagged data is also problematic unless there is some additional information about the source (like you know it all came from a windows-1252 system).

    Additionally some systems don’t "trust" the tagged code page.  For example, a user creates a web page on a cyrillic system and then posts it to a shared server that assumes the data is western european.  It works for them because they test it on the same cyrillic system, but perhaps the web server tags it with 1252 instead of the true code page because it wasn’t in the data.