Converting text file code pages

I’ve said “use Unicode” a lot, but sometimes there are programs that aren’t doing what you’d expect, and outputting stuff in a different code page.  Additionally, you might sometimes encounter a text file that was created using the system code page of a different machine.  (Like if someone emailed me a txt file from a…

4

Unicode 6.0 has a new Indian Rupee Symbol, how do I get it?

Well you can’t, not yet anyway.  Unicode 6.0 adds the new Indian Rupee Symbol at U+20B9 (see http://www.unicode.org/charts/PDF/U20A0.pdf ) so how do you get it to work? Unfortunately you can’t get it to work immediately ;-(.  The problem’s actually really complicated as there are lots of moving parts.  You need a font to display it,…

0

Thoughts About Email Addresses with EAI (Email Address Internationalization)

The EAI Working Group (http://datatracker.ietf.org/wg/eai/charter/) is making rapid progress toward standardizing Unicode email addresses.  Unicode email addresses are a terrific feature for people in many countries that don’t use Latin/ASCII as a native script.  Ironically, in the US its easy to miss the importance of non-ASCII email addresses.  Many other Latin script users may also think…

6

The Square Boxes in My Blog’s Title

Someone pointed out the boxes in my blog’s title.  That’s a script some fans use for Klingon, but since it’s not in Unicode, you need a pIqaD font to see it correctly.  If you really want to see the square boxes, then grab the pIqaD.ttf font from the .zip in my earlier post: http://blogs.msdn.com/shawnste/archive/2006/02/17/klingon-in-piqad-windows-vista-custom-locale.aspx  You…

2

UTF-8 usage on web approaching 50%

Google posted an interesting chart:  http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html  I’m sure Bing has similar data, but since Mark already built the chart it was easier for me to link there 🙂 Hopefully this will mean less code page confusion in the browser/server space, which has historically been a problem.  It’s also a sign that new apps should probably…

0

Most combining characters in a Unicode glyph/character/whatever

Recently on the Unicode list someone asked basically what the biggest number of combining characters could happen in a sequence.  It’s as many as someone wants to use, though the normalization UTS15 adds a limit, and the font rendering problem gets weird. An interesting example appeared on the list: In Tibetan and Ranjana scripts there…

1

Alternate encoding names recognized by .Net / IE

If you run the sample from http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx then you can get a list of what Microsoft .Net thinks each Encoding/Code Page’s name is.  (WebName is more consistent to what’s used in charset). eg: using System;using System.Text;public class SamplesEncoding{   public static void Main()   {      // For every encoding, get the property values.      foreach( EncodingInfo ei in Encoding.GetEncodings()…

0

Unicode, IDN (IDNA), EAI (IMA) and Homograph Security

I wrote about IDN & Security before http://blogs.msdn.com/shawnste/archive/2005/03/03/384692.aspx but thought I’d share some of my more updated views about security of URLs/IDN/Unicode/Email addresses. People haven’t really bothered much with DNS or character based security when it was limited to ASCII.  I’m not sure if this because people just didn’t think about it, or if they thought there wasn’t a problem…

0

Writing "fields" of data to an encoded file.

The moral here is “Use Unicode,” so you can skip the details below if you want 🙂 A common problem when storing string data in various fields is how to encode it.  Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file.  However, sometimes data…

2

Don’t use MB_COMPOSITE, MB_PRECOMPOSED or WC_COMPOSITECHECK

This pretty much demonstrates another reason to Use Unicode, but if you do need to use some non-Unicode encoding until you can convert to Unicode, please don’t use these flags.  MultiByteToWideChar() and WideCharToMultiByte() provide some interesting sounding flags that are actually useless, slow, badly broken, or far worse.  All of these flags would be expected to behave like…

4