Encodings in Strings are Evil Things (Part 8)

   As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations.  However, most of my “real situations” involve legacy encodings.  So, I need to start cracking on transcoding.

   The first concern is allowing adapters for arbitrary transcodings.  A tricky problem that’s related to transcoding is collation (aka sorting) — most people aren’t aware that sorting strings is often a locale-dependent issue.  This is a localization problem.  Just to make sure that terminology is clear, internationalization (often abbreviated to i18n) is the act of coding a program such that it is entirely independent of location and language; the most classic example of i18n is moving all string literals into a binary resource within an EXE, so that the strings may be changed without modifing the program’s logic.  This is almost always paired with localization (sometimes abbreviated to l10n), which is the act of tailoring an already-internationalized program for a specific language/locale.  Internationalization may be done by any programmer; localization requires translators.

   In the case of sorting, a binary sort is often not enough.  Context is everything!

  • Where do accented characters sort — the same as their base characters, or after?  (For French speakers, accented As come after Z.)
  • What are you sorting for?  (German has a special sorting order for names in phone books!)
  • What about ligatures such as ch or fi?  (Spanish speakers, for example, will sort character sequences starting in “ch” between “c” and “d”, even though they recognize “ch” as two separate characters.)

   For this reason, developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32’s CompareString or LCMapString APIs.  For example:

rmstring<ucs4, bytevector> getfirst( std::list<rmstring<utf8, bytevector> > & lines ) {
    std::sort( lines.begin(), lines.end(), win32_collator( LOCALE_USER_DEFAULT ) );
    return (*lines.begin()).transcode<ucs4, bytevector>();

   This example is a bit contrived — a real example would template the container and output encoding, and make the LCID a parameter with a default argument — but you get the point.  win32_collator, in this case, is a custom predicate for std::sort (see <algorithm>) that converts both strings to UTF-16 and then invokes CompareStringW on them, throwing a missing_symbol exception if there’s a codepoint above 0x10FFFF that UTF-16 can’t contain.  Of course, this will hardly be my primary solution!  More on that later.

   Anyways, similar issues arise for transcoding.  (Not to mention the fact that win32_collator is, in fact, dependent on the ability to transcode, since the Win32 Unicode APIs expect UTF-16 strings.)  So, we must include pluggable transcoders.  So, we change our prototypes from Part 7 to include one more template argument, the transcoding tool:

template <class Engine, class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt, Engine e = Engine() );

template <class Engine, class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( Engine e = Engine(), TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );

   These functions now put off transcoding to the Engine object, whatever that may be.  In the Win32 vein, we could use MultiByteToWideChar and WideCharToMultiByte — but that’s too easy, not to mention very difficult to wrap.  I’d really like to do something that’s solely C++ and entirely based in the Unicode Character Database‘s mappings directory.  There’s a few dilemmas to be solved for that.

   Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of compatibility characters.  Compatibility characters are canonically equivalent to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that’s equivalent to a character in some older standard.  For example, ISO8859-2 defines 0x5A to be equivalent to a capital letter L with a caron accent (&Lcaron).  The “simple” equivalent of this in Unicode is a capital letter L (U+004C) followed by a combining caron (U+030C); however, Unicode also defines a single pre-combined character, U+013D, that is directly equivalent to those two.  Therefore, almost all legacy encodings thus can have a simple 1:1 function to go up to Unicode, in the form of a lookup table.  (Unfortunately, not all legacy encodings have a complete set of compatibility characters, so a LUT will not work for everything.)  Going back from Unicode to legacy is more problematic, however: we now have two equivalents to a given legacy character.  The most direct solution, it seems, is to generate a finite automata.

   I’ve been working on the DFA for the last few days.  My main concern has been memory efficiency, and I can now get a complete set of typical round-trip encoding data to fit in at under 8K per encoding, which fits nicely in cache.  Obviously, certain ones will be smaller, and certain ones will be larger (in particular KOI8 and other encodings with very large symbol sets).  The DFA solution is very clean though; the legacy-to-Unicode DFA takes in bytes and outputs 32-bit unsigned ints containing codepoints which are then re-encoded, and the Unicode-to-legacy DFA takes in codepoints and outputs bytes.  Legacy-to-legacy transcodes use UCS-4 as an intermediary.

   At this point, I’m now working on a program that reads in a file from MAPPINGS and UnicodeData.txt from the Unicode Character Database and outputs the DFA in C++ format.  I’ll post more when that’s finished.  (I’m writing this entry pre-emptively, as this work-week looks like an absolute killer.)

Comments (8)

  1. Note that UTF-16 can handle supplementary characters as surrogate pairs, and both Windows XP and Server 2003 include weights for these characters where they are defined….

    Also, note that the MultiByteToWideChar API defines MB_PRECOMPOSED and MB_COMPOSITE flags that will let you choose which form you want of those letters that can be either one code point or two….

    You may not want to write them all yourself so quickly (especially since many of the mappings on the Unicode site are out of date!)

  2. Also, you may want to note that some of the most awful bugs that Microsoft has had to deal (like Nimda?) were caused by attempts to do UTF-8 conversions without using the APIs that had the supported fixes to problems in them.

    You may not want to go to far on your own in this direction…. 🙂

  3. Last post, I promise!

    Collation (sorting) is not really localization, we define it here with staff linguists and the help of language experts and dictionaries from around the world.

  4. Ryan Myers says:

    Yep — I’m aware of surrogate pairs, and the fact that not even our own CharNext/CharPrev APIs handle them properly 😛 I actually originally thought that our internal APIs were UCS2, until I read your blog.

    (That, and I thought Nimda’s entry hole was that someone had made an incorrect UTF-8 parser that accepted malformed sequences with incorrect leading bytes, resulting in an incorrect estimation of string length leading to a buffer overflow. All of my encodings that are variable-width support a malformed_data exception that is thrown whenever content doesn’t match up. There’s some good resources out there for this, including .txt files containing invalid UTF-8 sequences to test out a parser, so I’ll be running through those. One of the major differences here compared to std::string or char arrays is that the store objects, rather than the encodings, handle all memory management, for a single point of failure. An encoding DFA simply calls push_back for every byte or codepoint it needs to emit.)

    I’m specifically providing adapters for the Win32 APIs for those of us who are working on Win32 platforms — and as far as internal use goes, this won’t be used on shipping products right now. I intend to have this thoroughly peer reviewed before I even think of /proposing/ that it be used, and I’m sure LCA would want to chip in on it as well since I’ve been writing this in my spare time at home. (If you’re willing to help out with the peer review once the code’s in a more complete state, I’d appreciate your help there :))

    I figured the mappings would be out of date, since most of them have not been updated in ages. My main goal ifor now s to get the DFA logic finished; once that’s done, updating for more recent mappings is just a matter of running a new mapping through the converter.

    Don’t worry, I’m not necessarily trying to reinvent the wheel — at least, not on Win32, and certainly not on shipping code 🙂 This is more of a "climbing the mountain because it’s there" thing, because language is a fundamentally interesting problem to me. The minute I even remotely consider letting this code go into use outside my own projects, I’ll be asking everyone within earshot to review it for me.

  5. Well, they did not make a new UTF-8 parser — they picked up ours, only they didn’t tell us and they didn’t pick up any of the bug fixes or updates. 🙁

    I’ll stay nervous about all the stuff that you are doing, just in case. And I’d love to be on that review list. 🙂

  6. Matt Grosso says:

    good luck with this. I’ve finally reached the point where I couldnt avoid dealing with character set conversions anymore, and now that I’m starting to know what I dont know about characters, I have to say it seems like a bit of a rabbit hole.

    Theres no way to really learn something than to

    implement it, which is why I’ve been tempted for a while write my own linear algebra library, despite the existence of a great many very well peer reviewed alternatives. Dont hold your breathe for that one though.

    if you’re at liberty to release the code under some open source license I’d love to see how you handled some of the issues you raised and compare

    vs the basic_string in gcc.

  7. Ryan Myers says:

    To be blunt, I want to get started using this, so the only conversion I’ve written so far has been the win32<src,tgt> engine, which looks for a win32_tr_info member variable containing a code page and calls MultiByteToWideChar/WideCharToMultiByte to get it up to UTF-16 and back down to the target.

    My current implementation I’m working on is a DFA-based engine; I wrote a dedicated program that takes in a set of known relations and generates a DFA to go from that encoding to UCS4 and visa versa. My main concern there is memory usage, and at this time the DFA table for a typical single-byte char set mapping to accented characters (such as ISO8859-1) fits in under 8K, just enough for data L1. I just have to get a modern set of mappings, then.

    I do plan on releasing the source, but I’m debating what license. My gut instinct is to do BSD, or a modified Boost license; I’ve never been much of a GPL fan. (Most of my previous OSS work, before I went to MSFT, was BSD or Artistic.) Part of it too, of course, is that I would like my library to go through the internal review process so I can use it internally, and there’s no way that’d happen under a viral license.

    BTW: Woot, another ACM member!