Encodings in Strings are Evil Things (Part 8)

Article
01/17/2005

As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations. However, most of my "real situations" involve legacy encodings. So, I need to start cracking on transcoding.

The first concern is allowing adapters for arbitrary transcodings. A tricky problem that's related to transcoding is collation (aka sorting) -- most people aren't aware that sorting strings is often a locale-dependent issue. This is a localization problem. Just to make sure that terminology is clear, internationalization (often abbreviated to i18n) is the act of coding a program such that it is entirely independent of location and language; the most classic example of i18n is moving all string literals into a binary resource within an EXE, so that the strings may be changed without modifing the program's logic. This is almost always paired with localization (sometimes abbreviated to l10n), which is the act of tailoring an already-internationalized program for a specific language/locale. Internationalization may be done by any programmer; localization requires translators.

In the case of sorting, a binary sort is often not enough. Context is everything!

Where do accented characters sort -- the same as their base characters, or after? (For French speakers, accented As come after Z.)
What are you sorting for? (German has a special sorting order for names in phone books!)
What about ligatures such as ch or fi? (Spanish speakers, for example, will sort character sequences starting in "ch" between "c" and "d", even though they recognize "ch" as two separate characters.)

For this reason, developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32's CompareString or LCMapString APIs. For example:

rmstring<ucs4, bytevector> getfirst( std::list<rmstring<utf8, bytevector> > & lines ) {
std::sort( lines.begin(), lines.end(), win32_collator( LOCALE_USER_DEFAULT ) );
return (*lines.begin()).transcode<ucs4, bytevector>();
}

This example is a bit contrived -- a real example would template the container and output encoding, and make the LCID a parameter with a default argument -- but you get the point. win32_collator, in this case, is a custom predicate for std::sort (see <algorithm>) that converts both strings to UTF-16 and then invokes CompareStringW on them, throwing a missing_symbol exception if there's a codepoint above 0x10FFFF that UTF-16 can't contain. Of course, this will hardly be my primary solution! More on that later.

Anyways, similar issues arise for transcoding. (Not to mention the fact that win32_collator is, in fact, dependent on the ability to transcode, since the Win32 Unicode APIs expect UTF-16 strings.) So, we must include pluggable transcoders. So, we change our prototypes from Part 7 to include one more template argument, the transcoding tool:

template <class Engine, class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt, Engine e = Engine() );

template <class Engine, class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( Engine e = Engine(), TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );

These functions now put off transcoding to the Engine object, whatever that may be. In the Win32 vein, we could use MultiByteToWideChar and WideCharToMultiByte -- but that's too easy, not to mention very difficult to wrap. I'd really like to do something that's solely C++ and entirely based in the Unicode Character Database's mappings directory. There's a few dilemmas to be solved for that.

Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of compatibility characters. Compatibility characters are canonically equivalent to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard. For example, ISO8859-2 defines 0x5A to be equivalent to a capital letter L with a caron accent (&Lcaron). The "simple" equivalent of this in Unicode is a capital letter L (U+004C) followed by a combining caron (U+030C); however, Unicode also defines a single pre-combined character, U+013D, that is directly equivalent to those two. Therefore, almost all legacy encodings thus can have a simple 1:1 function to go up to Unicode, in the form of a lookup table. (Unfortunately, not all legacy encodings have a complete set of compatibility characters, so a LUT will not work for everything.) Going back from Unicode to legacy is more problematic, however: we now have two equivalents to a given legacy character. The most direct solution, it seems, is to generate a finite automata.

I've been working on the DFA for the last few days. My main concern has been memory efficiency, and I can now get a complete set of typical round-trip encoding data to fit in at under 8K per encoding, which fits nicely in cache. Obviously, certain ones will be smaller, and certain ones will be larger (in particular KOI8 and other encodings with very large symbol sets). The DFA solution is very clean though; the legacy-to-Unicode DFA takes in bytes and outputs 32-bit unsigned ints containing codepoints which are then re-encoded, and the Unicode-to-legacy DFA takes in codepoints and outputs bytes. Legacy-to-legacy transcodes use UCS-4 as an intermediary.

At this point, I'm now working on a program that reads in a file from MAPPINGS and UnicodeData.txt from the Unicode Character Database and outputs the DFA in C++ format. I'll post more when that's finished. (I'm writing this entry pre-emptively, as this work-week looks like an absolute killer.)

Encodings in Strings are Evil Things (Part 8)

Additional resources