Encodings in Strings are Evil Things (Part 7)

Article
01/10/2005

Eugh. Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on rmstring -- which means, of course, that this hasn't updated. I haven't given up on it though! (I'm not dead! I don't want to go on the cart...) If anything, my desire to finish it has increased, since I've been working on a set of internal utilities which parse text files to take instructions, and one keeps on thinking, "This would be so much easier if I just finished rmstring..."

So, on to business. First off, the all-important fixed_width_encoding class is done. This critical class is the foundation of all encodings with a fixed number of bits per code point; it's templated on an intrinsic type that the implementor knows is 1/2/4 bytes. The hardest part of an encoding, I've found, is writing the iterators; there are a huge number of methods that one must implement in order to make a 14882-compliant iterator. The internals are mostly simple pointer arithmetic; just a lot to be tested. (Yes, I have to write a test harness for this, if I want it to be approved for on-campus use :P)

One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for recv()ing something in from a TCP socket. If we know that said content is UCS-4, the natural urge is to cast it to an unsigned long * to iterate over... except that you can't. Or, at least, you shouldn't. If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or crash (on IA-64, unless SetErrorMode() is called to force OS alignment fixups, in which case it will run extremely slowly). Of course, people do this all the time; you just can't guarantee that doing so is safe within the confines of strictly conformant code. There is also no way for strictly conformant code to check if a given pointer is aligned, since there is no operator to retrieve a type's alignment requirements. The best you can do is assume that no type will have an alignment requirement greater than its size, and assert(0 == reinterpret_cast<size_t>(ptr) % sizeof(type)), which is throughly disgusting AND assumes certain things about the host's virtual memory system that may not be true.

Thus, I've opted for the simplest solution: a huge comment in the code that says "These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size. Violating either of these assumptions will result in your program's untimely death." Sometime later, I might come up with a helper function alignment_assert<T>(ptr) that takes advantage of compiler-specific extensions such as MSVC's __alignof if available. Note that this also could potentially result in a Unicode stream that does not make much sense, such as combining characters that don't properly match base characters. The Unicode standard notes that such a stream is not ill-formed, although it is not quite renderer-friendly; so, I'll support it.

I've also had occasion to rethink my plans for encoding_cast. Initially, I planned to use encoding_cast in a way similar to the Boost lexical_cast pseudo-operator. However, it disturbed me that doing so would mean that every call to encoding_cast would create a temporary in which to store the result, which would then make its way to final storage either by operator= or copy constructor. I ended up realizing that a good 70% of the calls to encoding_cast would be writing the encode into a string that already existed. So, instead, we now have the transcode function, which comes in both non-member and member flavors:

template <class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt );

template <class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );

With the above, the originally envisioned encoding_cast is now just syntactic sugar for a call to the source string's member transcode() function. It also means that the code to do transcodes is now centralized within rmstring. Handy!

Oh, and since someone asked: I'm currently developing and testing this on Visual C++ .NET 2003 and Stephan Lavavej's distribution of MinGW; I'll likely run it against Comeau as well to make sure it's kosher before I release the source to the public.

My goals for the next article are to have a few non-Unicode encodings done, so I can start testing out transcoding and flesh out the different encoding mechanisms. My main dilemma is designing the symbol tables; I noted in Part 4 that I wanted to have the ability to pass different resolving engines to the transcoder such as a perfect lossless transcription, visual parity, error codes, etc. Visual parity will be the hardest to do; in fact, I will likely not do it right away. (Namely, because the Unicode tables do not contain such parity information.) Another concern has been memory consumption of tables for encodings; I'll be tackling that shortly.

(Since this was mostly a "what happened while I was gone" article, no point summary.)

(Update 2pm: Michael Kaplan nudged me a bit that I had broken my previous insistence on "code point" versus "character" terminology -- that's what I get for stepping away from the blog for two weeks! Terminology corrected; anyone who doesn't know the difference between code points and characters needs to go back and read this blog from the beginning, or at least Part 5.)

Encodings in Strings are Evil Things (Part 7)

Additional resources