Encodings in Strings are Evil Things (Part 5)

Article
10/25/2004

In our last episode, we briefly discussed possible behaviors for encoding_cast, and we discussed how the STL's basic_string class was structured -- namely, we noted that it had several core functions that were overloaded many times for various types of input. We also noted that we could avoid many of the implementation headaches that result, because of our decision to generalize our backing store.

One of my coworkers pointed out that Herb Sutter had already done an excellent dissection of basic_string in Exceptional C++ Style -- and, indeed, the last four chapters of the book are spent analyzing its structure, breaking it down to the core functions, and then implementing many of the functions and overloads as non-member template functions. However, he's not looking to improve basic_string's foundation -- he's merely explaining how reducing the number of methods in basic_string makes the code much easier to maintain. (For example, rather than writing an empty() member function, he writes a templated empty function that takes a STL string or container, and returns true if the string's begin and end iterators are equal.)

Furthermore, he specifically chooses some less-than-ideal but good-enough implementations as a result of making simplicity the primary goal. For example, in his implementation of resize(), he implements the shrinking case by using a basic_string constructor to make a copy of the first N characters of the string, and then calls swap(), so he's incurring a memory allocation and deallocation there that is unneccessary. While Sutter's treatment is good, we have a slightly more ambitious goal in mind (making a better class to replace std::string, rather than merely improving upon the existing implementation through decomposition), so we're not duplicating effort.

That said, I agree with his approach of decomposing functions with many overloads such as insert and replace, especially considering that our choice to generalize backing stores eliminates most of my objections to his techniques. So, I've decided to make a basic_rmstring class after all, in a sense. The basic_rmstring class will have a single member function for each major piece of functionality, such as insertion or replacement or concatenation. We'll then make an rmstring wrapper class that provides overloads in a way to make it roughly equivalent to std::string.

Now, on to a concern I alluded to in the last entry: distinguishing code points and characters.

Up until now, I've specifically used the word "code point" to refer to a single symbol in the Unicode/UCS tables, even though Unicode refers to them as characters. I chose to do this because of the existence of "combining characters", which are symbols associated with the previous "base character" such as accents, enclosing boxes/circles, formatting marks for subscript/superscript, and so on. Unicode contains unaccented base characters, combining characters, and "precomposed characters" that use a single codepoint to represent a pre-accented base character. These are considered always canonically equivalent to some combination of a base character and one or more composing characters. (See Part 1 for an example of this.)

Unicode defines a set of normalization forms that are used to standardize whether to favor combining characters or precomposed characters. However, regardless of whether pre-composed characters are favored or not, there are some character sequences which do not have pre-composed equivalents and must be represented using combining characters. To make things even nastier, there are some combining characters, most notably double diacritics, that can span multiple base characters. (And I haven't even gotten into Arabic and Hebrew scripts that can change the direction of rendering in mid-string!)

Of course, our problem here is that most programmers don't think about accents as being distinct elements to iterate through! When you hit the right arrow in Microsoft Word to skip over an À, you don't go first to an A and then to the A's accent -- you move past the whole "character." (Unicode refers to this rough definition of character as a "grapheme cluster," FYI.) If it weren't for double diacritics, we could shrug and say "Well, a character is a base codepoint plus zero or more combining codepoints." But it may not be that easy.

After taking a walk to think it over, I ended up deciding to err on the side of the Unicode standard -- we'll treat double diacritics as a glyph problem. Namely, a double diacritic is attached to the preceeding base codepoint only, and the fact that it extends over the following base codepoint as well is a glyphing concern. (This is also due to the fact that most of the double diacritics can also be represented as a pair of "combining halfmark" where half of the glyph is applied to each character as two separate combining characters, and the glyphing engine is expected to recognize this and render it as a single glyph.) So, we can say that a grapheme cluster is a base character, plus zero or more combining code points, plus any uses of the Combining Grapheme Joiner codepoint.

So, do we want basic_rmstring to take integer index arguments, iterators, etc. as referring to code points, or to grapheme clusters? For the sake of programmer familiarity, we're going to default to clusters, but we'll allow code points. We will have a single iterator class that takes a bool in its construction describing whether advance() and related methods should advance by codepoint or by cluster. Our begin, end, and other such iterator methods will be templated with a default template argument to clusters; thus, you can ask for a codepointer iterator by calling str.begin<codepoints>(). This is a bit messy, but workable.

Before, we listed the methods that seemed worthwhile to carry over. However, many of them can be implemented as versions of the others. Tomorrow, we'll actually write a complete header for basic_rmstring and start implementing it.

That, and I think it's about time I go buy a hardcover copy of the Unicode standard, as I have way too many PDFs on my desktop right now.

Encodings in Strings are Evil Things (Part 5)

Additional resources