“Precomposed” and “Composite” Characters in Windows APIs

The term Composite and Precomposed in the Windows docs give me trouble, so I suspect they give other people trouble.

Basically Unicode provides 2 ways of encoding some characters.  Sometimes a character can be encoded as a single character (like Ä or U+00C4), and other times as a combination of characters (A +  ̈  or U+0041 + U+0308). 

Windows uses the terms “Precomposed” and “Composite” to define these two ideas.  Unfortunately Unicode defines both terms as “Decomposable Character”, which is “A character that is equivalent to a sequence of one or more other characters…” ie: the Ä form of the character. 

I’d even argue that Microsoft messed up its English when it chose the word composite however long ago (probably my boss’s boss’s boss, so shhhhh:-).  The dictionary I looked at said composite is “A structure or an entity made up of distinct components.”, which sounds to me like Ä.  Sadly we chose to use composite to describe a “Combining Character Sequence” such as A +  ̈ .  This is somewhat mitigated by the fact that we did this long ago when these technologies were still pretty new, but it doesn’t help the fact that I get confused every time I have to see these words.  Since I’m the guy that maintains these APIs, I figure if I get confused by it, others must too J

So to summarize, when you see these words in the docs for windows APIs:

Precomposed characters are characters like Ä (U+00C4) that use one code point to represent a single character.

Composite characters (in windows documentation and constants) are sequences of code points like A +  ̈  (U+0041 + U+0308) that use multiple code points to represent a single character shape.

For what its worth, Windows tends to generate characters in a Precomposed form when possible, however even then, cut & paste and other items can cause combining character sequences to occur.


Comments (0)