Combining marks and interesting effects

I'm planning to blog about some of the challenges involved in creating a world-class text editor, but first I'd like to talk a bit about an interesting set of Unicode characters - combining marks.

 

These characters, like their name suggest, combine with others and modify their appearance. In this image, you can see my last name (or rather, the first of my two last names, but I digress).

 

 

The first line has no marks on top of the 'o', and is thus spelled incorrectly. After that, you see an acute accent, which is the mark that should go over that vowel.

 

The third and fourth lines are visually identical, but I'm actually playing a trick on Word. The fist spelling was created by pressing the [ key after switching the layout to Spanish - Argentina, then pressing the o key. The first key in this case acts as a dead key, which the system processes by holding a bit of information to act on to the next key I type.

 

Brief digression: an interesting thing to note is that there is no feedback to the user to indicate that the next key will come out modified - why not press the 'o' first, then add the modifying marks instead? Well, typewriters worked by inking the mark and not advancing the carriage, so the logical order was to first press the key for the accent, then the vowel. I guess it stuck.

 

When I insert a key in this manner, the system will generate a precomposed character. This is a single Unicode character that is already an o with an accent - Unicode character U+00F3, Latin Small Letter O With Acute. The fourth line was actually created by inserting an o character, then pasting a Combining Acute Accent character (U+0301) after that.

 

These two glyphs are visually identical. However there is one subtle change in behavior, given that these are two separate characters in the document. If I put the caret to the right of the o on the third line and press Backspace, the character is deleted. If I put the caret to the right of the o on the fourth line and press Backspace, only the combining mark will be deleted - the 'o' is still there. However, if I press Delete from the left of the character, in both cases the o with its accent are wiped away. They also behave identically for the purposes of moving the caret with the left and right arrows.

 

If you think about them for a bit, these rules make sense when you know what the internal representation is and what the user is trying to do, but they can be tricky to understand sometimes. And they may change depending on other factors - which is when the fun begins. But this post is long enough as it is, and I'll bring up more interesting cases in the future.

 

 

This posting is provided "AS IS" with no warranties, and confers no rights.