For RichEdit 4.0 (Windows XP SP1), I developed a UTF-8 version of the Rich Text Format (RTF). The reason was to have a faster, more reliable way of handling copy/paste for RichEdit than regular RTF. RichEdit 5.0 added the binary format for this purpose (and for OneNote) and RichEdit 6.0 added a still faster internal method to speed the build up of the math linear format. Accordingly starting with RichEdit 5.0 (Office 2003), the UTF-8 RTF format isn’t used for copy/paste unless the client specifically asks for it, and I hadn’t paid much attention to it.

But a UTF-8 RTF bug for the N’Ko script (the only right-to-left script that displays its digits RTL!) showed up the other day needing some attention. In my standard RTF debugging mode, I opened the file in NotePad to see what was going on. Much to my delight, it looks sooooooo much better than usual! Here’s the text part of the “new scripts for Windows 8” file written by RichEdit for standard RTF (N’Ko characters highlighted ):









You can’t tell what the characters are since they’re all represented by the RTF uN notation. Btw, this is still a lot simpler than Word writes. You can see the latter by saving a Word file in RTF and looking at the file in NotePad. The RichEdit file containing the RTF above is 1437 bytes and the corresponding Word file is 38190 bytes. You can see why people pass Word RTF files through WordPad to get something lighter.

Now here’s what the same text looks like when written in the UTF-8 RTF format

\f0\fs22\lang1178 ꓐꓗꓷꓨꓮꓯꓺ\f1\lang1033\par

\f2 ꔊꔕꔣꖵꗲꗛ꘏\f3\par

\f4\rtlch\lang1176 ߂߰ߩߝ߹ߜߐ\f5\ltrch\lang1033\par

\f6 𐒓𐒝𐒗𐒘𐒎𐒔𐒄\f3\par

\f7 ꡔ꡵ꡋꡏꡚꡟꡮ\f3\par

\f8 𐌱𐌵𐍊𐍃𐌼𐌾𐍆\f1\par

\f9 𐐕𐐢𐐉𐑍𐐝𐐵𐐧\f1\par

\f10 ⴲⴺⵂⵙⵥⵞⴼ\f1\par

You can read all the new-script characters instead of looking at \uN control words! Well, maybe you don't understand the text, but at least you can see the characters. The file containing this RTF is 1003 bytes, about 70% the size of the RichEdit standard RTF file and about a fortieth the size of the Word RTF file.

The \uN notation is certainly very valuable, but it’s particularly awkward because it uses signed 16-bit decimal values. To find out what the characters are you have to add 65536 to negative values and convert the results to hexadecimal. Furthermore a surrogate pair is represented by two \uN control words instead of one with an unsigned integer. So you have to convert two negative 16-bit decimal numbers to hex and then convert the resulting surrogate pair to the UTF-32 form to get what’s in the Unicode Standard. Since Word writes many RTF control words with unsigned 32-bit values, there really wasn’t any reason to stick with the original signed 16-bit convention. Standard RTF writers convert characters that can be represented using a standard Windows code page to that code page, making those characters virtually unreadable unless the code page is the Western 1252 code page. Meanwhile UTF-8 RTF simply displays all characters outright. If you paste a UTF-8 RTF file into Word, you can see the characters and use the alt+x hot key to examine their values in Unicode.

Makes one think the UTF-8 RTF format is really a much better format than the original RTF format. Except that only RichEdit understands it.

Comments (5)

  1. Robert says:

    > Except that only RichEdit understands it.

    Well, you just proved Notepad does so, too! But, more seriously, I also implemented UTF-8 reading capability in my RTF formula editor because it's not difficult and allows recognition of plain text strings in many places where RTF is required.

  2. MurrayS3 says:

    Very intriguing. Did you just use a cpg65001 in the fN entry in the font table? In RichEdit, UTF-8 uses urtf1 instead of rtf1 to signal UTF-8, but it might actually be more general to use the cpg65001. Have to check to see if Word understands that…

  3. Robert says:

    The approach I used is quite simplistic: The program interprets all bytes above 0x7f as UTF-8 (except in bin fields) — this may be incompatible with otherwise encoded 8-bit RTF, but I have never encountered this since RTF writers seem to follow the RTF specification and use the escaped form 'xx.

    I added this feature mostly for convenience, to simplify passing a string with Unicode characters on the command line. In the output, the program always uses the u… form, it never writes UTF-8. Then again, it does write math RTF (along with its own format), but does not understand it.

  4. Bruce Rosenblum says:

    I just read this post. It's intriguing, but also a bit disconcerting because you can't assume that all bytes above 0x7f are UTF-8. In the Japanese version of Word 2007 and later, RTF pushed onto the Windows clipboard is Shift-JIS format, not UTF-8. This caused us a few headaches when we first learned of it, especially because you can't reproduce it on the US version of Word.

    Is there a way to clearly identify in RTF whether a high byte is UTF-8 vs. Shift-JIS (or anything else)?

    1. MurrayS3 says:

      Shift-JIS is marked by \fcharset128. UTF-8 RTF starts with {\urtf1 instead of {\rtf1, but \ansicpg65001 would have been a good choice too. And UTF-8 can be autorecognized. Sometimes it starts with the UTF-8 byte order mark (0xEF, 0xBB, 0xBF) and even without the BOM, the regularity of UTF-8 allows a program to recognize it reliably if it has a few nonASCII bytes.

Skip to main content