Writing an RTF to HTML converter, posting code in blogs.

Visual Studio's IDE will copy code as RTF (Rich Text Format). Web browsers like HTML. So posting code from Visual into blogs means a decent RTF to HTML conversion. And having a technical blog means posting code.  So I needed to solve this conversion problem.

The sad tale:
At first I tried Word, but Word had a heart-attack trying to convert RTF to HTML and generates heavily mangled HTML  (even it's allegedly "filtered" html is still garbled), which in turn gave Community Server (which runs my blog) a heart-attack. That almost killed my blogging days, until I switched to Front-page. But FrontPage 2003 can't properly convert RTF to HTML either (flabbergasting!), so I eventually wrote my own converter.

"How can I post Visual Studio code on my blog" was actually a very popular question on our internal blog alias. It took a while to get good answers. Several folks wrote their own tools. (Here's an example of Shawn's. His puts a pretty box around the code). I think CS's support improved here too over time.

The right way:
There are some great tools out there that solve this properly, like a VS Plug-in that copies code as HTML (https://blogs.msdn.com/powertoys/archive/2004/10/21/245850.aspx ). Anybody who actually wants a working reasonably solution should use that.

There are also sample RTF 2 HTML converters all around, including some nice web-based ones. Just a search away.

What I did:
It was easier to just write an RTF to HTML converter than to deal with these other apps.  And more fun.

I've had a few people ask about it, so I wanted to throw it up on my blog.

This takes RTF in from the clipboard, and then dumps it out as a an html file called "out.html" in the current directory. 

  1. RTF is just a text file with embedded control sequences. Check out the RTF spec on MSDN. Or create an RTF file with word-pad and then open it with notepad.
  2. I wanted the input to be via the clipboard, as opposed to a file, since I was copying the text from Visual Studio. That was part of my motivation for writing https://blogs.msdn.com/jmstall/archive/2005/08/22/Clipboard_tools.aspx
  3. The output HTML is very straightforward. No CSS. the most fancy thing it has are <span> tags.

It's only about 150 lines of C#, so I took a few shortcuts.

  1. It's by no means a complete RTF converter. It just handles the subset of RTF that VS2005's IDE produces. That's all I needed.
  2. It's hard-coded to use the colortable matching VS's default C# color scheme.
  3. It doesn't handle tabs. You should be using spaces anyways.

 

The comparison:

Here's a comparison of Word and FrontPage trying to convert the RTF to HTM on a simple snippet

What it should be:

 
                // check for RTF escape characters. According to the spec, these are the only escaped chars.
                char chNext = rtf[idx];
                if (chNext == '{' || chNext == '}' || chNext == '\\')
                {
                    // Escaped char
                    tw.Write(chNext);
                    idx++;
                    continue;
                }

------------------------------------------------------

Word 2003: It's got all these Mso class tags and extra <p> tags. And in my browser, it's got extra newlines.

                // check for RTF escape characters. According to the spec, these are the only escaped chars.

                char chNext = rtf[idx];

                if (chNext == '{' || chNext == '}' || chNext == '\\')

                {

                    // Escaped char

                    tw.Write(chNext);

                    idx++;

                    continue;

                }

------------------------------------------------------
Frontpage 2003
: It loses the indenting and the font.

// check for RTF escape characters. According to the spec, these are the only escaped chars.

char chNext = rtf[idx];

if (chNext == '{' || chNext == '}' || chNext == '\\')

{

// Escaped char

tw.Write(chNext);

idx++;

continue;

}

------------------------------------------------------

 

The code:

Here's the code. In good dogfooding fashion, I got the HTML for it via running it on itself. If you find it useful or entertaining, great.

 [update: missing & check] 
[update: added ';' in Escape]
 
// Very primitive RTF 2 HTML reader 
// Converts tiny subset of RTF (from VS IDE) into html.
// Author: Mike Stall (https://blogs.msdn.com/jmstall)
// Gets input RTF from clipboard.
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.IO;

namespace ClipBoard1
{
    class Program
    {
        [STAThread()]
        static void Main(string[] args)
        {
            Console.WriteLine("Get RTF from the clipboard.");
            IDataObject iData = Clipboard.GetDataObject();
            string[] f = iData.GetFormats();
            string rtf = (string)iData.GetData(DataFormats.Rtf);

            Console.WriteLine(iData.GetData(DataFormats.Text));

            // We assume the colortable and fontable are a standard preset used by VS.
            // Avoids hassle of parsing them.
            // Skip past {\colortbl.*;} and to the start of the real data
            // @todo - regular expression would be good here.
            int i1 = rtf.IndexOf(@"{\colortbl");
            if (i1 <= 0) throw new ArgumentException("Bad input RTF.");
            int i2 = rtf.IndexOf(";}", i1);
            if (i2 <= 0) throw new ArgumentException("Bad input RTF.");
            string data = rtf.Substring(i2 + 2, rtf.Length - (i2 + 2) - 1);

            TextWriter tw = new StreamWriter("out.html");
            Format(tw, data);
            tw.Close();
        }

        // Default color table used by VS's IDE.
        static string[] m_colorTable = new string[] 
            {
               // rrGGbb
                "#000000", // default, starts at index 0
                "#000000", // real color table starts at index 1
                "#0000FF",
                "#00ffFF",
                "#00FF00",
                "#FF00FF",
                "#FF0000",
                "#FFFF00",
                "#FFffFF",
                "#000080",
                "#008080",
                "#008000",
                "#800080",
                "#800000",
                "#808000",
                "#808080",
                "#c0c0c0"
            };


        // Escape HTML chars
        static string Escape(string st)
        {
            st = st.Replace("&", "&amp;");
            st = st.Replace("<", "&lt;");
            st = st.Replace(">", "&gt;");            
            return st;
        }
        // Convert the RTF data into an HTML stream.
        // This rtf snippet is past the font + color tables, so we're just transfering control words now.
        // Write out HTML to the text writer.        
        static void Format(TextWriter tw, string rtf)
        {
            tw.Write("<html><pre>");
            tw.Write("<span color=black>");
            // Example: \fs20 \cf2 using\cf0  System;
            // root --> ('text' '\' ('control word' | 'escaped char'))+
            // 'control word'  --> (alpha)+ (numeric*) space?
            // 'escaped char' = 'x'. Some characters \, {, } are escaped: '\x' --> 'x'
            // @todo - handle embedded groups (begin with '{')

            int idx = 0;
            while (idx < rtf.Length)
            {
                // Get any text up to a '\'. 
                Regex r1 = new Regex(@"(.*?)\\", RegexOptions.Singleline | RegexOptions.IgnoreCase);
                Match m = r1.Match(rtf, idx);
                if (m.Length == 0) break;

                // text will be empty if we have adjacent control words
                string stText = m.Groups[1].ToString();
                tw.Write(Escape(stText));
                idx += m.Length;

                // check for RTF escape characters. According to the spec, these are the only escaped chars.
                char chNext = rtf[idx];
                if (chNext == '{' || chNext == '}' || chNext == '\\')
                {
                    // Escaped char
                    tw.Write(chNext);
                    idx++;
                    continue;
                }

                // Must be a control char. @todo- delimeter includes more than just space, right?
                Regex r2 = new Regex(@"([\{a-z]+)([0-9]*) ", RegexOptions.Singleline | RegexOptions.IgnoreCase);
                m = r2.Match(rtf, idx);
                string stCtrlWord = m.Groups[1].ToString();
                string stCtrlParam = m.Groups[2].ToString();

                if (stCtrlWord == "cf")
                {
                    // Set font color.
                    int iColor = Int32.Parse(stCtrlParam);
                    tw.Write("</span>"); // close previous span, and start a new one for the given color.                    
                    tw.Write("<span style=\"color: " + m_colorTable[iColor] + "\">");
                }
                else if (stCtrlWord == "fs")
                {
                    // Sets font size. ignore
                }
                else if (stCtrlWord == "par")
                {
                    // This is a newline. ignore
                    // @todo- I think the only reason we can ignore this is because the \par in our input are always followed by
                    // a '\r\n' and we're accidentally writing that.
                }
                else
                {
                    throw new ArgumentException("Unrecognized control word '" + stCtrlWord + stCtrlParam + "'after:" + stText);
                }
                idx += m.Length;
            }
            tw.Write(Escape(rtf.Substring(idx))); // rest of string

            tw.Write("</pre></html>");
        } // end Format()
    }
}