Writing an RTF to HTML converter, posting code in blogs.


Visual Studio’s IDE will copy code as RTF (Rich Text Format). Web browsers like HTML. So posting code from Visual into blogs means a decent RTF to HTML conversion. And having a technical blog means posting code.  So I needed to solve this conversion problem.


The sad tale:
At first I tried Word, but Word had a heart-attack trying to convert RTF to HTML and generates heavily mangled HTML  (even it’s allegedly “filtered” html is still garbled), which in turn gave Community Server (which runs my blog) a heart-attack. That almost killed my blogging days, until I switched to Front-page. But FrontPage 2003 can’t properly convert RTF to HTML either (flabbergasting!), so I eventually wrote my own converter.


“How can I post Visual Studio code on my blog” was actually a very popular question on our internal blog alias. It took a while to get good answers. Several folks wrote their own tools. (Here’s an example of Shawn’s. His puts a pretty box around the code). I think CS’s support improved here too over time.


The right way:
There are some great tools out there that solve this properly, like a VS Plug-in that copies code as HTML (http://blogs.msdn.com/powertoys/archive/2004/10/21/245850.aspx ). Anybody who actually wants a working reasonably solution should use that.


There are also sample RTF 2 HTML converters all around, including some nice web-based ones. Just a search away.


What I did:
It was easier to just write an RTF to HTML converter than to deal with these other apps.  And more fun.


I’ve had a few people ask about it, so I wanted to throw it up on my blog.


This takes RTF in from the clipboard, and then dumps it out as a an html file called “out.html” in the current directory. 



  1. RTF is just a text file with embedded control sequences. Check out the RTF spec on MSDN. Or create an RTF file with word-pad and then open it with notepad.

  2. I wanted the input to be via the clipboard, as opposed to a file, since I was copying the text from Visual Studio. That was part of my motivation for writing http://blogs.msdn.com/jmstall/archive/2005/08/22/Clipboard_tools.aspx

  3. The output HTML is very straightforward. No CSS. the most fancy thing it has are <span> tags.

It’s only about 150 lines of C#, so I took a few shortcuts.



  1. It’s by no means a complete RTF converter. It just handles the subset of RTF that VS2005’s IDE produces. That’s all I needed.

  2. It’s hard-coded to use the colortable matching VS’s default C# color scheme.

  3. It doesn’t handle tabs. You should be using spaces anyways.

 


The comparison:


Here’s a comparison of Word and FrontPage trying to convert the RTF to HTM on a simple snippet


What it should be:


// check for RTF escape characters. According to the spec, these are the only escaped chars.
char chNext = rtf[idx];
if (chNext == ‘{‘ || chNext == ‘}’ || chNext == ‘\\’)
{
// Escaped char
tw.Write(chNext);
idx++;
continue;
}

——————————————————


Word 2003: It’s got all these Mso class tags and extra <p> tags. And in my browser, it’s got extra newlines.


                // check for RTF escape characters. According to the spec, these are the only escaped chars.


                char chNext = rtf[idx];


                if (chNext == ‘{‘ || chNext == ‘}’ || chNext == ‘\\’)


                {


                    // Escaped char


                    tw.Write(chNext);


                    idx++;


                    continue;


                }


——————————————————
Frontpage 2003
: It loses the indenting and the font.


// check for RTF escape characters. According to the spec, these are the only escaped chars.


char chNext = rtf[idx];


if (chNext == ‘{‘ || chNext == ‘}’ || chNext == ‘\\’)


{


// Escaped char


tw.Write(chNext);


idx++;


continue;


}


——————————————————


 


The code:


Here’s the code. In good dogfooding fashion, I got the HTML for it via running it on itself. If you find it useful or entertaining, great.

[update: missing & check] 
[update: added ‘;’ in Escape]

// Very primitive RTF 2 HTML reader
// Converts tiny subset of RTF (from VS IDE) into html.
// Author: Mike Stall (http://blogs.msdn.com/jmstall)
// Gets input RTF from clipboard.
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.IO;

namespace ClipBoard1
{
class Program
{
[
STAThread()]
static void Main(string[] args)
{
Console.WriteLine(“Get RTF from the clipboard.”);
IDataObject iData = Clipboard.GetDataObject();
string[] f = iData.GetFormats();
string rtf = (string)iData.GetData(DataFormats.Rtf);

Console.WriteLine(iData.GetData(DataFormats.Text));

// We assume the colortable and fontable are a standard preset used by VS.
// Avoids hassle of parsing them.
// Skip past {\colortbl.*;} and to the start of the real data
// @todo – regular expression would be good here.
int i1 = rtf.IndexOf(@”{\colortbl”);
if (i1 <= 0) throw new ArgumentException(“Bad input RTF.”);
int i2 = rtf.IndexOf(“;}”, i1);
if (i2 <= 0) throw new ArgumentException(“Bad input RTF.”);
string data = rtf.Substring(i2 + 2, rtf.Length – (i2 + 2) – 1);

TextWriter tw = new StreamWriter(“out.html”);
Format(tw, data);
tw.Close();
}

// Default color table used by VS’s IDE.
static string[] m_colorTable = new string[]
{
// rrGGbb
“#000000”, // default, starts at index 0
“#000000”, // real color table starts at index 1
“#0000FF”,
“#00ffFF”,
“#00FF00”,
“#FF00FF”,
“#FF0000”,
“#FFFF00”,
“#FFffFF”,
“#000080”,
“#008080”,
“#008000”,
“#800080”,
“#800000”,
“#808000”,
“#808080”,
“#c0c0c0”
};

// Escape HTML chars
static string Escape(string st)
{
st = st.Replace(
“&”, “&amp;”);
st = st.Replace(
“<“, “&lt;”);
st = st.Replace(
“>”, “&gt;”);
return st;
}
// Convert the RTF data into an HTML stream.
// This rtf snippet is past the font + color tables, so we’re just transfering control words now.
// Write out HTML to the text writer.
static void Format(TextWriter tw, string rtf)
{
tw.Write(
“<html><pre>”);
tw.Write(
“<span color=black>”);
// Example: \fs20 \cf2 using\cf0 System;
// root –> (‘text’ ‘\’ (‘control word’ | ‘escaped char’))+
// ‘control word’ –> (alpha)+ (numeric*) space?
// ‘escaped char’ = ‘x’. Some characters \, {, } are escaped: ‘\x’ –> ‘x’
// @todo – handle embedded groups (begin with ‘{‘)

int idx = 0;
while (idx < rtf.Length)
{
// Get any text up to a ‘\’.
Regex r1 = new Regex(@”(.*?)\\”, RegexOptions.Singleline | RegexOptions.IgnoreCase);
Match m = r1.Match(rtf, idx);
if (m.Length == 0) break;

// text will be empty if we have adjacent control words
string stText = m.Groups[1].ToString();
tw.Write(Escape(stText));
idx += m.Length;

// check for RTF escape characters. According to the spec, these are the only escaped chars.
char chNext = rtf[idx];
if (chNext == ‘{‘ || chNext == ‘}’ || chNext == ‘\\’)
{
// Escaped char
tw.Write(chNext);
idx++;
continue;
}

// Must be a control char. @todo- delimeter includes more than just space, right?
Regex r2 = new Regex(@”([\{a-z]+)([0-9]*) “, RegexOptions.Singleline | RegexOptions.IgnoreCase);
m = r2.Match(rtf, idx);
string stCtrlWord = m.Groups[1].ToString();
string stCtrlParam = m.Groups[2].ToString();

if (stCtrlWord == “cf”)
{
// Set font color.
int iColor = Int32.Parse(stCtrlParam);
tw.Write(
“</span>”); // close previous span, and start a new one for the given color.
tw.Write(“<span style=\”color: “ + m_colorTable[iColor] + “\”>”);
}
else if (stCtrlWord == “fs”)
{
// Sets font size. ignore
}
else if (stCtrlWord == “par”)
{
// This is a newline. ignore
// @todo- I think the only reason we can ignore this is because the \par in our input are always followed by
// a ‘\r\n’ and we’re accidentally writing that.
}
else
{
throw new ArgumentException(“Unrecognized control word ‘” + stCtrlWord + stCtrlParam + “‘after:” + stText);
}
idx += m.Length;
}
tw.Write(Escape(rtf.Substring(idx)));
// rest of string

tw.Write(“</pre></html>”);
}
// end Format()
}
}


 


 


 


 


 

Comments (12)

  1. Matt says:

    Agreed that your output looks really good in IE, but the text is too small in FireFox. And don’t even bother trying to read it in an Outlook 2007 RSS Feed – the newlines are missing or something.

    > It doesn’t handle tabs. You should be using spaces anyways.

    Cool, ok, but why?

  2. Ilan Assayag says:

    Greate tool, thanks!

    I have made a few changes/fixes as follows:

    1. You can see that the "&lt" and "&gt" were note correctly formatted in the Escape method (look at the resulting HTML – they came out with the actual sign). To fix that I added the following line at the beginning of the Escape method:

    st = st.Replace("&", "&amp");

    2. I personally prefer getting the result in my clipboard (i.e. replacing the RTF to HTML inside the clipboard). Then I can define a simple shortcut on my machine, and have it ready to use in my clipboard, without having to go through a concrete file. Therefore, instead of writing to a file I write to a StringWriter, and then set the text to my clipboard.

    With these changes I can now easily put my code on my Blogger blog, via Windows LiveWriter.

    Great!!!

    Thanks a lot!

    Ilan

  3. Tim Dawson says:

    Why shouldn’t we be using tabs?

  4. Ilan – after I started outputting it to a file "out.html", I quickly realized how inconvenient it was. I kept wanting to do the the clipboard –> clipboard, but was too lazy.

    My mistake about the &. I may be using an old version here. I wrote the tool a while ago, and I’ve copied the binaries around to about 5 different machines of mine. It actually took me about an hour to go find the source. I almost ended up using Reflector…

  5. Tim, Matt-

    I was joking about the tabs. I forgot to the <joking></joking> tags.

    FWIW, the CLR (and I believe Windows) coding conventions are both 4 spaces instead of tabs, and that’s what I personally use.

    Since I wrote the tool for my personal use, and was cutting every corner, I didn’t bother with tabs.

  6. .NET Junkie says:

    Converting code to HTML was a real pain! I solved my conversion problems using javascript. On my blog all code formatting and code highlighting is now done in the browser, which works pretty well for me.

  7. Isaac says:

    Could you post the regular expressions you used, as the output looks good and valid, but the regular expressions are confusing me!

    Please could you also explain what each line is doing… I know that it would be really boring, but for a person who doesn’t know C# (just PHP), it would be wonderful.

  8. Setting plain text on the clipboard is easy. Call Clipboad.SetText("Hello!"), and it works great. But

  9. I was asked yesterday where "South African Web Developer Daily News – 2007-03-12" was and if I had forgotten.

  10. Today is Microsoft TechDays, the official launch of Vista and Office 2007, so I’ll make it short a sweet.

  11. It is painful to post code to your blog without any special considerations. In the worst case you have

  12. I’m trying out Windows Live Writer. Currently, I do all of my blogging via Frontpage , so this will be

Skip to main content