WordBlogX v2



Peter has been giving me a hard time about getting the WordML transform integrated into WordBlogX for a while now. Well, right now I’m writing this blog entry with Microsoft Office Word 2003 under the debugger with a debug build of the WordBlogX assembly. So far so good.

 

There were two bugs that were driving me crazy.  First, if you were watching Mike Howard’s blog closely when he first started blogging you would have noticed some funky looking “A” characters spread about a couple of his entries (to be truthful my blog entries had the same problem sometimes).  Well, I tracked those annoying “A” characters down to Word changing the first character of two spaces after a period into some magical character with a Unicode value of 160 (in CLR: \u0160).  I don’t think this would be a problem if http://blogs.gotdotnet.com had the UTF-8 markers at the top of our web pages (we don’t control that part of the blog entries, BlogX does).  Unfortunately, they don’t.  So I added the following code to WordBlogX to convert all the text here to UTF-7:

 




   
// convert the entry into UTF7 (since BlogX seems to prefer that)




   
byte
[] entryBytes = Encoding.Unicode.GetBytes(entry);
// get the entry as a bunch of unicode bytes




   
byte
[] entryBytesUtf7 = Encoding.Convert(Encoding.Unicode, Encoding.UTF7, entryBytes);




   
string
entryUtf7 = Encoding.UTF7.GetString(entryBytesUtf7);
// get the bytes as a utf7 string

 

Now, if I did my job correctly, that text came out coloured (ahh, the dreaded “u” still exists in WordBlogX v2!) and indented and everything.  Honestly, getting that text indented turned out to be the most painful part of this whole feature.  This is the slightly hacked code that I’m currently using to get white-space preserved:

 




   
// convert all double spaces into a space + non-breaking space to preserve “indentedness” (like for code)




   StringBuilder entryWithSpaces =
new
StringBuilder();




   
int
iEntryStart = 0;




   
int
iEntryEnd = 0;




   regex =
new
Regex(@”<span.*?>(.*)</span>”);
// look for those spans with two or more spaces in their body




   mc = regex.Matches(entry);




   
if
(0 < mc.Count)




   {




       
foreach
(Match m
in
mc)




       {




           iEntryEnd = m.Groups[0].Index;




           
if
(iEntryStart < iEntryEnd)




           {




               entryWithSpaces.Append(entry.Substring(iEntryStart, iEntryEnd – iEntryStart));




           }



 




           entryWithSpaces.Append(m.Groups[0].Value, 0, m.Groups[1].Index – m.Groups[0].Index);




           entryWithSpaces.Append(m.Groups[1].Value.Replace(”  “, ” &nbsp;”));



 




           iEntryStart = m.Groups[1].Index + m.Groups[1].Length;
// move beyond all the matched stuff




       }



 




       
// if anything is left over, tack it on to the end




       
if
(iEntryEnd < iEntryStart)




       {




           entryWithSpaces.Append(entry.Substring(iEntryStart));




       }



 




       
// get the entry back as a string




       entry = entryWithSpaces.ToString();




   }

 

I’m not proud of that code but it seems to be getting the job done now.  Assuming everything posts correctly finally getting it working tonight will be totally worth it.

 

I still haven’t written the deferred CustomAction to take care of the path manipulations necessary for WordBlogX but that will be cake compared to the hoops I’ve been trying to jump through getting WordBlogX to look good.

 

Well, that’s enough for tonight.  I have a lot I’d like to blog about since I’ve kinda’ been distracted lately.  More later…

Comments (2)

  1. I just skimmed right over all the encoding details in this blog entry but while reading Chris Sells blog (http://www.sellsbrothers.com/) I came across this link: http://www.joelonsoftware.com/articles/Unicode.html. If you want an in-depth look at encoding checkout that article. Thought it would be a good way to end the night. Good night.

  2. Mike Dimmick says:

    Ah, that’s where it’s coming from!

    Character code 160 – decimal – is Unicode value u00A0; Unicode values are specified in Hex. According to the chart at http://www.unicode.org/charts/PDF/U0080.pdf, U+00A0 is a non-breaking space. Word has to do this because, unless you use the xml:space attribute and have an XML parser that understands it, contiguous whitespace in XML is reported to the application as a single space character, U+0020.

    blogs.gotdotnet.com is currently reporting Content-Type: text/html; charset=utf-8 for /robmen/default.aspx, and there doesn’t appear to be an encoding specified in the HTML source anywhere. It’s not Windows-1252, because that also has a non-breaking space at code point 160.

    The correct UTF-8 representation for U+00A0 is 0xC2 0xA0. I think the page is being output as UTF-8, but interpreted by the browser as Windows-1252 (or possibly ISO 8859-1), where 0xC2 is Latin-letter-A-with-circumflex.

    You also have this problem on MSDN, I’ve noticed.

    I think UTF-7 is going to produce seriously mangled output if anyone ever tries to use non-ASCII range characters and the server or page isn’t configured to report UTF-7 as the encoding. You’re better off ensuring that the server is correctly configured to report UTF-8 in the Content-Type header, or that the page includes an appropriate META HTTP-EQUIV tag. I realise that this is something that WordBlogX can’t easily do – perhaps you could get it to retrieve the HEAD of the page and output whatever encoding is reported by the server?