What is this rogue version 1.0 of the HTML clipboard format?

At least as of the time this article was originally written, the HTML clipboard format is officially at version 0.9. A customer observed that sometimes they received HTML clipboard data that marked itself as version 1.0 and wanted to know where they could find documentation on that version.

As far as I can tell, there is no official version 1.0 of the HTML clipboard format.

I hunted around, and the source of the rogue version 1.0 format appears to be the WPF Toolkit. Version 1.0 has been the version used by ClipboardHelper.cs since its initial commit.

If you read the code, it appears that they are not generating HTML clipboard data that uses any features beyond version 0.9, so the initial impression is that it's just somebody who jumped the gun and set their version number higher than they should have. The preliminary analysis says that you can treat version 1.0 the same as version 0.9.

But that's merely the preliminary analysis.

A closer look at the Get­Clipboard­Content­For­Html function shows that it generated the HTML content incorrectly. The code treats the fragment start and end offsets as character offsets, not byte offsets. But the offsets are explicitly documented as in bytes.

StartFragment Byte count from the beginning of the clipboard to the start of the fragment.
EndFragment Byte count from the beginning of the clipboard to the end of the fragment.

My guess is that the author of that helper function made two mistakes that partially offset each other.

  1. The author failed to take into account that C# operates in Unicode, whereas the HTML clipboard format operates in UTF-8. The byte offset specified in the HTML format header is the byte count in the UTF-8 encoding, not the byte count in the UTF-16 encoding used by C#.
  2. The author did all their testing with ASCII strings. UTF-8 has the property that ASCII encodes to itself, so one byte equals one character. If they had tested with a non-ASCII character, they would have seen the importance of the byte count. (Or maybe they simply would have gotten more confused.)

Now, WPF knows that the Data­Formats.HTML clipboard format is encoded in UTF-8, so when you pass a C# string to be placed on the clipboard as HTML, it knows to convert the string to UTF-8 before putting it on the clipboard. But it doesn't know to convert the offsets you provided in the HTML fragment itself. As a result, the values encoded in the offsets end up too small if the text contains non-ASCII characters. (You can see this by copying text containing non-ASCII characters from the DataGrid control, then pasting into Word. Result: Truncated text, possibly truncated to nothing depending on the nature of the text.)

There are two other errors in the Get­Clipboard­Content­For­Html function: Although the code attempts to follow the recommendation of the specification by placing a <!--EndFragment--> marker after the fragment, they erroneously insert a \r\n in between. Furthermore, the EndHTML value is off by two. (It should be DATA­GRID­VIEW_html­End­Fragment.Length, which is 38, not 36.)

Okay, now that we see the full situation, it becomes clear that at least five things need to happen.

The immediate concern is what an application should do when it sees a rogue version 1.0. One approach is to exactly undo the errors in the WPF Toolkit: Treat the offsets as character offsets (after converting from UTF-8 to UTF-16) rather than byte offsets. This would address the direct problem of the WPF Toolkit, but it is also far too aggressive, because there may be another application which accidentally marked its HTML clipboard data as version 1.0 but which does not contain the exact same bug as the WPF Toolkit. Instead, applications which see a version number of 1.0 should treat the EndHTML, EndFragment, and EndSelection offsets as untrustworthy. The application should verify that the EndFragment lines up with the <!--EndFragment--> marker. If it does not, then ignore the specified value for EndFragment and infer the correct offset to the fragment end by searching for the last occurrence of the <!--EndFragment--> marker in the clipboard data, but trim off the spurious \r\n that the WPF Toolkit erroneously inserted, if present. Similarly, EndHTML should line up with the end of the </HTML> tag; if not, the specified offset should be ignored and the correct value inferred. Fortunately, the WPF Toolkit does not use EndSelection, so there is no need to attempt to repair that value, and it does not use multiple fragments, so only one fragment repair is necessary.

Welcome to the world of application compatibility, where you have to accommodate the mistakes of others.

Some readers of this Web site would suggest that the correct course of action for your application is to detect version 1.0 and put up an error message saying, "The HTML on the clipboard was placed there by a buggy application. Contact the vendor of that application and tell them to fix their bug. Until then, I will refuse to paste the data you copied. Don't blame me! I did nothing wrong!" Good luck with that.

Second, the authors of the WPF Toolkit should fix their bug so that they encode the offsets correctly in their HTML clipboard format.

Third, at the same time they fix their bug, they should switch their reported version number back to 0.9, so as to say, "Okay, everybody, this is the not-buggy version. No workaround needed any more." If they leave it as 1.0, then applications which took the more aggressive workaround will end up double-correcting.

Fourth, the maintainers of the HTML clipboard format may want to document the rogue version 1.0 clipboard format and provide recommendations to applications (like I just did) as to what they should do when they encounter it.

Fifth, the maintainers of the HTML clipboard format must not use version 1.0 as the version number for any future revision of the HTML clipboard format. If they make another version, they need to call it 0.99 or 1.01 or something different from 1.0. Version 1.0 is now tainted. It's the version number that proclaims, "I am buggy!"

At first, we thought that all we found was a typo in an open-source helper library, but digging deeper and deeper revealed that it was actually a symptom of a much deeper problem that has now turned into an industry-wide five-pronged plan for remediation.

Comments (34)
  1. John says:

    This is kind of a fortunate situation, right?  I mean, what if they hadn't messed up the version number as well?

  2. Eitan says:

    What do you mean by "C# operates in Unicode, whereas the HTML clipboard format operates in UTF-8." vs UTF-8?  UTF-8 is an encoding of Unicode code points.

  3. Eitan says:

    What do you mean by "C# operates in Unicode, whereas the HTML clipboard format operates in UTF-8."?  UTF-8 is an encoding of Unicode code points.

  4. @Eitan says:

    "C# operates in UTF-16LE whereas the HTML clipboard format operates in UTF-8."

  5. Mordachai says:

    He means Microsoft's OS native UNICODE, which is UTF-16, which he also mentions a few times.

  6. ErikF says:

    @Eitan: Generally when Microsoft people are speaking informally, Unicode means UTF-16 (or, historically, UCS-2). Since Raymond works with Win32 almost exclusively, that's where he's coming from.

  7. Barbie says:

    @Eitan: welcome to Raymond's blog. As old timers will tell you, Unicode, within the world of Windows as an encoding, means the encoding that Windows uses internally for Unicode strings. That's UTF16.

  8. derpisch says:

    @Eitan: MSDN considers 'Unicode' (the character set) to be synonymous with UTF-16 (the character encoding).

    Unfortunately UTF-8 came about after the initial Windows Unicode support.

  9. Dan Bugglin says:

    @John then you would simply check for the misaligned ending offsets for ALL HTML 0.9 clipboard data, not just 1.0.

    I'm surprised Raymond didn't point out that simply throwing up an error dialog on HTML 1.0 content would be a bad idea, especially since the user may notice a competitor's program seems to have no problems with WPF Toolkit HTML clipboard data (and if enough users run into the problem your compentitor may intentionally get WPF Toolkit clipboard data working in their app and tout it as a feature that your app doesn't have!).  Now your app looks buggy and you look incompetent in the eyes of the user.

    Then again he did sort of build a strawman there; knock it down and you have a complete strawman argument.

  10. Joshua says:


    You wouldn't believe how many data formats exist that the writing software is supposed to write its name & version when writing the format data.

    This does tend to make fixing somebody else's bugs somewhat easier.

  11. Ken White says:

    @TheMAZZTer: Raymond did point out that you shouldn't just throw up a dialog. Read the paragraph starting with "Some readers of this Web site".

    [I didn't say that you should or shouldn't. I'm just saying that some people would argue that a dialog is the correct behavior, and I wished those people good luck. -Raymond]
  12. Anonymous Coward says:

    That's nice advice an all that, but I'll just keep doing what I've always done when I encountered an unsupported version: just paste plain text.

    The alternative seems a lot of cruft to code, vet and maintain, for something that's going to be fixed anyway. Maybe if I had to deal with conmen competitors like described above I could make a case for it, but at present, I can't.

    [Given that the problem has existed for over four years, I wouldn't be so sure about the "going to be fixed anyway." -Raymond]
  13. Suggestion for the maintainers of the HTML clipboard format spec.: in addition to a CF_HTML version, allow an implementation to specify its own name and version (e.g., Source:Windows Presentation Foundation/1.0) which allows clients to make implementation-specific hacks instead of standard-version-specific hacks.

  14. John says:

    @The MAZZTer:  Since 0.9 is (apparently) the initial version, that would require ALWAYS validating the data which is why using a non-existent version is sort of lucky.  On the other hand, if it's so easy to mess up that an official toolkit released by Microsoft is broken for years then I imagine there are many other buggy implementations out there so you should probably always validate it anyway.

  15. Andreas Rejbrand says:

    Hm… Like so many things I know and love (in particular, the GDI), this topic is referred to as a 'Legacy API' by the MSDN.

  16. Joshua says:

    [Given that the problem has existed for over four years, I wouldn't be so sure about the "going to be fixed anyway." -Raymond]

    By which we also know that Raymond has no intention of submitting a patch (not that he has any obligation to). Meh. If I used it I'd submit a patch based on this, but I don't so I won't either.

  17. cheong00 says:

    I think any application that supports HTML clipboard format should support TEXT as well, so the course I'd have chosen would be to request the clipboard data in TEXT instead.

  18. Smitty says:

    "byte count in the UTF-8 encoding, not the byte count in the UTF-16 encoding"

    Surely the byte count is independent of which encoding is used. if I have a string in UTF-8 that is 8 bytes long that could be anything from 2 to 8 characters, but it's still 8 bytes.  Similarly a 8 byte UTF-16 character string is between 2 and 4 characters long (if I've done my math right), but it's still 8 bytes.

    Still, it's always refreshing to read more idiocy associated with clipboard formats.  

  19. Neil says:

    "byte count in the UTF-16 encoding"

    Don't you mean the code point count in the UTF-16 encoding, which would be the same for ASCII as the byte count in UTF-8 encoding?

    Would an alternative be to validate the counts as UTF-16 code points before trying to interpret them as UTF-8 byte counts?

  20. Neil says:

    Sorry, I mean UTF-16 code units of course.

  21. @Smitty says:

    "Surely the byte count is independent of which encoding is used." What a strange statements. The byte length of the binary representation (encoding) of a given string for sure depends on the encoding. If the encoding is UTF-32 then "A" will have a byte count of 4, but in UTF-8, "A" will have a byte count of 1.

  22. 640k says:

    I only copy/paste latin letters. Unicode support in WPF is generally lacking.

  23. @@Smitty

    What Smitty means is that 8 bytes is 8 bytes, and a byte offset into any number of bytes is still a byte offset into those bytes.  What is IN any given 8 bytes doesn't alter that fact that 8 bytes is what it is.  Where the encoding matters is when what you have encoded in those 8 bytes DEcodes to a sequence of data points where the number of data points stored may vary according to the encoding.

    8 bytes of UTF8 might be 2 Unicode "characters", or it might be 3, 4 or as many as 8.

    8 bytes of UTF16 might be 4 Unicode "characters", or it might be 2 or 3.

    But 8 bytes of UTF8 is 8 bytes.

    And 8 bytes of UTF16 is 8 bytes.

    It is akin to the "Which is heavier:  A ton of feathers or a ton of bricks ?" type question.

    What got you confused was that you started thinking in terms of "8 characters", for which the length of the encoding in UTF8 vs UTF16 will of course vary, and if an offset is specified in characters then the actual BYTE offset into any given encoding will vary according to that encoding.

    But that isn't what Raymond said – he wrote (in the sentence that Smitty references) entirely in terms of bytes.  That is: grams/kilograms, not numbers of feathers or bricks.

  24. Daniel15 says:

    Did you report this to the WPF team?

  25. GregM says:

    Joylon, I'm trying to understand what you wrote, but I can't.  Raymond's sentence is correct as written, but I can't tell if you agree or not.

    Here is an example of some text that I want to put on the clipboard using offsets:

    "I want to copy just *this* text".

    If you are giving the offsets so that you only are copying "*this*", then the character offset to the start is 20 and the end is 25, which are 40 and 50 bytes in C# Unicode.  When that text is converted to UTF-8, the byte offsets of the desired text are 20 and 25.  However, since the code specified 40 to 50 bytes, it is now the wrong text (and in this case, off the end of the string).

    Therefore: The byte offset specified in the HTML format header is the byte count **to the desired text** in the UTF-8 encoding, not the byte count **to the desired text** in the UTF-16 encoding used by C#.

  26. Henderson101 says:

    Silly question – when using this in C# with WPF, what actually happens? I mean – does it work as intended, or does it fail to yield correct results? I only ask (having never used the functionality) because I have no idea what the WPF CLR side code is doing. Is it possible if uses some kind of shim or other correction to make thing "right"? Is it possible that the issue is "everyone else", and the WPF code just works? (We obviously assume everything Raymond says is true, and the WPF code is wrong, even if it does work for WPF based apps.)

  27. @Andreas Rejbrand:  "Hm… Like so many things I know and love (in particular, the GDI), this topic is referred to as a 'Legacy API' by the MSDN."

    Indeed… and many other APIs too – core painting desktop APIs and messages for which there is no replacement that I'm aware of.  For example: WM_PAINT, UpdateWindow, RedrawWindow, InvalidateRect.  These are "Legacy Graphics – Technologies that are obsolete and should not be used in new applications."  What are the Microsoft-recommended API replacements?  Direct2D looks like an option, except that the new Direct2D still has dependencies on these legacy APIs that "should not be used in new applications."  Direct2D example on MSDN uses: msdn.microsoft.com/…/dd370994(v=vs.85).aspx – this modern example still uses the following legacy APIs: UpdateWindow, WM_DISPLAYCHANGE, WM_PAINT.  How should we rewrite our apps to avoid these legacy APIs?

    What about OpenGL, which is completely marked as legacy?  It's the only 3D graphics API used for cross-platform development that I know of, and isn't obsolete outside of Microsoft.

    What about multiple monitors?  The entirety of the multiple monitor API has been designated as legacy, with no recommended replacement API that I could find.  Should I just conclude that multi-tasking with overlapping windows in general is legacy, since WM_PAINT is legacy and multiple monitors is legacy, with no replacements?  This worries me.

  28. Wei says:

    I think it is the explanation of the two mistake that throws most people off.

    Most likely Get­Clipboard­Content­For­Html author treads character as byte count.

    so when you go in with the ASCII format testing, byte = character, but when coming from UTF-16, the byte count might be anywhere from 2~4bytes/character.

    I have no idea what is the error that Get­Clipboard­Content­For­Html is giving out. My wild guess is that the function can only handles ASCII.

    So the code loop try to get a string using StartFragment and EndFragment might look like

    for(char count = StartFragment ; count < EndFragment ;count++)

    *ptr_result++ = *(ptr_OriginalString + count);

    that is simple but can only handle ASCII characters.

    however, if each character is in variable length, count increment must be handled on each iteration, some more code which loop through *(ptr_OriginalString + count) must added in to the body of the for loop

    for(char count = StartFragment ; count < EndFragment ;)


      if( (*(ptr_OriginalString + count) & 0x80) != true )  


         "Some extra work to determing UTF format and get number of remaining bytes for this character"

         for(count2 = 0; count2 < "number of remaining bytes"; count2++ , ptr_result++ , count++)

             *ptr_result = *(ptr_OriginalString + count);



    something similar to this.

    Not sure if this is the case, but this is my best guess without knowing what kind of error output from the function

  29. ErikF says:

    @Henderson101: I don't think this code could ever work with multi-byte character sequences because it computes the character length *before* encoding (and makes a seriously bad assumption to boot!)

    Here are the two lines of code that are responsible for the bad assumption (lines 127-128):

    // Marshal.SystemDefaultCharSize is 2 on WinXP Pro – so the offsets seem to be in character counts instead of bytes.

    int bytecountEndOfFragment = 135 + sbContent.Length;

    /* … put the content together … */

    My gut suspicion is that this code was "reverse-engineered" from trial and error; quite possibly the clipboard documentation was never looked at!

    [That comment suggests that the original developer though that the HTML clipboard was encoded in UTF-16. -Raymond]
  30. Kevin Eshbach says:


    In my opinion I would not worry about the api's being marked as legacy.  If Microsoft boots them from Windows and does not provide some sort of emulator to run application that use these api's then business users will not upgrade.

  31. @GregM – exactly.  A byte count is a byte count.  Not a character count.  That was entirely my point which was perfectly plain imho.  8 bytes of UTF8 is the same amount of bytes as 8 bytes of UTF16.  But it might not be the same amount of characters.

    Raymonds wording was a bit woolly, leading to the potential to read it as "the byte count is not the same as the byte count" if you didn't already know what Raymond was driving at, and Smitty's (perhaps deliberate) confusion as to how two things that are the same quantity (a number of bytes) can be made different just by the particular composition of the thing involved.

    i.e. "8 kilograms of feathers is not the same as 8 kilograms of bricks"

    Smitty:  Um.  Yes it is.  8 kg is 8 kg, no matter what you are weighing.

    What Raymond meant was that the number of feathers you get in 8kg of feathers is not the same as the number of bricks you get in 8 kg of bricks.  If you want a certain numeric quantity of two different things, measuring by WEIGHT is not going to work.

    If you already knew this then Smitty's comment was nonsense, even though it made perfect sense.  My mistake was pointing out to others what they already knew by way of explaining how Smitty had become confused.  Of course, people not similarly confused wouldn't understand why or how Smitty was confused himself so my explanation would have appeared equally redundant and – to their mind – confusing (because they had already made the translation from kg to bricks/feathers, in their own mind).

    A persistent problem in this field of clever/knowledgeable people who assume that other people are just as clever/knowledgeable, is an inability to "put yourself in someone elses' shoes and see things from their p.o.v", and express oneself accordingly.

    Short version:  I agree with both Raymond and Smitty.  ;)

  32. Matt says:


    Has reading Raymonds blog taught you NOTHING? If the API works now, and there is no security defect requiring its removal, it will work FOREVER because Microsoft will move mountains to avoid your app breaking on any future version of Windows.

  33. GregM says:

    Joylon, thanks for the clarification.

  34. @Matt: Actually it's taught me a lot.  Now your point – just because an API is there for legacy/compatibility reasons doesn't mean you should use it for new development.

    Some Win16 examples, all of which still exist but MSDN recommends not to use them:

    WinExec replaced with CreateProcess

    Get/WritePrivateProfileString replaced with the registry (INI files have Unicode problems, as Raymond noted)

    RegSetValue and related replaced with newer registry functions that support multiple values in a key

    WNetAddConnection replaced with WNetAddConnection2/WNetAddConnection3

    EnumFonts replaced with EnumFontFamiliesEx

    Other APIs aren't explicitly marked as legacy (yet?) – but exist for legacy reasons.  For example, GetParent, which Raymond blogged about a year ago.

    A newer example: SHBrowseForFolder replaced with new Vista-style IFileDialog API; the old API gives your app a "legacy" look

    All these APIs are still there, but not recommended to use in new development because the original APIs had problems, didn't fit well with other new features, or were just replaced with a newer/better design.  Using them means you might miss out on new stuff, or your app might appear "legacy".  Or it might trap you into having bugs because you didn't read the esoteric documentation (e.g. GetParent).  They might not delete the APIs for compatibility reasons but it's not ideal to use in new development.  Newer APIs might require you to stop using legacy APIs.  Legacy APIs could require cumbersome workarounds if they must be used.  And someday, the API might really go away.

    As far as Microsoft never trying to change the API to break you?  There's only some truth in that; here's a partial list of APIs that are either gone or going to go away (i.e. MSDN says might not be there in the future):

    A whole boatload of shell APIs, either unsupported or on their way out: msdn.microsoft.com/…/jj635743(VS.85).aspx

    Indexing Service API: msdn.microsoft.com/…/ee805985(VS.85).aspx

    Whatever online help system is not currently in style (WinHelp, HTML Help, and now they seem to have something new every couple years)

    Windows Gadgets & Sidebar

    Microsoft Agent

    Encrypting File System (EFS) APIs

    Transactional NTFS (talk about a short life there)

    Entire versions of .NET Framework

    The APIs seem to be removed because Windows decided to do things differently, or there were features that mostly failed and weren't widely adopted.

    That's why I'm asking what to replace the APIs I previously mentioned with for these seemingly-essential APIs for desktop development.  Or are desktop apps now legacy since some core, essential APIs legacy, and therefore new Windows development should not be desktop development?

Comments are closed.