How can I get the URL to the Web page the clipboard was copied from?


When you copy content from a Web page to the clipboard and then paste it into OneNote, OneNote pastes the content but also annotates it "Pasted from ...". How does OneNote know where the content was copied from?

As noted in the documentation for the HTML clipboard format, Web browsers can provide an optional Source­URL property to specify the Web page the HTML was copied from.

Let's write a Little Program that mimics what OneNote does, but just in plain text, because I don't want to try to parse HTML. This is much easier to do in C#, because the BCL provides most of the helper functions.

using System;
using System.IO;
using System.Windows;

class Program {
 [STAThread]
 public static void Main() {
  System.Console.WriteLine(Clipboard.GetText());
  using (var sr = new StringReader(
               Clipboard.GetText(TextDataFormat.Html))) {
   string s;
   while ((s = sr.ReadLine()) != null) {
    if (s.StartsWith("SourceURL:")) {
     System.Console.WriteLine("Copied from {0}", s.Substring(10));
     break;
    }
   }
  }
 }
}

First, we get the text from the clipboard and print it. That's the easy part.

Next, we get the HTML text from the clipboard. This is a bunch of text in a particular format. We look for an entry that specifies the Source­URL; if we find it, then we print the URL.

This code is rather sloppy. For example, if the HTML itself contains the string SourceURL:haha-fakeout, we risk misdetecting it as the source. To do this properly, we would have to verify that the string appears in the header area of the HTML (before the first StartFragment).

But this is a Little Program, so I can skip all that stuff.

Here's a sketch of the equivalent C/C++ version:

int __cdecl main(int, char **)
{
 if (OpenClipboard(NULL)) {

  // Obtain the Unicode text and print it
  HANDLE h = GetClipboardData(CF_UNICODETEXT);
  if (h) {
   PCWSTR pszPlainText = GlobalLock(h);
   ... print pszPlainText ...
   GlobalUnlock(h);
  }

  // Obtain the HTML text and extract the SourceURL
  h = GetClipboardData(RegisterClipboardFormat(TEXT("HTML Format")));
  if (h) {
   PCSTR pszHtmlFormat = GlobalLock(h);
   ... break pszHtmlFormat into lines ...
   ... look for a line that begins with "SourceURL:" ...
   ... if found, print it ...
   GlobalUnlock(h);
  }
  CloseClipboard();
 }
 return 0;
}
Comments (18)
  1. Gabe says:

    If there is no SourceURL, there is the possibility that the copying program put a <BASE> element in the HTML that you can use for a source.

  2. kantos says:

    I never cease to be amazed at the lack of just basic RAII helpers that are not provided with the windows SDK… even a basic clipboard that always ensure that GlobalUnlock is called would be useful

    [Windows is a cross-language platform. It focuses on making things possible from any language. If you have a language that supports RAII, then you can add RAII wrappers around the core functionality. -Raymond]
  3. Nick says:

    Ah, so that's what OneNote is doing when it helpfully says what site you pasted in an HTML fragment from.

  4. Wow! A C# example. What a sight for the sore eyes.

  5. Brian_EE says:

    @FleetCommand: Your eyes are still sore from reading Raymond's Friday post and subsequent response?

  6. Joshua says:

    [then you can add RAII wrappers around the core functionality. -Raymond]

    Following up:

    template <class T> class GlobalMemoryRegion {

      private:

         HGLOBAL h;

         T ptr;

      public:

         GlobalMemoryRegion(HGLOBAL hmem) : h(hmem) { ptr = (T)GlobalLock(h); }

         ~GlobalMemoryRegion() { GlobalUnlock(h); ptr = NULL; }

         operator T () { return ptr; }

         operator const T () const { return ptr; }

    }

  7. @Brian_EE: Not really, dear. I am not alien to personal attacks and assumption of bad faith. Once I gave a feedback to a collage professor that in an essay he has written for students, it is better for him to use a less formal language. Well, he downgraded the language to the level of street talk and literally added four F-words just to prove I am an idiot. Cost him his tenure review four years later. I just am mildly surprised what in my post hurt Raymond's feeling so much. I tried to be as courteous as possible. I myself welcome such input in my own blog. Damn, it helped me a lot too.

    But I love C# and certainly welcome more use of it in this blog.

  8. bill s says:

    <i>Once I gave a feedback to a collage professor that in an essay he has written for students, it is better for him to use a less formal language. Well, he downgraded the language to the level of street talk and literally added four F-words just to prove I am an idiot. Cost him his tenure review four years later.</I>

    And you're, like, proud of this or something? Get a life, man.

  9. John Doe says:

    @bill s, it's about showing the point: if you take criticism to the ridicule, expose yourself to the ridicule, risk making yourself a figure of ridicule.

    But comments to that post should be on that post, not here.

    BTW, that John Doe, "The longer version is much better.", it's not me, it's another John Doe.

    You can too make GlobalLock/GlobalUnlock automatic in .NET with IDispose and C#'s using statement.

  10. waleri says:

    @Joshua

    This code would cause more problems than it would solve.

  11. 640k says:

    This is the worst feature ever added and many M$ apps do it already. Past plain text instead of bloated formatted text junk which the user has to remove because it's wrong anyway. These "features" does NOT help anyone, it only makes peoples job harder to perform.

  12. Katie says:

    @640k

    I appreciate the little bits that OneNote adds – it makes it easier to keep track of all the sources I've used as I build up a binder worth of notes on a project. ~~~~

  13. skSdnW says:

    Why does MSDN refer to it as CF_HTML? It is not a native clipboard format, if anything it should be called CFSTR_HTML…

  14. Entegy says:

    @640k Normally I'd ask if you're 12… But I know you've been reading this blog long enough that you aren't. Which means you're an adult human being who uses M$. Grow up.

    I find the pasted from feature extremely useful when taking notes or snippets of passages that I want to source or remember where that information comes from. And I'm extremely happy that Raymond gave a C# example for mimicking!

  15. morlamweb says:

    @skSdnW: probably for consistency with the system-provided clipboard formats.  My guess is that it's a registered clipboard format.  Once it's registered, there's little difference between it and a system-provided format, so why not use the established naming convention?  What benefit(s) does the CFSTR_HTML name confer that CF_HTML doesn't?  Imagine if they took your advice (way back when it was first conceived) and they called it CFSTR_HTML.  Imagine the cries of geeks of a different stripe: "Stupid Microsoft!  I just want to use the HTML clipboard format.  Why did they have to give it a different name??  Now I have to remember that the HTML format has a different name from all the others.  Stupid Microsoft!!!1!"

  16. morlamweb says:

    @skSdnW: From what I can see, the CFSTR_* naming convention is for the Shell Clipboard formats: msdn.microsoft.com/…/bb776902%28v=vs.85%29.aspx

    The shell formats are used when copying Windows shell objects through the clipboard.  CF_HTML, on the other hand, is for copying fragments of HTML to the clipboard.  Sure, the shell could use CF_HTML, but so could any application that a) understands HTML and b) wants to share HTML with the clipboard.  In other words, CF_HTML is a more general-purpose clipboard format.  Why should it adhere to the naming convention of the shell clipboard formats when it has more in common with CF_TEXT?

  17. skSdnW says:

    @morlamweb: For consistency it _should_ be named CFSTR_HTML! CF_* are number constants and are used by the older clipboard formats and you don't have to register them. CFSTR_* are strings and have to be registered before you can use them. Several are already defined by the shell headers (CFSTR_INETURL etc).

  18. skSdnW says:

    @morlamweb: So you are saying that CFSTR_INETURL is not a general-purpose clipboard format just because it is defined by a shell header? Most browsers will put both CFSTR_INETURL and CF_TEXT on the clipboard if you drag&drop a link.

    My point is, RegisterClipboardFormat(CF_*) is never valid and neither is GetClipboardData(CFSTR_*) so why should CF_HTML be so special?

Comments are closed.