Sample code for Plagiarism Searcher tool

Here's my sample code for a tool to catch blog plagiarism that I described earlier. In retrospect, it was pretty easy to write (under 400 lines!). And edit-and-continue in C# and interceptable exceptions made my development time a lot faster!

The tool currently only works on a single entry, but it could be expanded to iterate over all entries in an entire blog. It's a simple C# console app which:
1. Takes in a source blog entry (via the clipboard) and a set of "author keywords" (via the command line). The keywords are things like the author's name or homepage that would indicate somebody is crediting the author. Ideally, the tool would read in the all the entries from some blog RSS feed instead of doing just 1 entry at a time.
2. breaks the entry up into excerpts (since it's unlikely somebody would plagiarize the entire entry). This also finds matches if they've changed a few characters. For example, perhaps their editing tool automatically changed something such as transforming unicode characters to ansi or transforming emoticons.
3. programmatically uses MSN Search to search for other URLs that includes those excerpts.
4. For each resulting URL, scans that page for any indication of crediting the original author. One simple way is to search the page for a set of "author keywords" provided in step 1. If no such keywords are found, then assume the page is not crediting the author. 

Offloading even more work to the Search Engine.
In retrospect, I realized after I wrote the tool that step #4 (filtering out urls that credit the author) can be folded into the search step in #3 (finding urls that copy content from the author) by using a sufficiently intelligent query with the NOT and AND keywords. Consider a query like: (NOT link: <author homepage> ) AND (NOT <author name> ) AND " <some excerpt> ". This also has a great advantage in that both step #3 + #4 are using the same copy of the page. This avoids step #3 using a cached version of the page, and then step #4 looking at a totally different version of the page.

As an example, this MSN search query:  "So a debugger would skip over the region between the" looks for all pages that contain that excerpt from this blog post of mine. I tried this MSN search query:
(NOT Link:blogs.msdn.com/jmstall) AND (NOT "Mike Stall") AND "So a debugger would skip over the region between the" which filters out anybody that mentions my name or links back to me.

Clearly, query engines with useful search keywords can become extremely powerful. This tradeoff reminds me of similar tradeoffs with SQL queries where it's possible to offload client side work onto the server with a more intelligent query.

Results:
I ran the tool with the full contents from my blog on 0xFeeFee Sequence points. The keywords were my name ('Mike Stall') and part of my blog URL ("jmstall"). Here's the output. It's pretty verbose because it prints all the excerpts that match, but the interesting part is the summary at the end.

[update] The tool originally found 1 candidate (with a 75% match) - but the candidate then went and added a reference back to me. I reran the tool and now that candidate doesn't show up. That's exactly how it should work! The original output is here. The tool now prints:

Test for plagiarism.
Getting entry data from clipboard contents.
Entry:#line hidden and 0xFeeFee sequence points
So
Keywords (if a target URL doesn't have any of these words, it may be plagiarizing): 'mike stall' 'jmstall'.
Doing Search. This could take a few minutes.
Search broke entry into 37 excerpt(s).
Found 1 URL(s) that contain excerpts from the entry w/o ref back to the author.
Url https://www.tagcloud.com/tag/Readify/nice/ has 1/37 matches:
------------------------
(0/1) use the `#line hidden' directive to mark a region of code as not exposed to the debugger.Eg:
------------------------
Summary: (sorted least to most matches)
(1/37) 2%: https://www.tagcloud.com/tag/Readify/nice/
 

[update] It found 1 candidate. It turns out that this is a false positive: the search engine cache found the copied content, but then the page had completely changed since then (it's an "Under Construction page" now).
[update] Just to be clear, Tagcloud is not plagiarising. Rather this is a false positive and shows a shortcoming with the tool. The candidates were found using MSN search, which uses a cached copy of the web pages. However, the author crediting uses the live copy of the webpage. So if the page has changed (perhaps it's a blog and the original entry is no longer on the homepage; or perhaps the site is down), then you can get false positives. This could be fixed by having both the candidate search and crediting use the same copy of the webpage. The easiest way to do this is to follow the suggestion above and have the search query use the AND,NOT, LINK keywords to do both candidate search and crediting.

Here's the code: Note you need to compile it with winforms and system.web like so:
csc t.cs /debug+ /r:System.Web.dll /r:System.Windows.Forms.dll

 
//-----------------------------------------------------------------------------
// Test harness to use MSN search to search for plagiarism.
// Author: Mike Stall.  https://blogs.msdn.com/jmstall
//-----------------------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
using System.Diagnostics;
// Include Windows.Forms because we use the clipboard.

namespace Web2
{
    // Class to get Search results using https://search.msn.com.
    // We want to get a list of Urls from a given search string.
    // It currently does a HTTP query with an embedded query string to retrieve the result as XML.
    // It then extracts the URLS from the XML (which are conveniently in <url> tags).
    // If MSN Search ever comes out with a real API for internet searches, we should use that instead of this.
    // See https://blogs.msdn.com/jmstall/archive/2005/08/21/msn_search_api.aspx for details on this class.
    class MsnSearch
    {
        // Helper to get the Search result for an exact string.
        // This will escape the string results. This is very fragile and reverse engineered based off my  
        // observations about how MSN Search encodes searches into the query string.
        // This also does not account for search keywords (like "AND").
        // If there's a spec for the query string, we should find and use it.
        static Uri GetMSNSearchURL(string input)
        {
            // The 'FORMAT=XML' part request the results back as XML, which will be easier to parse than HTML.
            StringBuilder sb = new StringBuilder(@"https://search.msn.com/results.aspx?FORMAT=XML&q=");
            sb.Append(System.Web.HttpUtility.UrlEncode(input)); // requires ref to System.Web.dll
            return new System.Uri(sb.ToString());
        }

        // Return an list of URLs for the search results against an string input.
        // This currently does not recognize any search keywords.
        // For an exact search, place the input string in quotes.
        // Note that these searches are not exact. For example, the search engine may have used a cached 
        // webpage and the URL may have changed since then. Or the search engine may take some
        // liberties about what constitutes an "exact" match.
        public static IList<Uri> SearchString(string input)
        {
            Uri url = GetMSNSearchURL(input);
            WebRequest request = HttpWebRequest.Create(url);
            WebResponse response = request.GetResponse();

            Stream raw = response.GetResponseStream();
            StreamReader s = new StreamReader(raw);
            string x = s.ReadToEnd();

            List<Uri> list = new List<Uri>();

            // In the XML format, the URLs are conveniently in URL tags. We could use a full XmlReader / XPathQuery
            // to find them, or we can just grab them with a regular expression.
            Regex r = new Regex("<url>(.+?)</url>", RegexOptions.Singleline);

            for (Match m = r.Match(x); m.Success; m = m.NextMatch())
            {
                list.Add(new Uri(m.Groups[1].Value));
            }


            return list;
        }
    }

    class Program
    {   

        // Provide easy way to get data from clipboard.
        // On the down side, this pulls in winforms. :(
        // And requires that we're an STA thread.
        static string GetDataFromClipboard()
        {
            System.Windows.Forms.IDataObject iData = System.Windows.Forms.Clipboard.GetDataObject();
            string[] f = iData.GetFormats();
            return (string)iData.GetData(System.Windows.Forms.DataFormats.Text);
        }

        [STAThread]
        static void Main(string[] args)
        {            
            Console.WriteLine("Test for plagiarism.");

            // 1.) Get the data to check for plagiarism. This includes the author's content
            // and "author keywords" that check if a target URL is crediting the author.
            // It would be nice to pull this from an RSS feed or something. For now, we pull the entry
            // from the clipboard since that's an easy way to suck in a lot of data off a webpage.
            Console.WriteLine("Getting entry data from clipboard contents.");
            string entry = GetDataFromClipboard();
            
            Console.WriteLine("Entry:{0}", (entry.Length < 50) ? (entry) : (entry.Substring(0, 45))); 
            
            // Set keywords to search to determine if we describe the author. Grab the keywords from the command line.
            // If no keywords specified, default to my blog. :)
            string [] keywords = (args != null) ? args : (new string[] { "mike stall", "jmstall" });

            Console.Write("Keywords (if a target URL doesn't have any of these words, it may be plagiarizing):");
            foreach (string keyword in keywords)
            {
                Console.Write(" '{0}'", keyword);
            }
            Console.WriteLine(".");
            

            // Do the search. This is an intensive operation.
            Console.WriteLine("Doing Search. This could take a few minutes.");
            PlagiarismSearcher p = new PlagiarismSearcher();
            p.Search(entry, keywords);

            // Now print the results (perhaps HTML spew would be prettier ;) )
            int total = p.Excerpts.Count;
            Console.WriteLine("Search broke entry into {0} excerpt(s).", total);
                        
            ICollection<Uri> uris = p.Matches.Keys;
            if (uris.Count > 0)
            {
                int [] summaryCounts = new int [uris.Count];
                Uri[] summaryUrls = new Uri [uris.Count];
                int summaryIdx = 0;

                Console.WriteLine("Found {0} URL(s) that contain excerpts from the entry w/o ref back to the author.", uris.Count);

                // Print all the excerpts for the match (this may be too verbose).
                foreach (Uri url in uris)
                {
                    int cMatches = p.Matches[url].Count;
                    Console.WriteLine("Url {0} has {1}/{2} matches:", url, cMatches, total);

                    summaryCounts[summaryIdx] = cMatches;
                    summaryUrls[summaryIdx] = url;
                    summaryIdx++;
                    
                    // Print the matches.
                    Console.WriteLine("------------------------");
                    int j = 0;
                    foreach (string excerpt in p.Matches[url])
                    {
                        Console.Write("({0}/{1})", j, cMatches);
                        Console.WriteLine(excerpt.Replace("\r", "").Replace("\n", ""));
                        j++;
                    }
                    Console.WriteLine("------------------------");
                }

                // Print summary sorted by matches.
                Console.WriteLine("Summary: (sorted least to most matches)");
                Array.Sort(summaryCounts, summaryUrls); // ascending
                for (int j = 0; j < summaryCounts.Length; j++)
                {
                    Console.WriteLine("({0}/{1}) {2}%: {3}", summaryCounts[j], total, (int)(summaryCounts[j] * 100 / total), summaryUrls[j]);
                }

            } else{
                Console.WriteLine("No plagiarism matches found.");
            }
        }
    } // program



    // Helper to search for plagiarism of an author's article (online content that copies the article without refering back to it).
    // Use by first calling the Search() method; and then using Matches property to get the search result.
    class PlagiarismSearcher
    {
        // Get total excerpts the search was broken down into.
        // This is not valid until after the Search() method returns.
        public IList<string> Excerpts
        {
            get
            {
                return m_excerpts;
            }
        }

        // Get a list of matches.
        // Each Key is a plagiarism-candidate URL.
        // Each Value is a the list of excerpts that the URL contains from the original doc.
        // If this is empty, there are no plagiarism candidates. 
        // This is not valid until after the Search() method returns.
        public IDictionary<Uri, IList<string>> Matches
        {
            get
            {
                return m_matches;
            }
        }

        // Excerpts that the original author's entry is broken into.
        // We break into excerpts to catch if somebody plagiarises just a subsection of the author.
        IList<string> m_excerpts;

        // Keep map of each URL + number of times it has a plagiaring chunk. 
        // Then we can sort by # of chunks. Since the excerpts may be small, there's a chance
        // of false positive. 
        // key = URL that contains plagiarised chunk.
        // value = string list of chunks that the key URL contains.
        IDictionary<Uri, IList<string>> m_matches;

        // Keywords to search for in a target URL to determine if the URL refers to the author.
        // This will search both the raw HTML and the text with the markup removed.
        string[] m_keywords;

        // Given the source HTML and a set of keywords that refer back to the author, search
        // the internet for plagiarism. That means searching for other copies of the source where
        // the target does not ref back to the source (eg, have the author keywords).
        // The author keywords may include the author's name and URL. 
        // These results are not necessarily 100% accurate. This just produces a list of likely plagiarism candidates.
        // @todo - pass in search engine too?
        //
        // This will break the incoming html into excerpts (in case only a paragraph was plagiarised) and 
        // then search for matches against those excerpts.
        public void Search(string htmlFullEntry, string[] authorKeywords)
        {
            m_keywords = authorKeywords;

        
            // Given an text entry (such as a blog entry), searches key strings to see if anything else on the web refers to this content.
            // Once it finds refering URLS, scan those URLs for references back to original article.
            // If no refences are found, the URL may be plagiarising the original article.
        
            // Break entire entry up into sub strings to search.            
            m_excerpts = GetExcerptsFromEntry(htmlFullEntry);                        

            m_matches = new Dictionary<Uri, IList<string>>();

            foreach (string e in m_excerpts)
            {
                IList<Uri> results = MsnSearch.SearchString('\"' + e + '\"'); //put in quotes for exact search.

                Debug.WriteLine("Searching excerpt:" + e);
                Debug.WriteLine(String.Format("Found {0} references.", results.Count));
                foreach (Uri url in results)
                {
                    Debug.WriteLine("Checking URL: " + url);
                    bool fGood = DoesURLReferBackToAuthor(url);
                    if (!fGood)
                    {
                        // Record that this URL has a matching chunk.
                        if (!m_matches.ContainsKey(url))
                        {
                            m_matches[url] = new List<string>();
                        }
                        m_matches[url].Add(e);
                        Debug.WriteLine("!!! Plagiarism!!!! " + url);
                        Debug.WriteLine("copies this excerpt:");
                        Debug.WriteLine(e);
                        Debug.WriteLine("----------------");
                    }
                }
            }
        }

        // Given a full entry (which may contain HTML markup), generate a set of non-HTML string excerpts.
        // We can then query for the excerpts.
        // Our search algorithm owns producing the excerpts because the excerpt size may depend on the 
        // underlying search engines abilities.
        private IList<string> GetExcerptsFromEntry(string htmlFullEntry)
        {
            List<string> list = new List<string>();

            // simple heuristic: just return chunks of 'size' characters.
            // Needs to be split on word boundaries or MSN search gets confused.
            string s = RemoveAllMarkup(htmlFullEntry);

            // I also notice that if the size chunks are too big, then the search fails!?!?! That 
            // seems very counter intuitive. I did some manual tuning to arrive at the current size.
            int size = 100; // 40

            int idx = 0;
            while (idx + size < s.Length)
            {
                int pad = s.IndexOf(' ', idx + size) - idx;
                if (pad < 0) pad = size;
                list.Add(s.Substring(idx, pad));
                idx += pad;
            }
            // Only add the fragment if the list is empty. Else if the fragment is too small, it will match too much.
            if (list.Count == 0)
            {
                list.Add(s.Substring(idx));
            }

            return list;
        }

        // Do a check if the URL refers back to the author. This includes checking:
        // - does URL mention author's name?
        // - does URL include hyperlink back to author's website.
        // May need to also compensate for the html markup in the URL (so a raw string search may be naive).        
        // We can just do 2 searches on the content (1 on the raw content w/ markup, and 1 on the content w/o markup)
        bool DoesURLReferBackToAuthor(Uri url)
        {
            string htmlContent = GetWebPage(url);
            if (htmlContent == null) return true; // if link is broken, it's not plagiarising.
            htmlContent = htmlContent.ToLower();

            string textContent = RemoveAllMarkup(htmlContent);

            // case-insensitive search for keywords.
            foreach (string s in m_keywords)
            {
                if (htmlContent.Contains(s))
                {
                    return true;
                }
                if (textContent.Contains(s))
                {
                    return true;
                }
            }
            return false;
        }


        // Utility function to remove all HTML markup.
        // Input: string that may contain html.
        // returns: string stripped of all HTML markup.
        static string RemoveAllMarkup(string html)
        {
            Regex r = new Regex("<.+?>");
            string s = r.Replace(html, "");
            // Now decode escape characters. eg: '<' --> '<'
            return System.Web.HttpUtility.HtmlDecode(s);
        }

        // Helper to get a WebPage as a string.
        // Returns null if web-page is not available.
        static string GetWebPage(Uri url)
        {
            try
            {
                WebRequest request = HttpWebRequest.Create(url);
                WebResponse response = request.GetResponse();

                Stream raw = response.GetResponseStream();
                StreamReader s = new StreamReader(raw);
                string x = s.ReadToEnd();
                return x;
            }
            catch
            {
                return null;
            }

        }
    } // Plagiarism checker class
       

} // end namespace Web