Sample code for Plagiarism Searcher tool

Here's my sample code for a tool to catch blog plagiarism that I described earlier. In retrospect, it was pretty easy to write (under 400 lines!). And edit-and-continue in C# and interceptable exceptions made my development time a lot faster!

The tool currently only works on a single entry, but it could be expanded to iterate over all entries in an entire blog. It's a simple C# console app which:
1. Takes in a source blog entry (via the clipboard) and a set of "author keywords" (via the command line). The keywords are things like the author's name or homepage that would indicate somebody is crediting the author. Ideally, the tool would read in the all the entries from some blog RSS feed instead of doing just 1 entry at a time.
2. breaks the entry up into excerpts (since it's unlikely somebody would plagiarize the entire entry). This also finds matches if they've changed a few characters. For example, perhaps their editing tool automatically changed something such as transforming unicode characters to ansi or transforming emoticons.
3. programmatically uses MSN Search to search for other URLs that includes those excerpts.
4. For each resulting URL, scans that page for any indication of crediting the original author. One simple way is to search the page for a set of "author keywords" provided in step 1. If no such keywords are found, then assume the page is not crediting the author. 

Offloading even more work to the Search Engine.
In retrospect, I realized after I wrote the tool that step #4 (filtering out urls that credit the author) can be folded into the search step in #3 (finding urls that copy content from the author) by using a sufficiently intelligent query with the NOT and AND keywords. Consider a query like: (NOT link:<author homepage>) AND (NOT <author name>) AND "<some excerpt>". This also has a great advantage in that both step #3 + #4 are using the same copy of the page. This avoids step #3 using a cached version of the page, and then step #4 looking at a totally different version of the page.

As an example, this MSN search query:  "So a debugger would skip over the region between the" looks for all pages that contain that excerpt from this blog post of mine. I tried this MSN search query:
(NOT AND (NOT "Mike Stall") AND "So a debugger would skip over the region between the" which filters out anybody that mentions my name or links back to me.

Clearly, query engines with useful search keywords can become extremely powerful. This tradeoff reminds me of similar tradeoffs with SQL queries where it's possible to offload client side work onto the server with a more intelligent query.

I ran the tool with the full contents from my blog on 0xFeeFee Sequence points. The keywords were my name ('Mike Stall') and part of my blog URL ("jmstall"). Here's the output. It's pretty verbose because it prints all the excerpts that match, but the interesting part is the summary at the end.

[update] The tool originally found 1 candidate (with a 75% match) - but the candidate then went and added a reference back to me. I reran the tool and now that candidate doesn't show up. That's exactly how it should work! The original output is  here. The tool now prints:

Test for plagiarism.
Getting entry data from clipboard contents.
Entry:#line hidden and 0xFeeFee sequence points
Keywords (if a target URL doesn't have any of these words, it may be plagiarizing): 'mike stall' 'jmstall'.
Doing Search. This could take a few minutes.
Search broke entry into 37 excerpt(s).
Found 1 URL(s) that contain excerpts from the entry w/o ref back to the author.
Url has 1/37 matches:
(0/1) use the `#line hidden' directive to mark a region of code as not exposed to the debugger.Eg:
Summary: (sorted least to most matches)
(1/37) 2%:

[update] It found 1 candidate. It turns out that this is a false positive: the search engine cache found the copied content, but then the page had completely changed since then (it's an "Under Construction page" now).
[update] Just to be clear, Tagcloud is not plagiarising. Rather this is a false positive and shows a shortcoming with the tool. The candidates were found using MSN search, which uses a cached copy of the web pages. However, the author crediting uses the live copy of the webpage. So if the page has changed (perhaps it's a blog and the original entry is no longer on the homepage; or perhaps the site is down), then you can get false positives. This could be fixed by having both the candidate search and crediting use the same copy of the webpage. The easiest way to do this is to follow the suggestion above and have the search query use the AND,NOT, LINK keywords to do both candidate search and crediting.

Here's the code: Note you need to compile it with winforms and system.web like so:
csc t.cs /debug+ /r:System.Web.dll /r:System.Windows.Forms.dll

// Test harness to use MSN search to search for plagiarism.
// Author: Mike Stall.
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
using System.Diagnostics;
// Include Windows.Forms because we use the clipboard.

namespace Web2
// Class to get Search results using
// We want to get a list of Urls from a given search string.
// It currently does a HTTP query with an embedded query string to retrieve the result as XML.
// It then extracts the URLS from the XML (which are conveniently in <url> tags).
// If MSN Search ever comes out with a real API for internet searches, we should use that instead of this.
// See for details on this class.
class MsnSearch
// Helper to get the Search result for an exact string.
// This will escape the string results. This is very fragile and reverse engineered based off my
// observations about how MSN Search encodes searches into the query string.
// This also does not account for search keywords (like "AND").
// If there's a spec for the query string, we should find and use it.
static Uri GetMSNSearchURL(string input)
// The 'FORMAT=XML' part request the results back as XML, which will be easier to parse than HTML.
StringBuilder sb = new StringBuilder(@"");
HttpUtility.UrlEncode(input)); // requires ref to System.Web.dll
return new System.Uri(sb.ToString());

// Return an list of URLs for the search results against an string input.
// This currently does not recognize any search keywords.
// For an exact search, place the input string in quotes.
// Note that these searches are not exact. For example, the search engine may have used a cached
// webpage and the URL may have changed since then. Or the search engine may take some
// liberties about what constitutes an "exact" match.
public static IList<Uri> SearchString(string input)
Uri url = GetMSNSearchURL(input);
WebRequest request = HttpWebRequest.Create(url);
WebResponse response = request.GetResponse();

Stream raw = response.GetResponseStream();
StreamReader s = new StreamReader(raw);
string x = s.ReadToEnd();

List<Uri> list = new List<Uri>();

// In the XML format, the URLs are conveniently in URL tags. We could use a full XmlReader / XPathQuery
// to find them, or we can just grab them with a regular expression.
Regex r = new Regex("<url>(.+?)</url>", RegexOptions.Singleline);

for (Match m = r.Match(x); m.Success; m = m.NextMatch())
new Uri(m.Groups[1].Value));

return list;

class Program

// Provide easy way to get data from clipboard.
// On the down side, this pulls in winforms. 🙁
// And requires that we're an STA thread.
static string GetDataFromClipboard()
IDataObject iData = System.Windows.Forms.Clipboard.GetDataObject();
string[] f = iData.GetFormats();
return (string)iData.GetData(System.Windows.Forms.DataFormats.Text);

static void Main(string[] args)
Console.WriteLine("Test for plagiarism.");

// 1.) Get the data to check for plagiarism. This includes the author's content
// and "author keywords" that check if a target URL is crediting the author.
// It would be nice to pull this from an RSS feed or something. For now, we pull the entry
// from the clipboard since that's an easy way to suck in a lot of data off a webpage.
Console.WriteLine("Getting entry data from clipboard contents.");
string entry = GetDataFromClipboard();

Console.WriteLine("Entry:{0}", (entry.Length < 50) ? (entry) : (entry.Substring(0, 45)));

// Set keywords to search to determine if we describe the author. Grab the keywords from the command line.
// If no keywords specified, default to my blog. 🙂
string [] keywords = (args != null) ? args : (new string[] { "mike stall", "jmstall" });

Console.Write("Keywords (if a target URL doesn't have any of these words, it may be plagiarizing):");
foreach (string keyword in keywords)
Console.Write(" '{0}'", keyword);

// Do the search. This is an intensive operation.
Console.WriteLine("Doing Search. This could take a few minutes.");
PlagiarismSearcher p = new PlagiarismSearcher();
p.Search(entry, keywords);

// Now print the results (perhaps HTML spew would be prettier 😉 )
int total = p.Excerpts.Count;
Console.WriteLine("Search broke entry into {0} excerpt(s).", total);

ICollection<Uri> uris = p.Matches.Keys;
if (uris.Count > 0)
int [] summaryCounts = new int [uris.Count];
Uri[] summaryUrls = new Uri [uris.Count];
int summaryIdx = 0;

Console.WriteLine("Found {0} URL(s) that contain excerpts from the entry w/o ref back to the author.", uris.Count);

// Print all the excerpts for the match (this may be too verbose).
foreach (Uri url in uris)
int cMatches = p.Matches[url].Count;
Console.WriteLine("Url {0} has {1}/{2} matches:", url, cMatches, total);

summaryCounts[summaryIdx] = cMatches;
summaryUrls[summaryIdx] = url;

// Print the matches.
int j = 0;
foreach (string excerpt in p.Matches[url])
Console.Write("({0}/{1})", j, cMatches);
Console.WriteLine(excerpt.Replace("\r", "").Replace("\n", ""));

// Print summary sorted by matches.
Console.WriteLine("Summary: (sorted least to most matches)");
Array.Sort(summaryCounts, summaryUrls); // ascending
for (int j = 0; j < summaryCounts.Length; j++)
Console.WriteLine("({0}/{1}) {2}%: {3}", summaryCounts[j], total, (int)(summaryCounts[j] * 100 / total), summaryUrls[j]);

} else{
Console.WriteLine("No plagiarism matches found.");
// program

// Helper to search for plagiarism of an author's article (online content that copies the article without refering back to it).
// Use by first calling the Search() method; and then using Matches property to get the search result.
class PlagiarismSearcher
// Get total excerpts the search was broken down into.
// This is not valid until after the Search() method returns.
public IList<string> Excerpts
return m_excerpts;

// Get a list of matches.
// Each Key is a plagiarism-candidate URL.
// Each Value is a the list of excerpts that the URL contains from the original doc.
// If this is empty, there are no plagiarism candidates.
// This is not valid until after the Search() method returns.
public IDictionary<Uri, IList<string>> Matches
return m_matches;

// Excerpts that the original author's entry is broken into.
// We break into excerpts to catch if somebody plagiarises just a subsection of the author.
IList<string> m_excerpts;

// Keep map of each URL + number of times it has a plagiaring chunk.
// Then we can sort by # of chunks. Since the excerpts may be small, there's a chance
// of false positive.
// key = URL that contains plagiarised chunk.
// value = string list of chunks that the key URL contains.
IDictionary<Uri, IList<string>> m_matches;

// Keywords to search for in a target URL to determine if the URL refers to the author.
// This will search both the raw HTML and the text with the markup removed.
string[] m_keywords;

// Given the source HTML and a set of keywords that refer back to the author, search
// the internet for plagiarism. That means searching for other copies of the source where
// the target does not ref back to the source (eg, have the author keywords).
// The author keywords may include the author's name and URL.
// These results are not necessarily 100% accurate. This just produces a list of likely plagiarism candidates.
// @todo - pass in search engine too?
// This will break the incoming html into excerpts (in case only a paragraph was plagiarised) and
// then search for matches against those excerpts.
public void Search(string htmlFullEntry, string[] authorKeywords)
m_keywords = authorKeywords;

// Given an text entry (such as a blog entry), searches key strings to see if anything else on the web refers to this content.
// Once it finds refering URLS, scan those URLs for references back to original article.
// If no refences are found, the URL may be plagiarising the original article.

// Break entire entry up into sub strings to search.
m_excerpts = GetExcerptsFromEntry(htmlFullEntry);

m_matches = new Dictionary<Uri, IList<string>>();

foreach (string e in m_excerpts)
IList<Uri> results = MsnSearch.SearchString('\"' + e + '\"'); //put in quotes for exact search.

Debug.WriteLine("Searching excerpt:" + e);
Debug.WriteLine(String.Format("Found {0} references.", results.Count));
foreach (Uri url in results)
Debug.WriteLine("Checking URL: " + url);
bool fGood = DoesURLReferBackToAuthor(url);
if (!fGood)
// Record that this URL has a matching chunk.
if (!m_matches.ContainsKey(url))
m_matches[url] =
new List<string>();
Debug.WriteLine("!!! Plagiarism!!!! " + url);
Debug.WriteLine("copies this excerpt:");

// Given a full entry (which may contain HTML markup), generate a set of non-HTML string excerpts.
// We can then query for the excerpts.
// Our search algorithm owns producing the excerpts because the excerpt size may depend on the
// underlying search engines abilities.
private IList<string> GetExcerptsFromEntry(string htmlFullEntry)
List<string> list = new List<string>();

// simple heuristic: just return chunks of 'size' characters.
// Needs to be split on word boundaries or MSN search gets confused.
string s = RemoveAllMarkup(htmlFullEntry);

// I also notice that if the size chunks are too big, then the search fails!?!?! That
// seems very counter intuitive. I did some manual tuning to arrive at the current size.
int size = 100; // 40

int idx = 0;
while (idx + size < s.Length)
int pad = s.IndexOf(' ', idx + size) - idx;
if (pad < 0) pad = size;
list.Add(s.Substring(idx, pad));
idx += pad;
// Only add the fragment if the list is empty. Else if the fragment is too small, it will match too much.
if (list.Count == 0)

return list;

// Do a check if the URL refers back to the author. This includes checking:
// - does URL mention author's name?
// - does URL include hyperlink back to author's website.
// May need to also compensate for the html markup in the URL (so a raw string search may be naive).
// We can just do 2 searches on the content (1 on the raw content w/ markup, and 1 on the content w/o markup)
bool DoesURLReferBackToAuthor(Uri url)
string htmlContent = GetWebPage(url);
if (htmlContent == null) return true; // if link is broken, it's not plagiarising.
htmlContent = htmlContent.ToLower();

string textContent = RemoveAllMarkup(htmlContent);

// case-insensitive search for keywords.
foreach (string s in m_keywords)
if (htmlContent.Contains(s))
return true;
if (textContent.Contains(s))
return true;
return false;

// Utility function to remove all HTML markup.
// Input: string that may contain html.
// returns: string stripped of all HTML markup.
static string RemoveAllMarkup(string html)
Regex r = new Regex("<.+?>");
string s = r.Replace(html, "");
// Now decode escape characters. eg: '<' --> '<'
return System.Web.HttpUtility.HtmlDecode(s);

// Helper to get a WebPage as a string.
// Returns null if web-page is not available.
static string GetWebPage(Uri url)
WebRequest request = HttpWebRequest.Create(url);
WebResponse response = request.GetResponse();

Stream raw = response.GetResponseStream();
StreamReader s = new StreamReader(raw);
string x = s.ReadToEnd();
return x;
return null;

} // Plagiarism checker class

} // end namespace Web


Comments (7)

  1. Hi Mike,

    Regarding your tools findings to the url above:

    If you take a look at the original post, you will see a white image (a box, representing the end of a quote). For some reason that I have only just realised, blogger doesn’t actually provide a link to the original, but instead, it creates an image with an embeded reference to

    Unfortunately I didn’t realise this until now when I was forced to open the HTML. I generally post using the blogger web client, and then leave it as that without checking the resultant view in a browser.

    So perhaps either a) blogger needs to fix their stuff, or b) you should check for a reference to that url as well in your app…(not that I know anything about what aggbug.aspx does…)

    Having said this, I will no longer rely on the blogger web client, and will never post with it again. As a matter of fact, I will probably start hosting my own very soon. I have fixed all references so that they are now direct links to your original.

    I appologise for not checking my posts more thoughoughly… if you look at the post you will also see that I ended my part of the post with the colon:

    I did not try and take credit for this, it was made clear that it wasnt my post.

    I just wish blogger had of put a proper link in… gah.


  2. Matthew – No problem! I’m glad you found the content useful, and I had a lot of fun writing the searcher tool. 🙂

  3. Michael,

    Once you have extracted the "raw" data from MSN, you can apply this ( to get a better idead of the degree of plagiarism involved.

  4. I reran the tool and the Matthew’s original page no longer shows up (since he added the credits). That’s actually a great full-circle demo. I’ve updated the entry to reflect the new search.

  5. Grant says:

    Hi Mike,

    I was interested to see why your tool picked up the URL for the tag "Readify"

    Tagcloud shows extracts of blogs for particular tags, not unlike

    Seeing as the site is down at the moment, this is an example of what your scanner would have seen (although it’s been updated since)

    So your scanner would’ve seen the tagcloud of Readify, and of course Matthew’s post.

    It looks as though the summary strips the HTML so that’s why there’s no link back to your post.

    Just to be safe, I’ve removed the Readify tagcloud so that it doesn’t get accused of plagiarising again.


  6. I apologize; I didn’t intentd to accuse tagcloud of plagiarising – I’ll update my original post to be very clear about that.

    I explicitly called out tagcloud as an example of a false-postive:

    "It turns out the 2% one is a false positive: the search engine cache found the copied content, but then the page had completely changed since then (it’s an "Under Construction page" now)."

    The problem is that the tool uses MSN Search (which uses a cache) to find candidates; but then live page access to look for credits to the author. It will search both the raw HTML and the stripped HTML. So having credits inside of tags (such as a <A>) will show up.

    The problem is if the cached page has a credit, but the site is down (or replaced with a "I’m under construction" notice), the tool won’t see it and report a false positive.

    This would also be a problem for blog homepages where the entry that credits the author is no longer on the homepage.

    This is a shortcoming of the tool. The tool needs to find candidates and check for crediting using the same copy of the HTML.

Skip to main content