Removing duplicate screenshot images

Article
10/22/2012

Readers of some of my other posts will have gathered by now that my day job involves work on a large test automation harness. Some may also know that that my current project involves the saving of screenshots and their respective html during test case execution. It also involves the use of those screenshots and data in various processes (like this one). So at the end of a test case run I now have saved a) a screenshot for every click action during the test case and b) a html file for every frame in that screen. The decision was taken early on to save on every click so that we do not miss any screen changes and this was a sound decision, but it does mean we initially get lots of duplicate data and for most of the jobs which require this data we need to start weeding out the duplicates.

The first stab at this is to remove the "exact" match duplicates - i.e. screens which are exactly the same, pixel for pixel. The only compromise is that I want to crop the screen because we don't care if the screen "chrome" at the top and bottom has changed. So for example if between two screenshots the only change is the time at the bottom of the screen then we want to count these as matching screens. This process can be used or adapted for any situation where you want to process a subset of a set of images based on matching a section of each image. For simplicity I wrote this code as a standalone assembly which takes 2 required parameters:

ImageDupes.exe “JpgFolder” “OpFolder”

Where:

JpgFolder: is the folder containing the screen data

OpFolder: is the output folder where the unique screens and their data will be saved.

Of course as I'm using pixel lengths the size of cropped rectangle is dependant on the size of the screen shot, so I made the rectangle properties an application setting:

 <setting name="CropRectangle" serializeAs="String">
 <value>0, 100, 1600, 1025</value>
 </setting>

The program logic is then:

1. For every picture file in the given folder

- Save the file location and a hash of the pixels for the relevant rectangle (used a custom class for this)

2. For the list of images and hash values

- Create a dictionary using the hash value as the key for each screen file location value

(Non-unique hashes overwrite each other as they are added leaving me with a dictionary list of unique screenshots)

3. Then finally, for every entry in the dictionary copy the file to the output location.

I crop each image using Bitmap.Clone(...)

private static Image CropImage(Image img, Rectangle cropArea){ try { Bitmap bmpImage = new Bitmap(img); Bitmap bmpCrop = bmpImage.Clone(cropArea, bmpImage.PixelFormat); return (Image)(bmpCrop); } catch (System.OutOfMemoryException) { Console.WriteLine("Memory Exception - check rectangle cropping size is correct for screenshot pixel size"); return null; } }

The MD5 hash value is generated from the byte values of the pixels using System.Security.Cryptography:

ImageConverter converter = new ImageConverter();byte[] rawImageData = converter.ConvertTo(img, typeof(byte[])) as byte[];MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();byte[] hashbytes = md5.ComputeHash(rawImageData);hashString = BitConverter.ToString(hashbytes);

The two major methods then of the program are "CutStatusBars()", which gets the crop rectangle and makes the list of screen shot files with their associated hash values (for the cropped section) and "RemoveDuplicates()" which simply makes the dictionary list and copies the unique values:

private static void CutStatusBars(){ hashScreens = new List<Screen>(); try { DirectoryInfo di = new DirectoryInfo(JpgFolder); int[] cropParams = Properties.Dupes.Default.CropRectangle.Split(',').Select(p=>int.Parse(p.Trim())).ToArray(); Rectangle betweenBars = new Rectangle(cropParams[0], cropParams[1], cropParams[2], cropParams[3]); if (betweenBars.Size.Height == 0 || betweenBars.Size.Width == 0) { Console.WriteLine("Invalid cropping rectangle paramaters specified in settings file"); return; } foreach (FileInfo jpgFile in di.GetFiles("*.jpg", SearchOption.TopDirectoryOnly)) { Console.WriteLine("Cropping {0}", jpgFile.Name); Image src = Image.FromFile(jpgFile.FullName); Image trg = CropImage(src, betweenBars); //make record using source image Screen hashScreen = new Screen(jpgFile.Name, jpgFile.DirectoryName); //but generate hash using cropped image - so we lose the chrome at top and bottom of screen hashScreen.hashString = GenerateHashString(trg); hashScreens.Add(hashScreen); trg.Dispose(); } } catch (Exception ex) { Console.WriteLine(ex.Message); } } private static void RemoveDuplicates(){ try { //use a dictionary to filter out duplicates - multiple duplicate hash string keys will overwrite each other Dictionary<string, Screen> uniqueScreens = new Dictionary<string, Screen>(); foreach (Screen scr in hashScreens) { uniqueScreens[scr.hashString] = scr; } //now copy the unique files and their related *.html Console.WriteLine("Saving out {0} unique screens", uniqueScreens.Count); foreach (KeyValuePair<string, Screen> hashAndScreen in uniqueScreens) { Screen scr = hashAndScreen.Value; string[] files = System.IO.Directory.GetFiles(scr.sourceFolder, scr.name.Replace(".jpg","*.*")); Console.WriteLine("Copying {0} related files for {1}", files.Count(), scr.name); foreach (string s in files) { Console.WriteLine("{0}...", s); string fname = System.IO.Path.GetFileName(s); string destFullName = System.IO.Path.Combine(OpFolder, fname); System.IO.File.Copy(s, destFullName, true); } } } catch (Exception e) { Console.WriteLine(e.Message); }}

Removing duplicate screenshot images

Additional resources