SYSK 382: File Comparison


Last weekend I wrote a utility that does file synchronization (yes, I know there many such tools), but where I can define in advance what I want to see happen if there are discrepancies…  To do that, I needed a class that would take an array of file names from one folder and another array of files from another folder, and return files that are found in both folders and that are identical, e.g.:

 

var commonFiles = files1.Intersect(files2, new FileComparer(folder1, folder2)).ToList();

 

Here is the code of the FileComparer class:

 

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

 

namespace FileSync

{

    public class FileComparer : IEqualityComparer<string>

    {

        public enum ComparisonOptions

        {

            FileSizeAndDate = 0, // default

            Content = 1,

        }

 

        private string _folder1;

        private string _folder2;

        private ComparisonOptions _compOption;

 

        public FileComparer(string folder1, string folder2)

            : this(folder1, folder2, ComparisonOptions.FileSizeAndDate)

        {

 

        }

 

        public FileComparer(string folder1, string folder2, ComparisonOptions compOption)

        {

            _folder1 = NormalizedFolder(folder1);

            _folder2 = NormalizedFolder(folder2);

            _compOption = compOption;

        }

 

        public bool Equals(string x, string y)

        {

            bool result = false;

 

            System.IO.FileInfo fi1 = new System.IO.FileInfo(x);

            if (fi1.Exists)

            {

                System.IO.FileInfo fi2 = new System.IO.FileInfo(y);

                if (fi2.Name == fi1.Name && fi2.Exists && fi1.Length == fi2.Length) // note: contents can’t be the same if the name or length is different

                {

                    if (_compOption == ComparisonOptions.FileSizeAndDate)

                    {

                        result = (fi1.LastWriteTimeUtc == fi2.LastWriteTimeUtc);

                    }

                    else

                    {

                        // TODO: for large files, read and compare in chunks

                        byte[] f1 = System.IO.File.ReadAllBytes(x);

                        byte[] f2 = System.IO.File.ReadAllBytes(y);

 

                        result = f1.Equals(f2);

                    }

                }

            }

 

           

            return result;

        }

 

        public int GetHashCode(string obj)

        {

            if (obj.StartsWith(_folder1))

                return obj.Substring(_folder1.Length).GetHashCode();

            else if (obj.StartsWith(_folder2))

                return obj.Substring(_folder2.Length).GetHashCode();

            else

                return 0;

        }

 

        public static string NormalizedFolder(string folder)

        {

            string result = String.Empty;

            if (!string.IsNullOrEmpty(folder) && !string.IsNullOrWhiteSpace(folder))

            {

                result = folder.Trim();

                if (result.EndsWith(@”\”))

                    result = result.Substring(0, result.Length – 1);

            }

            return result;

        }

    }

}

 

 

Comments (4)

  1. Gareth says:

    Does this work?

    result = f1.Equals(f2);

    Is this not done via a reference compare? Hence it will always be false?

    Sorry didn't get around to testing it myself…

  2. irenak says:

    [reply to Gareth]

    No, that wouldn't work because, as you correctly point out, it will use the default comparison algorithm, and will not use the FileComparer class above.  However, this should do what you what:

    System.IO.FileInfo fi1 = new System.IO.FileInfo(@"file1");

    System.IO.FileInfo fi2 = new System.IO.FileInfo(@"file2");

    bool r = new FileComparer(fi1.Directory.FullName, fi2.Directory.FullName).Equals(fi1.FullName, fi2.FullName);

  3. Julian Kuiters - http://www.julian-kuiters.id.au says:

    Use a hashing program over the file contents to check if they are the same.

    You can do this in a smart way for large files (100MB + ) to increase speed by comparing a small number of chunks. eg: first 1mb, last 1mb, and some chunks in the middle ( 5 chunks over a large file will usually do).

    This wont tell you if bits in the chunks you don't compare are different, but will quickly tell you if the chunks you check ARE different.

    Then you can follow up for files that the chunks are the same by running the full file through a hashing program to find if anything else is diffrent.

    This can give you a massive speed increase over checking the files byte by byte.

    Sha-1 is pretty quick for file hasing – check out the other versions of SHA algorithms, they are faster than the MD algothims as far as I've seen

  4. Julian Kuiters - http://www.julian-kuiters.id.au says:

    .. ah that last comment relates to these lines:

                           // TODO: for large files, read and compare in chunks

                           byte[] f1 = System.IO.File.ReadAllBytes(x);

                           byte[] f2 = System.IO.File.ReadAllBytes(y);

                           result = f1.Equals(f2);

    which just wont work for comparing large + very large files (as your comment notes)