Duplicate Files


Need a way to check if two files are the same?  Calculate a hash of the files.  Here is one way to do it:


 


## Calculates the hash of a file and returns it as a string.


function Get-MD5([System.IO.FileInfo] $file = $(throw 'Usage: Get-MD5 [System.IO.FileInfo]'))


{


  $stream = $null;


  $cryptoServiceProvider = [System.Security.Cryptography.MD5CryptoServiceProvider];


 


  $hashAlgorithm = new-object $cryptoServiceProvider


  $stream = $file.OpenRead();


  $hashByteArray = $hashAlgorithm.ComputeHash($stream);


  $stream.Close();


 


  ## We have to be sure that we close the file stream if any exceptions are thrown.


  trap


  {


    if ($stream -ne $null)


    {


      $stream.Close();


    }


   


    break;


  }


 


  return [string]$hashByteArray;


}


 


I think about the only new thing here is the trap statement.  It’ll get called if any exception is thrown, otherwise its just ignored.  Hopefully nothing will go wrong with the function but if anything does I want to be sure to close any open streams.  Anyway, keep this function around, we’ll use it along with AddNotes and group-object to write a simple script that can search directories and tell us all the files that are duplicates.  Now… an example of this function in use:


 


MSH>"foo" > foo.txt


MSH>"bar" > bar.txt


MSH>"foo" > AlternateFoo.txt


MSH>dir *.txt | foreach { get-md5 $_ }


33 69 151 28 248 32 88 177 8 34 154 58 46 59 255 53


54 122 136 147 125 209 249 229 12 105 236 19 140 5 107 169


33 69 151 28 248 32 88 177 8 34 154 58 46 59 255 53


MSH>


 


 


Note that two of the files have the same hash, as expected since they have the same content.  Of course, it is possible for two files to have the same hash and not the same content so if you are really paranoid you might want to check something else in addition to the MD5 hash.


 


- Marcel


Comments (8)

  1. TC says:

    Um, you realize that various cryptgraphy researchers have now managed to generate different files with the same MD5 hash?

  2. TC says:

    Oops, sorry, I just saw you recognized that at the end of your post.

  3. anild says:

    From a performance perspective, it might be wise to check file attributes like size on disk before launching into a full-blown MD5 hash. This optimizes for the common case where 2 files are not likely to be the same since you immediately know that they cannot be the same if the file sizes are different.

  4. Dan says:

    That’s cool…so now how would I use that in a script to list all the files in a given folder (recursively) that are (very likely) the same?

  5. Mario Goebbels says:

    I think byte arrays should be displayed in hexadecimal, when turned into strings.

  6. Alan says:

    NoClone uncovers duplicate files by comparing byte by byte, not by CRC or MD5, so duplicates found are totally identical. Try it.

    Alan Wo

    The author

  7. AMB says:

    Directory Report finds duplicate files based on name and/or size and/or CRC. It checks the size before calculating the file’s CRC.

    Download the program at

    http://www.file-utilities.com/downloads/wdir.zip

    Allan Braun

    The author

Skip to main content