An Easy Way to Detect Duplicates in a Given Data Set


Many a times we end up in lots of duplicate data in our systems like CRM, and when it becomes difficult to detect the duplicates and clean them up. This blog describes a very simple way to detect duplicates in data without any rocket science knowledge. This process of finding duplicates is not only for CRM data, but any data which can be copied/exported in excel sheet. For CRM, you can retrieve all records of any entity using SDK or other methods and follow the steps below to know all the duplicates.

Step 1: Have your data in excel sheet. I have taken a data set here for easy explanation. My data looks like below –

I have written First Name so that it describes which ones are duplicate and which ones are not. This is to show that we have filtered out all the duplicates at the end of this simple process. Now we need to come up with the criteria on which we want to find duplicates. Lets say our criteria is “If Last name and Email of a record is equal to same of other records then record is duplicate”. You can choose your own criteria.

Step 2: Now Select all the data in Last Name column excluding header. To do this you can select the cell D2, and then press “shift + ctrl + down arrow key”.

Step 3: Do conditional formatting to highlight duplicate values as shown in the screen shot below. (This is optional step and is being done only for easier identification). Do Step2 and Step3 for Email column too. If your criteria contains other columns too, then you will need to do these steps for them also.

After doing step2 and Step 3 for “Last Name” and “Email”, your data will look like below.

Step 4: Now select all the data rows leaving behind headers. To do this, select A2 cell, and press “shift + ctrl + right Arrow+Down Arrow”. Now click on sort and sort this as shown below –

If you had ignored step 3 earlier, then you can ignore first two condition of sorting. If you have chosen more columns in your criteria, then sorting criteria will also contain those. After clicking on ok on ‘sort’ window, our data looks like below –

Now you have all the duplicate records on top. You can just detect it by looking at rows where ‘Last Name’ and ‘Email’ both cells are colored. To further make it more detectable you can perform the following steps –

Step 5: Insert a column at starting, name it as “Is duplicate”. Select A2, and insert formula ‘=IF(AND(E2=E3,F2=F3),"Dup of row below","")’ into this cell, just like shown below. If your criteria contains more columns then this formula will look like ‘‘=IF(AND(E2=E3,F2=F3,X2=X4,...),"Dup of row below","")’.

Step 6: Now copy the similar formula in all the cells of first column. To do this easily, copy the cell A2. Now select all the cells of “Is Duplicate” column A3 onwards, and press “ctrl+v”. Your final result is as below –

Now you know all the duplicates. You just need to locate empty cell followed by “Dup of row Below” in “Is Duplicate” column and you can group duplicates. In this example, first 2nd, 3rd and 4th  rows are duplicate to each other and 5th and 6th Rows are duplicates to each other. You are now free to give appropriate treatments to these duplicates.

Note: If you are using SDK of CRM and need to delete duplicate records, it will be easier if you have GUIDs in one column, so that you can delete these by using SDK API.

Manbhawan Prasad

Comments (4)

  1. Anonymous says:

    Nice article, but kind of complex if you just want to do a quick import and do the verification later.

    On the following site you can find a more easier solution for quick duplicate removal.

    It’s in French.

    http://blogs.exakis.com/exacrm/Lists/Posts/Post.aspx?ID=8

    Below you find a translation and a little improvement to it:

    In the list to import, select the column that you want to remove the double from.

    Go than to Data->Filter->Advanced Filter->Filter the list in place->select unique records only

    Now the duplicate rows will become hidden and you can copy the results to a new excel file and save it to upload.

  2. westray says:

    I usually find duplicates using AutoFilter:

    Step 1: Sort the records.

    Step 2: Add new column with formulas like ‘=IF(OR(A2=A1,A2=A3),”dupl”,””)’ – add flag to those records which are equal with those above/bellow.

    Step 3: Using AutoFilter display simply the flagged records, i.e. the records which are duplicated.

    This procedure can be as well used when comparing multiple columns (like in the blog post), you just need to add multiple flag columns. And it also works in older versions of Excel.

  3. Neil Benson says:

    It’s interesting that you’re using Excel 2007 to manipulate data for CRM data imports. Have you found a way to save an Excel spreadsheet as a comma separated file with double quotes for field delimeters?

    I can’t, so I have to use Excel 2003.

    Neil

  4. ManBhawan prasad says:

    Hi Neil,

    To save it as comma delimited file, you just need to save it by choosing "CSV (Comma Delimited" in "Save As Type" drop down. It will give you a comma delimited file, with double quotes as data delimiter. Data delimiter will only come if there is comma in your data.

    Thanks.

    Manbhawan

Skip to main content