Identifying duplicates between multiple lists

Howdy folks. Jeff here, back from my summer holiday in the Coromandel Peninsula in the North Island of New Zealand, where I’ve been staring at this for the last 21 days:
DDOE_Identifying duplicates between lists_Opoutere

For the next 344 I’ll be staring at this:
DDOE_Identifying duplicates between lists_Excel
God, it’s good to be home.

A while back I answered this thread for someone wanting to identify any duplicate values found between 4 separate lists.

The way I understood the question, if something appears in each of the four lists, the Op wanted to know about it. If an item just appeared in 3 lists but not all 4, then they didn’t want it to be picked up. And the lists themselves might have duplicates within each list.

Say we’ve got these 4 lists:
DDOE_Identifying duplicates between lists_Names

We can’t simply use Conditional Formatting, because that will include duplicate names that don’t appear in each and every column, such as ‘Mike’:
DDOE_Identifying duplicates between lists_Wrong

Rather, we only want names that appear in every column:
DDOE_Identifying duplicates between lists_Right

I wrote a routine that handled any number of lists, using two dictionaries and a bit of shuffling between them. And the routine allows users to select either a contiguous range if their lists are all in one block, or multiple non-contiguous ranges if they aren’t.

  1. The user gets prompted for the range where they want the identified duplicates to appear:
    DDOE_Identifying duplicates between lists_Select Output Range
  2.  

  3. Then they get prompted to select the first list. The items within that list get added to Dic_A. (If they select more than one columns, the following steps get executed automatically).
    DDOE_Identifying duplicates between lists_Select First Range
  4.  

  5. Next they get prompted to select the 2nd list, at which point the code attempts to add each new item to Dic_A. If an item already exists in Dic_A then we know it’s a duplicate between lists, and so we add it to Dic_B. At the end of this, we clear Dic_A. Notice that any reference to selecting a contiguous range has been dropped from the InputBox:
    DDOE_Identifying duplicates between lists_Select 2nd range
  6.  

  7. When they select the 3rd list, then it attempts to add each new item to Dic_B, and if an error occurs, then we know it’s a duplicate between lists, and so we add it to Dic_A. At the end of this, we clear Dic_B. We carry on in this manner until the user pushes Cancel (and notice now that the InputBox message tells them to push cancel when they’re done):
    DDOE_Identifying duplicates between lists_Select 3rd range
  8.  

Pretty simple: just one input box, an intentional infinite loop, and two dictionaries that take turns holding the current list of dictionaries. Hours of fun.

Only problem is, I had forgotten to account for the fact that there might be duplicates within a list. The old code would have misinterpreted these duplicates as between-list duplicates, rather than within-list duplicates. The Op is probably completely unaware, and probably regularly bets the entire future of his country’s economy based on my bad code. Oops.

I’ve subsequently added another step where a 3rd dictionary is used to dedup the items in the list currently being processed. Here’s the revised code. My favorite line is the Do Until “Hell” = “Freezes Over” one.

12 thoughts on “Identifying duplicates between multiple lists

  1. Wouldn’t this suffice ?


    Sub M_snb()
    sn = Cells(1).CurrentRegion

    For Each cl In sn
    If (InStr(" " & Join(Application.Transpose(Application.Index(sn, 0, 1))) & " ", " " & cl & " ") > 0) * (InStr(" " & Join(Application.Transpose(Application.Index(sn, 0, 2))) & " ", " " & cl & " ") > 0) * (InStr(" " & Join(Application.Transpose(Application.Index(sn, 0, 3))) & " ", " " & cl & " ") > 0) * (InStr(" " & Join(Application.Transpose(Application.Index(sn, 0, 4))) & " ", " " & cl & " ") > 0) And InStr("_" & c00 & "_", "_" & cl & "_") = 0 Then c00 = c00 & "_" & cl
    Next

    MsgBox Mid(c00, 2)
    End Sub,

  2. On this small dataset, yes. On the Op’s larger data set, not anytime soon. He had 4 columns with up to 2500 rows in each of them. Your approach takes 2 minutes. Using my approach above takes seconds, including the time taken to select the 4 ranges.

  3. Based on your new information:

    Sub M_snb()
    sn = Cells(1).CurrentRegion

    ReDim sp(UBound(sn, 2) - 1)
    For j = 1 To UBound(sn, 2)
    sp(j - 1) = Application.Index(sn, 0, j)
    Next

    With CreateObject("scripting.dictionary")
    For Each cl In sn
    If Not .exists(cl) Then
    For j = 0 To UBound(sp)
    If IsError(Application.Match(cl, sp(j), 0)) Then Exit For
    Next
    If j > UBound(sp) Then x0 = .Item(cl)
    End If
    Next
    MsgBox Join(.keys, vbLf)
    End With
    End Sub

  4. On second thought we can reduce it to:

    Sub M_snb()
    sn = Cells(1).CurrentRegion

    ReDim sp(UBound(sn, 2) - 1)
    For j = 1 To UBound(sn, 2)
    sp(j - 1) = Application.Index(sn, 0, j)
    Next

    With CreateObject("scripting.dictionary")
    For Each cl In sp(0)
    If Not .exists(cl) Then
    For j = 0 To UBound(sp)
    If IsError(Application.Match(cl, sp(j), 0)) Then Exit For
    Next
    If j > UBound(sp) Then x0 = .Item(cl)
    End If
    Next
    MsgBox Join(.keys, vbLf)
    End With
    End Sub

  5. On third thought to:
    testing only the values in the first range to the other ranges.

    Sub M_snb()
    sn = Cells(1).CurrentRegion

    ReDim sp(UBound(sn, 2) - 1)
    For j = 1 To UBound(sn, 2)
    sp(j - 1) = Application.Index(sn, 0, j)
    Next

    With CreateObject("scripting.dictionary")
    For Each cl In sp(0)
    If Not .exists(cl) Then
    For j = 1 To UBound(sp)
    If IsError(Application.Match(cl, sp(j), 0)) Then Exit For
    Next
    If j > UBound(sp) Then x0 = .Item(cl)
    End If
    Next
    MsgBox Join(.keys, vbLf)
    End With
    End Sub

  6. It might even be done with a oneliner:

    Sub m_snb()
    MsgBox Join(Filter(Application.Transpose([if((countif(B1:B2000,A1:A2000)=0),"~",if(countif(C1:C2000,A1:A2000)=0,"~",if(countif(D1:D2000,A1:A2000)=0,"~",A1:A2000)))]), "~", False), vbLf)
    End Sub

  7. You might find this one-liner code for determining the ordinal suffix to be of interest…

    Function Ordinal(Number As Long) As String
    Ordinal = Mid$(“thstndrdthththththth”, 1 – 2 * ((Number) Mod 10) * (Abs((Number) Mod 100 – 12) > 1), 2)
    End Function

  8. snb: That last routine doesn’t seem to work for me. I entered =RANDBETWEEN(1,2000) in the range A1:D2500, then ran both your code and mine. Mine returned 500 items, and yours returned 461. Looking at your data, some duplicates are definately missing, and other numbers get repeated.

    I haven’t looked at your other listings yet.

  9. @jeff

    this method can’t contain duplicates, since it uses the dictionary.

    Sub M_snb()
    sn = Cells(1).CurrentRegion

    ReDim sp(UBound(sn, 2) - 1)
    For j = 1 To UBound(sn, 2)
    sp(j - 1) = Application.Index(sn, 0, j)
    Next

    With CreateObject("scripting.dictionary")
    For Each cl In sp(0)
    If Not .exists(cl) Then
    For j = 1 To UBound(sp)
    If IsError(Application.Match(cl, sp(j), 0)) Then Exit For
    Next
    If j > UBound(sp) Then x0 = .Item(cl)
    End If
    Next
    MsgBox Join(.keys, vbLf)
    End With
    End Sub

    Post your testfile somewhere so I can have a look.

  10. I realised later you used the oneliner.
    If the range is 2500 rows it might be practical to adapt the code to that range too.
    I introduced a short condition to avoid duplicates:

    Sub m_snb()
    MsgBox Join(Filter(Application.Transpose([if(countif(offset($A$1,,,row(A1:A2500)),A1:A2500)=1,if(countif(B1:B2500,A1:A2500)=0,"~",if(countif(C1:C2500,A1:A2500)=0,"~",if(countif(D1:D2500,A1:A2500)=0,"~",A1:A2500))),"~")]), "~", False), vbLf)
    End Sub

  11. Hi
    If it is not necessary to highlight the value in every column then you can use the following formula in the conditional format
    =SUM(COUNTIF(OFFSET($A1,0,COLUMN(INDIRECT(“$A$1:$C$1”)),6,1),$A1)*1)=3
    If the value must exists in all four columns for this to be true, then you only need to test the values in the first column exist in the other three columns.
    This will highlight the values in the first column where they exist in each of the other three columns, in your example this only returns Jeff & Helen.
    NOTE: I only used indirect so that if columns are inserted to the left of the table the conditional formatting still works.
    Robert

  12. Hi snb. That’s cool…I never knew about the Filter method till now. Only thing is, I wouldn’t want to generalize your approach…as the range to be processed grows, this becomes increasingly slower in comparison to a dictionary. For instance, 5000 items takes 12 seconds, and 7500 items takes 27 seconds, whereas a dictionary does it pretty much instantly.

    I’ve amended my original sub so that if the user selects a 2d range during the initial selection, the code knows that all the lists to be processed are within this range. So they no longer need to make multiple selections. But if they select a 1d range, then the code knows that the user will want to make further selections, so the input box keeps coming up till they push cancel. Have amended the original blog post accordingly.


Posting code? Use <pre> tags for VBA and <code> tags for inline.

Leave a Reply

Your email address will not be published.