Dictionaries can be rude!

Posted on October 23, 2013March 17, 2018 by Jeff Weir

Howdy folks. Jeff Weir here, hacking into Dick’s blog. Quickest bit of hacking I ever did, too: while Dick had prudently used an alpha-numeric password, it didn’t take me very long to hit upon his super-obfuscated ILoveExcel2013. My quickest dictionary attack yet.

Speaking of Dictionaries, I thought I’d make a couple of quick comments in response to Dick’s last post The Encyclopedia of Dictionaries. Did I say quick? That’s a calumniation, if not downright obloquy.

Ah, Dictionaries. While you’re rushing for yours in order to see if I just swore at you, let me reminisce for a moment. Oh magical dictionary of my youth: how I loved to fritter the hours away thumbing through your creased pages armed with a pencil and a penchant for rude words.

Annnnyway…in Dick’s previous post, he linked to this extensive page written by snb on Dictionaries. That’s a good resource. So is this excellent article by mattheswpatrick at Experts-Exchange: Using the Dictionary Class in VBA.

One of the code snippets on snb’s page concerns de-duping records, and goes something like this:

Option Explicit

Sub Original()

    Dim dic As Object
    Dim sn As Variant
    Dim j As Long
    Dim timetaken As Date
    
    timetaken = Now()
    
    Range("D1:E1").EntireColumn.ClearContents
    sn = Range("A1").CurrentRegion
      Set dic = CreateObject("scripting.dictionary")
    
    With dic
        For j = 1 To UBound(sn)
            .Item(sn(j, 1)) = Application.Index(sn, j, 0)
        Next
        Range("D1").Resize(.count, 2) = Application.Index(.items, 0, 0)
    End With
    
    timetaken = Now() - timetaken
    Debug.Print "Original took " & Format(timetaken, "HH:MM:SS")
End Sub

Option Explicit

Sub Original()

Dim dic As Object

Dim sn As Variant

Dim j As Long

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

sn = Range("A1").CurrentRegion

Set dic = CreateObject("scripting.dictionary")

With dic

For j = 1 To UBound(sn)

.Item(sn(j, 1)) = Application.Index(sn, j, 0)

Range("D1").Resize(.count, 2) = Application.Index(.items, 0, 0)

End With

timetaken = Now() - timetaken

Debug.Print "Original took " & Format(timetaken, "HH:MM:SS")

End Sub

(I added the Option Explicit bit in remembrance of my first well-thumbed dictionary. It certainly wasn’t. Explicit, I mean. Although at the time I thought it was. Why, even in the low B’s I came across words like “Boob.” Pretty slim pickings from W onwards, though. In hindsight I believe that putting rude words into dictionaries was a cunning ploy by teachers to get kids to learn English. Kids these days have it a lot better than I did, because today’s dictionaries have lots more rude words in them. Which probably explains why every 2^nd word uttered by modern kids rhymes with the common name for a bird from the Anatidae family.)

Where was I? Ahh, that’s right. There’s a problem with the above code, in that you cannot pass an array bigger than 65536 rows to a Worksheet Function from VBA without getting a Mismatch error. That’s covered in this recent post. So point this code at a range containing 65537 rows, and you’ll get an error on the Application.Index(.items, 0, 0) bit.

In fact, point this code at 65536 rows exactly, and while it will work, it won’t work fast. That’s because it’s really expensive to call a worksheet function like Application.Index from VBA lots and lots. To the point that you don’t want to hold your insufflation while you wait: this took 16 minutes and 15 seconds on my pretty fast laptop.

We can of course rewrite this code so that it doesn’t use Application.Match, which radically speeds things up. But this still leaves you to a potentially mortiferous flaw that might bite you in a place that renders you most undecidedly callipygian. More on this postliminary. First, the code:

Sub Revision1()

    Dim dic As Object
    Dim varInput As Variant
    Dim i As Long
    Dim timetaken As Date
    
    timetaken = Now()
    
    Range("D1:E1").EntireColumn.ClearContents
    varInput = Range("A1").CurrentRegion
    Set dic = CreateObject("scripting.dictionary")
    
    On Error Resume Next 'Needed, because the .Add method throws an error if you try to add a key that already exists.
    With dic
        For i = 1 To UBound(varInput)
            .Add varInput(i, 1), varInput(i, 2)
        Next
        Range("D1").Resize(.count) = Application.Transpose(dic.keys)
        Range("E1").Resize(.count) = Application.Transpose(dic.items)
    End With
    On Error GoTo 0
    
    timetaken = Now() - timetaken
    Debug.Print "Revision1 took " & Format(timetaken, "HH:MM:SS")
End Sub

Sub Revision1()

Dim dic As Object

Dim varInput As Variant

Dim i As Long

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

varInput = Range("A1").CurrentRegion

Set dic = CreateObject("scripting.dictionary")

On Error Resume Next 'Needed, because the .Add method throws an error if you try to add a key that already exists.

With dic

For i = 1 To UBound(varInput)

.Add varInput(i, 1), varInput(i, 2)

Range("D1").Resize(.count) = Application.Transpose(dic.keys)

Range("E1").Resize(.count) = Application.Transpose(dic.items)

End With

On Error GoTo 0

timetaken = Now() - timetaken

Debug.Print "Revision1 took " & Format(timetaken, "HH:MM:SS")

End Sub

Well, that’s slightly faster to be sure. 1 second. Even I can hold my eupnea that long.

One thing to note from the above is that the .add method I’m using throws an error if you try to add a key that already exists. Whereas the .Item method in the original code doesn’t. That error can come in handy, as we’ll see soon.

But while that approach avoids that pesky Application.Index bit, we’ve still got a potentially pernicious problem: in order to transfer the Dictionary from VBA to Excel, we’ve got to first Transpose the array to a Column Vector. And we use Application.Transpose to do that. Which begs the question: What if we end up with more than 65536 non-duplicate values in the dictionary? How will Application.Transpose handle that?

Badly.

I was thinking that the following approach would prove the fastest way around this issue:

Keep adding things to the dictionary until it contains exactly 65536 items (or until we’ve checked everything…whichever occurs first).
At that point, transfer those 65536 items in one go directly from the Dictionary to the worksheet.
After that point, keep trying to add items to the dictionary. If the dictionary accepts them without complaining (which means they are not duplicates) then add them to a variant array as well.
Transfer that variant array containing our ‘remainder’ items to the worksheet all in one hit.

Something like this:

Sub Revision2()

    Dim dic As Object
    Dim varInput As Variant
    Dim varOutput(1 To 983040, 1 To 2) As Variant
    Dim i As Long
    Dim lCount As Long
    Dim timetaken As Date
    
    timetaken = Now()
    
    Range("D1:E1").EntireColumn.ClearContents
    lCount = 1
    varInput = Range("A1").CurrentRegion
    
    Set dic = CreateObject("scripting.dictionary")

    With dic
        For i = 1 To UBound(varInput)
            If Not dic.exists(varInput(i, 1)) Then
                .Add varInput(i, 1), varInput(i, 2)
                
                If i = 65536 Then 'This is the most we can transfer from the Dictionary in one go
                    Range("D1").Resize(65536) = Application.Transpose(dic.keys)
                    Range("E1").Resize(65536) = Application.Transpose(dic.items)
                    lCount = 1
                End If
                
                If i > 65536 Then
                'Transfer remaining duplicates to VarOutut,
                'so we can dump them to the worksheet in one go
                    varOutput(lCount, 1) = varInput(i, 1)
                    varOutput(lCount, 2) = varInput(i, 2)
                    lCount = lCount + 1
                End If
            End If
        Next
        
         If .count < 65537 Then 'we haven't yet transferred anything from the Dictionary to the worksheet
            Range("D1").Resize(.count) = Application.Transpose(dic.keys)
            Range("E1").Resize(.count) = Application.Transpose(dic.items)
        Else
            Range("D1").Offset(65536).Resize(lCount-1, 2) = varOutput
            If lCount > 1 And lCount < 65536 Then Range("D1").Offset(.count + 1).Resize(65536 - lCount).ClearContents
        End If
    End With

    timetaken = Now() - timetaken
    Debug.Print "Revision2 took " & Format(timetaken, "HH:MM:SS")
End Sub

Sub Revision2()

Dim dic As Object

Dim varInput As Variant

Dim varOutput(1 To 983040, 1 To 2) As Variant

Dim i As Long

Dim lCount As Long

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

lCount = 1

varInput = Range("A1").CurrentRegion

Set dic = CreateObject("scripting.dictionary")

With dic

For i = 1 To UBound(varInput)

If Not dic.exists(varInput(i, 1)) Then

.Add varInput(i, 1), varInput(i, 2)

If i = 65536 Then 'This is the most we can transfer from the Dictionary in one go

Range("D1").Resize(65536) = Application.Transpose(dic.keys)

Range("E1").Resize(65536) = Application.Transpose(dic.items)

lCount = 1

End If

If i > 65536 Then

'Transfer remaining duplicates to VarOutut,

'so we can dump them to the worksheet in one go

varOutput(lCount, 1) = varInput(i, 1)

varOutput(lCount, 2) = varInput(i, 2)

lCount = lCount + 1

End If

If .count < 65537 Then 'we haven't yet transferred anything from the Dictionary to the worksheet

Range("D1").Resize(.count) = Application.Transpose(dic.keys)

Range("E1").Resize(.count) = Application.Transpose(dic.items)

Else

Range("D1").Offset(65536).Resize(lCount-1, 2) = varOutput

If lCount > 1 And lCount < 65536 Then Range("D1").Offset(.count + 1).Resize(65536 - lCount).ClearContents

End If

End With

timetaken = Now() - timetaken

Debug.Print "Revision2 took " & Format(timetaken, "HH:MM:SS")

End Sub

So why did I Dim varOutput as (1 To 983040, 1 To 2)? Because 983040 is the difference between the number of rows in ‘new’ Excel, less the block we already transferred across before our Dictionary got too big for Application.Transpose to handle. (Edit: It makes more sense to instead ReDim this varOutput(1 To UBound(varInput, 1), 1 To 2) as Jan Karel suggests in the comments, so that is is only as big as our initial list)

I know…it seems a waste to transfer such a large block, when there might very well only be a few things in it. But that’s the way it has to be, I think. We can’t redim the array back to the actual size of the block we need to transfer, because you can only redim the last dimension of an array. And we can’t swap the order of the dimensions around in order to circumvent this, because then we’d need to use Application.Transpose after I redimmed it, in order to turn it back into a nice, downward-pointing table. So we’d be rubbing against that 65536 limit again. Which – to coin a few words from my Dictionary – is vapid and soporific. Not to mention dreich.

Despite the double handling past 65536 items, this seems pretty fast: I put the numbers 1 through 65536 in column A, and the text A1 through to A65536 in column B, and it deduped them in around 1 second flat. (Of course, this being an extreme example, it didn’t actually find any duplicates to drop. But not for want of trying). And to process 1048576 unique records took 42 seconds.

Now here’s something interesting I also discovered: when I randomly sorted that list, it took 1 minute ten seconds to process those 1048576 records. So spend a couple of seconds sorting large data ranges before you dedupe.

Hmmm…maybe that approach of transferring the first 65536 items from the dictionary – and the rest via a variant array – is overkill. What if we don’t transfer anything from the dictionary to the worksheet, but simply use the dictionary as a means to identify non-duplicates? We can then write ALL those non-duplicates to a variant array one by one as we identify them. Then we can just bring the one block of data across. Like this:

Sub Revision3()

    Dim dic As Object
    Dim varInput As Variant
    Dim varOutput(1 To 1048576, 1 To 2) As Variant
    Dim i As Long
    Dim count As Long
    Dim timetaken As Date
    
    timetaken = Now()
    
    Range("D1:E1").EntireColumn.ClearContents
    count = 1
    varInput = Range("A1").CurrentRegion
    
    Set dic = CreateObject("scripting.dictionary")
    With dic
        For i = 1 To UBound(varInput)
            If Not dic.exists(varInput(i, 1)) Then
                .Add varInput(i, 1), varInput(i, 2)
                    varOutput(.count, 1) = varInput(i, 1)
                    varOutput(.count, 2) = varInput(i, 2)
            End If
        Next
        Range("D1").Resize(.count, 2) = varOutput
    End With

    timetaken = Now() - timetaken
    Debug.Print "Revision3 took " & Format(timetaken, "HH:MM:SS")

End Sub

Sub Revision3()

Dim dic As Object

Dim varInput As Variant

Dim varOutput(1 To 1048576, 1 To 2) As Variant

Dim i As Long

Dim count As Long

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

count = 1

varInput = Range("A1").CurrentRegion

Set dic = CreateObject("scripting.dictionary")

With dic

For i = 1 To UBound(varInput)

If Not dic.exists(varInput(i, 1)) Then

.Add varInput(i, 1), varInput(i, 2)

varOutput(.count, 1) = varInput(i, 1)

varOutput(.count, 2) = varInput(i, 2)

End If

Range("D1").Resize(.count, 2) = varOutput

End With

timetaken = Now() - timetaken

Debug.Print "Revision3 took " & Format(timetaken, "HH:MM:SS")

End Sub

Well, it’s only slightly slower for a sorted dataset of 1048576 items and a heck of a lot simpler: 43 seconds as opposed to 42 under the much more complicated previous approach. It takes 1 minute 10 seconds if the data isn’t sorted first. If time is money, we ain’t gonna get rich off of the back of the extra complexity of the Revision2 approach vs the Revision3 approach.

Of course, another option is a bit of SQL:

Sub sql()

Dim con As Object
Dim rstData As Object
Dim sDatabaseRangeAddress As String
Dim timetaken As Date

    timetaken = Now()
    
    Range("D1:E1").EntireColumn.ClearContents
    
    Set con = CreateObject("ADODB.Connection")
    Set rstData = CreateObject("ADODB.Recordset")
    
    con.Open "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" & ActiveWorkbook.FullName & ";Extended Properties = ""Excel 12.0 Macro;HDR=no"";"
    rstData.Open "SELECT DISTINCT F1, F2 FROM [Sheet1$]   ", con
    
    Range("D1").CopyFromRecordset rstData

    rstData.Close
    Set rstData = Nothing
    Set con = Nothing
    
    timetaken = Now() - timetaken
    Debug.Print "SQL took " & Format(timetaken, "HH:MM:SS")
End Sub

Sub sql()

Dim con As Object

Dim rstData As Object

Dim sDatabaseRangeAddress As String

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

Set con = CreateObject("ADODB.Connection")

Set rstData = CreateObject("ADODB.Recordset")

con.Open "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" & ActiveWorkbook.FullName & ";Extended Properties = ""Excel 12.0 Macro;HDR=no"";"

rstData.Open "SELECT DISTINCT F1, F2 FROM [Sheet1$] ", con

Range("D1").CopyFromRecordset rstData

rstData.Close

Set rstData = Nothing

Set con = Nothing

timetaken = Now() - timetaken

Debug.Print "SQL took " & Format(timetaken, "HH:MM:SS")

End Sub

This took 2 min 58 seconds to process 1048576 unique numbers. Man, that’s slow. I thought it might have been because of the use of DISTINCT, so I took the DISTINCT bit out and re-ran it. It still took a long time – I think 2 min 10 seconds or something like that, but I forgot to write it down.

What about Excel’s native Remove Duplicates functionality?

Sub Excel_RemoveDuplicates()

Dim rngSource As Range
Dim rngDest As Range
Dim timetaken As Date

    timetaken = Now()
    Range("D1:E1").EntireColumn.ClearContents
    
    Set rngSource = Range("A1").CurrentRegion
    Set rngDest = Range("D1").Resize(rngSource.Rows.count, 2)
    rngDest.Value = rngSource.Value
    rngDest.RemoveDuplicates Columns:=2, Header:=xlNo
    
    timetaken = Now() - timetaken
     Debug.Print "Remove Duplicates took " & Format(timetaken, "HH:MM:SS")
End Sub

Sub Excel_RemoveDuplicates()

Dim rngSource As Range

Dim rngDest As Range

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

Set rngSource = Range("A1").CurrentRegion

Set rngDest = Range("D1").Resize(rngSource.Rows.count, 2)

rngDest.Value = rngSource.Value

rngDest.RemoveDuplicates Columns:=2, Header:=xlNo

timetaken = Now() - timetaken

Debug.Print "Remove Duplicates took " & Format(timetaken, "HH:MM:SS")

End Sub

Faster? Well, it depends. Pretty much instantaneous if your items are sorted. But it took 2min 48 seconds to run for completely randomized data. Which is pretty concerning, given it only takes a few seconds to sort data.

So there you go. Dictionaries are great, but perhaps check first how much data you have going in to them, and how much going out. Otherwise you’ll possibly get an error.

Right, that’s me. I’m off to try a pocket dictionary attack on another couple of Excel blogs. Surely at least one of them will have either ILoveBacon, ILoveBanjos, ILoveBach, ILoveCharts, or ILoveCanada as their WordPress password. Maybe even IloveDictionaries.

—Update 25/10/2013—

Prompted by some of the great comments below, I did some testing on a simple one-column dataset under a range of conditions:

Key points:

Remove Duplicates is very fast on datasets with lots of unique values if they are sorted
Remove Duplicates is very slow on datasets with lots of unique values if they are not sorted
Dictionaries are fastest on datasets sorted A to Z, slower on datasets sorted Z to A, and slowest on unsorted datasets. That’s my surname with a ‘d’ on the end. Weird.
The Advanced Filter is not shown, because it simply can’t handle big arrays with numerous non-duplicates quickly.

27 thoughts on “Dictionaries can be rude!”

Excellent article.
In my quest to program “everything VBA” I got curious: what if we do use a collection in this particular case?

So I cooked up:

Sub RevisionJKP()
    Dim cCol As Collection
    Dim varInput As Variant
    Dim varOutput() As Variant
    Dim i As Long
    Dim count As Long
    Dim timetaken As Date

    timetaken = Now()

    Range("D1:E1").EntireColumn.ClearContents
    count = 1
    varInput = Range("A1").CurrentRegion
    ReDim varOutput(1 To UBound(varInput, 1), 1 To 2) As Variant

    Set cCol = New Collection
    On Error Resume Next
    For i = 1 To UBound(varInput)
        Err.Clear
        cCol.Add varInput(i, 1), varInput(i, 1) & "|" & varInput(i, 2)
        If Err.Number = 0 Then
            count = count + 1
            varOutput(count, 1) = varInput(i, 1)
            varOutput(count, 2) = varInput(i, 2)
        End If
    Next
    Range("D1").Resize(cCol.count, 2) = varOutput

    timetaken = Now() - timetaken
    Debug.Print "RevisionJKP took " & Format(timetaken, "HH:MM:SS")

End Sub

Sub RevisionJKP()

Dim cCol As Collection

Dim varInput As Variant

Dim varOutput() As Variant

Dim i As Long

Dim count As Long

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

count = 1

varInput = Range("A1").CurrentRegion

ReDim varOutput(1 To UBound(varInput, 1), 1 To 2) As Variant

Set cCol = New Collection

On Error Resume Next

For i = 1 To UBound(varInput)

Err.Clear

cCol.Add varInput(i, 1), varInput(i, 1) & "|" & varInput(i, 2)

If Err.Number = 0 Then

count = count + 1

varOutput(count, 1) = varInput(i, 1)

varOutput(count, 2) = varInput(i, 2)

End If

Range("D1").Resize(cCol.count, 2) = varOutput

timetaken = Now() - timetaken

Debug.Print "RevisionJKP took " & Format(timetaken, "HH:MM:SS")

End Sub

And executed both your revision3 and mine:

Revision3 took 00:01:05
RevisionJKP took 00:00:28

I guess doing it all in native VBA can have its advantage.
Note I used redim to at least make that array no larger than the original range.

Hi Jan Karel. That speed difference is staggering.

I made some slight modifications to your code:

Sub RevisionJKP()
    Dim cCol As Collection
    Dim varInput As Variant
    Dim varOutput() As Variant
    Dim i As Long
    Dim timetaken As Date

    timetaken = Now()

    Range("D1:E1").EntireColumn.ClearContents
    varInput = Range("A1").CurrentRegion
    ReDim varOutput(1 To UBound(varInput, 1), 1 To 2)

    Set cCol = New Collection
    On Error Resume Next
    With cCol
        For i = 1 To UBound(varInput)
            Err.Clear
            .Add varInput(i, 1), varInput(i, 1) & "|" & varInput(i, 2)
            If Err.Number = 0 Then
                varOutput(.count, 1) = varInput(i, 1)
                varOutput(.count, 2) = varInput(i, 2)
            End If
         Next
        Range("D1").Resize(.count, 2) = varOutput
    End With

    timetaken = Now() - timetaken
    Debug.Print "RevisionJKP took " & Format(timetaken, "HH:MM:SS")

End Sub

Sub RevisionJKP()

Dim cCol As Collection

Dim varInput As Variant

Dim varOutput() As Variant

Dim i As Long

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

varInput = Range("A1").CurrentRegion

ReDim varOutput(1 To UBound(varInput, 1), 1 To 2)

Set cCol = New Collection

On Error Resume Next

With cCol

For i = 1 To UBound(varInput)

Err.Clear

.Add varInput(i, 1), varInput(i, 1) & "|" & varInput(i, 2)

If Err.Number = 0 Then

varOutput(.count, 1) = varInput(i, 1)

varOutput(.count, 2) = varInput(i, 2)

End If

Range("D1").Resize(.count, 2) = varOutput

End With

timetaken = Now() - timetaken

Debug.Print "RevisionJKP took " & Format(timetaken, "HH:MM:SS")

End Sub

On my system, this takes 0:00:16.

Then I changed Revision3 to early binding, and used the same structure:

Sub Revision4()

    Dim dic As Scripting.Dictionary
    Dim varInput As Variant
    Dim varOutput() As Variant
    Dim i As Long
    Dim timetaken As Date
   
    timetaken = Now()
   
    Range("D1:E1").EntireColumn.ClearContents
    varInput = Range("A1").CurrentRegion
    ReDim varOutput(1 To UBound(varInput, 1), 1 To 2)
   
    Set dic = New Scripting.Dictionary
    On Error Resume Next
    With dic
        For i = 1 To UBound(varInput)
            Err.Clear
            .Add varInput(i, 1), varInput(i, 2)
            If Err.Number = 0 Then
                varOutput(i, 1) = varInput(i, 1)
                varOutput(i, 2) = varInput(i, 2)
            End If
        Next
        Range("D1").Resize(.count, 2) = varOutput
    End With

    timetaken = Now() - timetaken
    Debug.Print "Revision3 took " & Format(timetaken, "HH:MM:SS")

End Sub

Sub Revision4()

Dim dic As Scripting.Dictionary

Dim varInput As Variant

Dim varOutput() As Variant

Dim i As Long

Dim timetaken As Date

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

varInput = Range("A1").CurrentRegion

ReDim varOutput(1 To UBound(varInput, 1), 1 To 2)

Set dic = New Scripting.Dictionary

On Error Resume Next

With dic

For i = 1 To UBound(varInput)

Err.Clear

.Add varInput(i, 1), varInput(i, 2)

If Err.Number = 0 Then

varOutput(i, 1) = varInput(i, 1)

varOutput(i, 2) = varInput(i, 2)

End If

Range("D1").Resize(.count, 2) = varOutput

End With

timetaken = Now() - timetaken

Debug.Print "Revision3 took " & Format(timetaken, "HH:MM:SS")

End Sub

This executed in 0:00:51

I’m going to have to go back and reread matthewspatrick’s article, because I was under the illusion that the Dictionary would be faster.

snb says:

October 24, 2013 at 2:45 am

I think you might have overlooked the essence of the cited code:
It enters an [u]array[/u] into the dictionary item, not a value.
Suppose the currentregion consists of 100 rows and 8 columns, your revision won’t handle that.
I don’t think you can compare handling arrays in dictionary itmes to handling values in dictionary items.
jkpieterse says:

October 24, 2013 at 2:56 am

@snb: I know Dictionaries can do a lot more. I just want to make one point:

Don’t try to use Dictionaries (replace Dictionaries with whatever technology you like here) just because you *think* they provide the best method to solve the problem. Carefully decide (and test) whether they are *really* the best solution for the problem at hand.

So if efficiency counts, test your code for efficiency (speed, mem usage, programming ease, …) and also test alternatives. Then decide which route to take.

@Jeff: Well, as you can see the old saying applies: the proof of the pudding is in the eating.

Hi snb. I’ll have to digest that in the morning…late here and just about to go to bed. I absolutely might have missed the point, and this is complex stuff for me to get my head around :-)

Part of the reason for writing this post (which I did in a day) was to gain a better understanding of these things. And also to highlight that while it’s pretty easy to get data into a dictionary from the grid, getting it back out again into the grid requires thinking about how you are going to work around the 65536 hard-limit of the VBA/Worksheet interface. Because as I’ve pointed out, you can’t directly transfer the Dictionary keys or items back to the worksheet in one go if they have more than that amount of items in them.

And along the way, it was also very interesting to test some other approaches.

Jan Karel: I think the speed difference between your revision and mine is due to something wrong in my coding. I did some further testing with these simplified versions that just look at one column:

Sub Coln() Dim cCol As Collection Dim varInput As Variant Dim varOutput() As Variant Dim i As Long Dim timetaken As Date


        timetaken = Now()
        Range(“D1”).EntireColumn.ClearContents

        varInput = Range(“A1”).CurrentRegion

        ReDim varOutput(1 To UBound(varInput), 1 To 1)
        Set cCol = New Collection

        On Error Resume Next

        With cCol

            For i = 1 To UBound(varInput)

                Err.Clear

                .Add varInput(i, 1), varInput(i, 1) & “”

                If Err.Number = 0 Then

                    varOutput(.count, 1) = varInput(i, 1)

                End If

             Next

            Range(“D1”).Resize(.count) = varOutput

        End With
        timetaken = Now() – timetaken

        Debug.Print “Coln ” & Time() & ” took ” & Format(timetaken, “HH:MM:SS”)
    End Sub


Sub Dic()
    Dim Dic As Scripting.Dictionary

    Dim varInput As Variant

    Dim varOutput() As Variant

    Dim i As Long

    Dim timetaken As Date
    timetaken = Now()
    Range(“E1”).EntireColumn.ClearContents

    varInput = Range(“A1”).CurrentRegion

    ReDim varOutput(1 To UBound(varInput, 1), 1 To 1)
    Set Dic = New Scripting.Dictionary

    On Error Resume Next

    With Dic

        For i = 1 To UBound(varInput)

            Err.Clear

            .Add varInput(i, 1), varInput(i, 1)

            If Err.Number = 0 Then

                varOutput(.count, 1) = varInput(i, 1)
            End If

        Next

        Range(“E1”).Resize(.count, 1) = varOutput

    End With
    timetaken = Now() – timetaken

    Debug.Print “Dic ” & Time() & ” took ” & Format(timetaken, “HH:MM:SS”)
End Sub
Here’s my results:


Column Content
Dictionary
Collection
Sort
Unique Values


1 to 1048575
0:00:25
0:00:13
A-Z
1


1 to 1048575
0:00:40
0:00:14
Z-A
1


1 to 1048575
0:00:00
0:00:11
Random
1


Randbetween(1,1000)
0:00:02
0:00:03
Random
1000


Randbetween(1,100000)
0:00:04
0:00:06
Random
99997


Randbetween(1,1000000)
0:00:25
0:00:09
Random
649652


Basically the Dictionary is faster if you have fewer unique values/more duplicates, and the Collection is faster if you have more unique values/fewer duplicates.

I’ll have to do some testing of the other methods too across a variety of sample data. All in good time.

Doug Glancy says:

October 24, 2013 at 8:28 am

Jeff, once again a literary tour-de-farce. I beg, however, to differ with your conclusions regarding RemoveDuplicates, which I maintain warrants longer shrift.

In a recent post, at http://yoursumbuddy.com/get-unique-per-row-values-removeduplicates/, I came up with code that can come up with unique values for non-contiguous columns using RemoveDuplicates. I just ran it on three columns and 1,048,576 rows of unsorted data, getting 5 unique values from the combination of columns A and C in just over two seconds.
Craig Hatmaker says:

October 24, 2013 at 11:11 am

As mentioned, it all depends on what we’re trying to do.

About the 65536 row limit – If we are using XL for over even a few thousand rows odds are we are using XL as a database. Even free databases can handle this problem in fractions of a second and can go well beyond XL’s current 1,048,576 limit. So for me, I choose not to spend time figuring out how to work around that limitation.

While ignoring limits (even though ALL of these methods work with XL’s current 1,048,576 limit), let’s discuss which XL solution makes sense for which situation. For me, they are:

[1] Removing Duplicates from a range – Use RemoveDuplicates. It’s crazy fast, is a 1 liner, handles ranges with multiple columns, and has some nice options to pick and choose key columns:

Range("D1").CurrentRegion.RemoveDuplicates Columns:=Array(1, 2), Header:=xlNo
[2] Creating a list of unique values in a new range [2a] We could Use Advanced Filter:Range("A1").CurrentRegion.AdvancedFilter Action:=xlFilterCopy, CopyToRange:=Range("D1"), Unique:=True
[2b] Oddly enough, copying the range with duplicates and then use RemoveDuplicates works even faster:Range("A1").CurrentRegion.Copy Sheet2.Range("D1") Range("D1").CurrentRegion.RemoveDuplicates Columns:=Array(1, 2), Header:=xlNo
[3] Creating a list of Unique values from a range with the range’s contents into an Array – If our target is NOT a range, then Collections and Dictionaries make sense. This is my go to method:
Sub DicRow()
Dim Dic As Object Dim Row As Range Dim timetaken As Date timetaken = Now() Set Dic = CreateObject("Scripting.Dictionary") For Each Row In [A1].CurrentRegion.Rows Dic(Row.Cells(1)) = Row Next timetaken = Now() - timetaken Debug.Print "DicRow " & Time() & " took " & Format(timetaken, "HH:MM:SS") End Sub Remember that Dic.Items() and Dic.Keys() are ARRAYS! We don’t need another variable. To see a unique value, use the Dic.Keys() array like this:
Debug.Print Dic.Keys()(0)
The above gets the first key. And in this example, DicItems() has the actual row range tucked into it. So to see any of the associated cells use something like this:
Debug.Print Dic.Items()(0)(1,2)
The above gets the first unique row’s second cell.
Dick Kusleika says:

October 24, 2013 at 12:00 pm

Craig: While I was fixing the code tags, I fixed the zero-bound thing too. I also changed

Dic.Keys()(0)
to
Dic.Keys(0)
They both work the same, it seems, so I'm wondering if there's a difference. And if not, why would you choose the former over the later?
Michael says:

October 24, 2013 at 12:14 pm

Two comments about collections. Different than dictionaries, collections can be sorted. Dick showed us how way back here:

http://dailydoseofexcel.com/archives/2004/06/02/sort-a-collection/

And different than dictionaries, collections do not have a .exists property. You have to roll your own, trapping the error. Here is the one I use.

Public Function IsIn(Col As Collection, Key As String) As Boolean Dim errNum As Long, TEMP As Variant errNum = 0 Err.Clear On Error Resume Next TEMP = Col.Item(Key) errNum = CLng(Err.Number) On Error GoTo 0 If errNum = 5 Then 'IsIn = False Exit Function End If IsIn = True 'errNums 0 , 438 End Function
… mrt
Jeff Weir says:

October 24, 2013 at 1:43 pm
Hiya Craig. I was hoping you’d show up. I love to argue with IT :-)

Even free databases can handle this problem in fractions of a second and can go well beyond XL’s current 1,048,576 limit. Free database aren’t. They don’t come with free IT support, or freedom from IT to implement them, unless you are IT.

You’re the IT director where you work, Craig. So you have no constraints to implementing the above. Whereas I’m currently doing some work for a large (by New Zealand standards) government department. They won’t even let us load up the free PowerPivot Addin that’s built by Microsoft for Microsoft Excel. So adding say another MYSQL implementation that is going to require their resource and be out of their control is out of the question.

Use RemoveDuplicates. It’s crazy fast. Try this:
- Put the numbers 1 through 1,048,576 into column A in that order. Use remove duplicates.
- Now sort that list in random order (chuck a temporary row of =RAND() down the side, and sort on that. Then delete that temp column). Now try Remove Duplicates.
- Now sort that column by Z to A, and use Remove Duplicates.
I’ll see you back here in 5 minutes or so. This is one lousy algorithm if your data is not sorted.

Plus, it doesn’t help out in this case: http://forum.chandoo.org/threads/find-duplicate-values-in-1-8m-data.12473/ where a user has 1.8M records that they want to dedup. Yes, those records probably live in a database, so the op could probably get IT to rewrite the query that extracted those records so that they come through as distinct in the first place. But IT can sometimes be damn unresponsive. I used to email such basic requests to an IT guy in a small firm, and I’d never hear back, because his time management skills sucked. I had to email the request, then go see him on another floor and insist he did it while I was there. Give me ‘inefficient’ Excel any day, because I’m the only one in the driver’s seat.

Remember that Dic.Items() and Dic.Keys() are ARRAYS! We don’t need another variable. Your sample code puts some data into a dictionary. It doesn’t spit it back out again. What if that dictionary contains over 65536 items? How do you get it back into the worksheet? That’s why I’m using another variable.

[2a] We could Use Advanced Filter. Yeah, I forgot to mention this in my post. Put =randbetween(1,100,000) in column A, and convert to values. Now run the advanced filter on that, and select the ‘Unique Records Only’ option. I’ll see you back here after you’ve restarted Excel.

Hey, thanks for the comments. Always fun reminding IT that not all users are IT :-)
Jeff Weir says:

October 24, 2013 at 1:54 pm

@Doug. Ha! Literary tour-de-farce…that’s a good one. Hey, your post is brilliant. I missed it somehow. (Sometimes I tend to skim over long posts, although I have no qualms about writing an even longer one!).

Check out my comments above re Excel’s Remove Duplicates. Sort your data first, is my advice. And bad luck if you want to dedup more than 1,048,576 records.
snb says:

October 24, 2013 at 2:54 pm

@Jeff

The code you cited was meant for this situation:

Sub M_delete_duplicates() sn = Sheets("Sheet1").Cells(1).CurrentRegion.Resize(, 5)
With CreateObject("scripting.dictionary") For j = 1 To UBound(sn) .Item(sn(j, 1)) = Application.Index(sn, j, 0) Next sp = Application.Index(.items, 0, 0) MsgBox UBound(sn) & vbTab & UBound(sp, 2) Sheets("Sheet1").Cells(1, 10).Resize(.Count, UBound(sn, 2)) = Application.Index(.items, 0, 0) End With End Sub It also illustrates that you can construct a 2-dimensional array based on a 1-dimensional array .items, using:
sp = Application.Index(.items, 0, 0)
Doug Glancy says:

October 24, 2013 at 4:20 pm

@Jeff, thanks! Again, though, my point is that the code in that post is really fast without sorting. I wonder if the difference is because my code does the operation in a brand-new temporary workbook.
Jeff Weir says:

October 24, 2013 at 5:03 pm

Doug: “Really Fast” depends on how many records, and with some approaches speed decreases non-lineally in relation to an increase in records. So we can never assume that an approach that is fastest with a dataset of rows x will also be fastest with a dataset of 2x or (heavens forbid) x^2. Check out the table I’ve added at the bottom of my original post above to see what I mean.

@snb. I see your intent now. If you feed an entire table row by row to the dictionary using the .item property, then the dictionary automatically uses the first item in that array as the key, and stores the entire line as the item.

So as written, your code looks for the last occurring duplicate items in row A of the table, and returns the whole row. So if we feed it this:

A	1	2	3	4
A	5	6	7	8
B	9	10	11	12
B	13	14	15	16

…it will return these rows:

A	5	6	7	8
B	13	14	15	16

That’s cool. I knew you could assign other objects (e.g. arrays, dictionaries) to a dictionary, but I didn’t know that if you did this via the .item property it would automatically assign the first element in the array as a key. Sweet.

That said, the poor speed performance of Application.Index(sn, j, 0) within the loop limits the practicality of this approach severely. To the point that you would want to seriously reconsider using this approach if there was any chance that you would have more than a few thousand rows.
For instance, here’s what I get if I fill 5 columns with =RANDBETWEEN(1,100)

1000 rows: 1 second
5000 rows: 13 seconds
10000 rows: 54 seconds
15000 rows: 2 minutes
65536 rows: 16 minutes
Anything greater: Code fails

It would actually be much faster to concatenate all the columns together in the worksheet, attempt to add the key to the dictionary along with the concatenated records, and then spit the deduped records back to the worksheet, and then split them back into separate columns.

Or if you don’t want to do the concatenation in the worksheet, you can do it in VBA like this:

Sub Revision_A2()
Dim dic As Object
Dim varInput As Variant
Dim j As Long
Dim timetaken As Date

timetaken = Now()
    
    varInput = Range("A1").CurrentRegion.Resize(, 5)
   
    Set dic = CreateObject("scripting.dictionary")
    With dic
        For j = 1 To UBound(varInput)
            .Item(varInput(j, 1)) = varInput(j, 1) & "|" & varInput(j, 2) & "|" & varInput(j, 3) & "|" & varInput(j, 4) & "|" & varInput(j, 5)
        Next
    
        Range("G1").CurrentRegion.ClearContents
        Range("G1").Resize(.count, 1) = Application.Transpose(.items)
        Range("G1").CurrentRegion.TextToColumns _
                Destination:=Range("G1"), _
                DataType:=xlDelimited, _
                Other:=True, OtherChar:="|", _
                FieldInfo:=Array(Array(1, 1), Array(2, 1), Array(3, 1), Array(4, 1), Array(5, 1))

    End With
    
    timetaken = Now() - timetaken
    Debug.Print "Revision2 " & Time() & " took " & Format(timetaken, "HH:MM:SS")
    
End Sub

Sub Revision_A2()

Dim dic As Object

Dim varInput As Variant

Dim j As Long

Dim timetaken As Date

timetaken = Now()

varInput = Range("A1").CurrentRegion.Resize(, 5)

Set dic = CreateObject("scripting.dictionary")

With dic

For j = 1 To UBound(varInput)

.Item(varInput(j, 1)) = varInput(j, 1) & "|" & varInput(j, 2) & "|" & varInput(j, 3) & "|" & varInput(j, 4) & "|" & varInput(j, 5)

Range("G1").CurrentRegion.ClearContents

Range("G1").Resize(.count, 1) = Application.Transpose(.items)

Range("G1").CurrentRegion.TextToColumns _

Destination:=Range("G1"), _

DataType:=xlDelimited, _

Other:=True, OtherChar:="|", _

FieldInfo:=Array(Array(1, 1), Array(2, 1), Array(3, 1), Array(4, 1), Array(5, 1))

End With

timetaken = Now() - timetaken

Debug.Print "Revision2 " & Time() & " took " & Format(timetaken, "HH:MM:SS")

End Sub

That takes 1 second to do 65536 records.

Craig Hatmaker says:

October 24, 2013 at 8:42 pm

@Jeff

Re. RemoveDuplicates
Dost thou thinketh I dare break the first IT commandment – “THOU SHALT TEST!”

Code:
Sheet2.Range(“A1”).CurrentRegion.RemoveDuplicates Columns:=Array(1, 2), Header:=xlNo

Data:
Unsorted. 3 Columns in data. Used 2 Columns for unique items. Used random A## format (Random Alpha & Random 2 digit number). 1,048,567 rows.

Results:
3 seconds

And yes – it takes less than a second for a mere 65,536 rows.

Re. Free Databases – Damn IT.
We don’t need MS Access to put data in MS Access tables. SQL Server Express can be loaded onto a PC and run by mere mortals. No DBA daemons required. And yse – it’s good to by King. As ruler I shall not lead my people unto darkness but rather into the light of best practices. So don’t expect me to support using XL as a database. It ain’t. It’s a good thing that poor slob working with 1.8 million records in XL has people like you – cause my ONLY advice to him/her is “get a database”.

Re. Dic.Items() and Dic.Keys() are ARRAYS!
You missed the point. If you need to output to a range – don’t use dictionaries. Use RemoveDuplicates. If you want to work with the data as an array – Dic.Items() IS an array – just use as is.

@Craig. Weird. Because it takes a full 2 minutes ten seconds when I fill columns A to C with this:
=CHAR(RANDBETWEEN(65,90))& RANDBETWEEN(1,99)

…and then run this:

Sub RemoveDuplicates_Craig()
Dim timetaken As Date

timetaken = Now()
Sheet2.Range("A1").CurrentRegion.RemoveDuplicates Columns:=Array(1, 2), Header:=xlNo
timetaken = Now() - timetaken
    Debug.Print "RemoveDuplicates_Craig " & Time() & " took " & Format(timetaken, "HH:MM:SS")
End Sub

Sub RemoveDuplicates_Craig()

Dim timetaken As Date

timetaken = Now()

Sheet2.Range("A1").CurrentRegion.RemoveDuplicates Columns:=Array(1, 2), Header:=xlNo

timetaken = Now() - timetaken

Debug.Print "RemoveDuplicates_Craig " & Time() & " took " & Format(timetaken, "HH:MM:SS")

End Sub

Maybe it’s a version or computer issue? I’ll email you my file.

Craig Hatmaker says:

October 24, 2013 at 9:10 pm

Hi Dick

Thanks for the help. Please set the dic.Keys() and dic.Items() back. Here’s why. To be precise – these aren’t arrays. They are METHODS that RETURN arrays, similar to Split(). So if we intend to use the dic.Keys() or dic.Items() methods repeatedly, we should assign them to arrays so as to not incur the function overhead. Like so (assumes v is declared as a variant or variant array and dic has already been built):

V = dic.Keys()
debug.print v(0), dic.Keys()(0)
Craig Hatmaker says:

October 24, 2013 at 9:12 pm

@Jeff – Try setting Application.Calculation = xlCalculationManual.
Craig Hatmaker says:

October 24, 2013 at 9:18 pm

BTW – Here is a neat trick. Let’s assume we have a source table with 3 columns. After inputing the table into our dictionary using the coding in DicRow from my first post:

Dic(Row.Cells(1)) = Row

We can put the entire row into a range with one line like so:

[A1].Resize(1,3) = dic.Items()(0)
Jeff Weir says:

October 24, 2013 at 9:19 pm

Makes no difference on my PC. I get lots and lots of screen flickering, and it takes ages. Maybe it’s an addin issue or something?

@Jan Karel: snb’s intent with his code has become a bit more clear to me: As written, his code looks for the last occurring duplicate items in column A of the table, and returns the whole row. So if we feed it this:

A	1	2	3	4
A	5	6	7	8
B	9	10	11	12
B	13	14	15	16

…it will return these rows:

A	5	6	7	8
B	13	14	15	16

Meaning we’d need to change your code slightly to do what he wants. And also meaning we don’t need to concatenate the varInput(i, 1) & “|” & varInput(i, 2) bit on the line cCol.Add varInput(i, 1) , varInput(i, 1) & “|” & varInput(i, 2)
Also note that you don’t need the Count = 1 line, and in fact at the beginning we want Count = 0 as you have coded it up, otherwise you get a blank line right at the beginning of your output.

Which leaves us with this even faster version:

Sub RevisionJKP_2()
    Dim cCol As Collection
    Dim varInput As Variant
    Dim varOutput() As Variant
    Dim i As Long
    Dim count As Long
    Dim timetaken As Date
    Dim strItem As String

    timetaken = Now()

    Range("D1:E1").EntireColumn.ClearContents
    varInput = Range("A1").CurrentRegion
    ReDim varOutput(1 To UBound(varInput, 1), 1 To 2) As Variant

    Set cCol = New Collection
    On Error Resume Next
    For i = 1 To UBound(varInput)
        strItem = varInput(i, 1)
        Err.Clear
        cCol.Add strItem, strItem
        If Err.Number = 0 Then
            count = count + 1
            varOutput(count, 1) = strItem
            varOutput(count, 2) = varInput(i, 2)
            
        End If
    Next
    Range("D1").Resize(cCol.count, 2) = varOutput

    timetaken = Now() - timetaken
    Debug.Print "RevisionJKP2 " & Time() & " took " & Format(timetaken, "HH:MM:SS")

End Sub

Sub RevisionJKP_2()

Dim cCol As Collection

Dim varInput As Variant

Dim varOutput() As Variant

Dim i As Long

Dim count As Long

Dim timetaken As Date

Dim strItem As String

timetaken = Now()

Range("D1:E1").EntireColumn.ClearContents

varInput = Range("A1").CurrentRegion

ReDim varOutput(1 To UBound(varInput, 1), 1 To 2) As Variant

Set cCol = New Collection

On Error Resume Next

For i = 1 To UBound(varInput)

strItem = varInput(i, 1)

Err.Clear

cCol.Add strItem, strItem

If Err.Number = 0 Then

count = count + 1

varOutput(count, 1) = strItem

varOutput(count, 2) = varInput(i, 2)

End If

Range("D1").Resize(cCol.count, 2) = varOutput

timetaken = Now() - timetaken

Debug.Print "RevisionJKP2 " & Time() & " took " & Format(timetaken, "HH:MM:SS")

End Sub

Craig Hatmaker says:

October 24, 2013 at 10:00 pm

@Jeff

Got your spreadsheet. It’s the randomness. Not the data size. Your data has more unique values. It is also why Revision_A2() is so fast. For each result, check not only the time but the number of unique values returned.
Jeff Weir says:

October 24, 2013 at 10:30 pm

It’s actually a combination of the number of unique values, and the crap algorithm that Excel’s Remove Duplicates uses. To prove it, put =CHAR(RANDBETWEEN(65,90))& RANDBETWEEN(1,99999) in Column A, then paste special values. Run the Remove Duplicates algorythm. It takes a while.

Now do that again, but sort it Z to A. Takes much longer.

Now,do it again, but sort it A to Z before you run Excel’s Remove Duplicates. This time it takes no time at all.

Refer to the table I added at the bottom of my original article.
snb says:

October 25, 2013 at 3:28 am

@Jeff Weir

There’s a file attached to the webpage to illustrate the examples.
http://www.snb-vba.eu/bestanden/__VBA_dictionaries_voorbeelden_en.xls
You will see that your interpretation doesn’t match what the example is all about.

Your quote:
‘If you feed an entire table row by row to the dictionary using the .item property, then the dictionary automatically uses the first item in that array as the key, and stores the entire line as the item.’

Can you show the code to do this ?
Since the ‘key’ is a required argument in the property item, I don’t see how it can be attributed automatically to a new item.
Jeff Weir says:

October 25, 2013 at 2:20 pm

You will see that your interpretation doesn’t match what the example is all about. Maybe I haven’t made my interpretation clear. Put the string “First Record for aa1” in Unique!B2 and “Last record for aa1” in Unique!B18 and run your code. You will see that out of all the records with the key aa1, the “Last Record” one gets returned. I.e. if there are duplicate records, the last one gets returned.

Can you show the code to do this ? No I can’t. Because it doesn’t. I misinterpreted something :-)

If you didn’t want the last matching item, then much simpler and faster to use this:

Sub M_remove_duplicate_records_2()
Dim rngData As Range
Dim rngDest As Range
Dim timetaken As Date


timetaken = Now()


Set rngData = Sheets("Unique").Cells(1).CurrentRegion
Set rngDest = Sheets("Unique").Cells(4).Resize(rngData.Rows.Count, 2)
rngDest.Value = rngData.Value


rngDest.RemoveDuplicates Columns:=1, Header:=xlNo
End Sub

Posting code? Use <pre> tags for VBA and <code> tags for inline.

Daily Dose of Excel

Haphazardly Posted Excel Information and Other Stuff

Dictionaries can be rude!

27 thoughts on “Dictionaries can be rude!”

Leave a Reply

Column Content	Dictionary	Collection	Sort	Unique Values
1 to 1048575	0:00:25	0:00:13	A-Z	1
1 to 1048575	0:00:40	0:00:14	Z-A	1
1 to 1048575	0:00:00	0:00:11	Random	1
Randbetween(1,1000)	0:00:02	0:00:03	Random	1000
Randbetween(1,100000)	0:00:04	0:00:06	Random	99997
Randbetween(1,1000000)	0:00:25	0:00:09	Random	649652