Find Matching Data in Array Speed Test

JP has a good post about finding exact matches in arrays. I use a similar method. I Join the array with delimiters around all the values, then use Instr to see if it’s there. Here’s my code:

Function IsInArrayDK(vArr As Variant, sValueToCheck As String, _
Optional bMatch As Boolean = True) As Boolean

Dim bReturn As Boolean
Dim sWordList As String

Const sDELIM As String = "|"

'See if it's a match even if only a substring
bReturn = UBound(Filter(vArr, sValueToCheck)) > -1

'If a match and need exact
'If exact match not needed, the line above provides the return value
If bReturn And bMatch Then
'put pipes around all the values
sWordList = sDELIM & Join(vArr, sDELIM) & sDELIM
'See if the values with pipes is there
bReturn = InStr(1, sWordList, sDELIM & sValueToCheck & sDELIM) > 0
End If

IsInArrayDK = bReturn

End Function

To test, I filled an array with 100,000 random strings, picked one of the strings to find, then timed JP’s funciton, my function, and the non-optimized method. The non-optimized method simply loops through the array and checks for values.

Function IsInArrayLoop(vArr As Variant, sValueToCheck As String, _
Optional bMatch As Boolean = True) As Boolean

Dim bReturn As Boolean
Dim i As Long

For i = LBound(vArr) To UBound(vArr)
If bMatch Then
If vArr(i) = sValueToCheck Then
bReturn = True
Exit For
End If
Else
If InStr(1, vArr(i), sValueToCheck) > 0 Then
bReturn = True
Exit For
End If
End If
Next i

IsInArrayLoop = bReturn

End Function

The code to fill the array converts Rnd to a string and puts it in the array. Then I pick one of the values (first, middle, and last) as the value I want to check.

Sub FillArray(ByRef vArr As Variant, ByVal lPlace As Long, ByRef sValue As String)

Dim i As Long

For i = 1 To 100000
vArr(i) = CStr(Int(Rnd * 10000000))
If i = lPlace Then
sValue = vArr(i)
End If
Next i

End Sub

I used the same API timer that JP uses when he does speed tests.

Public Declare Function timeGetTime Lib "winmm.dll" () As Long

And finally, the sub to test loops through the early, middle, and late values-to-check and times them.

Sub TestArray()

Dim aNames(1 To 100000) As Variant
Dim i As Long
Dim bResult As Boolean
Dim lStart As Long, lEnd As Long
Dim sValueToCheck As String
Dim aPlace(1 To 3, 1 To 2) As Variant
Dim sTable As String, sRow As String

'name the tests and determine where the value to check is in the array
aPlace(1, 1) = "Value Early:": aPlace(1, 2) = 1
aPlace(2, 1) = "Value Middle:": aPlace(2, 2) = 50000
aPlace(3, 1) = "Value Late:": aPlace(3, 2) = 99999

'The results go in an html table
sRow = Tag(Tag("Milliseconds", "td") & Tag("JP", "td") & Tag("DK", "td") & Tag("Loop", "td"), "tr") & vbNewLine
sTable = sRow

For i = 1 To 3
sRow = Tag(aPlace(i, 1), "td")
FillArray aNames, aPlace(i, 2), sValueToCheck

lStart = timeGetTime
bResult = IsInArrayJP(aNames, sValueToCheck, True)
lEnd = timeGetTime
sRow = sRow & Tag(lEnd - lStart, "td")

lStart = timeGetTime
bResult = IsInArrayDK(aNames, sValueToCheck, True)
lEnd = timeGetTime
sRow = sRow & Tag(lEnd - lStart, "td")

lStart = timeGetTime
bResult = IsInArrayLoop(aNames, sValueToCheck, True)
lEnd = timeGetTime
sRow = sRow & Tag(lEnd - lStart, "td")

sTable = sTable & Tag(sRow, "tr") & vbNewLine
Next i

Debug.Print Tag(sTable, "table", , True)

End Sub

The results:

Milliseconds JP DK Loop
Value Early: 53 53 0
Value Middle: 48 53 11
Value Late: 49 54 22

 
JP’s and mine are a wash and the loop is fastest. I guess I should just use that.

36 thoughts on “Find Matching Data in Array Speed Test

  1. A cursory look at your results seems to indicate that the average (middle) seek time for the Loop method is about 20% of the others, and is a linear progression. Since the others are static seek times (early, middle, late are the same), increasing the records from 100,000 to about 500,000 should be (about) where the times equalize. I would expect that the other two methods would begin to outperform the loop significantly when over a million records are searched.

  2. Dick – how would you handle checking say an array of 1,000 items off against an array of 100,000 items?

  3. @Jeff: might be fast to sort the array and use binary search – there must be a breakeven point where the additional overhead of doing the sort beats doing many linear searches, but I don’t know where the breakeven point is.

  4. Dick – I think you missed a trick with your code because the ‘Filter’ command returns an array. Therefore if you use this instead of the ‘vArr’ variant in the ‘Join’ command, you only use this smaller subset of values to check against. It’s still slower than the simple loop though. This is an example below adapted from yours (hope I got it right).

    Function IsInArrayJM(vArr As Variant, sValueToCheck As String, _
    Optional bMatch As Boolean = True) As Boolean

    Dim bFiltArr As Variant
    Dim bReturn As Boolean
    Dim wordList As String
    Dim i As Long

    ‘See if it’s a match even if only a substring
    bFiltArr = Filter(vArr, sValueToCheck, True)
    bReturn = Not (IsEmpty(bFiltArr))

    ‘If a match and need exact
    ‘If exact match not needed, the line above provides the return value
    If bReturn And bMatch Then
    wordList = Join(bFiltArr, “,”)
    bReturn= InStr(1, wordList, “,” & sValueToCheck & “,”) > 0
    End If

    IsInArrayJM = bReturn

    End Function

    I don’t think there is any code that would be faster than the simple loop because all the loop does is check against the 100,000 numbers. The DK and JP versions both do something that requires looping through the 100,000 records first of all (either to ‘Join’ or ‘Filter’ and then looping through them again (or a subset) to identify a match.

  5. Hi Charles. So dump it into a worksheet range, sort it, then either pull it back to VBA or do the search in the worksheet, I take it? Obviously we’d need to transfer the array to excel in blocks of 65,000 so we don’t hit the VBA/Worksheet transfer bug, and also possibly old Excel’s row limit if applicable.

    I’m thinking that it could well be faster to:
    1. dump master list and lookup list to the worksheet
    2. remove duplicates from the master list
    3. remove duplicates from the lookup list
    4. turn the lookup list into an excel table
    5. apply say red fill via conditional formatting to both lists, looking for non-duplicates
    6. filter the lookup table on “Is Not Red” if we want to know which items are NOT in the master array.

  6. Charles – There’s two small problems with a binary search.

    First, to do a binary search you would need to sort the 100,000 numbers into numerical order which would take a lot longer than to just check 100,000 unsorted records. If the 100,000 numbers did not change, then subsequent exact match searches would be a lot quicker (you would only need to check a maximum of 18 records (I think) before finding an exact match rather than up to 100,000). However if any of the 100,000 numbers changed then you would have to re-sort the list to do the search, which would take longer than simply checking all 100,000 records.

    Second, a binary search would only be quicker for an exact match and not a partial match (you’d still have to check every number to find a partial match)

  7. To compare lists, for more on linear and binary searches, and using MATCH to simulate a linear exact search with a binary approximate search, see
    Using the MATCH Function for a Linear Search and a Binary Search
    http://www.dailydoseofexcel.com/archives/2010/03/10/using-the-match-function-for-a-linear-search-and-a-binary-search/
    and the corresponding
    Using the MATCH Function for a Linear Search and a Binary Search
    http://www.tushar-mehta.com/publish_train/xl_vba_cases/match-exact-vs-binary.shtml

  8. @John,

    Of course sorting 100,000 numbers takes longer than a single sequential scan of 100,000 numbers. But doing a sequential scan of 100,000 numbers 100,000 times is slower than sorting the 100,000 numbers and doing 100,000 binary searches.
    So the question is: where is the breakeven point?

    I did not see the OP asking for a partial match.
    If you mean that binary search always finds an approximate match (which is the way that MSoft have implemented their Match and Lookup alogorithms) this is not inherent in the algorithms and is easily bypassed even with the MSoft algorithms without resorting to linear search.

  9. Charles,

    You are correct the op did not ask for a partial match but the coding included partial matching so I took it that was needed in the matching solution. If no partial match is needed then a binary match would be fine although a scripting.dictionary solution would be easier to implement (and potentially almost as fast if not faster in some cases).

    As for the ‘breakeven point’, that’s a good question as it took me on a little tour of the internet. If we made some really loose (and inaccurate) assumptions we could come up with a number. So given that a sequential search will take (n+1)*(n/2) searches to match every item in the list, that a binary search will take on average nLog2(n) searches and that sorting a list will take on average nLog(n) swaps (thanks Wikipedia) we can then see when the crossover point is. If we also assume that the binary search code is more complex than the simple search by a factor of 5 and that the sort code is more complex by a factor of 10 (yes I made those up) then I get a breakeven point just after 100 records. Some internet pages suggest the number can be a lot lower depending on how the binary search code is written and gets compiled. It probably would have been quicker to write the code to test it than look it all up, but not as interesting.

  10. I must be being extra stupid today, I’ve read this post a few times now, its of some practical interested to me right now… so..

    I can see why 100, 1000 long array might get looped, but at 100,000 would you not load to a collection or dictionary?

    I have a ongoing project, which loads 1.3 million lines of data (postal codes) to a collection and might make typically between 10 to 10,000 “searches”, each search is in milliseconds…. still 10,0000 takes a while, but you know what can you do!

    I guess I don’t understand why you would do this…? what am I missing?

  11. John –

    “Second, a binary search would only be quicker for an exact match and not a partial match (you’d still have to check every number to find a partial match)”

    That’s not *quite* true. I would consider sorting a static data set and using a binary search to identify a starting-point for my sequential search if most (or all) of my search strings had the first wildcard after the second character.

    However, there are alternative approaches which use binary searches to give a much smaller subset for second sequential search. These approaches run against the unstated assumption that that ‘sorting’ has to mean a straight alphabetical ranking, AAAB>AAAA and so on: there are other ways of creating a lookup structure that a binary search can navigate.

    Here’s two examples:

    What if the sort algorithm is an ‘Edit distance’ algorithm, generating a ranked list of ‘closest match to “MMMM” numbers? That will give a small subset of the data having closest-match scores similar to that of the search term, and one binary search will get the upper boundary of the subset, ready us to run a (hopefully) short sequential traverse downwards until we satisfy the partial-match criteria.

    If you’re lucky, or have the right algorithm, the filtered set will be a single row (or will match at the frst row) even in a medium-sized table of several thousand rows.

    Alternatively, what if the indexing algorith generates a number of BTREE structures based on trigrams (three-letter groups) in the unsorted data, and any subsequent ‘Seek’ ignores trigrams in the search term with wildcards and extracts a subset of the data where the search string has matching trigrams?

    Again, this is a binary-search of the known parts of the search string, followed by a short traverse of the filtered subset.

    There’s an overhead with these strategies, so we’ll very rarely use them *directly* in Excel – and never on dynamic data. But you’ll definitely be using them if you use ‘seek’ methods on a recordset.

    …Which is, of course, the other point: you don’t roll-your-own binary search or partial-match algorithm in VBA when there are off-the-shelf data structures with these features built-in, other than afor curiosity or as a teaching exercise.

  12. Dick –

    Your string search may be better than you realise for short data sets, and worse for long ones: did you extend your test to small, medium and large data sets?

    The ‘Join’ method does a *very* fast piece of pointer arithmetic to concatenate the string, and this is far, far faster than a VBA loop through the array. The subsequent ‘string’ search is, of course, a traverse through an array of chars; but this will be significantly (but not orders-of-magnitude) faster than our own loop through a VBA array – especially if the VBA loop is allocating and deallocating strings.

    However, I would expect the string-concatenation method to slow down exponentially when the length exceeds a sixteen million characters; and, more importantly for everyday use, to slow down exponentially in a data set of any size when the search term length exceeds 32 characters.

    The VBA array traverse will, at least, be linear in its expected running time versus data set size and length-of-search-term.

  13. @John,
    I coded up an array VBA UDF that did a sort and binary search or a linear search. If the LookFor list is equal in size to the LookinList then the breakeven point is about 100 (which I think is what you predicted!!), but interestingly linear search never really wins significantly even on very small lists.

    If the LookFor list is much smaller than than the LookinList then the breakeven point where Sort+Binary search is better is when the LookFor list is greater than 4% of the LookIn list plus 25.

    Of course this is all dependent on the efficiency of the implementation …

    @Ross, I am not sure how efficient Collection/Dictionary is for bulk loading of data: if they re-index every time an item gets added it could be slower to create the dictionary/collection than both linear search or sort+binary search.

  14. @Charles Williams

    I am not sure how efficient Collection/Dictionary is for bulk loading of data: if they re-index every time an item gets added it could be slower to create the dictionary/collection than both linear search or sort+binary search

    If you can design a reliable performance test, the results would be useful to all of us. It would improve everyone’s programming if our choices were informed by a set of of ‘rule-of-thumb’ benchmarks for using a collection, or an array, or a string-searched Join. Or, indeed, sending the whole problem to a database engine.

  15. @Charles @Ross –

    Looks like you’re correct about the dictionary: it does seem to re-index every time an item gets added.

    Full test details, with results below: I’m not sure how to interpret these results in terms of comparing linear and binary searching, and I would welcome your comments.

    Some general comments on the performance of the Scripting.Dictionary object in a VBA loop:

    I can demonstrate that the loop itself (including a simple string concatenation) is linear in its runtime vs loop count, and you can run a million iterations in about 0.75 secs.

    I am shocked, *shocked*, that each trip ’round the barebones loop with a string concatenation takes 2,000 clock cycles. I find it difficult to believe that compiled VBA is *that* inefficient.

    Adding items to a dictionary is nonlinear: runtime increases by a factor of four or five, every time you double the number of items.

    So yes, it’s probably rebuilding the index every time you add an item.

    Checking an item exists adds about 20-25% to the time required to add an item; the .Exists() check is subject to the same exponential performance drop-off with increasing member count.

    I haven’t tested it, but I feel safe in assuming that lookups from a dictionary will perform (on average) twice or four times as fast as an Exists() check; and they, too will suffer the same exponential performance drop-off with increasing member count.

    It all starts going pear-shaped after 1.5 Million items and you really shouldn’t be using Excel and VBA for this sort of data volume.

    In practical terms…

    Common programming tasks that run ’round a loop to add a thousand items to a dictionary are too fast for me to measure – less than a millisecond – and looping ten thousand times takes about 30 milliseconds.

    You *can* go ’round the innermost loop of a nested loop routine a million times, often without realising that you’ve created such a critical performance bottleneck – so you absolutely must flush and restart any incremental process using dictionaries within the ‘inside’ loop if it’s going over 2-5 thousand items.

    …And, at realistic data volumes for spreadsheet work (50,000 rows) it all happens in less than a quarter of a second.

    This is the read-write time to a cell and ‘instantaneous’ to a user, so I can’t really complain about the performance of the scripting dictionary in this context.

    Testing code:

    This loop was run for a loop count of 1,000 (as shown below) then 10,000, 50,000, 100k, 250k, 0.5 Million, 1M, 1.5M and 2 Million.

    Test Results:

    0 concatenate key x 1k
    0.015625 concatenate key x 10k
    0.03125 concatenate key x 50k
    0.078125 concatenate key x 100k
    0.1875 concatenate key x 250k
    0.375 concatenate key x 500k
    0.765625 concatenate key x 1m
    1.140625 concatenate key x 1.5m
    1.515625 concatenate key x 2m

    0 add key to dictionary 1k
    0.015625 add key to dictionary 10k
    0.15625 add key to dictionary 50k
    0.5625 add key to dictionary 100k
    3.953125 add key to dictionary 250k
    18.5625 add key to dictionary 500k
    82.234375 add key to dictionary 1m
    192.00390625 add key to dictionary 1.5m
    348.0703125 add key to dictionary 2m

    0 add with check .Exists() 1k
    0.03125 add with check .Exists() 10k
    0.203125 add with check .Exists() 50k
    0.71875 add with check .Exists() 100k
    5.078125 add with check .Exists() 250k
    23.61328125 add with check .Exists() 500k
    109.984375 add with check .Exists() 1m
    266.67578125 add with check .Exists() 1.5m
    601.18922475 add with check .Exists() 2m

  16. @Nigel,
    I just posted ( http://fastexcel.wordpress.com/2012/07/10/comparing-two-lists-vba-udf-shootout-between-linear-search-binary-search-collection-and-dictionary/ )the code for a VBA array UDF that uses the various methods, together with some timings. Looks to me like Dictionary is the winner up to about 500K items, after which Collection starts winning. If I was using a more efficient Sort routine then I think Sort + Binary Search would eventually win for large volumes.

  17. Testing the posted code, I get similar results for DK and Loop but I find JP’s method twice as fast (Excel 2010 64-bit Core i3 Laptop 2.13GHz, 2 Core(s)). Application.Match also gave similar results for up to 65536 elements using:

  18. @Charles @Nigel – and everyone else!

    I’ll have to re read all of this over the weekend, my head is looping at a million cycles per second!!
    I think it will be worth investing the mental time though as it’s a good performance pay back if you pick the right one….

    Although in fairness, my current code works “fast than man” on an I5, so maybe it the use case that needs to be defined as well – or at least to included in the method you select justification… if you see what I mean…

    thanks for taking the time to test this stuff guys.

    your in loops, your in loops, yours in loops….

    Ross

  19. Not clear Excel is an ideal tool for this. Here’s my results using R for a similar comparison.

    > doit <- function() {
    + x <- sort(trunc(2^12 * runif(2^20)))
    + y <- trunc(2^12 * runif(2^20))
    + match(y,x)
    + }
    > system.time(doit())
    user system elapsed
    0.69 0.03 0.72
    >

    IOW, compares one 1,048,576 value array to another in 0.72 seconds total run time.

  20. InterestinG: I have never played with R …
    So how does R perform with the equivalent string arrays?
    (not to mention reading 1 million strings from Excel, doing the comparison and then writing 1 million string results back to Excel: is there something called RExcel that links R and Excel?)

  21. Really interesting post, there is also another method that hasn’t been mentioned. Recently I’ve been playing with the System.Collection classes from .Net that are usable from Excel – in particular the ArrayList – http://msdn.microsoft.com/en-us/library/system.collections.arraylist.aspx.

    For large datasets it is much faster than dictionaries to fill and comes complete with a sort, binary search and standard search functions. The sort is relatively slow, but the binary searches are fast once sorted.

  22. The results of this function were half the time IsInArray_JP of IsInArray_DK produced.

  23. I left a wrong email address.
    So I corrected it.

    The results of this function were half the time IsInArray_JP of IsInArray_DK produced.

    Function IsInArray_snb(sn As Variant, c01 As String, Optional c02 As Boolean = True) As Boolean
    IsInArray_snb = InStr(“_” & Join(Filter(sn, c01), “_”) & “_”, “_” & c01 & “_”)
    End Function,

  24. I tested another method, using an ActiveX control.
    Instead of searching in an array, you can look into the array that has heen assigned to a combobox or listbox.

    So in Dick’s macro ‘testarray’, after filling the array aNames() I assigned that array to
    Sheet1.Combobox1.List
    or
    Sheet1.Listbox1.List

    Now the IsInArrray function can have the following structure

    Function IsInArray_snb(c01 As String) As Boolean
    With Sheet1.ComboBox1
    .Value = c01
    IsInArray_snb = .ListIndex > -1
    End With
    End Function

    I'm not very disappointed by it's speed.

  25. @snb

    Clever, but I prefer StrComp to get an exact match. Your function will return True for a substring match where one of the array elements contains an underscore. ex:

    [cc lang=’vb’]IsInArray_snb(Array(“ID_12345”, “ID_23432”, “orange”), “ID”)

  26. Just a guess, but this is probably not as fast as snb’s one-liner, but I thought I would put it out there in case someone wanted to test it…

    Function IsInArray_Rick(vArr As Variant, S As String) As Boolean
    IsInArray_Rick = UBound(Split(Chr$(1) & Join(vArr, Chr$(1)) & Chr$(1), Chr$(1) & S & Chr$(1), 2))
    End Function

  27. @snb,

    Thanks for running the test! I figured it would be slower than yours, but I thought it might have been a little closer than that though. Just out of curiosity, did you try running your code without using the Filter function? InStr is fast, so I wonder if Join’ing the entire array rather than Filter’ing the array and Join’ing what results might be faster.

    Function IsInArray_snb(sn As Variant, c01 As String) As Boolean
    IsInArray_snb = InStr(“_” & Join(sn, “_”) & “_”, “_” & c01 & “_”)
    End Function

    Note: I left out the Optional c02 argument since your original code, as well as this one, never make use of it.

Leave a Reply

Your email address will not be published. Required fields are marked *