Fuzzy Text Match

I need some VBA that does a fuzzy match of text. I have a list of fax numbers that can be appended by various people in my office. The list is getting huge and I just know there are duplicates in there. It’s easy enough to sort on the fax number and see if there are duplicates, but that doesn’t get them all.

I want to go through the other two columns, Contact Name and Company, and compare them. I want to identify where someone typed in ABC, Inc. and someone else typed in just ABC.

I think it will have to be some sort of scoring system, where the characters are compared and a score is assigned. Scores above a certain number will be considered duplicates.

If you’ve heard of this already being done or you are inclined to tackle this yourself, post a comment. Thanks.

Posted in Uncategorized

17 thoughts on “Fuzzy Text Match

  1. Dick,

    I created this function to do something similar to what you’ve described in the first paragraph – identify duplicates (exact duplicates). It won’t work for your fuzzy text match, but then again, the numbers should be identical, or else they’re probably not duplicates.

    I drop this into a blank column next to the list I’m examining, and sometimes use conditional formatting to identify the rows, or I use this formula to get a count: =if(duplicates(x,y),1,””). Then I use a sum() function to get a quick count of how many there are, and look for the 1s. Don’t know if this is helpful, but I hope so — you’ve certainly been helpful enough to me in the past!

    Function duplicates(look_val, look_range As Range) As Boolean
    Dim myRange As Range, cell As Range
    Dim rangeRows As Integer, i As Integer
    Dim firstAddress, secondAddress

    Set myRange = look_range
    rangeRows = myRange.Rows.Count
    duplicates = False

    For Each cell In myRange
    If cell.Value = look_val Then
    firstAddress = cell.Address
    Exit For
    End If
    Next
    ‘ Returns false if the value wasn’t found in the range at all
    If firstAddress = Empty Then Exit Function

    For Each cell In myRange
    If cell.Value = look_val And cell.Address <> firstAddress Then
    duplicates = True
    Exit Function
    End If
    Next

    End Function

  2. JWalk: That’s interesting. I didn’t quite work for me, though. Debbie Korytowski and Davis Erection Company have the same Soundex code. I got 1,947 matches out of 1,500 contacts. I guess the longer the word, the worse it does.

  3. kevboy: Two points. With my users, you can’t even assume that different numbers aren’t duplicates. They would rather enter everything again in a new record then fix their mistakes. Secondly, believe it or not, I was just scanning the list (with my eyes) to find dupes in the fax number. Now why didn’t I write a macro to do it?

  4. I would suggest this: note not exact formulas..just talking thru with formula

    1. join the desired columns–in a new column–copy formula down..

    =a1&b1&c1

    2. sort by your new column

    3. create new column #2 and use the if function: =IF(new column cell #1 “D1? = new column cell #2 “D2?,”duplicate”,” “)

    4. copy paste value the new column #2 (where the duplicates would show up)

    5. sort by duplicate column…delete duplicates..or just copy and paste anything that is not a duplicate into a new worksheet..that way you would have your original list as a backup.

  5. Hi

    This is primitive stuff, far from what you had in mind, but I use a thing like this for a second check when a new person is registered into one of my databases. It sees through case and a middle name or two, but won’t catch misspellings. (And it shouldn’t. Most norwegian names look like misspellings, so I’d end up with a single record :-)

    Function IsThere(ByVal S1 As String, ByVal S2 As String) As Boolean
    S1 = “*” & UCase$(S1) & “*”
    S2 = “*” & UCase$(S2) & “*”
    S1 = Replace$(S1, “.”, “*”)
    S2 = Replace$(S2, “.”, “*”)
    S1 = Replace$(S1, ” “, “*”)
    S2 = Replace$(S2, ” “, “*”)
    IsThere = S1 Like S2
    If IsThere = False Then IsThere = S2 Like S1
    End Function

    Sub test()
    MsgBox IsThere(“ABC”, “ABC inc.”)
    MsgBox IsThere(“ABC inc”, “abc”)
    MsgBox IsThere(“Dick Kusleika”, “D Kusleika”)
    MsgBox IsThere(“Davis Erection Company”, “D Erect. Comp.”)
    End Sub

  6. I hit this page when searching for other fuzzy match implementations for Excel. I have recently developed a fuzzy search capability for Excel. One enters a search string and then a list of ranked best matches is brought up, with match percentages. It works quite well but at this point I’m not willing to share as I’m not sure how much this is sought for. Several people have suggested I could make money with this. Email me for more info.

    Klaas Bil (Netherlands)

  7. I could make good use of a text matching facility in my daily work, but Dick’s problem of finding duplicates in a list of fax numbers is challenging enough.

    Ideally, Excel should have a function that calculates the correlation between two strings. CORREL and PEARSON do exactly that for two equally-sized ranges of numbers, but my new function would have to have the ability to evaluate strings of different lengths.

    Unless you’re a genius, this is a complex problem to solve in one go. How could we break up this this problem into parts? I think this one requires basically three steps:

    1) “cleaning” the data
    2) defining the algorithm that will decide whether any string matches another one
    3) finding a setup in which to carry out the actual matching between the given strings

    More in detail:

    1) “Garbage” in a telephone number string is in fact every non-numeric character. People use all kinds of characters to separate the number components, so if we could extract these from our strings, we would improve comparability.

    2) After the cleaning proces we would actually end up with numeric strings: numbers of varying length, often depending on whether a prefix is included or not. Phone numbers usually don’t have suffixes, so probably the best way to search for duplicates is by starting from the back.

    3) Maybe the most difficult exercise is to apply the matching procedure itself. Since every string has to be compared to every other string, a lot of comparisons need to be made: if n is the number of strings, this would be n times n(n-1): any number with every other number in the list, n times, i.e. n²(n-1).

    Ideas for a solution?

  8. Frank, as I am a genius, and I like unanswered questions (this one is small fry, but still), I put a few phone numbers on a worksheet, like this:

    +1 275 561/32/58
    (0275) 555.32.58
    001-275 555-32-58
    275/555 3258
    555 32 58

    As you can see they are all really the same number, and none of them is longer than 17 characters (didn’t you notice that? – all right, you’re a genius or you’re not).

    The first thing you have to remember when you work with Excel is: it’s only a tool, so it’s not going to be perfect.

    At this stage I have to tell you I will not even bother trying to solve this with VBA – that would be way too elaborate and as you rightly say, complex.

    The next step is to parse these numbers into one-character components.

    Although worksheets are Excel’s greatest programming environment (don’t worry, Dick knows I can explain this), text formulas have a severe limitation: they consider text strings first and foremost as values, rather than a set of characters. To look at text cells with array- (not X-ray) glasses, you need elaborate formulas – so we keep it simple and for each telephone number we allow an extra 17 cells to contain every single one of its characters.

    Talking of limitations in Excel, do you know why CONCATENATE is completely redundant? Because Excel doesn’t allow =CONCATENATE(A1:F1), which is really a shame. Can anyone think of a reason why one would not use the ampersand in concatenating two strings? As I said, keep it simple.

    Then we take out all the numeric characters using ISNUMBER (again, the simplest is to allow another 17 extra cells – there are millions left!), and we re-assemble the phone numbers (using the ampersand), but in reverse order, without their non-numeric characters:

    85231655721
    85235555720
    8523555572100
    8523555572
    8523555

    The last step is the one where you don’t need to be a genius, so I’m leaving all the “fuzzyness” of the text matching to you.

  9. Dick et Al:

    Anyone in need of a fuzzy match VBA/add-in solution likely is already aware of the commercial Fuzzy Finder add-in; if you’d rather a free solution, there’s a few options over on Mr.Excel’s old site under http://www.mrexcel.com/pc07.shtml. Cheers!

    @ Herman: It appears as though you were more interested in screaming “I’m a genius everyone!” from the rooftops then you were in providing an actual solution. It’s an intriguing insight to note that you “like unanswered questions” (which isn’t actually even accurate, you were trying to state that you like the challenge/entertainment of them), yet didn’t provide a single substantial formula. Any idiot could have trolled any one of the many Excel boards and posted that rubbish you prattled on with; a true fan of connundrums would have spoken concisely and technically without treating their response as a soapbox for empty-handed boasts.

  10. Shall I tell you what today’s best text matching system is? Everybody uses it and yet nobody knows how it works? It’s the way Google matches our strings with relevant web pages.
    If only Microsoft had a spark of this G-power, help facilities would become twice as user friendly, and Excel would probably have a few stunning search-and-match functions.

    What is Google’s secret?

    JMW: As Donny Miller said, “You have to fool yourself before you can fool anybody else”. One can do serious things only on condition one doesn’t take them too seriously – that’s why people keep coming to this blog.

  11. I’ve built an advanced fuzzy deduping process for a database of customers using SQL Server Integration Services (SSIS). If you must do this in excel, you should be able to use VBA to call the service. Problem with this is that you will need the enterprise edition of SQL 2005. Anyway, the solution was a huge success for me that was very easy to implement. The fuzzy lookups use a token based system for scoring similarity between values. Hope this helps!

Leave a Reply

Your email address will not be published. Required fields are marked *