Calculating the Number of Unique Items in a Delimited String

This is a guest post by David Hager. You can learn more about David on LinkedInd

There has recently been interest in handling row data that contain delimited strings. An elegant solution to split a delimited row into multiple rows using DAX queries can be found here.

http://www.sqljason.com/2013/06/split-delimited-row-into-multiple-rows.html

Another question was asked on LinkedIn about how to count the number unique items of a delimited string. I could not figure out how to do this in PowerPivot, and I am not sure it can be done with a single formula. However, I undertook the challenge to do this in Excel and I came up with a solution, albeit a rather lengthy one.

So, if the string a,c,d,a,b,c,d,e,f,g,h,I,j is in cell A1, the following formula will return the value 10, which is the number of unique items in the delimited string.

=SUM(N(MATCH(TRIM(MID(SUBSTITUTE(A1,",",REPT(" ",999)) ,ROW(INDIRECT("1:"&LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1))*999-998,999)),TRIM(MID(SUBSTITUTE(A1,",",REPT(" ",999)) ,ROW(INDIRECT("1:"&LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1))*999-998,999)),0)=ROW(INDIRECT("1:"&LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1))))

In order to explain how this formula works, a simplified version is shown below:

=SUM(N(MATCH(DelimitedStringArray,DelimitedStringArray,0)=RowArray))

The MATCH function returns an array of n elements. Each value in the array is the MATCH function result for the nth element. For the example string, the array will look like this:

{1;2;3;1;5;2;3;8;9;10;11;12;13}

If every element in the delimited string is unique, this array would be filled by consecutive numbers from 1 to 13. It can be easily seen which elements do not fit that pattern. In order to calculate this, the array from the MATCH function must be set equal to the unique array, or:

{1;2;3;1;5;2;3;8;9;10;11;12;13} = {1;2;3;4;5;6;7;8;9;10;11;12;13}

which affords an array of Boolean values:

{TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE}

Use of the N function converts this array into ones and zeros, and the SUM function returns the desired result.
Well, that was the easy part. :)

The hard part of this formula is to convert a delimited string into an array of each element in the delimited string. This was accomplished by using the following formula:

=TRIM(MID(SUBSTITUTE(A1,",",REPT(" ",999)),ROW(INDIRECT("1:"&LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1))*999-998,999))

To give credit where credit is due, the core of this formula was created by Rick Rothstein, see:

The truly amazing function of this formula is that it converts a delimited string into an array! I’m not going to go into an explanation here of how this formula works, since it is explained at the provided link. The original formula

=TRIM(MID(SUBSTITUTE(A1,",",REPT(" ",999)),n*999-998,999))

was designed to return the nth element from a delimited string, but in this case all of the elements are returned in an array by replacing n with ROW(INDIRECT("1:"&LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1)), which in this example returns {1;2;3;4;5;6;7;8;9;10;11;12;13}. This is also the array used for RowArray in the simplified example.

So, this formula works great in Excel, but how could it be used in PowerPivot? For those using Excel 2013, and assuming that your column of delimited strings resides within a table in your DataModel, you can use a DAX query to bring the column into Excel, add the formula demonstrated here to the Excel table, and link the table back into the DataModel. A comprehensive example of this can be found at:

http://www.sqlbi.com/articles/linkback-tables-in-powerpivot-for-excel-2013/

So, for situations where solutions may not be possible in PowerPivot (or just incredibly complex), don’t forget about the power of Excel formulas.

BTW, an offshoot of the creation of this formula is another array formula that sums (or averages, whatever) the values in a delimited string:

=SUM(N(VALUE(TRIM(MID(SUBSTITUTE(A1,",",REPT(" ",999)) ,ROW(INDIRECT("1:"&LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1))*999-998,999)))))

A solution to this problem has been pursued in the Excel community for many years, so I am happy to present this formula here.

45 thoughts on “Calculating the Number of Unique Items in a Delimited String

  1. Very slick formulas. However, if you build this out as a UDF using VBA, it’s much simpler.
    Just call Chip Pearson’s Quicksort routine, compare the buckets and ignore the dupes via iterator, done.
    For the Sum or Average, just do a Split, and call the Worksheet functions SUM or AVERAGE.

    I did something very similar recently to create a RankUnique function since Excel’s Rank produces duplicates.

  2. Another way is to split the string into a range of cells, then apply a UDF from Erlandsen Consulting which treats the range as a collection, viz, “UniqueItem()”

  3. M Simms, that’s the same story I heard back in 1994 when I first started creating complex Excel formulas.
    I went on to make extremely complex functions with the xlm language and later with VBA, but I continued to
    appreciate the challenge of solving a problem with a native Excel formula.

  4. It would be interesting to see which approach performs better.
    MSFT has done a lot to optimize formula calculations in the past 2 releases.

  5. @M Simms,

    Here is another method that seems to work which does not involve sorting…

    Function UniqueCount(ByVal S As String) As Long
    Dim X As Long, Parts() As String
    S = “@” & Replace(S, “,”, “@,@”) & “@”
    Parts = Split(S, “,”)
    For X = 0 To UBound(Parts)
    If UBound(Split(S, Parts(X))) > 0 Then
    UniqueCount = UniqueCount + 1
    S = Replace(S, Parts(X), “,”)
    End If
      Next
    End Function

  6. Just repeating my code, but trying something to see if I can format it better (might work or it might not work)…

    Function UniqueCount(ByVal S As String) As Long
      Dim X As Long, Parts() As String
      S = “@” & Replace(S, “,”, “@,@”) & “@”
      Parts = Split(S, “,”)
      For X = 0 To UBound(Parts)
        If UBound(Split(S, Parts(X))) > 0 Then
          UniqueCount = UniqueCount + 1
          S = Replace(S, Parts(X), “,”)
        End If
    Next
    End Function

  7. Cool! Here’s another formula approach:

    =SUM(N(FREQUENCY(IFERROR(MATCH(MID(A1,FIND("|",SUBSTITUTE(","&A1,",","|",ROW(OFFSET(A1,,,LEN(A1))))),FIND("|",SUBSTITUTE(A1&"|",",","|",ROW(OFFSET(A1,,,LEN(A1)))))-FIND("|",SUBSTITUTE(","&A1,",","|",ROW(OFFSET(A1,,,LEN(A1)))))),MID(A1,FIND("|",SUBSTITUTE(","&A1,",","|",ROW(OFFSET(A1,,,LEN(A1))))),FIND("|",SUBSTITUTE(A1&"|",",","|",ROW(OFFSET(A1,,,LEN(A1)))))-FIND("|",SUBSTITUTE(","&A1,",","|",ROW(OFFSET(A1,,,LEN(A1)))))),0),""),ROW(OFFSET(A1,,,LEN(A1))))>0))

    Much longer than yours: 462 characters, as opposed to your 299.

  8. This is such a great challenge, that I’ve posted it in the ‘Formula Challenges’ section at Chandoo’s forum. Will be interesting to see what other approaches are tried.

    This particular challenge is posted at http://chandoo.org/forums/topic/formula-challenge-016-unique-items-in-a-delimited-string and there’s a whole bunch of other very tricky challenges listed under http://chandoo.org/forums/forum/excel-challenges if anyone is interested in testing their formula mettle.

  9. This array-entered formula is slightly shorter than yours (274 characters versus your 299) and appears to work correctly in Excel 2007 and above (the nesting levels are too large for earlier versions), which is a limitation of both your formula and Jeff’s as well…

    =SUM(1*ISERROR(-SEARCH(TRIM(MID(SUBSTITUTE(A1,”,”,REPT(” “,999)),ROW(INDIRECT(“1:”&LEN(A1)-LEN(SUBSTITUTE(A1,”,”,””))+1))*999-998,999)),TRIM(LEFT(SUBSTITUTE(TRIM(SUBSTITUTE(A1,” “,”@”)),”,”,REPT(” “,999)),(ROW(INDIRECT(“1:”&LEN(A1)-LEN(SUBSTITUTE(A1,”,”,””))+1))-1)*999)))))

  10. Follow up to my last message
    ==================================
    I note that all the quote marks in my formula will need to be replaced with proper quote marks if you copy the formula directly from my message and paste it (anywhere).

    Jeff… what code tags did you use to get that scrolling display field which retained the proper quote marks?

  11. ***BUG ALERT**

    Do not use my previously posted formula as it will improperly count item like “aaaa,a,aaaa,a”. The following modification to my formula seems to correct the problem, but it also balloons the character count for the formula to 329. One positive my formula has over David’s formula, though, is that it returns 0 if A1 is empty whereas David’s returns 1. Anyway, for what its worth, here is the formula…

    =SUM(1*ISERROR(-SEARCH(“,”&TRIM(MID(SUBSTITUTE(SUBSTITUTE(A1,” “,”|”),”,”,REPT(” “,999)),ROW(INDIRECT(“1:”&LEN(A1)-LEN(SUBSTITUTE(A1,”,”,””))+1))*999-998,999))&”,”,”,”&SUBSTITUTE(TRIM(LEFT(SUBSTITUTE(TRIM(SUBSTITUTE(A1,” “,”|”)),”,”,REPT(” “,999)),(ROW(INDIRECT(“1:”&LEN(A1)-LEN(SUBSTITUTE(A1,”,”,””))+1))-1)*999)),” “,”,”)&”,”)))

  12. Here’s my revised method for converting a delimited string into an array:

    =MID(A1,FIND("|",SUBSTITUTE(","&A1,",","|",ROW(OFFSET(A1,,,LEN(A1))))),MMULT(IFERROR(FIND("|",SUBSTITUTE(CHOOSE({1,2},","&A1,A1&","),",","|",ROW(OFFSET(A1,,,LEN(A1))))),0), {-1;1}))

    At 181 characters, it’s still longer than David’s 113 character method.

    This gets my formula down to 413 characters. Still much longer than David’s 299 one.

  13. Whoops, forgot to post my entire formula. And now there is a wall of my ugly mug pulling a funny face staring down at y’all.

    =SUM(IFERROR(N(MATCH(MID(A1,FIND("|",SUBSTITUTE(","&A1,",","|",ROW(OFFSET(A1,,,LEN(A1))))),MMULT(IFERROR(FIND("|",SUBSTITUTE(CHOOSE({1,2},","&A1,A1&","),",","|",ROW(OFFSET(A1,,,LEN(A1))))),), {-1;1})),MID(A1,FIND("|",SUBSTITUTE(","&A1,",","|",ROW(OFFSET(A1,,,LEN(A1))))),MMULT(IFERROR(FIND("|",SUBSTITUTE(CHOOSE({1,2},","&A1,A1&","),",","|",ROW(OFFSET(A1,,,LEN(A1))))),), {-1;1})),)=ROW(OFFSET(A1,,,LEN(A1)))),0))
  14. @Rick

    Function UniqueCount(ByVal S As String) As Long
    sn = Split(Replace("@" & S & "@", ",", "@,@"), ",")

    Do
    sn = Filter(sn, sn(0), False)
    UniqueCount = UniqueCount + 1
    Loop Until UBound(sn) = -1
    End Function

  15. I used to have a little blurb about how to post code, but it appears to be gone. I wonder what happened to that. Well, it looks like this

    <code>formula goes here</code>

    <code lang=”vb”>vba goes here</code>

    I can put <code lang=”vb” inline=”true”>code inline</code> too.

  16. In comparison to the complexity of the Excel formula I prefer VBA:
    Function F_unique_snb(c00)
    sn = Split(c00, ",")

    With CreateObject("scripting.dictionary")
    For j = 0 To UBound(sn)
    c01 = .Item(sn(j))
    Next
    F_unique_snb = .Count
    End With
    End Function

  17. 195 characters long!

    ......
    ......
    =SUM(N(MATCH(
    TRIM(MID(SUBSTITUTE(A1&REPT("",8^5),",",REPT(" ",999)),ROW($1:$999)*999-998,999)),
    TRIM(MID(SUBSTITUTE(A1&REPT("",8^5),",",REPT(" ",999)),ROW($1:$999)*999-998,999)),
    0)=ROW($1:$999)))-1
    ......
    ......
    

    Regards

  18. Indeed, you are avoiding any errors if you stick to work with keys only.
    It performs pretty well to generate a list of unique keys.

  19. Rick, your formula is pure genius.

    Unless I’m missing something, you can ditch SUBSTITUTE(A1,” “,”|”) for just A1, which will make your array-generation portion identical to Davids. And you can also ditch the minus sign in front of the SEARCH.

    Also, if this were a challenge to come up wit the shortest formula length, you could replace SEARCH with FIND and also replace this:
    ROW(INDIRECT(“1:”&LEN(A1)-LEN(SUBSTITUTE(A1,”,”,””))+1))
    …with this:
    ROW(OFFSET(A1,,,LEN(A1)-LEN(SUBSTITUTE(A1,”,”,””))+1))

    Not that those changes make the resulting formula any better than your existing masterpiece. Just shorter for the sake of it. Down to 283 characters in fact.

     =SUM(1*ISERROR(FIND(","&TRIM(MID(SUBSTITUTE(A1,",",REPT(" ",999)),ROW(OFFSET(A1,,,LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1))*999-998,999))&",",","&SUBSTITUTE(TRIM(LEFT(SUBSTITUTE(TRIM(A1),",",REPT(" ",999)),(ROW(OFFSET(A1,,,LEN(A1)-LEN(SUBSTITUTE(A1,",",""))+1))-1)*999))," ",",")&","))) 

    For readers following along at home, then given data in A1 like this:
    aa,c,d,aa,bbb,c,d,e,f,g,h,I,jjjjjjjj,aa

    …you first split the string into two arrays:
    Array 1:

    {",aa,";",c,";",d,";",aa,";",bbb,";",c,";",d,";",e,";",f,";",g,";",h,";",I,";",jjjjjjjj,";",aa,"}

    Array 2:

    {",,";",aa,";",aa,c,";",aa,c,d,";",aa,c,d,aa,";",aa,c,d,aa,bbb,";",aa,c,d,aa,bbb,c,";",aa,c,d,aa,bbb,c,d,";",aa,c,d,aa,bbb,c,d,e,";",aa,c,d,aa,bbb,c,d,e,f,";",aa,c,d,aa,bbb,c,d,e,f,g,";",aa,c,d,aa,bbb,c,d,e,f,g,h,";",aa,c,d,aa,bbb,c,d,e,f,g,h,I,";",aa,c,d,aa,bbb,c,d,e,f,g,h,I,jjjjjjjj,"}

    …where array2 is just a list of incrementally concatenated elements of the string with a placeholder at the front (which offsets arary2 by one position compared to array1).

    That offsetting of array2 this means that for any given element n in array1, array2(n) contains everything in the original string up to that point. Or put another way, array2(n) = CONCATENATE(“,,” , array1(1) , array 1(2) , … , array1(n -1) )

    So when we search for array1(n) within array2(n), then if there’s a match, the thing we are looking for has obviously already occurred earlier in the string.

    Pure genius.

    Here’s how that looks graphically (assuming WordPress doesn’t mangle things):
    Result array1 array2
    #VALUE! ,aa, ,,
    #VALUE! ,c, ,aa,
    #VALUE! ,d, ,aa,c,
    1 ,aa, ,aa,c,d,
    #VALUE! ,bbb, ,aa,c,d,aa,
    4 ,c, ,aa,c,d,aa,bbb,
    6 ,d, ,aa,c,d,aa,bbb,c,
    #VALUE! ,e, ,aa,c,d,aa,bbb,c,d,
    #VALUE! ,f, ,aa,c,d,aa,bbb,c,d,e,
    #VALUE! ,g, ,aa,c,d,aa,bbb,c,d,e,f,
    #VALUE! ,h, ,aa,c,d,aa,bbb,c,d,e,f,g,
    #VALUE! ,I, ,aa,c,d,aa,bbb,c,d,e,f,g,h,
    #VALUE! ,jjjjjjjj, ,aa,c,d,aa,bbb,c,d,e,f,g,h,I,
    1 ,aa, ,aa,c,d,aa,bbb,c,d,e,f,g,h,I,jjjjjjjj,

  20. Damn, I forgot the code tags on my 2nd array. Dick, would you kindly add code tags to Array2:

    {",,";",aa,";",aa,c,";",aa,c,d,";",aa,c,d,aa,";",aa,c,d,aa,bbb,";",aa,c,d,aa,bbb,c,";",aa,c,d,aa,bbb,c,d,";",aa,c,d,aa,bbb,c,d,e,";",aa,c,d,aa,bbb,c,d,e,f,";",aa,c,d,aa,bbb,c,d,e,f,g,";",aa,c,d,aa,bbb,c,d,e,f,g,h,";",aa,c,d,aa,bbb,c,d,e,f,g,h,I,";",aa,c,d,aa,bbb,c,d,e,f,g,h,I,jjjjjjjj,"}

    And also, is there a way to paste tablular data here, so that it shows in columns?

  21. I was about to suggest a udf using the scripting dictionary, but I see snb has beaten me to it, and probably a good bit shorter than mine would have been.

    Nice work.

  22. You can use html in the comments box, so table tags for tabular data. I recommend joinrange with the htmltable argument.

    Result array1 array2
    #VALUE! ,aa, ,,
    #VALUE! ,c, ,aa,
    #VALUE! ,d, ,aa,c,
    1 ,aa, ,aa,c,d,
    #VALUE! ,bbb, ,aa,c,d,aa,
    4 ,c, ,aa,c,d,aa,bbb,
    6 ,d, ,aa,c,d,aa,bbb,c,
    #VALUE! ,e, ,aa,c,d,aa,bbb,c,d,
    #VALUE! ,f, ,aa,c,d,aa,bbb,c,d,e,
    #VALUE! ,g, ,aa,c,d,aa,bbb,c,d,e,f,
    #VALUE! ,h, ,aa,c,d,aa,bbb,c,d,e,f,g,
    #VALUE! ,I, ,aa,c,d,aa,bbb,c,d,e,f,g,h,
    #VALUE! ,jjjjjjjj, ,aa,c,d,aa,bbb,c,d,e,f,g,h,I,
    1 ,aa, ,aa,c,d,aa,bbb,c,d,e,f,g,h,I,jjjjjjjj,
  23. You could reduce the code to:
    Function F_unique_snb(c00)
    With CreateObject("scripting.dictionary")
    For each it in Split(c00, ",")
    c01 = .Item(it)
    Next

    F_unique_snb = .Count
    End With
    End Function

  24. Nice formula! There is a slight limitation because the maximum string length is harcoded to 999, which restricts the number of delimited values to around 66 in my tests. To work around this I would suggest replacing 999 by LEN(A1) and 998 by (LEN(A1)-1). An alternative that also seems to work and is shorter (175 chars) – but probably no clearer :)

    =SUMPRODUCT(--(FIND(","&MID(A1&",",ROW(INDIRECT("1:"&LEN(A1))),MMULT(FIND(",",{"",","}&A1&",",ROW(INDIRECT("1:"&LEN(A1)))),{1;-1})+1),","&A1&",")=ROW(INDIRECT("1:"&LEN(A1)))))

    Agree with others that udfs are probably a better option here although maybe a bit slower than the suggested formulas for the sample string given in the post. In some basic optimisation tests, i found that the scripting dictionary udf could be made around twice as fast if references were included and would be quicker still using a collection in place of dictionary with “on error resume next” making it faster than the worksheet formulas as well as non-volatile. Would be nice to do a more detailed speed comparison given the time…

  25. Trying above formula again:

    =SUMPRODUCT(--(FIND(","&MID(A1&",",ROW(INDIRECT("1:"&LEN(A1))),MMULT(FIND(",",{"",","}&A1&",",ROW(INDIRECT("1:"&LEN(A1)))),{1;-1})+1),","&A1&",")=ROW(INDIRECT("1:"&LEN(A1)))))
  26. to digress somewhat more:

    to return the list of unique items (not only the number of unique items):

    Function F_uniquelist_snb(c00)
    With CreateObject("scripting.dictionary")
    For Each it In Split(c00, ",")
    c01 = .Item(it)
    Next

    F_uniquelist_snb = Join(.keys, ",")
    End With
    End Function

    To return a sorted list of unique items:

    Function F_uniquesortedlist_snb(c00)
    With CreateObject("System.Collections.arraylist")
    For Each it In Split(c00, ",")
    If Not .contains(it) Then .Add it
    Next
    .Sort

    F_uniquesortedlist_snb = Join(.toarray, ",")
    End With
    End Function

  27. Another formula approach. For readability, I have used S for the separator e.g. “,”

    {=COUNT(1/SIGN(FIND(S&MID(A1,FIND(S,S&A1,ROW(1:999)),FIND(S,A1&S,ROW(1:999))-FIND(S,S&A1,ROW(1:999)))&S,S&A1&S)=ROW(1:999)))}

    For example, for the string: a,b,a,c

    Step 1: S&MID(A1,FIND(S,S&A1,ROW(1:999)),FIND(S,A1&S,ROW(1:999))-FIND(S,S&A1,ROW(1:999)))&S splits out the four comma-wrapped components “,a,” | “,b,” | “,a,” and “,c,”

    Step 2: Find each of these components in the comma-wrapped string “,a,b,a,c,” and count 1 if they’re found in their actual position. The second “,a,” component is located at position 5, but found at position 1, therefore must be a duplicate.

    It appears to work for all well-defined strings, including spaces, e.g. it returns 4 for “a,, ,b” which I’d argue is correct. But if A1 is blank, it returns 1.

  28. Stephen, that’s nice too. It’s pretty much the same as mine which uses “SUMPRODUCT(- -(” instead of “COUNT(1/SIGN(” and so gives an error for blanks. The only other differences are using a variable length array (which does make it many times faster to recalc the test string at the expense of being volatile) and MMULT to reduce function calls.

  29. That’s awesome, Lori. I’m yet to fully digest it, but noted that you can make it shorter still by replacing any instances of this:
    ROW(INDIRECT(“1:”&LEN(A1)))
    …with this:
    ROW(OFFSET(A1,,,LEN(A1)))

    …leaving you this 169 character puppy:

    =SUMPRODUCT(--(FIND(","&MID(A1&",",ROW(OFFSET(A1,,,LEN(A1))),MMULT(FIND(",",{"",","}&A1&",",ROW(OFFSET(A1,,,LEN(A1)))),{1;-1})+1),","&A1&",")=ROW(OFFSET(A1,,,LEN(A1)))))
  30. Stephen, that’s a fantastic approach. Note that you can make it slightly shorter by removing the S from A1&S and instead adding 1 to the final result. Plus you don’t need the SIGN:

    =1+COUNT(1/(FIND(","&MID(A1,FIND(",",","&A1,ROW(1:999)),FIND(",",A1,ROW(1:999))-FIND(",",","&A1,ROW(1:999)))&",",","&A1&",")=ROW(1:999)))

    What’s cool about it is that it can be completely dynamic in terms of string length:

    =1+COUNT(1/(FIND(","&MID(A1,FIND(",",","&A1,ROW(OFFSET(A1,,,LEN(A1)))),FIND(",",A1,ROW(OFFSET(A1,,,LEN(A1))))-FIND(",",","&A1,ROW(OFFSET(A1,,,LEN(A1)))))&",",","&A1&",")=ROW(OFFSET(A1,,,LEN(A1)))))
  31. If we combine Stephen’s approach with Loris, then we get this masterpiece that handles any length text string up to the cell limit, in 179 characters:

    =COUNT(1/(FIND(","&MID(A1,FIND(",",","&A1,ROW(OFFSET(A1,,,LEN(A1)))),MMULT(FIND(",",{"",","}&A1&",",ROW(OFFSET(A1,,,LEN(A1)))),{1;-1}))&",",","&A1&",")=ROW(OFFSET(A1,,,LEN(A1))))),

    Bam!


Posting code? Use <pre> tags for VBA and <code> tags for inline.

Leave a Reply

Your email address will not be published.