Building a self-sorting list
I haven’t posted in a good while. I’ve done over half the Euler problems, but haven’t had the time to research the high-numbered ones considering I might even understand them. Now that I’ve retired from my second career, and my third is only part-time, maybe I’ll get back to it. Today is something perhaps more useful…to have a list of names, specifications, part numbers, etc and have them self-sort, de-duplicate, and collapse without ever having to go to the Sort Menu. Add a member to the bottom, and it flows through. Replace or paste over the data, and Excel turns the crank. Unique values appear, and no VBA is involved. This was a monthly chore in Job #2 for sometimes as many as 8000 email addresses.
Assume in Column A are 1000 (fantasy) names, in random order. These names were generated by using this website four times. Rice University and Dr. Chris Brown have provided name generation tools for many spoken languages. Your Column A might look something like this:
A | B | |
---|---|---|
1 | Alfred Sundagger | =IF(LEN(A1)>0,COUNTIF(A:A,”<“&A1)+COUNTIF($A$1:$A1,”=”&A1),””) |
2 | Aki Stonesaber | =IF(LEN(A2)>0,COUNTIF(A:A,”<“&A2)+COUNTIF($A$1:$A2,”=”&A2),””) |
3 | Yrre the Carver | |
4 | Alfred Birchleaf | |
5 | Olaf the Dagger | |
6 | Isen Grimboldson | |
7 | Eagle Arianson | |
8 | Besyrwan Odonson | |
9 | Onund Boarherder | |
10 | Berdoi of the Radiant Face | |
11 | Azhar Fahim | |
12 | Osric the Hostler | |
13 | Yasha the Hare | |
14 | Warian Pikethrower | |
15 | Konrad Firelash | |
16 | Faran Stillstoke |
The formula in B1 tests if there in anything of length in A1. I’ve found that the LEN() function always returns a value that agrees with my ol’ Mark 1 Mod 0 eyeballs, and empty text doesn’t mislead me. If there is something of length in A1, then COUNTIF() everything in A:A that is less than A1. This uses Excel’s lexicographic algorithms, and that opinion may differ from what you want, but I’ve never had a problem with it. Add to that count using mixed references the COUNTIF() of everything between $A$1 and $A1 inclusive that equals $A1. When we fill down this will become COUNTIF($A$1:$A2,”=”&A2) in A2. Otherwise, if there is nothing of length, place empty text (“”-double double quotes). Fill down well below the extent of values in A:A.
Column B now looks something like this:
A | B | C | |
---|---|---|---|
1 | Alfred Sundagger | 43 | =IF(ROW()<=COUNT(B:B),SMALL(B:B,ROW()),””) |
2 | Aki Stonesaber | 25 | |
3 | Yrre the Carver | 971 | |
4 | Alfred Birchleaf | 39 | |
5 | Olaf the Dagger | 660 | |
6 | Isen Grimboldson | 470 | |
7 | Eagle Arianson | 240 | |
8 | Besyrwan Odonson | 144 | |
9 | Onund Boarherder | 683 | |
10 | Berdoi of the Radiant Face | 136 | |
11 | Azhar Fahim | 110 | |
12 | Osric the Hostler | 693 | |
13 | Yasha the Hare | 960 | |
14 | Warian Pikethrower | 944 | |
15 | Konrad Firelash | 541 | |
16 | Faran Stillstoke | 303 |
The formula in C1 tests if the row number is less than or equal to the count of the numbers in Column B. If it is, put the the smallest number there from Column B:B in ROW() order, otherwise put empty text. After filling down as far as in Column B, Column C is just 1 through the extent of your data in Column A. The use of SMALL() is what accommodates blank cells in A:A.
A | B | C | D | |
---|---|---|---|---|
1 | Alfred Sundagger | 43 | 1 | =IF(ROW()<=COUNT(B:B),INDEX(A:A,MATCH(C1,B:B,0),1),””) |
2 | Aki Stonesaber | 25 | 2 | |
3 | Yrre the Carver | 971 | 3 | |
4 | Alfred Birchleaf | 39 | 4 | |
5 | Olaf the Dagger | 660 | 5 | |
6 | Isen Grimboldson | 470 | 6 | |
7 | Eagle Arianson | 240 | 7 | |
8 | Besyrwan Odonson | 144 | 8 | |
9 | Onund Boarherder | 683 | 9 | |
10 | Berdoi of the Radiant Face | 136 | 10 | |
11 | Azhar Fahim | 110 | 11 | |
12 | Osric the Hostler | 693 | 12 | |
13 | Yasha the Hare | 960 | 13 | |
14 | Warian Pikethrower | 944 | 14 | |
15 | Konrad Firelash | 541 | 15 | |
16 | Faran Stillstoke | 303 | 16 |
The formula in D1 tests the ROW() against the COUNT() again, and if ROW() is less than or equal to the COUNT(), specify zero as the third argument and find the exact MATCH() of the number in Column C:C within Column B:B, and then INDEX down Column A:A that far and return that result. Otherwise, put empty text. Fill down as far as in Columns B and C. After filling down, the list is sorted, and might look something like this:
A | B | C | D | |
---|---|---|---|---|
1 | Alfred Sundagger | 43 | 1 | Aart Moonhammer |
2 | Aki Stonesaber | 25 | 2 | Aart Moonhammer |
3 | Yrre the Carver | 971 | 3 | Aart the Millwright |
4 | Alfred Birchleaf | 39 | 4 | Aart Whiteson |
5 | Olaf the Dagger | 660 | 5 | Aberri of the White Heart |
6 | Isen Grimboldson | 470 | 6 | Aberri the Thieving Wizard |
7 | Eagle Arianson | 240 | 7 | Acennan Badgerrunner |
8 | Besyrwan Odonson | 144 | 8 | Acennan Kenricsson |
9 | Onund Boarherder | 683 | 9 | Adei of the Red Ruins |
10 | Berdoi of the Radiant Face | 136 | 10 | Adolphus Blackbird |
11 | Azhar Fahim | 110 | 11 | Adolphus Blackbird |
12 | Osric the Hostler | 693 | 12 | Adolphus Stillearth |
13 | Yasha the Hare | 960 | 13 | Adrik Yakovovich |
14 | Warian Pikethrower | 944 | 14 | Aethelred Awierganson |
15 | Konrad Firelash | 541 | 15 | Aethelred Darkseed |
16 | Faran Stillstoke | 303 | 16 | Ageio the Fastidious Sorceror |
Next post we’ll de-dupe the list and then collapse it to unique entries.
…mrt
There are a lot of ways to do this, and A LOT of those ways could be found in the newsgroup archives. Given the redundant calculations involved in this, udfs could be faster than formulas.
A more efficient pure formula approach would be to use an array formula in column B, and definitely not use entire column references. If the names in col A spanned rows 1 to 2000, select B1:B2000 and enter the array formula
=IF(A1:A2000?”,COUNTIF(A1:A2000,”<“&A1:A2000))
Then you only need one more column.
C1:
=INDEX(A$1:A$2000,MATCH(0,$B$1:$B$100,0))
C2:
=IF(ISNUMBER(MATCH(ROWS(C$1:C1),B$1:B$2000,0)),INDEX(A$1:A$2000,MATCH(ROWS(C$1:C1),B$1:B$2000,0)),
IF(ROWS(C$1:C1)”&C1),INDEX(A$1:A$2000,MATCH(COUNTIF(A$1:A$2000,”<=”&C1),$B$1:$B$2000,0)),””)
Hi fzz -
There are pluses and minuses with array formulas, and one of the minuses is that they are very hard, relatively, to edit. Easier to fill down.
Your indexing number in Column B will only point to the first of a kind, and then the match will repeatedly return that first found. In the example, this meets the intent, but if the adjacent cells are unique and need to be carried with the sort, you need a unique id. That’s the purpose of the COUNTIF($A$1:$A1,”=”&A1) part. Trivial example: Column A is an ordered sort of last names. Column B is first names. One name is Doe|John and another is Smith|John and the desired output is a sort on first names because that’s how they show up in the address book (from real life!). Without the unique id, you’ll get John|Doe twice.
Excel gives me lots of columns. I like to use them to reveal intermediate steps. Tends to keep the hair in my head. I can noodle out your C2 but it’ll cost me some gray hair ;-)
…mrt
So why use formulas to sort? Wouldn’t Change and/or Calculate event handlers calling the Sort method of specified range objects be far more efficient? Don’t do so in order to avoid macro security?
If you must have a unique key including duplicates, then with just formulas,
B1:B2000:
=IF(A1:A2000?”,COUNTIF(A1:A2000,”B#,B$1:B$2000)) to find the next smallest after the value in cell B#, and that requires single cell array formulas.
So why use formulas to sort? Wouldn’t Change and/or Calculate event handlers calling the Sort method of specified range objects be far more efficient? Don’t do so in order to avoid macro security?
If you must have a unique key including duplicates, then with just formulas,
=IF(A1:A2000<>“”,COUNTIF(A1:A2000,“<“&A1:A2000)+(ROW(B1:B2000)-ROW(B$1))/ROWS(B1:B2000))
This is an array formula. To me no more difficult to edit than nonarray formulas. It just takes pressing 3 keys rather than one to enter is. The point here is to avoid calling COUNTIF more than once for each cell in B1:B2000.
This requires changing the col C formulas as well, to use MIN(IF(B$1:B$2000>B#,B$1:B$2000)) to find the next smallest after the value in cell B#, and that requires single cell array formulas.
fzz – on the strong probability that I don’t know something here: to expand the array that an array formula applies to, I select the array, delete the formula, select the new array and re-enter the new formula. Anything else I’ve ever tried gives me a “Can’t edit parts of an array” complaint.
How do you do it? I’m talking about more that a CSE press.
…mrt
Michael,
To overcome the problem of CSE you can use Names
Define Rand_Name =Sheet1!$A$1:INDEX(Sheet1!$A:$A,COUNTA(Sheet1!$A:$A))
Define Rank_Lst = COUNTIF(Rand_Lst,”<=”&Rand_Lst)
In B1 type = Index(Rand_Lst,Match(Row(),Rank_Lst,0)
No need of CSE
Opps forgot a close bracket
In B1 type = Index(Rand_Lst,Match(Row(),Rank_Lst,0))
Hi Sam -
Pressing CSE isn’t the problem…but maybe my approach is ;-) Thanks.
Both the definition of Rank_Lst and B1 have a Rand_Lst inside. Believe you meant Rand_Name, or you meant Rand_Name to be Rand_Lst?
…mrt
Michael….yes more typos
Define Rand_Lst =Sheet1!$A$1:INDEX(Sheet1!$A:$A,COUNTA(Sheet1!$A:$A))
Define Rank_Lst = COUNTIF(Rand_Lst,”<=”&Rand_Lst)
In B1 type = Index(Rand_Lst,Match(Row(),Rank_Lst,0))
To change the size of the array, just select all cells you want the array formula to span and hit F2, CSE.
Jan
In my experience you can only _increase_ the size of an array that way. I’ve always had to do what Michael says to shrink an array. What is the cool way to shrink an array?
Nice post by the way – this is something I have to do a lot of. I have my own hacky way of sorting but I’ve never had to deal with missing spaces or duplicates before. I’ll mark this page!
To increase the size of an array formula’s result, select the larger result range, make one of the cells containing the array active and edit it, then enter as an array formula.
To shrink the size of an array formula’s result, I find it easiest to select a cell in the array, insert an apostrophe before the initial =, enter as an array formula. That converts all cells in the array to the same string constant, and the result range ceases to be an array. Select the smaller result range, re-edit the formula, delete the initial apostrophe, enter as an array formula. Finally, clear the contents of the cells which were part of the original array result but not the new, smaller result. If this needs to be done often, it’s a good candidate for a macro.
Dim na As Range, oa As Range, af As String
If Not TypeOf Selection Is Range Then Exit Sub
Set na = Selection
If na.Areas.Count > 1 Then Exit Sub
On Error Resume Next
Set oa = ActiveCell.CurrentArray
If oa Is Nothing Then Exit Sub
On Error GoTo 0
af = ActiveCell.FormulaR1C1
oa.ClearContents
na.FormulaArray = af
End Sub
This macro can shrink or expand the array result range. To use it, you need to select the new result range and ensure that the active cell contains the array formula. Error checking left as an exercise.
Moral: don’t try to do everything manually/interactively. There are many situations in which macros will be vastly more efficient than any human.
Very nice fzz, thank you – that will be definitely be added to my toolbox.
Instead of changing the formula to a string, you can also edit the array, hit Control+enter instead of CSE to make them all “normal” formulas.
You can download the sample workbook in Part 2
http://www.dailydoseofexcel.com/archives/2010/09/06/building-a-self-sorting-list-part-2/
The [Ctrl]+[Enter] approach has one potential drawback: the resulting formulas differ from cell to cell if there were any partly or fully relative cell references. For example, if B1:B10 contained =2^(A1:A10-1), B1:B10 were selected with B1 the active cell, [F2] [Ctrl]+[Enter] produces =2^(A1:A10-1) in A1, =2^(A2:A11-1) in A2, =2^(A3:A12-1) in A3, etc. So as long as the 1st (leftmost and topmost) cell of the new array result range is the active cell when using [F2] [Ctrl]+[Enter], all’s well, but any other cell active can screw things up. Converting all cells to the same string has the advantage that it doesn’t matter which cell was active. It’s a question of which approach is more robust.
But why use array formulas.
If you let use “array” names the we can use them in formulas and dont have to array enter the formula
@sam – if you’re going to push . . . the original article’s IF(LEN(A#)>0, test implies there could be many blank cells interspersed within the data in col A. If so, a fair amount of special processing is needed to exclude cells evaluating to “”, and some of that special processing defeats using defined names as replacements for array expressions.