# Probabilistic Lookup

Here’s a challenge for someone. Well, maybe it’s easy, but it’s challenging for me. For lack of a better name, I call it probabilistic lookup. Say you have a list of names, ordered by popularity (according to the U.S. Census Bureau):

Mary
Nancy
Pamela
Beatrice
Deborah
Enid
Lolita
Delta
Mayola
Concha

How can you randomly select a name such that names at the top of the list are more likely to be selected than names at the bottom of the list? In other words, if you randomly select 1,000 names from this list and then do a sorted frequency distribution of the selected names, they should (roughly) appear in the same order as the original list.

It needs to be generalizable to handle a much larger list (~10,000) . I’m not seeking statistical perfection, just a reasonably accurate way to give names higher on the list a greater chance of being selected, and names lower on the list a lower chance of being selected.

I’d prefer a formula solution, if possible.

Posted in Uncategorized

## 20 thoughts on “Probabilistic Lookup”

1. I can think of a number of ways.
Here is a pretty simple one.
Reverse the list order.
Autonumber 1 through n+1 (assuming n names)
In the next column use a formula
The first cell is 1, the next one down = cell above + the autonumbered cell
Drag all the way down to one beyond the last name.
This is now your lookup table.

Next use random() and multiply by the last value
Use that as the input to your lookup table and return the name associated with that number.

If you had 4 names the first name would have 1 chance in 10, the second 2 chances in ten etc…

1 Bob
3 Carol
6 Ted
10 Alice

2. Shawn Stackhouse says:

Hi John,

Assign your name list a named range of Names.

This formula will return the names at the top of the list more often:

=INDEX(Names,INT(COUNTA(Names)-(SQRT(2*(RAND()*COUNTA(Names)*(COUNTA(Names)+1)/2)+0.25)-0.5))+1)

The weights work like this:

e.g. If there are 3 names in the list
Name 1 is returned around 3/6 of the time
Name 2 is returned around 2/6 of the time
Name 3 is returned around 1/6 of the time

e.g. If there are 4 names in the list
Name 1 is returned around 4/10 of the time
Name 2 is returned around 3/10 of the time
Name 3 is returned around 2/10 of the time
Name 4 is returned around 1/10 of the time

etc.

If you’re interested in the math, then I could explain it!

Thanks,
Shawn

3. John – I’ve taken a stab at it. My solution requires some sort of frequency to be specified. Lacking one, I assumed the top item in a list of n items happens n times, the second item happens (n-1) times, and so forth. I suppose it’s possible some sort of megaformula could be made to handle it all in one calc, but I’d prefer not to try and wrap my brain around that…

I’ve posted the file at my site so anybody can download and play with it: http://bill.simonifamily.net/2006/07/12/probabilistic-lookup-in-excel/

4. Jason Morin says:

Here’s a start. Assuming the names are in A1:A10, I assigned probabilities to each name in B1:B10 and then sorted both columns ascending based on column B. So I have:

Delta 0.005
Mayola 0.005
Concha 0.005
Lolita 0.01
Enid 0.025
Beatrice 0.05
Deborah 0.05
Pamela 0.1
Nancy 0.25
Mary 0.5

I used a helper column (column C) by multiplying each probability by 10000 (cell G1) and then adding the values cumulatively down col. C. So C1 has: =B1*\$G\$1 and C2: =B2*\$G\$1+C1. Copy the formula in C2 down to C10. Now I have 3 columns that look like:

Delta 0.00550
Mayola 0.005100
Concha 0.005150
Lolita 0.01250
Enid 0.025500
Beatrice 0.051000
Deborah 0.051500
Pamela 0.12500
Nancy 0.255000
Mary 0.510000

From here I used an INDEX/MATCH combo with RANDBETWEEN selecting a whole number between 0 and 9999:

=INDEX(A1:A10,MATCH(RANDBETWEEN(0,G1-1),C1:C10,1)+1)

Of course the biggest issue is that the numbers 0-49 will product #N/A, and the formula will never return “Delta.” That should be an easy workaround but I can’t find one at the moment. I’ll address later.

I also picked “pretty” probabilities in this example.

5. Sam says:

Hi

Formula in Colum C =IF(RANDBETWEEN(1,26)>ROW()=TRUE,A2,0)
Formula in Colum D =IF(C2=0,””,IF(AND(C20),”T”,”B”))
This gives more random “T”‘s than “B”‘s

Regards
Sam

A B C D
SlNameTOP/BOTTOM
1a1TTop1077%
2b0Bottom323%
3c3T13
4d4T
5e5T
6f6T
7g0
8h8T
9i9T
10j0
11k11T
12l12T
13m13T
14n14B
15o0
16p0
17q17B
18r0
19s0
20t0
21u21B
22v0
23w0
24×0
25y0
26z0

6. XLurker says:

Hi John,

If the list is in, say, cells A1 to A10, then you could make your selections with a formula like:

=OFFSET(\$A\$1,TRUNC(RAND()*RAND()*COUNTA(\$A\$1:\$A\$10),0),0)

Cheers,

XLurker

7. Six Steps:

1. Copy list to A1:A10
2. Sort list by smallest frequency of occurance to largest
3. Enter 0 in Cell B1
4. Enter =INT(AVERAGE(B1+2,(B1+2)*2)) in cell B2
5. Copy formula down to Cell B11 (Yes…one row below the end the of list)
6. Enter this formula anywhere =INDEX(\$A\$1:\$A\$10,(MATCH(INT((RAND()*MAX(\$B:\$B)+1)),\$B\$1:\$B\$11,1)))

Copy formula down to 1000 rows, and you will get a distribution that should support the premise of your original list. The more rows you include, the closer you get to the correct distribution.
The formula in step 4 could be tweaked for better results.

8. Jasser says:

Hi,

here’s my try:

=INDEX(INDIRECT(“A1:A”&COUNTA(\$A:\$A)),COUNTA(\$A:\$A)-INT(LOG10(10*(9/10*RAND()+0.1))*COUNTA(\$A:\$A)))

Assuming that your list of names starts at A1.

-Asser

9. Mpemba says:

First is to generate an index table:
“Occurance” is the relative indvidual frequency of the names in ascending order. It does not matter how these are scaled.
“Prob” is the cumulative probability of that name being returned. The top name will always be 1.000

The formula in the Prob column is:
= SUM(B2:\$B\$11)/SUM(\$B\$2:\$B\$11)
 A B C 1 Names Occurance Prob 2 Mary 15.10% 1.000 3 Nancy 14.90% 0.765 4 Pamela 9.10% 0.533 5 Beatrice 6.10% 0.392 6 Deborah 5.60% 0.297 7 Enid 4.80% 0.210 8 Lolita 3.10% 0.135 9 Delta 2.40% 0.087 10 Mayola 1.90% 0.050 11 Concha 1.30% 0.020

To actually return names the formula is:
=INDEX(Names,MATCH(RAND(),Prob,-1))

Where:
“Names” represents A\$2:\$A\$11
“Prob” represents C\$2:\$C\$11

Example Result for a block of 10000 selections
FWIW formula to complie this was:
=COUNTIF(\$J\$1:\$S\$1000,A2) – to Give answer for “Mary”

Mary      2337
Nancy     2316
Pamela    1325
Beatrice   960
Deborah    922
Enid       783
Lolita     520
Delta      348
Mayola     308
Concha     181

Mpemba

Editor: Mpemba, I’ve been using your comment to play around with the formatting.

10. Mpemba says:

Darn that formatting: it looked fine in the preview window (Courier) !

M

11. XLurker says:

Hi,

If the list is in, say, cells A1 to A10, then how about making your selections with a formula like this:

=OFFSET(\$A\$1,TRUNC(RAND()^2*COUNTA(\$A\$1:\$A\$10),0),0)

The COUNTA is for scalability of the list. Set the power of 2 higher to skew the selection more to the top of the list, or between 0 and 1 to skew the selection to the bottom of the list.

Cheers,

XLurker

12. Mpemba says:

Me again:
In hindsight I should add that in post 4 my “Occurance” column does not have to be sorted in to ascending order. The order does not matter a jot using this method.

I must also admit I found it surprising Mary was at the top of the example list. The 2005 data for the USA listed at:
http://www.socialsecurity.gov/OACT/babynames/

Top ten for the 2000s
Emily 149,420 1.2525%
Hannah 110,081 0.9228%
Emma 106,428 0.8921%
Ashley 91,644 0.7682%
Abigail 89,848 0.7532%
Alexis 89,512 0.7503%
Olivia 88,971 0.7458%
Samantha 88,669 0.7433%
Sarah 85,747 0.7188%

13. Gareth says:

RNDBTWN? thght mst b mssng smthng, hng n tls > dd-ns > nlyss tlpk.

It’s making a lot more sense now.

14. J Walker says:

Mpemba:
You shouldn’t be surprised Mary was at the top of the list. If you go to the site you linked and check the “Popularity of a Name” on Female for the past 100 years using the name Mary, you’ll find it was a very popular name for a very long time.

15. Andrew Reynolds says:

{=OFFSET(\$A\$1,MATCH(RAND(),1-EXPONDIST(ROW(1:[n])-1,[tweakable constant],TRUE),-1),0)} … so entered as an array formula over the required output cells.

A1 is one cell above the input list
[n] is the upper bound for the number of input values + 1
[tweakable constant] controls the decay in probabilities for the exponential function. 0.1 ‘works’ over 10,000 cells, 0.9 can be seen to ‘work’ for a shorter list. 0.1 is very ‘slow’ decay in probability.

16. Cyrus Sujar... says:

Statistical Perfection yields the best results. Sorry so late but I was helping someone with homework and came across this unsatisfactory answer. What you need is an “adjusted cumulative frequency” column to the left of the names. Luckily, the data includes the cumulative frequency:

0 MARY 2.629 2.629 1
.02629 PATRICIA 1.073 3.702 2
.03702 LINDA 1.035 4.736 3
.04736 BARBARA 0.980 5.716 4
.05716 ELIZABETH 0.937 6.653 5

=VLOOKUP(RAND(), \$A\$2:\$B\$6, 2)
or if you use the actual data
=VLOOKUP(RAND(), \$A\$2:\$B\$4275, 2)