Here’s a challenge for someone. Well, maybe it’s easy, but it’s challenging for me. For lack of a better name, I call it *probabilistic lookup*. Say you have a list of names, ordered by popularity (according to the U.S. Census Bureau):

Mary

Nancy

Pamela

Beatrice

Deborah

Enid

Lolita

Delta

Mayola

Concha

How can you randomly select a name such that names at the top of the list are more likely to be selected than names at the bottom of the list? In other words, if you randomly select 1,000 names from this list and then do a sorted frequency distribution of the selected names, they should (roughly) appear in the same order as the original list.

It needs to be generalizable to handle a much larger list (~10,000) . I’m not seeking statistical perfection, just a reasonably accurate way to give names higher on the list a greater chance of being selected, and names lower on the list a lower chance of being selected.

I’d prefer a formula solution, if possible.

I can think of a number of ways.

Here is a pretty simple one.

Reverse the list order.

Add a column

Autonumber 1 through n+1 (assuming n names)

In the next column use a formula

The first cell is 1, the next one down = cell above + the autonumbered cell

Drag all the way down to one beyond the last name.

This is now your lookup table.

Next use random() and multiply by the last value

Use that as the input to your lookup table and return the name associated with that number.

If you had 4 names the first name would have 1 chance in 10, the second 2 chances in ten etc…

1 Bob

3 Carol

6 Ted

10 Alice

Hi, not sure if this is the best way but:

http://jameslow.com/ProbabilisticLookup.xls

Hi John,

How about this solution:

Assign your name list a named range of Names.

This formula will return the names at the top of the list more often:

=INDEX(Names,INT(COUNTA(Names)-(SQRT(2*(RAND()*COUNTA(Names)*(COUNTA(Names)+1)/2)+0.25)-0.5))+1)

The weights work like this:

e.g. If there are 3 names in the list

Name 1 is returned around 3/6 of the time

Name 2 is returned around 2/6 of the time

Name 3 is returned around 1/6 of the time

e.g. If there are 4 names in the list

Name 1 is returned around 4/10 of the time

Name 2 is returned around 3/10 of the time

Name 3 is returned around 2/10 of the time

Name 4 is returned around 1/10 of the time

etc.

If you’re interested in the math, then I could explain it!

Thanks,

Shawn

John – I’ve taken a stab at it. My solution requires some sort of frequency to be specified. Lacking one, I assumed the top item in a list of n items happens n times, the second item happens (n-1) times, and so forth. I suppose it’s possible some sort of megaformula could be made to handle it all in one calc, but I’d prefer not to try and wrap my brain around that…

I’ve posted the file at my site so anybody can download and play with it: http://bill.simonifamily.net/2006/07/12/probabilistic-lookup-in-excel/

Here’s a start. Assuming the names are in A1:A10, I assigned probabilities to each name in B1:B10 and then sorted both columns ascending based on column B. So I have:

Delta 0.005

Mayola 0.005

Concha 0.005

Lolita 0.01

Enid 0.025

Beatrice 0.05

Deborah 0.05

Pamela 0.1

Nancy 0.25

Mary 0.5

I used a helper column (column C) by multiplying each probability by 10000 (cell G1) and then adding the values cumulatively down col. C. So C1 has: =B1*$G$1 and C2: =B2*$G$1+C1. Copy the formula in C2 down to C10. Now I have 3 columns that look like:

Delta 0.00550

Mayola 0.005100

Concha 0.005150

Lolita 0.01250

Enid 0.025500

Beatrice 0.051000

Deborah 0.051500

Pamela 0.12500

Nancy 0.255000

Mary 0.510000

From here I used an INDEX/MATCH combo with RANDBETWEEN selecting a whole number between 0 and 9999:

=INDEX(A1:A10,MATCH(RANDBETWEEN(0,G1-1),C1:C10,1)+1)

Of course the biggest issue is that the numbers 0-49 will product #N/A, and the formula will never return “Delta.” That should be an easy workaround but I can’t find one at the moment. I’ll address later.

I also picked “pretty” probabilities in this example.

Hi

Formula in Colum C =IF(RANDBETWEEN(1,26)>ROW()=TRUE,A2,0)

Formula in Colum D =IF(C2=0,””,IF(AND(C20),”T”,”B”))

This gives more random “T”‘s than “B”‘s

Regards

Sam

A B C D

SlNameTOP/BOTTOM

1a1TTop1077%

2b0Bottom323%

3c3T13

4d4T

5e5T

6f6T

7g0

8h8T

9i9T

10j0

11k11T

12l12T

13m13T

14n14B

15o0

16p0

17q17B

18r0

19s0

20t0

21u21B

22v0

23w0

24×0

25y0

26z0

Hi John,

If the list is in, say, cells A1 to A10, then you could make your selections with a formula like:

=OFFSET($A$1,TRUNC(RAND()*RAND()*COUNTA($A$1:$A$10),0),0)

Cheers,

XLurker

Six Steps:

1. Copy list to A1:A10

2. Sort list by smallest frequency of occurance to largest

3. Enter 0 in Cell B1

4. Enter =INT(AVERAGE(B1+2,(B1+2)*2)) in cell B2

5. Copy formula down to Cell B11 (Yes…one row below the end the of list)

6. Enter this formula anywhere =INDEX($A$1:$A$10,(MATCH(INT((RAND()*MAX($B:$B)+1)),$B$1:$B$11,1)))

Copy formula down to 1000 rows, and you will get a distribution that should support the premise of your original list. The more rows you include, the closer you get to the correct distribution.

The formula in step 4 could be tweaked for better results.

Hi,

here’s my try:

=INDEX(INDIRECT(“A1:A”&COUNTA($A:$A)),COUNTA($A:$A)-INT(LOG10(10*(9/10*RAND()+0.1))*COUNTA($A:$A)))

Assuming that your list of names starts at A1.

-Asser

First is to generate an index table:

“Occurance” is the relative indvidual frequency of the names in ascending order. It does not matter how these are scaled.

“Prob” is the cumulative probability of that name being returned. The top name will always be 1.000

= SUM(B2:$B$11)/SUM($B$2:$B$11)

To actually return names the formula is:

=INDEX(Names,MATCH(RAND(),Prob,-1))

Where:

“Names” represents A$2:$A$11

“Prob” represents C$2:$C$11

Example Result for a block of 10000 selections

FWIW formula to complie this was:

=COUNTIF($J$1:$S$1000,A2) – to Give answer for “Mary”

Nancy 2316

Pamela 1325

Beatrice 960

Deborah 922

Enid 783

Lolita 520

Delta 348

Mayola 308

Concha 181

Mpemba

Editor:Mpemba, I’ve been using your comment to play around with the formatting.Darn that formatting: it looked fine in the preview window (Courier) !

I should add, if you want a working copy of the spreadsheet I’d gladly mail it.

M

Hi,

If the list is in, say, cells A1 to A10, then how about making your selections with a formula like this:

=OFFSET($A$1,TRUNC(RAND()^2*COUNTA($A$1:$A$10),0),0)

The COUNTA is for scalability of the list. Set the power of 2 higher to skew the selection more to the top of the list, or between 0 and 1 to skew the selection to the bottom of the list.

Cheers,

XLurker

Me again:

In hindsight I should add that in post 4 my “Occurance” column does not have to be sorted in to ascending order. The order does not matter a jot using this method.

I must also admit I found it surprising Mary was at the top of the example list. The 2005 data for the USA listed at:

http://www.socialsecurity.gov/OACT/babynames/

Top ten for the 2000s

http://www.socialsecurity.gov/OACT/babynames/decades/names2000s.html

Emily 149,420 1.2525%

Madison 123,729 1.0372%

Hannah 110,081 0.9228%

Emma 106,428 0.8921%

Ashley 91,644 0.7682%

Abigail 89,848 0.7532%

Alexis 89,512 0.7503%

Olivia 88,971 0.7458%

Samantha 88,669 0.7433%

Sarah 85,747 0.7188%

RNDBTWN? thght mst b mssng smthng, hng n tls > dd-ns > nlyss tlpk.

It’s making a lot more sense now.

Mpemba:

You shouldn’t be surprised Mary was at the top of the list. If you go to the site you linked and check the “Popularity of a Name” on Female for the past 100 years using the name Mary, you’ll find it was a very popular name for a very long time.

Thanks to everyone. I haven’t gone through them all yet, but this is a great help.

BTW, here’s where I got the names from:

http://www.census.gov/genealogy/names/names_files.html

And each name has a frequency.

For a graphical representation of name popularity over the decades, check out the totally fun Baby Name Wizard at http://www.babynamewizard.com/namevoyager/

Poor Mary ain’t what she used to be!

From these charts, you can easily pinpoint when I was born. :)

{=OFFSET($A$1,MATCH(RAND(),1-EXPONDIST(ROW(1:[n])-1,[tweakable constant],TRUE),-1),0)} … so entered as an array formula over the required output cells.

A1 is one cell above the input list

[n] is the upper bound for the number of input values + 1

[tweakable constant] controls the decay in probabilities for the exponential function. 0.1 ‘works’ over 10,000 cells, 0.9 can be seen to ‘work’ for a shorter list. 0.1 is very ‘slow’ decay in probability.

Statistical Perfection yields the best results. Sorry so late but I was helping someone with homework and came across this unsatisfactory answer. What you need is an “adjusted cumulative frequency” column to the left of the names. Luckily, the data includes the cumulative frequency:

adj.cf name freq cum.freq(%) rank

0 MARY 2.629 2.629 1

.02629 PATRICIA 1.073 3.702 2

.03702 LINDA 1.035 4.736 3

.04736 BARBARA 0.980 5.716 4

.05716 ELIZABETH 0.937 6.653 5

=VLOOKUP(RAND(), $A$2:$B$6, 2)

or if you use the actual data

=VLOOKUP(RAND(), $A$2:$B$4275, 2)

http://www.excelforum.com/showthread.php?t=628930

shows similar question with Letters of alphabet

Thanks to Dick and all the contributors of this site. I have picked up hundreds of tips here. Many thanks.

I am trying to develop a function that will count the number of runs within a range with one run being defined as when the value encounters a sign change (i.e. from positive to negative or negative to positive). Consider the following range: starting in cell A1 and continuing down to cell A10 I have these numbers: 125, 200, 175, -50, -75, 135, -65, -225, -90, and 115. The first three values (125, 200, 175) are positive and are considered as one run (a winning streak). The next two values are negative (-50, -75) thus are considered another run (a losing streak). The next value (135) is positive and is considered another run. The next three values (-65, -225, -90) are negative and thus is considered another run. Finally, the last value in the range (115) is positive, thus is considered another run. Therefore, for this example range the total number of runs 5. I have developed functions to calculate the longest winning and losing streaks, but I am really hitting a wall developing a function to calculate the total number of runs within a range. Any suggestions on how to approach this problem would be greatly appreciated. Thanks.