P: n/a

I have written this function, "choose_by_probability()"
It takes a list of probabilities in the range 0.001.00, and it must
then randomly select one of the probabilities and return it's index.
The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.
values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]
print values[choose_by_probability(probabilities)] 'item 1'
etc...
I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?
regards Max M
###################
from random import random
def choose_by_probability(probabilities):
"Randomly selects an index from a list of probabilities"
rnd = random() * sum(probabilities)
range_start = 0.0
for i in range(len(probabilities)):
probability = probabilities[i]
range_end = range_start + probability
if range_start < rnd < range_end:
return i
range_start = range_end
probabilities = []
for f in range(0,16):
probabilities.append(1.0 / (f+1.0))
print probabilities
# writing it out sorted to do a visual test...
result = []
for i in range(100):
result.append(choose_by_probability(probabilities) )
result.sort()
print result  
Share this Question
P: n/a

In article <40*********************@dread12.news.tele.dk>,
Max M <ma**@mxm.dk> wrote: I have written this function, "choose_by_probability()"
It takes a list of probabilities in the range 0.001.00, and it must then randomly select one of the probabilities and return it's index.
The idea is to have two lists, one with a value, and another with a probability. The value then gets randomly selected from its probability.
values = ['item 1','item 2','item 3',] probabilities = [0.66, 0.33, 0.16]
print values[choose_by_probability(probabilities)]'item 1'
etc...
I just wondered if anybody has a better way of doing it, as this seems nastily squared to me?
My usual way is:
import random
def choose_by_probability (values, probabilities):
cumul = 0.0
for prob, item in zip (probabilities, values):
cumul += prob
if prob > random.random()*cumul:
selected = item
return selected
def run_tests (n):
D = {}
for i in xrange (n):
j = choose_by_probability (['item 1', 'item 2', 'item 3']
, [0.66, 0.33, 0.17])
D[j] = D.get (j, 0) + 1
s = D.items()
s.sort()
for k, v in s:
print k, float(v)/n
print "Total:", sum([v for k, v in D.items()])
run_tests (10000)
Regards. Mel.  
P: n/a

Max M <ma**@mxm.dk> wrote in message news:<40*********************@dread12.news.tele.dk >... It takes a list of probabilities in the range 0.001.00, and it must then randomly select one of the probabilities and return it's index.
You may wish to look at the GRADES example in the bisect documentation.
jJ Take only memories. Leave not even footprints.  
P: n/a

Max M <ma**@mxm.dk> I have written this function, "choose_by_probability()"
It takes a list of probabilities in the range 0.001.00, and it must then randomly select one of the probabilities and return it's index.
The idea is to have two lists, one with a value, and another with a probability. The value then gets randomly selected from its probability.
values = ['item 1','item 2','item 3',] probabilities = [0.66, 0.33, 0.16]
Here is a simplified version of the approach you took:
def select(tab, random=random.random):
cp = 0
r = random()
for i, p in enumerate(tab):
cp += p
if cp > r:
return i
raise Exception('probabilities do not add upto 1.0')
If you need to call this many times and for large table, there are
ways to speed this up.
* rearrange the tables to make sure the largest probabilities come
first. Intentionally or not, your sample data is already arranged this
way (though the probabilities add up to more than 1.0 and should be
scaled back to 1.0).
* presummarize the summations with a cumulative probability table:
[.50, .25, .15, .10] > [.50, .75, .90, 1.00]
def cselect(ctab, random=random.random):
r = random()
for i, cp in enumerate(ctab):
if cp > r:
return i
raise Exception
* if the table is cumulative and large, bisection offers a faster
search:
def cselect(ctab, random=random.random, bisect=bisect.bisect):
return bisect(ctab, random())
* if the probabilities come in integer ratios, you could reduce the
problem to an O(1) lookup:
[.5, .3, .1, .1] > [0,0,0,0,0,1,1,1,2,3]
def lselect(ltab, choice=random.choice):
return choice(ltab)
IOW, if speed is what is important, then be sure to exploit any known
structure in the data (ordering, cumulativeness, integer rations, etc)
and don't compute anything twice (precompute the summation).
Raymond Hettinger  
P: n/a

Jim Jewett wrote: Max M <ma**@mxm.dk> wrote in message news:<40*********************@dread12.news.tele.dk >...
You may wish to look at the GRADES example in the bisect documentation.
First of, thanks for the replies!
I finally chose a combination of the suggested approaches, and ended up
with something I find is both more functional and nicer to look at than
my first try.
regards Max M
################################
from random import random
from bisect import bisect
class Selector:
"Returns random elements by probability"
def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
self.decorated = zip(probabilities, elements)
self.decorated.sort()
def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))1
return self.decorated[index][1]
def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]
if __name__ == '__main__':
probabilities = []
for f in range(0,10):
probabilities.append(1.0 / (f+1.0))
elements = ['a','b','c','d','e','f','g','h','i','j',]
s = Selector(probabilities, elements)
result = s.get_range(1000)
result.sort()
print result  
P: n/a

Raymond Hettinger wrote: Here is a simplified version of the approach you took:
def select(tab, random=random.random): cp = 0 r = random() for i, p in enumerate(tab): cp += p if cp > r: return i raise Exception('probabilities do not add upto 1.0')
* rearrange the tables to make sure the largest probabilities come first. Intentionally or not, your sample data is already arranged this way (though the probabilities add up to more than 1.0 and should be scaled back to 1.0).
It doesn't really have to be scaled to 1.0 as long as they get selected
with a frequency that corresponds to their probability.
IOW, if speed is what is important, then be sure to exploit any known structure in the data (ordering, cumulativeness, integer rations, etc) and don't compute anything twice (precompute the summation).
Lot's of good stuff. Thanks.
Max M  
P: n/a

Max M <ma**@mxm.dk> wrote in message news:<40*********************@dread12.news.tele.dk >... I have written this function, "choose_by_probability()"
It takes a list of probabilities in the range 0.001.00, and it must then randomly select one of the probabilities and return it's index.
The idea is to have two lists, one with a value, and another with a probability. The value then gets randomly selected from its probability.
values = ['item 1','item 2','item 3',] probabilities = [0.66, 0.33, 0.16]
print values[choose_by_probability(probabilities)] >>'item 1'
etc...
I just wondered if anybody has a better way of doing it, as this seems nastily squared to me?
In one line:
v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]
[v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0]
R.  
P: n/a

Roberto A. F. De Almeida wrote: In one line:
v = ['item 1','item 2','item 3',] p = [0.66, 0.33, 0.16]
[v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0]
uh, that's hard to read :)
But doesn't random.random()*sum(p) get executed for each probability in p?
Then each probability in p will be compared to a different random value.
That isn't fair...
regards Max m  
P: n/a

Max M wrote: Roberto A. F. De Almeida wrote:
In one line:
v = ['item 1','item 2','item 3',] p = [0.66, 0.33, 0.16]
[v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0] uh, that's hard to read :)
Yes, I know. I just wanted to see if I could do it in one line. :)
But it just calculates the cumulative probability, and gets the first
one which passes a random value between 0 and the sum of the probabilities.
But doesn't random.random()*sum(p) get executed for each probability in p?
I actually don't know. You could call random.random() out of the list
comprehension... :)
R.  
P: n/a

Max M wrote: Jim Jewett wrote:
I finally chose a combination of the suggested approaches, and ended up with something I find is both more functional and nicer to look at than my first try.
Well, that version had a problem. Whenever a series of values are the
same bisect() allways selects the first one. So I never got a
statistical valid sample of elements.
I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.
I used it in a Markov chain class, that is now pretty fast.
regards Max M
#########################
from random import random, randint
from bisect import bisect
class Selector:
"Returns random elements by probability"
def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
acumulated_probabilities = []
acumulated_probability = 0
for probability in probabilities:
acumulated_probability += probability
acumulated_probabilities.append(acumulated_probabi lity)
self.decorated = zip(acumulated_probabilities, elements)
self.decorated.sort()
def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))
self.decorated[index][1]
return self.decorated[index][1]
def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]
if __name__ == '__main__':
probabilities = [1.0, 0.5, 0.25, 0.125, 0.0625]
# probabilities = [0.1, 0.1, 1.0, 0.1, 0.1, ]
elements = ['a', 'b', 'c', 'd', 'e']
s = Selector(probabilities, elements)
r = s.get_range(10000)
r.sort()
for element in elements:
print element, r.count(element)
########## a 5148 b 2594 c 1301 d 654 e 303  
P: n/a

> def get_range(self, n): "Randomly returns a range of elements by probability" return [self.get() for itm in range(n)]
Use xrange here ^^^^^
xrange doesn't instantiate a list, so is faster and uses less memory for
large n.
 Josiah  
P: n/a

Max M <ma**@mxm.dk> wrote: I solved it by using acumulated probabilities instead. So the final version is here, if anybody cares.
I used it in a Markov chain class, that is now pretty fast.
Yes it's fast, but I don't think it's smart :) Unfortunately I
haven't got a working solution as an alternative, but only some code
that "explains" what I mean:
def choice_generator(population,probabilities):
PC = zip(probabilities,population)
while 1:
p,c = max([(p*random(),c) for p,c in PC])
yield c
This is short and fast. However, the distribution of the outcomes is
wrong, because the list of probabilities should be "adjusted" so that
in the end the *outcomes* are distributed according to the
"probabilities". Or should that be proportions?
My statistics skills are a bit rusty, but I think it's an interesting
problem. I'll get back when I have some more ideas. In the meantime
more mathematically educated readers are given a chance to beat me to
it :), or prove it can't be done.
Anton  
P: n/a

Anton Vredegoor wrote: Max M <ma**@mxm.dk> wrote:
I solved it by using acumulated probabilities instead. So the final version is here, if anybody cares.
Yes it's fast, but I don't think it's smart :) Unfortunately I haven't got a working solution as an alternative, but only some code that "explains" what I mean:
def choice_generator(population,probabilities): PC = zip(probabilities,population) while 1: p,c = max([(p*random(),c) for p,c in PC]) yield c
This is short and fast. However, the distribution of the outcomes is wrong, because the list of probabilities should be "adjusted" so that in the end the *outcomes* are distributed according to the "probabilities". Or should that be proportions?
I don't understand what you mean. If I calculate the deviations from
what is expected, and use a large result set, I get very small deviations.
I am interrested in getting a result as a propertion of the probablilty,
or more correctly in this case, the frequency. If I want it as
probabilities I just have to normalise the sim to 1.0.
This has the advantage that the frequencies can be expressed as integers
too. This is nice in my Markov chain class that count words in text, etc.
In my example below each letter should be occur 50% of the times of the
previous letter.
Perhaps you mean that it should behave differently?
regards Max M
###################
probabilities = [16, 8, 4, 2, 1]
elements = ['a', 'b', 'c', 'd', 'e']
sample_size = 1000000
s = Selector(probabilities, elements)
r = s.get_range(sample_size)
r.sort()
previous = float(sample_size)
for element in elements:
count = r.count(element)
deviation = (previous/2.0count) / count * 100
previous = count
print element, count, deviation a 517046 3.29680531326 b 257439 0.421070622555 c 129159 0.340278261677 d 64148 0.672663216312 e 32208 0.416045702931  
P: n/a

Max M <ma**@mxm.dk> wrote: I don't understand what you mean. If I calculate the deviations from what is expected, and use a large result set, I get very small deviations.
That's what I was afraid of, and I can't blame you for it :)
In my example below each letter should be occur 50% of the times of the previous letter.
Perhaps you mean that it should behave differently?
No, your code does what it should do and my code was just some
obviously wrong example to show what *kind* of solution I was after.
For example if my code is run with a list of probabilities like this:
[.1,.1,.1,.1,.1] it produces exactly the same distribution as your
code. However, with input like this: [2,1] (*different* probabilities)
it starts to diverge from your code. What I would like to do is to
replace this list [2,1] with the list of probabilities that would
cause my code to produce the same output as yours ...
So [2,1] ==> "some list of probabilities" and this *new* list is used
as multiplication factors in the line with :
"max([(p*random(),c) for p,c in PC])"
I hope I succeeded in just getting the idea across instead of giving
the impression that my code works better, which obviously it doesn't.
It *needs* some kind of transformation of the probabilities into an
other list ...
Anton   This discussion thread is closed Replies have been disabled for this discussion.   Question stats  viewed: 1087
 replies: 13
 date asked: Jul 18 '05
