Connecting Tech Pros Worldwide Forums | Help | Site Map

Smarter way of doing this?

Max M
Guest
 
Posts: n/a
#1: Jul 18 '05
I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

print values[choose_by_probability(probabilities)][color=blue][color=green]
>>'item 1'[/color][/color]

etc...

I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?


regards Max M


###################

from random import random

def choose_by_probability(probabilities):
"Randomly selects an index from a list of probabilities"
rnd = random() * sum(probabilities)
range_start = 0.0
for i in range(len(probabilities)):
probability = probabilities[i]
range_end = range_start + probability
if range_start < rnd < range_end:
return i
range_start = range_end


probabilities = []
for f in range(0,16):
probabilities.append(1.0 / (f+1.0))

print probabilities

# writing it out sorted to do a visual test...
result = []
for i in range(100):
result.append(choose_by_probability(probabilities) )
result.sort()
print result
Mel Wilson
Guest
 
Posts: n/a
#2: Jul 18 '05

re: Smarter way of doing this?


In article <401e4d57$0$295$edfadb0f@dread12.news.tele.dk>,
Max M <maxm@mxm.dk> wrote:[color=blue]
>I have written this function, "choose_by_probability()"
>
>It takes a list of probabilities in the range 0.00-1.00, and it must
>then randomly select one of the probabilities and return it's index.
>
>The idea is to have two lists, one with a value, and another with a
>probability. The value then gets randomly selected from its probability.
>
>values = ['item 1','item 2','item 3',]
>probabilities = [0.66, 0.33, 0.16]
>
>print values[choose_by_probability(probabilities)][color=green][color=darkred]
> >>'item 1'[/color][/color]
>
>etc...
>
>I just wondered if anybody has a better way of doing it, as this seems
>nastily squared to me?[/color]

My usual way is:


import random

def choose_by_probability (values, probabilities):
cumul = 0.0
for prob, item in zip (probabilities, values):
cumul += prob
if prob > random.random()*cumul:
selected = item
return selected


def run_tests (n):
D = {}
for i in xrange (n):
j = choose_by_probability (['item 1', 'item 2', 'item 3']
, [0.66, 0.33, 0.17])
D[j] = D.get (j, 0) + 1
s = D.items()
s.sort()
for k, v in s:
print k, float(v)/n
print "Total:", sum([v for k, v in D.items()])

run_tests (10000)



Regards. Mel.
Jim Jewett
Guest
 
Posts: n/a
#3: Jul 18 '05

re: Smarter way of doing this?


Max M <maxm@mxm.dk> wrote in message news:<401e4d57$0$295$edfadb0f@dread12.news.tele.dk >...
[color=blue]
> It takes a list of probabilities in the range 0.00-1.00, and it must
> then randomly select one of the probabilities and return it's index.[/color]

You may wish to look at the GRADES example in the bisect documentation.

-jJ Take only memories. Leave not even footprints.
Raymond Hettinger
Guest
 
Posts: n/a
#4: Jul 18 '05

re: Smarter way of doing this?


Max M <maxm@mxm.dk>[color=blue]
> I have written this function, "choose_by_probability()"
>
> It takes a list of probabilities in the range 0.00-1.00, and it must
> then randomly select one of the probabilities and return it's index.
>
> The idea is to have two lists, one with a value, and another with a
> probability. The value then gets randomly selected from its probability.
>
> values = ['item 1','item 2','item 3',]
> probabilities = [0.66, 0.33, 0.16][/color]

Here is a simplified version of the approach you took:

def select(tab, random=random.random):
cp = 0
r = random()
for i, p in enumerate(tab):
cp += p
if cp > r:
return i
raise Exception('probabilities do not add upto 1.0')

If you need to call this many times and for large table, there are
ways to speed this up.

* re-arrange the tables to make sure the largest probabilities come
first. Intentionally or not, your sample data is already arranged this
way (though the probabilities add up to more than 1.0 and should be
scaled back to 1.0).


* pre-summarize the summations with a cumulative probability table:
[.50, .25, .15, .10] --> [.50, .75, .90, 1.00]

def cselect(ctab, random=random.random):
r = random()
for i, cp in enumerate(ctab):
if cp > r:
return i
raise Exception


* if the table is cumulative and large, bisection offers a faster
search:

def cselect(ctab, random=random.random, bisect=bisect.bisect):
return bisect(ctab, random())


* if the probabilities come in integer ratios, you could reduce the
problem to an O(1) lookup:

[.5, .3, .1, .1] --> [0,0,0,0,0,1,1,1,2,3]

def lselect(ltab, choice=random.choice):
return choice(ltab)


IOW, if speed is what is important, then be sure to exploit any known
structure in the data (ordering, cumulativeness, integer rations, etc)
and don't compute anything twice (pre-compute the summation).


Raymond Hettinger
Max M
Guest
 
Posts: n/a
#5: Jul 18 '05

re: Smarter way of doing this?


Jim Jewett wrote:
[color=blue]
> Max M <maxm@mxm.dk> wrote in message news:<401e4d57$0$295$edfadb0f@dread12.news.tele.dk >...
>
> You may wish to look at the GRADES example in the bisect documentation.[/color]

First of, thanks for the replies!

I finally chose a combination of the suggested approaches, and ended up
with something I find is both more functional and nicer to look at than
my first try.

regards Max M


################################

from random import random
from bisect import bisect

class Selector:

"Returns random elements by probability"

def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
self.decorated = zip(probabilities, elements)
self.decorated.sort()

def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))-1
return self.decorated[index][-1]

def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]


if __name__ == '__main__':

probabilities = []
for f in range(0,10):
probabilities.append(1.0 / (f+1.0))

elements = ['a','b','c','d','e','f','g','h','i','j',]

s = Selector(probabilities, elements)
result = s.get_range(1000)
result.sort()

print result
Max M
Guest
 
Posts: n/a
#6: Jul 18 '05

re: Smarter way of doing this?


Raymond Hettinger wrote:
[color=blue]
> Here is a simplified version of the approach you took:
>
> def select(tab, random=random.random):
> cp = 0
> r = random()
> for i, p in enumerate(tab):
> cp += p
> if cp > r:
> return i
> raise Exception('probabilities do not add upto 1.0')[/color]
[color=blue]
> * re-arrange the tables to make sure the largest probabilities come
> first. Intentionally or not, your sample data is already arranged this
> way (though the probabilities add up to more than 1.0 and should be
> scaled back to 1.0).[/color]


It doesn't really have to be scaled to 1.0 as long as they get selected
with a frequency that corresponds to their probability.

[color=blue]
> IOW, if speed is what is important, then be sure to exploit any known
> structure in the data (ordering, cumulativeness, integer rations, etc)
> and don't compute anything twice (pre-compute the summation).[/color]


Lot's of good stuff. Thanks.

Max M
Roberto A. F. De Almeida
Guest
 
Posts: n/a
#7: Jul 18 '05

re: Smarter way of doing this?


Max M <maxm@mxm.dk> wrote in message news:<401e4d57$0$295$edfadb0f@dread12.news.tele.dk >...[color=blue]
> I have written this function, "choose_by_probability()"
>
> It takes a list of probabilities in the range 0.00-1.00, and it must
> then randomly select one of the probabilities and return it's index.
>
> The idea is to have two lists, one with a value, and another with a
> probability. The value then gets randomly selected from its probability.
>
> values = ['item 1','item 2','item 3',]
> probabilities = [0.66, 0.33, 0.16]
>
> print values[choose_by_probability(probabilities)][color=green][color=darkred]
> >>'item 1'[/color][/color]
>
> etc...
>
> I just wondered if anybody has a better way of doing it, as this seems
> nastily squared to me?[/color]

In one line:

v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]

[v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0]

R.
Max M
Guest
 
Posts: n/a
#8: Jul 18 '05

re: Smarter way of doing this?


Roberto A. F. De Almeida wrote:
[color=blue]
> In one line:
>
> v = ['item 1','item 2','item 3',]
> p = [0.66, 0.33, 0.16]
>
> [v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0][/color]


uh, that's hard to read :-)

But doesn't random.random()*sum(p) get executed for each probability in p?

Then each probability in p will be compared to a different random value.
That isn't fair...


regards Max m
Roberto Antonio Ferreira De Almeida
Guest
 
Posts: n/a
#9: Jul 18 '05

re: Smarter way of doing this?


Max M wrote:[color=blue]
> Roberto A. F. De Almeida wrote:
>[color=green]
>> In one line:
>>
>> v = ['item 1','item 2','item 3',]
>> p = [0.66, 0.33, 0.16]
>>
>> [v[i] for i,j in enumerate(p) if sum(p[:i]) >
>> (random.random()*sum(p))][0][/color]
>
> uh, that's hard to read :-)[/color]

Yes, I know. I just wanted to see if I could do it in one line. :)

But it just calculates the cumulative probability, and gets the first
one which passes a random value between 0 and the sum of the probabilities.
[color=blue]
> But doesn't random.random()*sum(p) get executed for each probability in p?[/color]

I actually don't know. You could call random.random() out of the list
comprehension... :)

R.
Max M
Guest
 
Posts: n/a
#10: Jul 18 '05

re: Smarter way of doing this?


Max M wrote:
[color=blue]
> Jim Jewett wrote:[/color]
[color=blue]
> I finally chose a combination of the suggested approaches, and ended up
> with something I find is both more functional and nicer to look at than
> my first try.[/color]


Well, that version had a problem. Whenever a series of values are the
same bisect() allways selects the first one. So I never got a
statistical valid sample of elements.

I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.

I used it in a Markov chain class, that is now pretty fast.

regards Max M



#########################

from random import random, randint
from bisect import bisect


class Selector:

"Returns random elements by probability"

def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
acumulated_probabilities = []
acumulated_probability = 0
for probability in probabilities:
acumulated_probability += probability
acumulated_probabilities.append(acumulated_probabi lity)
self.decorated = zip(acumulated_probabilities, elements)
self.decorated.sort()

def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))
self.decorated[index][-1]
return self.decorated[index][-1]

def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]





if __name__ == '__main__':


probabilities = [1.0, 0.5, 0.25, 0.125, 0.0625]
# probabilities = [0.1, 0.1, 1.0, 0.1, 0.1, ]
elements = ['a', 'b', 'c', 'd', 'e']
s = Selector(probabilities, elements)
r = s.get_range(10000)
r.sort()

for element in elements:
print element, r.count(element)


##########[color=blue][color=green]
>>a 5148
>>b 2594
>>c 1301
>>d 654
>>e 303[/color][/color]
Josiah Carlson
Guest
 
Posts: n/a
#11: Jul 18 '05

re: Smarter way of doing this?


> def get_range(self, n):[color=blue]
> "Randomly returns a range of elements by probability"
> return [self.get() for itm in range(n)][/color]

Use xrange here ^^^^^
xrange doesn't instantiate a list, so is faster and uses less memory for
large n.

- Josiah
Anton Vredegoor
Guest
 
Posts: n/a
#12: Jul 18 '05

re: Smarter way of doing this?


Max M <maxm@mxm.dk> wrote:
[color=blue]
>I solved it by using acumulated probabilities instead. So the final
>version is here, if anybody cares.
>
>I used it in a Markov chain class, that is now pretty fast.[/color]

Yes it's fast, but I don't think it's smart :-) Unfortunately I
haven't got a working solution as an alternative, but only some code
that "explains" what I mean:


def choice_generator(population,probabilities):
PC = zip(probabilities,population)
while 1:
p,c = max([(p*random(),c) for p,c in PC])
yield c

This is short and fast. However, the distribution of the outcomes is
wrong, because the list of probabilities should be "adjusted" so that
in the end the *outcomes* are distributed according to the
"probabilities". Or should that be proportions?

My statistics skills are a bit rusty, but I think it's an interesting
problem. I'll get back when I have some more ideas. In the meantime
more mathematically educated readers are given a chance to beat me to
it :-), or prove it can't be done.

Anton
Max M
Guest
 
Posts: n/a
#13: Jul 18 '05

re: Smarter way of doing this?


Anton Vredegoor wrote:
[color=blue]
> Max M <maxm@mxm.dk> wrote:
>
>[color=green]
>>I solved it by using acumulated probabilities instead. So the final
>>version is here, if anybody cares.[/color]
>
> Yes it's fast, but I don't think it's smart :-) Unfortunately I
> haven't got a working solution as an alternative, but only some code
> that "explains" what I mean:
>
>
> def choice_generator(population,probabilities):
> PC = zip(probabilities,population)
> while 1:
> p,c = max([(p*random(),c) for p,c in PC])
> yield c
>
> This is short and fast. However, the distribution of the outcomes is
> wrong, because the list of probabilities should be "adjusted" so that
> in the end the *outcomes* are distributed according to the
> "probabilities". Or should that be proportions?[/color]


I don't understand what you mean. If I calculate the deviations from
what is expected, and use a large result set, I get very small deviations.

I am interrested in getting a result as a propertion of the probablilty,
or more correctly in this case, the frequency. If I want it as
probabilities I just have to normalise the sim to 1.0.

This has the advantage that the frequencies can be expressed as integers
too. This is nice in my Markov chain class that count words in text, etc.

In my example below each letter should be occur 50% of the times of the
previous letter.

Perhaps you mean that it should behave differently?

regards Max M



###################

probabilities = [16, 8, 4, 2, 1]
elements = ['a', 'b', 'c', 'd', 'e']
sample_size = 1000000

s = Selector(probabilities, elements)
r = s.get_range(sample_size)
r.sort()

previous = float(sample_size)
for element in elements:
count = r.count(element)
deviation = (previous/2.0-count) / count * 100
previous = count
print element, count, deviation

[color=blue][color=green]
>> a 517046 -3.29680531326
>> b 257439 0.421070622555
>> c 129159 -0.340278261677
>> d 64148 0.672663216312
>> e 32208 -0.416045702931[/color][/color]
Anton Vredegoor
Guest
 
Posts: n/a
#14: Jul 18 '05

re: Smarter way of doing this?


Max M <maxm@mxm.dk> wrote:
[color=blue]
>I don't understand what you mean. If I calculate the deviations from
>what is expected, and use a large result set, I get very small deviations.[/color]

That's what I was afraid of, and I can't blame you for it :-)
[color=blue]
>In my example below each letter should be occur 50% of the times of the
>previous letter.
>
>Perhaps you mean that it should behave differently?[/color]

No, your code does what it should do and my code was just some
obviously wrong example to show what *kind* of solution I was after.

For example if my code is run with a list of probabilities like this:

[.1,.1,.1,.1,.1] it produces exactly the same distribution as your
code. However, with input like this: [2,1] (*different* probabilities)
it starts to diverge from your code. What I would like to do is to
replace this list [2,1] with the list of probabilities that would
cause my code to produce the same output as yours ...

So [2,1] ==> "some list of probabilities" and this *new* list is used
as multiplication factors in the line with :

"max([(p*random(),c) for p,c in PC])"

I hope I succeeded in just getting the idea across instead of giving
the impression that my code works better, which obviously it doesn't.
It *needs* some kind of transformation of the probabilities into an
other list ...

Anton
Closed Thread