439,957 Members | 2,017 Online Need help? Post your question and get tips & solutions from a community of 439,957 IT Pros & Developers. It's quick & easy.

# Smarter way of doing this?

 P: n/a I have written this function, "choose_by_probability()" It takes a list of probabilities in the range 0.00-1.00, and it must then randomly select one of the probabilities and return it's index. The idea is to have two lists, one with a value, and another with a probability. The value then gets randomly selected from its probability. values = ['item 1','item 2','item 3',] probabilities = [0.66, 0.33, 0.16] print values[choose_by_probability(probabilities)]'item 1' etc... I just wondered if anybody has a better way of doing it, as this seems nastily squared to me? regards Max M ################### from random import random def choose_by_probability(probabilities): "Randomly selects an index from a list of probabilities" rnd = random() * sum(probabilities) range_start = 0.0 for i in range(len(probabilities)): probability = probabilities[i] range_end = range_start + probability if range_start < rnd < range_end: return i range_start = range_end probabilities = [] for f in range(0,16): probabilities.append(1.0 / (f+1.0)) print probabilities # writing it out sorted to do a visual test... result = [] for i in range(100): result.append(choose_by_probability(probabilities) ) result.sort() print result Jul 18 '05 #1
13 Replies

 P: n/a In article <40*********************@dread12.news.tele.dk>, Max M wrote:I have written this function, "choose_by_probability()"It takes a list of probabilities in the range 0.00-1.00, and it mustthen randomly select one of the probabilities and return it's index.The idea is to have two lists, one with a value, and another with aprobability. The value then gets randomly selected from its probability.values = ['item 1','item 2','item 3',]probabilities = [0.66, 0.33, 0.16]print values[choose_by_probability(probabilities)]'item 1'etc...I just wondered if anybody has a better way of doing it, as this seemsnastily squared to me? My usual way is: import random def choose_by_probability (values, probabilities): cumul = 0.0 for prob, item in zip (probabilities, values): cumul += prob if prob > random.random()*cumul: selected = item return selected def run_tests (n): D = {} for i in xrange (n): j = choose_by_probability (['item 1', 'item 2', 'item 3'] , [0.66, 0.33, 0.17]) D[j] = D.get (j, 0) + 1 s = D.items() s.sort() for k, v in s: print k, float(v)/n print "Total:", sum([v for k, v in D.items()]) run_tests (10000) Regards. Mel. Jul 18 '05 #2

 P: n/a Max M wrote in message news:<40*********************@dread12.news.tele.dk >... It takes a list of probabilities in the range 0.00-1.00, and it must then randomly select one of the probabilities and return it's index. You may wish to look at the GRADES example in the bisect documentation. -jJ Take only memories. Leave not even footprints. Jul 18 '05 #3

 P: n/a Max M I have written this function, "choose_by_probability()" It takes a list of probabilities in the range 0.00-1.00, and it must then randomly select one of the probabilities and return it's index. The idea is to have two lists, one with a value, and another with a probability. The value then gets randomly selected from its probability. values = ['item 1','item 2','item 3',] probabilities = [0.66, 0.33, 0.16] Here is a simplified version of the approach you took: def select(tab, random=random.random): cp = 0 r = random() for i, p in enumerate(tab): cp += p if cp > r: return i raise Exception('probabilities do not add upto 1.0') If you need to call this many times and for large table, there are ways to speed this up. * re-arrange the tables to make sure the largest probabilities come first. Intentionally or not, your sample data is already arranged this way (though the probabilities add up to more than 1.0 and should be scaled back to 1.0). * pre-summarize the summations with a cumulative probability table: [.50, .25, .15, .10] --> [.50, .75, .90, 1.00] def cselect(ctab, random=random.random): r = random() for i, cp in enumerate(ctab): if cp > r: return i raise Exception * if the table is cumulative and large, bisection offers a faster search: def cselect(ctab, random=random.random, bisect=bisect.bisect): return bisect(ctab, random()) * if the probabilities come in integer ratios, you could reduce the problem to an O(1) lookup: [.5, .3, .1, .1] --> [0,0,0,0,0,1,1,1,2,3] def lselect(ltab, choice=random.choice): return choice(ltab) IOW, if speed is what is important, then be sure to exploit any known structure in the data (ordering, cumulativeness, integer rations, etc) and don't compute anything twice (pre-compute the summation). Raymond Hettinger Jul 18 '05 #4

 P: n/a Jim Jewett wrote: Max M wrote in message news:<40*********************@dread12.news.tele.dk >... You may wish to look at the GRADES example in the bisect documentation. First of, thanks for the replies! I finally chose a combination of the suggested approaches, and ended up with something I find is both more functional and nicer to look at than my first try. regards Max M ################################ from random import random from bisect import bisect class Selector: "Returns random elements by probability" def __init__(self, probabilities, elements): self.sum = sum(probabilities) self.decorated = zip(probabilities, elements) self.decorated.sort() def get(self): "Randomly returns an element by probability" rnd = random() * self.sum index = bisect(self.decorated, (rnd, 0))-1 return self.decorated[index][-1] def get_range(self, n): "Randomly returns a range of elements by probability" return [self.get() for itm in range(n)] if __name__ == '__main__': probabilities = [] for f in range(0,10): probabilities.append(1.0 / (f+1.0)) elements = ['a','b','c','d','e','f','g','h','i','j',] s = Selector(probabilities, elements) result = s.get_range(1000) result.sort() print result Jul 18 '05 #5

 P: n/a Raymond Hettinger wrote: Here is a simplified version of the approach you took: def select(tab, random=random.random): cp = 0 r = random() for i, p in enumerate(tab): cp += p if cp > r: return i raise Exception('probabilities do not add upto 1.0') * re-arrange the tables to make sure the largest probabilities come first. Intentionally or not, your sample data is already arranged this way (though the probabilities add up to more than 1.0 and should be scaled back to 1.0). It doesn't really have to be scaled to 1.0 as long as they get selected with a frequency that corresponds to their probability. IOW, if speed is what is important, then be sure to exploit any known structure in the data (ordering, cumulativeness, integer rations, etc) and don't compute anything twice (pre-compute the summation). Lot's of good stuff. Thanks. Max M Jul 18 '05 #6

 P: n/a Max M wrote in message news:<40*********************@dread12.news.tele.dk >... I have written this function, "choose_by_probability()" It takes a list of probabilities in the range 0.00-1.00, and it must then randomly select one of the probabilities and return it's index. The idea is to have two lists, one with a value, and another with a probability. The value then gets randomly selected from its probability. values = ['item 1','item 2','item 3',] probabilities = [0.66, 0.33, 0.16] print values[choose_by_probability(probabilities)] >>'item 1' etc... I just wondered if anybody has a better way of doing it, as this seems nastily squared to me? In one line: v = ['item 1','item 2','item 3',] p = [0.66, 0.33, 0.16] [v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))] R. Jul 18 '05 #7

 P: n/a Roberto A. F. De Almeida wrote: In one line: v = ['item 1','item 2','item 3',] p = [0.66, 0.33, 0.16] [v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))] uh, that's hard to read :-) But doesn't random.random()*sum(p) get executed for each probability in p? Then each probability in p will be compared to a different random value. That isn't fair... regards Max m Jul 18 '05 #8

 P: n/a Max M wrote: Roberto A. F. De Almeida wrote: In one line: v = ['item 1','item 2','item 3',] p = [0.66, 0.33, 0.16] [v[i] for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))] uh, that's hard to read :-) Yes, I know. I just wanted to see if I could do it in one line. :) But it just calculates the cumulative probability, and gets the first one which passes a random value between 0 and the sum of the probabilities. But doesn't random.random()*sum(p) get executed for each probability in p? I actually don't know. You could call random.random() out of the list comprehension... :) R. Jul 18 '05 #9

 P: n/a Max M wrote: Jim Jewett wrote: I finally chose a combination of the suggested approaches, and ended up with something I find is both more functional and nicer to look at than my first try. Well, that version had a problem. Whenever a series of values are the same bisect() allways selects the first one. So I never got a statistical valid sample of elements. I solved it by using acumulated probabilities instead. So the final version is here, if anybody cares. I used it in a Markov chain class, that is now pretty fast. regards Max M ######################### from random import random, randint from bisect import bisect class Selector: "Returns random elements by probability" def __init__(self, probabilities, elements): self.sum = sum(probabilities) acumulated_probabilities = [] acumulated_probability = 0 for probability in probabilities: acumulated_probability += probability acumulated_probabilities.append(acumulated_probabi lity) self.decorated = zip(acumulated_probabilities, elements) self.decorated.sort() def get(self): "Randomly returns an element by probability" rnd = random() * self.sum index = bisect(self.decorated, (rnd, 0)) self.decorated[index][-1] return self.decorated[index][-1] def get_range(self, n): "Randomly returns a range of elements by probability" return [self.get() for itm in range(n)] if __name__ == '__main__': probabilities = [1.0, 0.5, 0.25, 0.125, 0.0625] # probabilities = [0.1, 0.1, 1.0, 0.1, 0.1, ] elements = ['a', 'b', 'c', 'd', 'e'] s = Selector(probabilities, elements) r = s.get_range(10000) r.sort() for element in elements: print element, r.count(element) ##########a 5148b 2594c 1301d 654e 303 Jul 18 '05 #10

 P: n/a > def get_range(self, n): "Randomly returns a range of elements by probability" return [self.get() for itm in range(n)] Use xrange here ^^^^^ xrange doesn't instantiate a list, so is faster and uses less memory for large n. - Josiah Jul 18 '05 #11

 P: n/a Max M wrote: I solved it by using acumulated probabilities instead. So the finalversion is here, if anybody cares.I used it in a Markov chain class, that is now pretty fast. Yes it's fast, but I don't think it's smart :-) Unfortunately I haven't got a working solution as an alternative, but only some code that "explains" what I mean: def choice_generator(population,probabilities): PC = zip(probabilities,population) while 1: p,c = max([(p*random(),c) for p,c in PC]) yield c This is short and fast. However, the distribution of the outcomes is wrong, because the list of probabilities should be "adjusted" so that in the end the *outcomes* are distributed according to the "probabilities". Or should that be proportions? My statistics skills are a bit rusty, but I think it's an interesting problem. I'll get back when I have some more ideas. In the meantime more mathematically educated readers are given a chance to beat me to it :-), or prove it can't be done. Anton Jul 18 '05 #12

 P: n/a Anton Vredegoor wrote: Max M wrote:I solved it by using acumulated probabilities instead. So the finalversion is here, if anybody cares. Yes it's fast, but I don't think it's smart :-) Unfortunately I haven't got a working solution as an alternative, but only some code that "explains" what I mean: def choice_generator(population,probabilities): PC = zip(probabilities,population) while 1: p,c = max([(p*random(),c) for p,c in PC]) yield c This is short and fast. However, the distribution of the outcomes is wrong, because the list of probabilities should be "adjusted" so that in the end the *outcomes* are distributed according to the "probabilities". Or should that be proportions? I don't understand what you mean. If I calculate the deviations from what is expected, and use a large result set, I get very small deviations. I am interrested in getting a result as a propertion of the probablilty, or more correctly in this case, the frequency. If I want it as probabilities I just have to normalise the sim to 1.0. This has the advantage that the frequencies can be expressed as integers too. This is nice in my Markov chain class that count words in text, etc. In my example below each letter should be occur 50% of the times of the previous letter. Perhaps you mean that it should behave differently? regards Max M ################### probabilities = [16, 8, 4, 2, 1] elements = ['a', 'b', 'c', 'd', 'e'] sample_size = 1000000 s = Selector(probabilities, elements) r = s.get_range(sample_size) r.sort() previous = float(sample_size) for element in elements: count = r.count(element) deviation = (previous/2.0-count) / count * 100 previous = count print element, count, deviation a 517046 -3.29680531326 b 257439 0.421070622555 c 129159 -0.340278261677 d 64148 0.672663216312 e 32208 -0.416045702931 Jul 18 '05 #13

 P: n/a Max M wrote: I don't understand what you mean. If I calculate the deviations fromwhat is expected, and use a large result set, I get very small deviations. That's what I was afraid of, and I can't blame you for it :-) In my example below each letter should be occur 50% of the times of theprevious letter.Perhaps you mean that it should behave differently? No, your code does what it should do and my code was just some obviously wrong example to show what *kind* of solution I was after. For example if my code is run with a list of probabilities like this: [.1,.1,.1,.1,.1] it produces exactly the same distribution as your code. However, with input like this: [2,1] (*different* probabilities) it starts to diverge from your code. What I would like to do is to replace this list [2,1] with the list of probabilities that would cause my code to produce the same output as yours ... So [2,1] ==> "some list of probabilities" and this *new* list is used as multiplication factors in the line with : "max([(p*random(),c) for p,c in PC])" I hope I succeeded in just getting the idea across instead of giving the impression that my code works better, which obviously it doesn't. It *needs* some kind of transformation of the probabilities into an other list ... Anton Jul 18 '05 #14

### This discussion thread is closed

Replies have been disabled for this discussion. 