Sampling a population

Brian Quinlan

This is less a Python question and more a optimization/probability
question. Imaging that you have a list of objects and there frequency in
a population e.g.

lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]

and you want to drawn n items from that list (duplicates allowed), with
that probability distribution.

The fastest algorithm that I have been able to devise for doing so is:
O(n * log(len(lst))). Can anyone think or a solution with a better time
complexity? If not, is there an obviously better way to do this
(assuming n is big and the list size is small).

Here is the code:

from random import random
from bisect import bisect

def draw(n, lst):
ps = []
last = 0
for p in lst:
ps.append(last + p)
last += p

# ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]

chosen = [0] * len(lst) # track frequency
for i in range(n):
r = random()

chosen[bisect(ps, r)] += 1 # binary search and increment

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result

lst = [0.01, 0.05, 0.50, 0.30, 0.04, 0.10]
print draw(10000, lst)

Jun 2 '06 #1

Subscribe Post Reply

1214

Duncan Smith

Brian Quinlan wrote:

This is less a Python question and more a optimization/probability
question. Imaging that you have a list of objects and there frequency in
a population e.g.

lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]

and you want to drawn n items from that list (duplicates allowed), with
that probability distribution.

The fastest algorithm that I have been able to devise for doing so is:
O(n * log(len(lst))). Can anyone think or a solution with a better time
complexity? If not, is there an obviously better way to do this
(assuming n is big and the list size is small).

Here is the code:

from random import random
from bisect import bisect

def draw(n, lst):
ps = []
last = 0
for p in lst:
ps.append(last + p)
last += p

# ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]

chosen = [0] * len(lst) # track frequency
for i in range(n):
r = random()

chosen[bisect(ps, r)] += 1 # binary search and increment

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result

lst = [0.01, 0.05, 0.50, 0.30, 0.04, 0.10]
print draw(10000, lst)

I would do something like the following (maybe with an additional check
that the probabilities do not sum to less than 1),

from random import random
import operator
def draw(n, lst): lst.sort(key=operator.itemgetter(1), reverse=True)
cumprobs = []
this_cp = 0
for p in lst:
this_cp += p[1]
cumprobs.append(this_cp)
for _ in xrange(n):
rnd = random()
for i, cumprob in enumerate(cumprobs):
if rnd < cumprob:
yield lst[i][0]
break

lst = [('a', 0.01), ('b', 0.05), ('c', 0.50), ('d', 0.30), ('e', 0.04), ('f', 0.10)] list(draw(8, lst)) ['d', 'd', 'c', 'e', 'c', 'd', 'c', 'f']

The sorting of the list means that iterating over the cumulative
probabilities is minimised (for your density function the inner loop
will be broken out of after the first iteration 50% of the time). (I've
assumed that it doesn't matter to you that the list is sorted.) I'm not
sure that this is in any way optimal.

Duncan

Jun 2 '06 #2

Mitja Trampus

Brian Quinlan wrote:

The fastest algorithm that I have been able to devise for doing so is:
O(n * log(len(lst))). Can anyone think or a solution with a better time
complexity? If not, is there an obviously better way to do this
(assuming n is big and the list size is small).

If list is indeed short, I'd say common sense speaks against complicating and optimizing
further - you can only get log(len(lst))-fold speedup, which is in your case more or less
a small constant. _IF_ this part of code later turns out to be a bottleneck, you might
profit more by porting it to C than searching for an O(n) solution, if it even exists.

Jun 2 '06 #3

Paddy

Brian Quinlan wrote:

This is less a Python question and more a optimization/probability
question. Imaging that you have a list of objects and there frequency in
a population e.g.

lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]

and you want to drawn n items from that list (duplicates allowed), with
that probability distribution.

The fastest algorithm that I have been able to devise for doing so is:
O(n * log(len(lst))). Can anyone think or a solution with a better time
complexity? If not, is there an obviously better way to do this
(assuming n is big and the list size is small).

Any way I tried to slice and dice it, I could not get any faster.
draw2 and draw 3 generate code on the fly. draw4 sneakily tries to
trade memory and accuracy for speed but is even slower!

First the times, then the code:

$ ./timeit.py 'from probDistribution import draw as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
100 loops, best of 3: 13.4 msec per loop

$ ./timeit.py 'from probDistribution import draw2 as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
100 loops, best of 3: 15.2 msec per loop

$ ./timeit.py 'from probDistribution import draw3 as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
100 loops, best of 3: 16.2 msec per loop

$ ./timeit.py 'from probDistribution import draw4 as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
10 loops, best of 3: 30.5 msec per loop
=== CODE probDistribution.py ===
from random import random, randrange
from bisect import bisect

def draw(n, lst):
ps = []
last = 0
for p in lst:
ps.append(last + p)
last += p

# ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]

chosen = [0] * len(lst) # track frequency
for i in range(n):
r = random()

chosen[bisect(ps, r)] += 1 # binary search and increment

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result
def draw2(n, lst):
""" uses dynamicc code generation of this form:
chosen = [0] * 6
for i in xrange(10000):
r = random()
if r < 0.01: chosen[0]+=1
elif r < 0.06: chosen[1] +=1
...
elif r < 0.90: chosen[4] +=1
else chosen[5]+=1
"""
assert len(lst)>1, "Corner case NOT covered"

codestr = 'chosen = [0] * %i\n' % (len(lst),)
codestr += 'for i in xrange(%i):\n r = random()\n' % (n,)

last = 0.0
lstmax = len(lst)-1
for i,p in enumerate(lst):
last += p
if i==0: codestr += ' if r < %g: chosen[%i] +=1\n' %
(last, i)
elif i==lstmax: codestr += ' else: chosen[%i] +=1\n' % (i,)
else: codestr += ' elif r < %g: chosen[%i] +=1\n' %
(last, i)

exec codestr

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result
def draw3(n, lst):
""" uses dynamicc code generation of this form:
chosen = [0] * 6
for i in xrange(10000):
r = random()
chosen[-1+ (
((r<0.01) and 1)
or ((r<0.06) and 2)
...
or ((r<0.90) and 5)
or 6
)] +=1

"""
assert len(lst)>1, "Corner case NOT covered"

codestr = 'chosen = [0] * %i\n' % (len(lst),)
codestr += 'for i in xrange(%i):\n r = random()\n' % (n,)
codestr += ' chosen[-1+ (\n'

last = 0.0
lstmax = len(lst)-1
for i,p in enumerate(lst):
last += p
if i==0: codestr += ' ((r<%g) and 1)\n' % (last)
elif i==lstmax: codestr += ' or %i\n )] +=1\n' % (i+1,)
else: codestr += ' or ((r<%g) and %i)\n' % (last,
i+1)
#return codestr
exec codestr

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result
def draw4(n, lst, precision = 0.01, maxbins=10000):
""" Memory/speed tradeoff by coarse quantizing of frequency values.

"""
assert len(lst)>1, "Corner case NOT covered"
assert 0.0 < precision < 1.0 and (1.0/precision) < maxbins

binmax = int(1.0/precision)
chosenbin = [0] * binmax
for i in xrange(n):
chosenbin[randrange(binmax)] +=1

left, right = 0, 0 # extract bin range for summation
chosen = [0] * len(lst)
last = 0.0
for i,p in enumerate(lst):
last += p
right = int(last/precision)
chosen[i] = sum( chosenbin[left:right] )
left = right

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result

=== END CODE ===

Jun 2 '06 #4

Terry Reedy

"Paddy" <pa*******@netscape.net> wrote in message
news:11*********************@i39g2000cwa.googlegro ups.com...

Brian Quinlan wrote:
This is less a Python question and more a optimization/probability
question. Imaging that you have a list of objects and there frequency in
a population e.g.

lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]

and you want to drawn n items from that list (duplicates allowed), with
that probability distribution.

For a set of probabilities that uneven, you *might* do better to sort high
to low and do a linear search. Or check the two common cases and then do a
binary search on the rest.

tjr

Jun 2 '06 #5

Roger Miller

For your example, since the probabilities are all multiples of 0.01 you
could make a list of 100 elements. Set one of them to a, 5 of them to
b, 50 of them to c, etc. Then just randomly sample from the table
(which is O(1)). Of course if the probabilities can be arbitrary
floating point values then this won't work. But if you can live with
rounding to 3 or 4 digits this is probably the fastest and easiest
approach.

Jun 2 '06 #6

by: steflhermitte | last post by:

Dear cpp-ians, I want to apply a stratified sampling on an image. Following simplified example will explain my problem. The original image I with nrows and ncols is now a vector V of length...

C / C++

runstats sampling oddities

by: kenfar | last post by:

I've got a large table on db2 8.2.1 that I rarely perform runstats on. It has about 600 million rows organized in a single MDC time dimension on a non-dpf warehouse. Anyhow, we recently ran...

DB2 Database

Population program

by: neonman14 | last post by:

Hello I am in Intro to Java Programming and I am having problems with assignment. The Homework assignment is called Population. Population Write a program that will predict the size of a...

Java

As a percentage of world inhabitants, the white population willplummet to a single digit (9.76%) by 2060 from a high-water mark of 27.98% in1950

by: tg | last post by:

http://img522.imageshack.us/img522/8647/scan10005ci7.jpg As a percentage of world inhabitants, the white population will plummet to a single digit (9.76%) by 2060 from a high-water mark of...

C / C++

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Sampling a population

Similar topics