473,626 Members | 3,304 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Hypergeometric distribution

Hi to all, I need to calculate the hpergeometric distribution:
choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution? The statistical package R can handle
such calculations but I don't want to use python R binding since I want
a standalone app.
Thanks a lot
Ale

Dec 26 '05 #1
24 6267
Raven wrote:
Hi to all, I need to calculate the hpergeometric distribution:
choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution?


Use logarithms.

Specifically,

from scipy import special

def logchoose(n, k):
lgn1 = special.gammaln (n+1)
lgk1 = special.gammaln (k+1)
lgnk1 = special.gammaln (n-k+1)
return lgn1 - (lgnk1 + lgk1)

def gauss_hypergeom (x, r, b, n):
return exp(logchoose(r , x) +
logchoose(b, n-x) -
logchoose(r+b, n))

Or you could use gmpy if you need exact rational arithmetic rather than floating
point.

--
Robert Kern
ro*********@gma il.com

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Dec 26 '05 #2
Raven wrote:
Hi to all, I need to calculate the hpergeometric distribution:
choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution? The statistical package R can handle
such calculations but I don't want to use python R binding since I want
a standalone app.
Thanks a lot
Ale


Ale

I had this code lying about if it helps. I don't know if it's even
correct but it's non-recursive!

def Binomial( n, k ):
ret = 0
if k == 0:
ret = 1
elif k > 0:
a = range( n+1 )
a[0] = 1
for i in a[1:]:
a[i] = 1
for j in range(i-1,0,-1):
a[j] = a[j] + a[j-1]
ret = a[k]
return ret

Gerard

Dec 26 '05 #3
On Mon, 26 Dec 2005 12:18:55 -0800, Raven wrote:
Hi to all, I need to calculate the hpergeometric distribution:
choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int.
Are you sure about that? Python long ints can be as big as you have enough
memory for. My Python can print 10L**10000 to the console with a barely
detectable pause, and 10L**100000 with about a ten second delay. (Most of
that delay is printing it, not calculating it.)

25206 is the first integer whose factorial exceeds 10L**100000, so even if
you are calculating the binomial coefficient using the most naive
algorithm, calculating the factorials and dividing, you should easily be
able to generate it for a,b up to 20,000 unless you have a severe
shortage of memory.
I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist.


What exactly is your problem? What values of hypergeometric( x; r,b,n) fail
for you?

--
Steven.

Dec 27 '05 #4
Raven <ba********@gma il.com> wrote:
Hi to all, I need to calculate the hpergeometric distribution:
choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution? The statistical package R can handle
such calculations but I don't want to use python R binding since I want
a standalone app.


Have you tried with gmpy?
Alex
Dec 27 '05 #5
Thanks to all of you guys, I could resolve my problem using the
logarithms as proposed by Robert. I needed to calculate the factorial
for genomic data, more specifically for the number of genes in the
human genome i.e. about 30.000 and that is a big number :-)
I didn't know gmpy
Thanks a lot, really

Ale

Jan 1 '06 #6
On Sat, 31 Dec 2005 16:24:02 -0800, Raven wrote:
Thanks to all of you guys, I could resolve my problem using the
logarithms as proposed by Robert. I needed to calculate the factorial
for genomic data, more specifically for the number of genes in the
human genome i.e. about 30.000 and that is a big number :-)
I didn't know gmpy
Thanks a lot, really


Are you *sure* the existing functions didn't work? Did you try them?
def log2(x): .... return math.log(x)/math.log(2)
.... n = 0.0
for i in range(1, 300000): # ten times bigger than you need .... n += log2(i)
.... n 5025564.6087276 665 t = time.time(); x = 2L**(int(n) + 1); time.time() - t 0.2664909362792 9688

That's about one quarter of a second to calculate 300,000 factorial
(approximately) , and it shows that the calculations are well within
Python's capabilities.

Of course, converting this 1.5 million-plus digit number to a string takes
a bit longer:
t = time.time(); len(str(x)); time.time() - t 1512846
6939.3762848377 228

A quarter of a second to calculate, and almost two hours to convert to a
string. Lesson one: calculations on longints are fast. Converting them to
strings is not.

As far as your original question goes, try something like this:

(formula from memory, may be wrong)
def bincoeff(n,r): .... x = 1
.... for i in range(r+1, n+1):
.... x *= i
.... for i in range(1, n-r+1):
.... x /= i
.... return x
.... bincoeff(10, 0) 1 bincoeff(10, 1) 10 bincoeff(10, 2) 45 bincoeff(10, 3) 120 import time
t = time.time(); L = bincoeff(30000, 7000); time.time() - t 28.317800045013 428

Less than thirty seconds to calculate a rather large binomial coefficient
exactly. How many digits?
len(str(L)) 7076

If you are calculating hundreds of hypergeometric probabilities, 30
seconds each could be quite painful, but it certainly shows that Python is
capable of doing it without resorting to logarithms which may lose some
significant digits. Although, in fairness, the log function doesn't seem
to lose much accuracy for arguments in the range you are dealing with.
How long does it take to calculate factorials?
def timefact(n): .... # calculate n! and return the time taken in seconds
.... t = time.time()
.... L = 1
.... for i in range(1, n+1):
.... L *= i
.... return time.time() - t
.... timefact(3000) 0.0549139976501 46484 timefact(30000) # equivalent to human genome 5.0699510574340 82 timefact(300000 ) # ten times bigger

4255.2370519638 062

Keep in mind, if you are calculating the hypergeometric probabilities
using raw factorials, you are doing way too much work.
--
Steven.

Jan 1 '06 #7
Thanks Steven for your very interesting post.

This was a critical instance from my problem:
from scipy import comb
comb(14354,174)
inf

The scipy.stats.dis tributions.hype rgeom function uses the scipy.comb
function, so it returned nan since it tries to divide an infinite. I
did not tried to write a self-made function using standard python as I
supposed that the scipy functions reached python's limits but I was
wrong, what a fool :-)
If you are calculating hundreds of hypergeometric probabilities, 30
seconds each could be quite painful, but it certainly shows that Python is
capable of doing it without resorting to logarithms which may lose some
significant digits. Although, in fairness, the log function doesn't seem
to lose much accuracy for arguments in the range you are dealing with.


Yes I am calculating hundreds of hypergeometric probabilities so I need
fast calculations

Ale

Jan 1 '06 #8
"Raven" <ba********@gma il.com> writes:
Yes I am calculating hundreds of hypergeometric probabilities so I need
fast calculations


Can you use Stirling's approximation to get the logs of the factorials?
Jan 1 '06 #9
On Sun, 01 Jan 2006 14:24:39 -0800, Raven wrote:
Thanks Steven for your very interesting post.

This was a critical instance from my problem:
from scipy import comb
comb(14354,174) inf
Curious. It wouldn't surprise me if scipy was using floats, because 'inf'
is usually a floating point value, not an integer.

Using my test code from yesterday, I got:
bincoeff(14354, 174)

111727771935623 249173533679580 244374733360180 53487854593870
070906374894056 044891924883461 446844023623444 09632515556732
335635231613081 458252082763952 387644418578294 54464446478336
901737770950418 910676375517833 240712336253706 19908633625448
310766773824486 162461253466677 378968915481668 98009878730510
574761395158405 427699564142041 306927336297233 05869285300247
645972456505830 620188961902165 086857407612722 931651840L

Took about three seconds on my system.
Yes I am calculating hundreds of hypergeometric probabilities so I
need fast calculations


Another possibility, if you want exact integer maths rather than floating
point with logarithms, is to memoise the binomial coefficients. Something
like this:

# untested
def bincoeff(n,r, \
cache={}):
try:
return cache((n,r))
except KeyError:
x = 1
for i in range(r+1, n+1):
x *= i
for i in range(1, n-r+1):
x /= i
cache((n,r)) = x
return x
--
Steven.

Jan 2 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
7064
by: Michael Peuser | last post by:
Hi, I should like to make a distribution (using Tkinter), with standard DLLs removed. pythonXX.dll is no problem. tcl und tk, which make the mass of mega bytes, cannot be removed because _tkinter.pyd seems to expect them in the same directory (Question 1) Is there a configration or path setting in py2exe/distutils I
3
5208
by: Tan Thuan Seah | last post by:
Hi all, I am looking for a way to generate a random number given the variance of a gaussian distribution(or normal distribution). The mean is 0 but the variance will be a user input. Does C++ have any of this sort of generator available? Or must I use some transformation to get a random number from a standardised normal distribution and map it to my distribution? Any link and references are welcome. Thuan Seah
2
11076
by: joshsackett | last post by:
Hi all, Is it possible to move the distribution database to a new folder/drive without removing replication? I am attempting to do it the same way you would move tempdb: ALTER DATABASE distribution MODIFY FILE (name = distmodel, filename = 'C:\DISTMOVED\distribution.MDF') ALTER DATABASE distribution MODIFY FILE (name = distmodel_log, filename = 'C:\DISTMOVED\distribution.LDF')
2
1917
by: Edward Hua | last post by:
Hi, I'm wondering if there is a function written in C that computes the inverse hypergeometric distribution (i.e., not the density function, but the cumulative distribution) that's been written by anyone? If it's in some online resource, please give me its URL; if it's in print, plese give me the title of the book and/or where it is in the book. Thanks in advance.
12
3105
by: kabradley | last post by:
Hello, Thanks for looking at my post and hopefully having an answer or at least a suggestion to my problem. I currently work at a financial planning office that deals with many clients and accounts. Each client may have multiple accounts such as an individual, IRA, and possibly a joint tenant account. Each one of these accounts for the particular client is contained in a 'Portfolio'. For example, Joe Smith may have 3 accounts: Joe Smith...
1
2636
by: Peter Graf | last post by:
Hi, i tried to evaluate the 2F1(a,b,c,z) hypergeometric function for a=1, b=2-0.1i, c=2.5-0.01i and z=-5e8. I've used the numerical recipes routine but this routine couldn't act with this large value of z. Does someone know a C or C++ routine, which can act with complex b, c and large values of z.
11
11546
by: Alex | last post by:
Hi everybody, I wonder if it is possible in python to produce random numbers according to a user defined distribution? Unfortunately the random module does not contain the distribution I need :-( Many thanks axel
0
1391
by: eGenix Team: M.-A. Lemburg | last post by:
________________________________________________________________________ ANNOUNCING eGenix.com mx Base Distribution Version 3.1.1 for Python 2.6 Open Source Python extensions providing important and useful services for Python programmers.
0
1300
by: M.-A. Lemburg | last post by:
Just to let you know: we also provide binaries and support for Mac OS X Intel and PPC. Thanks to Joe Strout for pinging us about this. On 2008-10-15 17:41, eGenix Team: M.-A. Lemburg wrote: -- Marc-Andre Lemburg eGenix.com
0
8203
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8368
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8512
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6125
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5576
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4094
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4206
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1815
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1515
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.