469,888 Members | 1,267 Online

Prime number module

Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8 range
so it would have to fairly efficient

Dag
Jul 18 '05 #1
36 7970
Dag wrote:
Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8
range so it would have to fairly efficient
gmpy is pretty good for this sort of thing, but the primitives it gives
you are quite different -- is_prime to check if a number is prime, and
next_prime to get the smallest prime number larger than a given number.

You'd have to build your own caching on top of this to avoid repeating
computation if you need, e.g., "how many primes are < N" for several
different values of N; and I'm not even sure that gmpy's primitives are
optimal for this (compared with, for example, some kind of sieving).
Anyway, with all of these caveats, here's an example function:

import gmpy

def primes_upto(N):
count = 0
prime = gmpy.mpz(1)
while prime < N:
prime = prime.next_prime()
count += 1
return count

and having saved it in pri.py, here's what I measure in terms of time:

[alex@lancelot python2.3]$python timeit.py -s 'import pri' -s 'M=1000*1000' \ 'pri.primes_upto(M)' 10 loops, best of 3: 2.76e+06 usec per loop [alex@lancelot python2.3]$ python timeit.py -s 'import pri' -s
'M=2*1000*1000' 'pri.primes_upto(M)'
10 loops, best of 3: 4.03e+06 usec per loop

i.e., 2.76 seconds for primes up to 1,000,000 -- about 4 seconds
for primes up to 2,000,000. (This is on my good old trusty Athlon
Linux machine, about 30 months old now and not top-speed even when
new -- I'm sure one of today's typical PC's might take 2 or 3
times less than this, of course). If I was to use this kind of
computation "in production", I'd pre-compute the counts for some
typical N of interest and only generate and count primes in a
narrow band, I think; but it's hard to suggest anything without a
better idea of your application.
Alex

Jul 18 '05 #2
On Mon, 29 Sep 2003 14:53:34 GMT, Dag <da*@velvet.net> wrote:
Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8 range
so it would have to fairly efficient

Dag

I just wrote a fairly simple sieve, which gave primes up to 1,000,000
in a few seconds (1.8GHz P4, Python code). There were 78498 primes in
that range (unless I screwed the code up, but it looked OK for smaller
ranges).

Going for 10^7 took too long for my patience, let alone 10^8, but at
least in theory there should be less than 100 times as many primes in
the range up to 10^8.

So here's the thought - a binary file containing a complete list of
primes up to 10^8 would require roughly 30MB (using 32 bit integers,
which should comfortably handle your requirement). Open in random
access and do a binary search to find a particular prime and the
position in the file should tell you how many primes are smaller than
that one.

30MB shouldn't be too prohibitive on todays machines, though if this
is to be distributed to other people there would of course be issues.
--
Steve Horne

steve at ninereeds dot fsnet dot co dot uk
Jul 18 '05 #3
Stephen Horne <$@.co.uk> wrote previously: |I just wrote a fairly simple sieve, which gave primes up to 1,000,000 |in a few seconds (1.8GHz P4, Python code). There were 78498 primes in |that range...at least in theory there should be less than 100 times as |many primes in the range up to 10^8. Quite a few less, actually. Under Gauss' Prime Number Theorem, an approximation for the number of primes less than N is N/ln(N). I know there have been some slight refinements in this estimate since 1792, but in the ranges we're talking about, it's plenty good enough. So I only expect around 5,428,681 primes less than 10^8 to occur. Well, that's not SO much less than 7.8M. |So here's the thought - a binary file containing a complete list of |primes up to 10^8 would require roughly 30MB (using 32 bit integers, |which should comfortably handle your requirement). I wonder if such a data structure is really necessary. Certainly it produces a class of answers quite quickly. I.e. search for a prime, and its offset immediately gives you the number of primes less than it. Heck, searching for any number, even a composite occurring between primes, works pretty much the same way. Of course, the above approximation gives you a close answer even quicker. But if you are worried about disk storage, one easy shortcut is to store a collection of 16-bit differences between successive primes. That's half the size, and still lets you answer the desired question *pretty* quickly (extra addition is required)... or at least generate a local copy of Horne's data structure in one pass. Moving farther, even this gap structure is quite compressible. Most gaps are quite a bit smaller than 65536, so the highbits are zeros. In fact, I am pretty sure that almost all the gaps are less than 256. So an immediate compression strategy (saving disk space, costing time to recreate the transparent structure) is to store gaps as 8-bit values, with a x00 byte escaping into a larger value (I guess in the next two bytes). Maybe I'll try it, and see how small I can make the data... unless I do my real work :-). Yours, Lulu... -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. Jul 18 '05 #4 "Lulu of the Lotus-Eaters" <me***@gnosis.cx> wrote in message Moving farther, even this gap structure is quite compressible. Most gaps are quite a bit smaller than 65536, so the highbits are zeros. In fact, I am pretty sure that almost all the gaps are less than 256. So an immediate compression strategy (saving disk space, costing time to recreate the transparent structure) is to store gaps as 8-bit values, with a x00 byte escaping into a larger value (I guess in the next two bytes). You can take this to 512 knowing that the gaps will always be an even interval. Emile Jul 18 '05 #5 Lulu of the Lotus-Eaters wrote: Stephen Horne <$@.co.uk> wrote previously:
|I just wrote a fairly simple sieve, which gave primes up to 1,000,000
|in a few seconds (1.8GHz P4, Python code). There were 78498 primes in
|that range...at least in theory there should be less than 100 times as
|many primes in the range up to 10^8.

Quite a few less, actually. Under Gauss' Prime Number Theorem, an
approximation for the number of primes less than N is N/ln(N). I know
there have been some slight refinements in this estimate since 1792, but
in the ranges we're talking about, it's plenty good enough.

So I only expect around 5,428,681 primes less than 10^8 to occur. Well,
that's not SO much less than 7.8M.

|So here's the thought - a binary file containing a complete list of
|primes up to 10^8 would require roughly 30MB (using 32 bit integers,
|which should comfortably handle your requirement).

I wonder if such a data structure is really necessary. Certainly it
produces a class of answers quite quickly. I.e. search for a prime, and
its offset immediately gives you the number of primes less than it.
Heck, searching for any number, even a composite occurring between
primes, works pretty much the same way. Of course, the above
approximation gives you a close answer even quicker.

But if you are worried about disk storage, one easy shortcut is to store
a collection of 16-bit differences between successive primes. That's
half the size, and still lets you answer the desired question *pretty*
quickly (extra addition is required)... or at least generate a local
copy of Horne's data structure in one pass.

Moving farther, even this gap structure is quite compressible. Most
gaps are quite a bit smaller than 65536, so the highbits are zeros. In
fact, I am pretty sure that almost all the gaps are less than 256. So
an immediate compression strategy (saving disk space, costing time to
recreate the transparent structure) is to store gaps as 8-bit values,
with a x00 byte escaping into a larger value (I guess in the next two
bytes).

Maybe I'll try it, and see how small I can make the data... unless I do
my real work :-).
I believe you could implement a hybrid scheme that would be quite fast
and still maintain nearly the same level of compression that you
describe above. In addition to the above compressed data, also store,
uncompressed, every Nth prime. A binary search will get you within N
primes of your answer, to find the exact value, recreate those N-primes.
For a N of, for instance, 64 the level of compression would be minimally
affected but should make finding the number of primes less than a given
number number much faster than the basic compressed scheme.

In fact I wouldn't be suprised if this was faster than the uncompressed
scheme since you're less likely to thrash your memory.

-tim
Yours, Lulu...

--
Keeping medicines from the bloodstreams of the sick; food from the bellies
of the hungry; books from the hands of the uneducated; technology from the
underdeveloped; and putting advocates of freedom in prisons. Intellectual
property is to the 21st century what the slave trade was to the 16th.

Jul 18 '05 #6
Lulu of the Lotus-Eaters wrote:
So I only expect around 5,428,681 primes less than 10^8 to occur.
Well, that's not SO much less than 7.8M.
I found 5,761,455 primes < 1E8.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #7
Lulu of the Lotus Eaters:
So I only expect around 5,428,681 primes less than 10^8 to occur.
Well, that's not SO much less than 7.8M.

Klaus Alexander Seistrup I found 5,761,455 primes < 1E8.

http://www.utm.edu/research/primes/howmany.shtml has the same number.
(found by googling for "5,761,455" - 3rd hit. :)

That's using Proof by Consensus,

Andrew
da***@dalkescientific.com

Jul 18 '05 #8
Hello Dag,
Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8 range
so it would have to fairly efficient

Try gmpy (http://gmpy.sourceforge.net/)

HTH.
Miki
Jul 18 '05 #9
Andrew Dalke wrote:
I found 5,761,455 primes < 1E8.
http://www.utm.edu/research/primes/howmany.shtml has the same
number. (found by googling for "5,761,455" - 3rd hit. :)

Cool!
That's using Proof by Consensus,
:-)

// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #10
On Mon, 29 Sep 2003 17:20:21 -0400, Lulu of the Lotus-Eaters <me***@gnosis.cx> wrote:
Stephen Horne <$@.co.uk> wrote previously: |I just wrote a fairly simple sieve, which gave primes up to 1,000,000 |in a few seconds (1.8GHz P4, Python code). There were 78498 primes in |that range...at least in theory there should be less than 100 times as |many primes in the range up to 10^8. Quite a few less, actually. Under Gauss' Prime Number Theorem, an approximation for the number of primes less than N is N/ln(N). I know there have been some slight refinements in this estimate since 1792, but in the ranges we're talking about, it's plenty good enough. So I only expect around 5,428,681 primes less than 10^8 to occur. Well, that's not SO much less than 7.8M. |So here's the thought - a binary file containing a complete list of |primes up to 10^8 would require roughly 30MB (using 32 bit integers, |which should comfortably handle your requirement). If the object is to be able to test primeness of an arbitrary candidate number in range(10**8), I think I might just make a bit map of the 10**8 on disk, which would only be 10**8/8000000. 12.5 megabytes. In fact, that's not that bad to hold in memory, but you could calculate an exact sector to seek to and read and then you can check the bit pretty fast. It might be interesting to make a timed access map for sectors, to see what kinds of seek/read delays there are on the average. But it would be hard/timeconsuming to reset/flush the OS and controller cache states to get mechanical delay info. Perhaps it has to be done at the BIOS level out of non-protected DOS or a disk mounted raw... Anyway, getting to a known (file-relative) sector and only reading that sector ought to be about as fast as you could get the info off the disk for an arbitrary number, IWT. Nth-prime and how-many-below-N would need some other representations. I wonder if such a data structure is really necessary. Certainly it produces a class of answers quite quickly. I.e. search for a prime, and its offset immediately gives you the number of primes less than it. Heck, searching for any number, even a composite occurring between primes, works pretty much the same way. Of course, the above approximation gives you a close answer even quicker. But if you are worried about disk storage, one easy shortcut is to store a collection of 16-bit differences between successive primes. That's half the size, and still lets you answer the desired question *pretty* quickly (extra addition is required)... or at least generate a local copy of Horne's data structure in one pass. Moving farther, even this gap structure is quite compressible. Most gaps are quite a bit smaller than 65536, so the highbits are zeros. In fact, I am pretty sure that almost all the gaps are less than 256. So an immediate compression strategy (saving disk space, costing time to recreate the transparent structure) is to store gaps as 8-bit values, with a x00 byte escaping into a larger value (I guess in the next two bytes). I wouldn't be surprised if most of the gaps are less than 16... I wonder what the largest gap between primes in range(10**8) is, and what the distribution of gaps might be. Maybe I'll try it, and see how small I can make the data... unless I do my real work :-). Hm, the bit map stores the same information in another form, so if you generate that, a zip compression of that ought to give an idea of how small you are competing with. You could translate the bit map to a sequence of prime counts for 8 or 16-bit chunks with a simple 2*8 or 2**16 byte table mapping bytes or 16-bit chunks to the counts of bits within them, then sum and only fiddle with one bit chunk to make an nth-prime table... The bit map ought to be pretty compressible, since all the bits at indices of a multiple of 2, 3, 5, and 7 beyond the first byte should be zero, so even as bytes, the alphabet ought to contain a lot less than 256 states. Or you could pre-compress the bit map using a mask of redundant zeroes (beyond the first chunk) generated from some smallish set of small primes like 2,3,5,7. Probably hard to beat zip though. Regards, Bengt Richter Jul 18 '05 #11 Tim Hochberg wrote: ... I believe you could implement a hybrid scheme that would be quite fast and still maintain nearly the same level of compression that you describe above. In addition to the above compressed data, also store, uncompressed, every Nth prime. A binary search will get you within N primes of your answer, to find the exact value, recreate those N-primes. For a N of, for instance, 64 the level of compression would be minimally affected but should make finding the number of primes less than a given number number much faster than the basic compressed scheme. Still thinking of gmpy's primitives, I came up with a variant of this...: import gmpy import array import bisect highest_number_of_interest = gmpy.mpz(100*1000*1000) def make_primes(put_them_here, every=1000): current_prime = gmpy.mpz(put_them_here[-1]) count = 0 while current_prime < highest_number_of_interest: current_prime = current_prime.next_prime() count += 1 if count == every: put_them_here.append(int(current_prime)) count = 0 try: savefile = open('primes.dat', 'rb') except: every_1000th_prime = array.array('l', [1]) make_primes(every_1000th_prime) savefile = open('primes.dat', 'wb') every_1000th_prime.tofile(savefile) savefile.close() else: every_1000th_prime = array.array('l', []) try: every_1000th_prime.fromfile(savefile, 6000) except EOFError: pass else: warnings.warn("Only %d primes (?)" % len(every_1000th_prime)) warnings.warn("Highest is %d" % every_1000th_prime[-1]) # could fill out and re-save the array here, if needed savefile.close() print "%d primes stored (1 every 1000), highest is %d" % ( len(every_1000th_prime), every_1000th_prime[-1]) def nth_prime(N): N_div_1000, N_mod_1000 = divmod(N, 1000) try: prime = every_1000th_prime[N_div_1000] except IndexError: raise ValueError, "must be N<%d" % (1000*len(every_1000th_prime)+999) prime = gmpy.mpz(prime) for x in xrange(N_mod_1000): prime = prime.next_prime() return int(prime) print "55555th prime is", nth_prime(55555) import bisect def primes_upto(N): if N>highest_number_of_interest: raise ValueError, "must be N<%d" % highest_number_of_interest x = bisect.bisect_left(every_1000th_prime, N) if N == every_1000th_prime[x]: return x*1000+1 prime = gmpy.mpz(every_1000th_prime[x-1]) x = (x-1)*1000 while prime < N: x += 1 prime = prime.next_prime() if prime == N: x += 1 return x print "Up to 555555, there are %d primes" % primes_upto(555555) The first run, building the file primes.dat (just 23048 bytes), took almost 7 minutes elapsed time on my machine (250 CPU seconds). But any successive run is extremely fast: [alex@lancelot perex]$ time python -O prio.py
5762 primes stored (1 every 1000), highest is 99991609
55555th prime is 686671
Up to 555555, there are 45741 primes
0.06user 0.00system 0:00.14elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (494major+297minor)pagefaults 0swaps
[alex@lancelot perex]$I've coded this rapidly and sloppily, so I wouldn't be surprised if there were off-by-one errors here and there (needs to be checked against some other, independent way to generate/check primes!), but I think the fundamental ideas are sound. Alex Jul 18 '05 #12 |> > So I only expect around 5,428,681 primes less than 10^8 to occur. |> > Well, that's not SO much less than 7.8M. |Klaus Alexander Seistrup |> I found 5,761,455 primes < 1E8. "Andrew Dalke" <ad****@mindspring.com> wrote previously: |http://www.utm.edu/research/primes/howmany.shtml has the same number. |(found by googling for "5,761,455" - 3rd hit. :) Looking at Andrew's resource, I am reminded that N/(log(N)-1) is a better approximation for "primes up to N" than was Gauss' original. Of course, as I wrote, both are accurate at the asymptote. But infinity is a long way to go. N = 10**8 N/(log(N)-1) 5740303.8072846411 That looks assuringly closer to the specific total than my first pass. Obviously, enumeration of all the primes in the range is still more accurate than any such approximation formula (some further such approximations can be found at the above URL, and elsewhere). Yours, Lulu... -- mertz@ _/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: \_\_\_\_ n o gnosis _/_/ Postmodern Enterprises \_\_ ..cx _/_/ \_\_ d o _/_/_/ IN A WORLD W/O WALLS, THERE WOULD BE NO GATES \_\_\_ z e Jul 18 '05 #13 bo**@oz.net (Bengt Richter) wrote previously: |just make a bit map of the 10**8 on disk, which would only be | >>> 10**8/8000000. | 12.5 |megabytes. I had been thinking about this also, but Bengt beat me to posting something. I think that we could pick up Emile van Sebille's suggestion about gaps only needing to store even numbers here; the equivalent being that we only store a bit array of ODD numbers, getting us down to 6.25 mB of storage space (special casing of the prime '2' needs to be handled, seems manageable :-)). Doing this does not necessarily produce a smaller storage format than a zip'd bitmap of all the numbers. But the in-memory image doesn't require a decompression step (and the stored version can still be compressed). I guess the test is something like: prime_bits = open('bitarray.primes','rb').read() # only odd nums def isPrime(N): if N < 2: return False elif N==2: return True else: byte, bit = divmod(N, 16) bit = bit//2 return ord(prime_bits[byte]) & 2**bit You could also use a memory-mapped file or the like, of course. |Nth-prime and how-many-below-N would need some other representations. |You could translate the bit map to a sequence of prime counts for 8 or 16-bit |chunks with a simple 2*8 or 2**16 byte table mapping bytes or 16-bit chunks to |the counts of bits within them, then sum and only fiddle with one bit chunk to |make an nth-prime table... The ancillary data structure would indeed speed the "how-many-below" calculation enormously, while not taking too much space. Along those lines, what's the quickest way to count bits in Python? Maybe someone has a clever idea... something that would run a whole bunch faster than my bit mask style above for counting all the "1"s in a multibyte string. Yours, Lulu... -- mertz@ _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i gnosis _/_/ Postmodern Enterprises _/_/ s r ..cx _/_/ MAKERS OF CHAOS.... _/_/ i u _/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/ g s Jul 18 '05 #14 Alex Martelli wrote: Tim Hochberg wrote: ... I believe you could implement a hybrid scheme that would be quite fast and still maintain nearly the same level of compression that you describe above. In addition to the above compressed data, also store, uncompressed, every Nth prime. A binary search will get you within N primes of your answer, to find the exact value, recreate those N-primes. For a N of, for instance, 64 the level of compression would be minimally affected but should make finding the number of primes less than a given number number much faster than the basic compressed scheme. Still thinking of gmpy's primitives, I came up with a variant of this...: [Nice compact version using gmpy snipped] Just to keep my hand in, I'm attaching a version based on the idea above which is in turn based on Emile's and Lulu's ideas for compression. This produces data files that total just over 6 GB for all primes up to 1E8. The initial run is quite slow, but after that it can determine the nth prime, or the number of primes below some number in about 0.5 ms on my box. The first run, building the file primes.dat (just 23048 bytes), took almost 7 minutes elapsed time on my machine (250 CPU seconds). But any successive run is extremely fast: [alex@lancelot perex]$ time python -O prio.py
5762 primes stored (1 every 1000), highest is 99991609
55555th prime is 686671
Up to 555555, there are 45741 primes
0.06user 0.00system 0:00.14elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (494major+297minor)pagefaults 0swaps
[alex@lancelot perex]$I've coded this rapidly and sloppily, so I wouldn't be surprised if there were off-by-one errors here and there (needs to be checked against some other, independent way to generate/check primes!), but I think the fundamental ideas are sound. Well one of us is off by one as I get 45740 primes below 55555. However, I get 686671 as the 55555th prime, so we agree there. I would not be at all suprised if it is mine that is off by one. -tim Alex import array, bisect, psyco, struct def eratosthenes(max=None): '''Yields the sequence of prime numbers via the Sieve of Eratosthenes.''' # Mostly based on Alex Marteli's version from the Python Cookbook. The max # attribute, which I addded, trims the memory usage substantially when # computing lots of primes. Of course you need to know your max prime in # advance. D = {} # map composite integers to primes witnessing their compositeness q = 2 # first integer to test for primality while (max is None) or (q <= max): p = D.pop(q, None) if p: x = p + q while x in D: x += p if (x is None) or (x <= max): D[x] = p else: q2 = q*q if (max is None) or (q2 <= max): D[q2] = q yield q q += 1 psyco.bind(eratosthenes) def encode(delta): # To encode N-bytes, preface with (N-1) leading zero-bytes enc = [] delta /= 2 while delta: delta, byte = divmod(delta, 128) enc.append(byte) enc.reverse() result = ['\0'] * (len(enc) - 1) for byte in enc[:-1]: result.append(chr(byte+1)) result.append(chr(enc[-1])) return ''.join(result) def decode(chars): chars = chars.lstrip('\0') bytes = [] for c in chars[:-1]: bytes.append(ord(c)-1) bytes.append(ord(chars[-1])) value = bytes[0] for b in bytes[1:]: value *= 128 value += b return 2 * value def pop_delta(data): # return delta, data_without_delta index = 0 while data[index] == '\0': index += 1 index = 2*index + 1 return decode(data[:index]), data[index:] def encode_primes(n, chunk, basename): primes = eratosthenes(n) index_file = open(basename+'.index', 'wb') everyn_file = open(basename+'.everyn', 'wb') encoded_file = open(basename+'.encoded', 'wb') index_file.write(struct.pack('l', chunk)) everyn_file.write(struct.pack('l', chunk)) # We reserve some space now and come back later and fill in the length encoded_file.write(struct.pack('l', 0)) # We write to the files a chunk at a time. encoded_chunk = '' primes.next() # skip 2 p0 = primes.next() index = 0 for i, p in enumerate(primes): delta = p - p0 if not (i % chunk): index += len(encoded_chunk) index_file.write(struct.pack('l', index)) everyn_file.write(struct.pack('l', p0)) encoded_file.write(encoded_chunk) encoded_chunk = '' encoded_chunk += encode(delta)# This is supposed to be fast under psyco p0 = p # write leftover data index += len(encoded_chunk) index_file.write(struct.pack('l', index)) encoded_file.write(encoded_chunk) encoded_file.seek(0) encoded_file.write(struct.pack('l', i)) psyco.bind(encode_primes) class Primes: def __init__(self, basename): sizel = struct.calcsize('l') index_file = open(basename+'.index', 'rb') [self.chunk] = struct.unpack('l', index_file.read(sizel)) index_data = index_file.read() self.indices = struct.unpack("%sl" % (len(index_data) // sizel), index_data) index_file.close() everyn_file = open(basename+'.everyn', 'rb') [chunk] = struct.unpack('l', everyn_file.read(sizel)) assert self.chunk == chunk, "inconsistent chunk sizes (%s vs %s)" % (self.chunk, chunk) everyn_data = everyn_file.read() self.everyn = struct.unpack("%sl" % (len(everyn_data) // sizel), everyn_data) everyn_file.close() encoded_file = open(basename+'.encoded', 'rb') [self.nprimes] = struct.unpack('l', encoded_file.read(sizel)) self.encoded = encoded_file.read() def __len__(self): return self.nprimes def __getitem__(self, n): if n < 0: n += self.nprimes if not isinstance(n, int) and (0 <= n < self._len): raise IndexError() if n == 0: return 2 cindex, pindex = divmod((n-1), self.chunk) return self.decode_chunk(cindex)[pindex] def decode_chunk(self, n): cprimes = [self.everyn[n]] data = self.encoded[self.indices[n]:self.indices[n+1]] while data: delta, data = pop_delta(data) cprimes.append(cprimes[-1] + delta) return cprimes def upto(self, n): if n < 2: return 0 elif n < 3: return 1 else: n0 = max(bisect.bisect_right(self.everyn, n) - 1, 0) chunk = self.decode_chunk(n0) return bisect.bisect_right(chunk, n) + n0*self.chunk + 1 basename = "packedprimes" try: primes = Primes(basename) except: # be more specific... encode_primes(int(1e8), 64, basename) primes = Primes(basename) if __name__ == "__main__": for n in 1, 2, 2.5, 3, 3.5, 5, 13, 13.5, 555555: print "Up to", n, ", there are %d primes" % primes.upto(n) print primes[55555-1] import timeit t = timeit.Timer("packedprimes.primes.upto(555555)", "import packedprimes") print "seconds per call to upto =", t.timeit(number=1000) / 1000 t = timeit.Timer("packedprimes.primes[99834]", "import packedprimes") print "seconds per call to [] =", t.timeit(number=1000) / 1000 #~ eprimes = eratosthenes(int(1e5)) #~ for i, p in enumerate(eprimes): #~ assert p == primes[i], "mismatch at prime[%s] (%s vs %s)" % (i, p, primes[i]) #~ print "OK" Jul 18 '05 #15 "Lulu of the Lotus-Eaters" <me***@gnosis.cx> wrote: Along those lines, what's the quickest way to count bits in Python? Maybe someone has a clever idea... something that would run a whole bunch faster than my bit mask style above for counting all the "1"s in a multibyte string. Probably far from the fastest, but in the tradition that you'll get a better answer by posting, here's a stab at a python method for counting bits: import string,time nibbles = [(w+x+y+z, w*8+x*4+y*2+z) for w in (0,1) for x in (0,1) for y in (0,1) for z in (0,1)] bitcounts = [ (chr(hinibblevalue*16 + lonibblevalue), hibitcount+lobitcount) for hibitcount, hinibblevalue in nibbles for lobitcount, lonibblevalue in nibbles ] bytes = dict(bitcounts) xlater = "".join([chr(bitcount) for byte, bitcount in bitcounts]) def bitcount1(bytestr): return sum([bytes[ii] for ii in bytestr]) def bitcount2(bytestr): return sum(map(ord, string.translate(bytestr,xlater))) t=time.time() bitcount1('a'*100000) print time.time()-t t=time.time() bitcount2('a'*100000) print time.time()-t -- Emile van Sebille em***@fenx.com Jul 18 '05 #16 Lulu of the Lotus-Eaters wrote: ... Along those lines, what's the quickest way to count bits in Python? Probably gmpy's popcount function. Maybe someone has a clever idea... something that would run a whole bunch faster than my bit mask style above for counting all the "1"s in a multibyte string. I would try gmpy.mpz(multibytestring, 256).popcount(). E.g.: gmpy.mpz('gmpy IS pretty fast for many kinds of things',256).popcount() 163 and, on my trusty 30-months-old Athlon box...: [alex@lancelot python2.3]$ python timeit.py -s'import gmpy' \ -s'x="gmpy IS pretty fast for many kinds of things"' \
'gmpy.mpz(x,256).popcount()'

100000 loops, best of 3: 8.13 usec per loop
[alex@lancelot python2.3]$Alex Jul 18 '05 #17 |Lulu of the Lotus-Eaters wrote: |> Along those lines, what's the quickest way to count bits in Python? Alex Martelli <al***@aleax.it> wrote previously: |Probably gmpy's popcount function. What about fast AND portable? (I'm sure gmpy can be installed lots of places, but not necessarily by receivers of Python code). Yours, Lulu... -- _/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/ _/_/ ~~~~~~~~~~~~~~~~~~~~[me***@gnosis.cx]~~~~~~~~~~~~~~~~~~~~~ _/_/ _/_/ The opinions expressed here must be those of my employer... _/_/ _/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them! _/_/ Jul 18 '05 #18 Taking up Bengt's suggestion, and some various others, I created a bit array version of the primality data structure. That is, each bit of the file 'primes.bitarray' encodes the primality of the number corresponding to its bit position. However, only ODD numbers are mentioned in this file, since it's rather easy to rule out evens (and special case 2). This produces a 6.25 mB file for the first 10**8 numbers. I was disappointed that gzip only reduces the size by 41%. I would have guessed better, given the predominance of 0 bits. Still, no reason not to use the [gzip] module to read the data. You can download the data file, if you wish, at: http://gnosis.cx/download/primes.bitarray.gz I decided to check how quickly I can check primality using this data structure. Pretty well, to my mind (for reference, I have a PIII 733Mhz; nothing very fast for nowadays). Here's the code: #!/usr/bin/python import gzip bit_source = gzip.open('primes.bitarray.gz') prime_bits = bit_source.read(100000) def isPrime(N): "Check primality on pre-generated bit field (read more as needed)" if N < 2: return False elif N==2: return True elif not N % 2: return False else: # Check if we need to extend prime_bits global prime_bits if 8*len(prime_bits) < N: needed = 1+(N//8) - len(prime_bits) prime_bits += bit_source.read(needed) byte, bit = divmod(N, 16) bit = 7 - bit//2 return ord(prime_bits[byte]) & 2**bit if __name__=='__main__': from time import clock from random import randint # generate a million numbers to check tests = [] for _ in xrange(10**6): tests.append(randint(1,10**8)) start = clock() primes = 0 for n in tests: if isPrime(n): primes += 1 print "%d primes in 1e6 numbers (%.2f secs)" % (primes,clock()-start) Here's the timing: % test_bits.py 57546 primes in 1e6 numbers (29.57 secs) I figured there was no need to read the whole data file in unless a check required it--I just read in a smallish 100k at initialization. Of course, in the course of a million tests, I am pretty sure I wind up reading in essentially the whole file at some point. One nice thing to do would be to make 'bit_source' a bit smarter. I.e. it could be a class that generated more primes when needed (maybe using gmpy as Alex suggests). In principle, 'isPrime()' could then check unbounded primes (subject to time and resource constraints). The next thing would probably be to store sparce accumulations as Bengt suggested (and Alex implemented in a different fashion). That is, another structure could store "every 1000th prime" in order to answer questions like "how-many-less-than" and "which-is-Nth-prime". Maybe I'll Emile's bitcounting techniques to do this. Yours, Lulu... -- mertz@ | The specter of free information is haunting the Net! All the gnosis | powers of IP- and crypto-tyranny have entered into an unholy ..cx | alliance...ideas have nothing to lose but their chains. Unite | against "intellectual property" and anti-privacy regimes! ------------------------------------------------------------------------- Jul 18 '05 #19 On 30 Sep 2003 13:00:39 GMT, bo**@oz.net (Bengt Richter) wrote: If the object is to be able to test primeness of an arbitrary candidate number in range(10**8), I think I might just make a bit map of the 10**8 on disk, which would only be 10**8/8000000. 12.5 megabytes. If the object was simply to test primeness I would agree, but the requirement included determining the number of primes below some given value. It should be easy enough to fix, though - keep a cached count for every n primes and only redo the count for the last few. I think bitsets are the way to go for compression. The idea of only storing every nth prime until you consider how you implement the seive for the missing primes. You can't just check the interval - you need to know the primes in earlier intervals to do the seive checks. You may end up recalculating a lot of the smaller primes again. It is not too difficult to count the bits in a single integer. One of my favorite tricks exploits the fact that a single operation on an integer can be seen as a parallel operation processing 32 or 64 bits in one go. Take an integer. Each bit can be considered a count of the number of bits in that bit position. With a little bit of masking and shifting, you can get a value where every pair of bits contains the number of bits that were in that pair of bits in the original. In the 32 bit case... x = (x & 0x55555555) + ((x & 0xAAAAAAAA) >> 1) Next, get the totals for each group of four bits... x = (x & 0x33333333) + ((x & 0xCCCCCCCC) >> 2) And so on, until the value contains a single total of the number of bits in the whole original value. x = (x & 0x0F0F0F0F) + ((x & 0xF0F0F0F0) >> 2) x = (x & 0x00FF00FF) + ((x & 0xFF00FF00) >> 4) x = (x & 0x0000FFFF) + ((x & 0xFFFF0000) >> 8) A simple looping method would need 32 shifts, 32 bitwise ands and an average of 16 increments. This takes 5 shifts, 10 bitwise ands and 5 additions. Probably a worthwhile gain. LuLu points out that you can compress the bitset even further by storing bits only for odd numbers. This compression can be extended by looking at larger repeating patterns. For example, if we look at a each value modulo 2*3 = 6, we can exclude multiples of both two and three from the bitset. 0 1 2 3 4 5 - value % 6 * . * . * . - multiples of two * . . * . . - multiples of three * . * * * . - multiples of two or three Basically, you only need to store bits where (value % 6) is one of the values that gets a dot in the bottom line. Then to determine which bit you need, you set up something like... lookup = [None, 0, None, None, None, 1] if lookup [value % 6] is None : if (value == 2) or (value == 3) : value is prime else : value is not prime else lookup_bit ((value / 3) + lookup [value % 6]) So you only need to hold one bit in three (those with value % 6 equal to either 1 or 5) with both 2 and 3 being special-cased. Use modulo 2*3*5 == 30 and I think the bitset reduces to 8/30 == just under 27%, which tells me that along with a rapidly increasing pattern-checking lookup table size you get rapidly decreasing returns. No way am I going to work out the figures for modulo 2*3*5*7 = 210! The modulo 6 case is certainly worthwhile, and maybe the module 30 if you really need those extra few % points (which are still worth a couple of MB at a guess), but I doubt it's worth going further than that. -- Steve Horne steve at ninereeds dot fsnet dot co dot uk Jul 18 '05 #20 Adding another thought to the discussion. Instead of pregenerating all primes using a seive, store the cumulative counts for, say, every 10,000 terms (so 10,000 numbers are needed for the 1E8 range requested by the poster) then use a fast prime tester, like the Rabin-Miller strong pseudoprime test (see http://mathworld.wolfram.com/Rabin-M...primeTest.html ) to fill in any missing range. Could cache any hits in a bitstring along with the counts, like counts = [(None, 9999), # however many are between 0 and 10000 (None, 8888), # between 10,000 and 20,000 (or 0 to 20,000) ... ] where the None field stores the bitstring May want to do every 1,000 if the tests are too slow. pycrypto has Rabin-Miller and here's a copy I found in pure Python: http://senderek.de/pcp/release/tools/number.py It uses the first 7 primes for tests, which is good for all numbers up to 341550071728321 ~ 3.4E14. Andrew da***@dalkescientific.com Jul 18 '05 #21 Lulu of the Lotus-Eaters <me***@gnosis.cx> writes: I was disappointed that gzip only reduces the size by 41%. I would have guessed better, given the predominance of 0 bits. I think it is not a surprise. (But maybe that's because I tried it before.) You need redundancy to compress data. If bitmap of prime numbers contained lot of redundancy, you could use it to generate prime numbers. Of course bitmap is not the most efficient way to store list of prime numbers, but neither is gzip the most efficient way to find redundancy in it. IIRC gzip basically encodes repeating bit patterns and it's hardly a suprise that bitmap of prime numbers doesn't have lots of repeating bit patterns. -- Juha Autero http://www.iki.fi/jautero/ Eschew obscurity! Jul 18 '05 #22 "Andrew Dalke" <ad****@mindspring.com> wrote previously: |use a fast prime tester, like the Rabin-Miller strong pseudoprime test Well sure, you can cheat. But aren't you worried about undiscovered small pseudoprimes? *wink* (I guess for Rabin-Miller they are not called Carmichael numbers, just for Fermat pseudoprimes[*]) Yours, Lulu... [*] Does anyone know whether there are Carmichael-like numbers for Rabin-Miller tests? Last I knew, it was unproven either way, but that was a while. That is, can any composite pass R-M for every base? -- ---[ to our friends at TLAs (spread the word) ]-------------------------- Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad ---[ Postmodern Enterprises <me***@gnosis.cx> ]-------------------------- Jul 18 '05 #23 |> I was disappointed that gzip only reduces the size by 41%. I would have |> guessed better, given the predominance of 0 bits. Juha Autero <Ju*********@iki.fi> wrote previously: |patterns and it's hardly a suprise that bitmap of prime numbers |doesn't have lots of repeating bit patterns. Sure it does. The bit pattern "0 0 0 0 0 0 0" occurs a WHOLE LOT of times :-). Certainly much more than any other pattern. After all, my 10^8/2 bits only contain 5,761,453 ones among them[*]--there's got to be a lot of runs of zeros in there :-). The problem, I think, is that these bit patterns don't fall on byte borders as neatly as I would like. I am pretty sure that I could write a custom LZ-style compressor that did better by being bit-oriented rather than byte-oriented. But I'm not going to do that work, 3.8 megs is small enough for me. FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at this point if another representation of the primes under 10^8 actually did better[**]. I proposed listing the gaps as a strategy; but given that there are 5.7 million gaps to list, it's hard to beat 3.8 megs. Maybe encoding gaps in nibbles-with-escaping would work. But now I'm fond of the speed of the bit array. Yours, Lulu... [*] I'm missing two primes versus some other claims. My odd numbers don't include 2; and I called 1 a non-prime. I don't feel strongly about the second matter though. [**] Well, a MUCH shorter representation is a Python-coded algorithm for generating the primes in the first place. But somehow that seems like cheating :-). -- ---[ to our friends at TLAs (spread the word) ]-------------------------- Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad ---[ Postmodern Enterprises <me***@gnosis.cx> ]-------------------------- Jul 18 '05 #24 Lulu of the Lotus-Eaters: FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at this point if another representation of the primes under 10^8 actually did better[**]. As I recall, the Burrows-Wheeler transform used in bzip2 uses a clever way to rearrange the data in a block into a form which is more compressible, yes? Well, what about rearranging the bits so that every multiple of 3 is before every non-multiple of 3, then every multiple of 5, etc. There's a simple algorithm to unravel the ordering, and you'll end up with lot of 0s at the start of the block. It's very much like the ordering+compression you did when you moved all the terms divisible by 2 to the front of the group then realized they could all be represented as part of the decompression algorithm. Extend my idea far enough and you'll have a prime sieve. [**] Well, a MUCH shorter representation is a Python-coded algorithm for generating the primes in the first place. But somehow that seems like cheating :-). Why doesn't bzip2 feel like cheating? It's good at compressing text because it's designed for the sort of patterns found in text. Why doesn't mp3 feel like cheating? It's designed to handle the sounds we normally hear, as compared to the output of a random noise generator. In essense, if you know the full corpus beforehand, it's really easy to compress - just store the identifier for each text. There's only one text here so the result is trivial compressed to nothing. The problem is, your algorithm tries to compress an ordered list of odd numbers, and so fails to compress as well as a compression algorithm tuned for just compressing the first N primes. ---[ to our friends at TLAs (spread the word) ]-------------------------- Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad Hmm... Seems dated. Terms like whitewater and Clinton (Bill, not Hillary) don't hit many hot buttons these days, the number of people in various milita has gone way down[*] and Libya is working towards renormalization. What about these words? Iran nuclear neocon expose POTUS patriot Pakistan armed weaponized enriched uranium UN smallpox Gitmo invasion Castro Tikrit revolution sarin Andrew da***@dalkescientific.com [*] Actually, US law states http://www4.law.cornell.edu/uscode/10/311.html (a) The militia of the United States consists of all able-bodied males at least 17 years of age and, except as provided in section 313 of title 32, under 45 years of age who are, or who have made a declaration of intention to become, citizens of the United States and of female citizens of the United States who are members of the National Guard. (b) The classes of the militia are - (1) the organized militia, which consists of the National Guard and the Naval Militia; and (2) the unorganized militia, which consists of the members of the militia who are not members of the National Guard or the Naval Militia Therefore, the numbers of people in the militia hasn't changed that drastically -- and I'm a member of the unorganized militia of the US. Maybe I should get a patch saying that, and wear it next time I visit Europe? :) Jul 18 '05 #25 Lulu of the Lotus-Eaters: Well sure, you can cheat. But aren't you worried about undiscovered small pseudoprimes? I am. I'm doing an extensive study of the small pseudoprimes between 0 and 1. I'm working on the distributed computing framework now, and hope to start the actual mathematical analysis in about 3 months. Andrew da***@dalkescientific.com Jul 18 '05 #26 Lulu of the Lotus-Eaters wrote: |Lulu of the Lotus-Eaters wrote: |> Along those lines, what's the quickest way to count bits in Python? Alex Martelli <al***@aleax.it> wrote previously: |Probably gmpy's popcount function. What about fast AND portable? (I'm sure gmpy can be installed lots of places, but not necessarily by receivers of Python code). Why not? gmpy's downloads include a self-contained .pyd for Windows users (the ones who typically lack easy access to a C compiler) -- don't most users of Linux, MacOs and other BSD's, etc, typically have gcc available to install extensions if they want...? Or are you worrying about the operations' availability to users of Jython...? I see you guys are having a great time concocting pure-Python ways to count bits, etc -- but GMP, the library underying gmpy, has carefully tuned machine-code accelerations for many important operations and platforms... if one really needs to perform advanced multi-precision arithmetic, as opposed to just playing with it, GMP is just too precious a resource to ignore it. Given how widespread Java is, I would not be surprised if libraries of similar quality were available there (with native-methods and all) -- I just don't know of any because I never needed them, so didn't research the issue. I do acknowlegde the usefulness of pure Python (AND readable, NOT distorted for fun & optimization-profit:-) code in *teaching* even the lowest-level concepts of multi-precision arithmetic & bit-fiddling. And Python is ideal for playing with higher-level architectures, such as you guys have been doing about various potential ways to store compressed pre-computed representations of primes. Finally, one day, if and when pypy triumphs, one will no doubt need to start with pure Python code (most likely SIMPLE, NON-TRICKY pure Python code) in order to obtain maximum speed. But from a pragmatic point of view, I'll keep using gmpy for the foreseeable future, thanks;-). Alex Jul 18 '05 #27 Alex Martelli wrote: Lulu of the Lotus-Eaters wrote: ... Along those lines, what's the quickest way to count bits in Python? Probably gmpy's popcount function. Maybe someone has a clever idea... something that would run a whole bunch faster than my bit mask style above for counting all the "1"s in a multibyte string. I would try gmpy.mpz(multibytestring, 256).popcount(). E.g.: gmpy.mpz('gmpy IS pretty fast for many kinds of things',256).popcount() 163 and, on my trusty 30-months-old Athlon box...: [alex@lancelot python2.3]$ python timeit.py -s'import gmpy' \
-s'x="gmpy IS pretty fast for many kinds of things"' \
'gmpy.mpz(x,256).popcount()'

100000 loops, best of 3: 8.13 usec per loop
[alex@lancelot python2.3]$Silly me, for not including a "pure Python" alternative run on the same machine for comparison purposes -- sorry. Given the following bitcount.py module (only the initialization is coded with gmpy, and just because I'm lazy -- that doesn't affect timings!): import gmpy byte_bit_counts = dict([(chr(b), gmpy.popcount(b)) for b in range(256)]) def popcount(s, bbc=byte_bit_counts): return sum([bbc[c] for c in s]) import bitcount x="gmpy IS pretty fast for many kinds of things" bitcount.popcount(x) 163 and, on the same box: [alex@lancelot python2.3]$ python timeit.py -s'import bitcount' \ -s'x="gmpy IS pretty fast for many kinds of things"' \
'bitcount.popcount(x)'

10000 loops, best of 3: 82.2 usec per loop

So, the speed advantage of gmpy is just about one order of magnitude
for this trivial operation.
Alex

Jul 18 '05 #28
Alex Martelli <al***@aleax.it> writes:
Lulu of the Lotus-Eaters wrote:
|Lulu of the Lotus-Eaters wrote:
|> Along those lines, what's the quickest way to count bits in Python?

Alex Martelli <al***@aleax.it> wrote previously:
|Probably gmpy's popcount function.

What about fast AND portable?

(I'm sure gmpy can be installed lots of places, but not necessarily by
receivers of Python code).
Why not? gmpy's downloads include a self-contained .pyd for Windows
users (the ones who typically lack easy access to a C compiler) -- don't
most users of Linux, MacOs and other BSD's, etc, typically have gcc
available to install extensions if they want...? Or are you worrying
about the operations' availability to users of Jython...?

I'm not sure "most" OS X installations have the dev tools installed.

I'm not sure GMP has been ported to, e.g. VMS.
But from a pragmatic point of view, I'll keep using gmpy for the
foreseeable future, thanks;-).

Well, sure.

Cheers,
mwh

--
Hmmm... its Sunday afternoon: I could do my work, or I could do a
Fourier analysis of my computer's fan noise.
-- Amit Muthu, ucam.chat (from Owen Dunn's summary of the year)
Jul 18 '05 #29
Lulu of the Lotus-Eaters wrote:
[SNIP]
FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.

[SNIP]

FWIW, using escaped nibbles plus compression put me down around 3.4
megs. This includes storing enough metadata to do fast how many primes
below n testing. Much slower for straight primality testing than the
bitmap scheme though.

-tim

Jul 18 '05 #30
"Andrew Dalke" <ad****@mindspring.com> wrote previously:
|In essense, if you know the full corpus beforehand, it's really
|easy to compress - just store the identifier for each text.

I've invented an excellent compression algorithm. It represents the
12.5 meg bit sequence for prime/no-prime (<10^8) as the bit "1". To
encode other sequences, you use the escape bit "0" as the first bit...

Pick the best algorithm for the job at hand...

Yours, Lulu...

|Iran nuclear neocon expose POTUS patriot Pakistan armed weaponized
|enriched uranium UN smallpox Gitmo invasion Castro Tikrit revolution sarin

Yeah... I like those. I'll add a new .signature to my rotation list.
I can troll both the spooks of the mid-1990s and those of the mid-2000s.

--
mertz@ | The specter of free information is haunting the Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
..cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------

Jul 18 '05 #31
Stephen Horne:
I think bitsets are the way to go for compression. The idea of only
storing every nth prime until you consider how you implement the seive
for the missing primes.
Well, my idea was to store the counts for every, say, 10,000th prime
then do a fast prime test to fill in missing ranges, and cache that list
of primes for later use.
Basically, you only need to store bits where (value % 6) is one of the
values that gets a dot in the bottom line. Then to determine which bit
you need, you set up something like...

You could also get better compression testing for %5, %7, %11, etc.
Note that by doing so you convert your lookup system into a simple
prime tester, and if you extend it along 2, 3, 5, ... then it is the seive
of Eratosthenes and you don't need to store any bit patterns at all.

Andrew
da***@dalkescientific.com
Jul 18 '05 #32
On Wed, 01 Oct 2003 19:53:04 GMT, "Andrew Dalke"
You could also get better compression testing for %5, %7, %11, etc.
That was the point of the modulo 2*3*5 == 30 and modulo 2*3*5*7 ==
210. As I said, working with the pattern needs a rapidly increasing
modulo size and lookup table, and the advantages rapidly diminish as
you add more and more primes.
Note that by doing so you convert your lookup system into a simple
prime tester, and if you extend it along 2, 3, 5, ... then it is the seive
of Eratosthenes and you don't need to store any bit patterns at all.

No, you don't need the bitset - but the whole thing becomes a waste of
time and space. If you extend it sufficiently to not need any bit
patterns at all, what you do need is a huge modulo-based lookup table
that serves no real purpose at all (it only has benefits for the
second and so-on repetitions of the repeating pattern) and a complete
list of the 'special case' primes (ie all primes up to 10^8).

Storing the list of special primes up to 10^8 would take much more
space than the original bitset, and the lookup table would take even
more space. The whole exercise would be far worse than pointless as
you could get more efficient results by discarding the method and the
lookup table, and just searching that list of 'special case' primes.

Optimisation problems often require a balance. In this case, I was
trying to balance prior knowledge of the repeating pattern of definite
non-primes in the program with externally stored data for the bulk of
the primes (the bitset on disk).
--
Steve Horne

steve at ninereeds dot fsnet dot co dot uk
Jul 18 '05 #33
> "Lulu of the Lotus-Eaters" <me***@gnosis.cx> wrote:
Along those lines, what's the quickest way to count bits in Python?
Maybe someone has a clever idea... something that would run a whole
bunch faster than my bit mask style above for counting all the "1"s in a
multibyte string.

FWIW, using array pops the speed up significantly for long strings:

import string, time, array

nibbles = [(w+x+y+z, w*8+x*4+y*2+z)
for w in (0,1) for x in (0,1)
for y in (0,1) for z in (0,1)]

bitcounts = [
(chr(hinibblevalue*16 + lonibblevalue),
hibitcount+lobitcount)
for hibitcount, hinibblevalue in nibbles
for lobitcount, lonibblevalue in nibbles ]

bytes = dict(bitcounts)

xlater = "".join([chr(bitcount) for byte, bitcount in bitcounts])

def bitcount1(bytestr):
return sum([bytes[ii] for ii in bytestr])

def bitcount2(bytestr):
return sum(map(ord, string.translate(bytestr,xlater)))

def bitcount4(bytestr):
return sum(array.array("B",string.translate(bytestr,xlate r)))

refstr = 'a'*1000000

t=time.time()
bitcount1(refstr)
print time.time()-t

t=time.time()
bitcount2(refstr)
print time.time()-t

t=time.time()
bitcount4(refstr)
print time.time()-t
--

Emile van Sebille
em***@fenx.com
Jul 18 '05 #34
On Wed, 01 Oct 2003 02:35:31 -0400, Lulu of the Lotus-Eaters <me***@gnosis.cx> wrote:
|> I was disappointed that gzip only reduces the size by 41%. I would have
|> guessed better, given the predominance of 0 bits.

Juha Autero <Ju*********@iki.fi> wrote previously:
|patterns and it's hardly a suprise that bitmap of prime numbers
|doesn't have lots of repeating bit patterns.

Sure it does. The bit pattern "0 0 0 0 0 0 0" occurs a WHOLE LOT of
times :-). Certainly much more than any other pattern. After all, my
10^8/2 bits only contain 5,761,453 ones among them[*]--there's got to be
a lot of runs of zeros in there :-).

The problem, I think, is that these bit patterns don't fall on byte
borders as neatly as I would like. I am pretty sure that I could write
a custom LZ-style compressor that did better by being bit-oriented
rather than byte-oriented. But I'm not going to do that work, 3.8 megs
is small enough for me.
Hm, a bit sequence 001010010010000100100001001000010000001001 ... could be
encoded in a series of symbols '001', '01', '001', '001', '00001', '001', ...
which are essentially just different string representations for prime first differences,
e.g., (faking in the starting 0)
prime = [0,2,3,5,7,11,13,17,19,23,29,31]
['0'*(prime[i+1]-prime[i])+'1' for i in range(len(prime)-1)] ['001', '01', '001', '001', '00001', '001', '00001', '001', '00001', '0000001', '001']

Or, using different concrete symbols, [chr(prime[i+1]-prime[i]+ord('a')) for i in range(len(prime)-1)]

['c', 'b', 'c', 'c', 'e', 'c', 'e', 'c', 'e', 'g', 'c']

Maybe it would be best to analyze the frequency and assign huffman codes and
pack those, for maximal compression? I guess that is about what zip would do to
the latter (byte-oriented, as you mentioned) symbols.

But actually, I was looking at a modulo 2*3*5*7... mask as a pre-compression step, which
might pack things even tighter, because it eliminates whole structured patterns
from the whole. One could do a bigger modulo too. If I get ambitious, I might
try it. I have an inkling of a multi-step modulo, multi-bitmap check that could go
fast in smaller space, but I have to meet someone now...

[Back next day, didn't post that]

The idea alluded to above is to make a mask using a selected part of the front
of the big bitmap, and noting that its zeroes are guaranteed to repeat in forward
repeats of that mask, so those zeroes don't need actual representation going forward.

I.e., if pfact(p) is the "factorial" product of all primes from p on down, then
if mlen=pfact(p), [bitmap[i:i+mlen] for i in range(1,len(bitmap):mlen)] will be
(untested) a list of bitmap chunks with guaranteed zeroes everywhere bitmap[1:mlen+1]

(Ok, but I reconstituted the theoretical bitmap by including factors of 2 and the
primes 1 & 2 in the following)

===< primask.py >================================================= ===
chunksize = (nb/2+7)//8 # nb/2 since even bits are not in bitfile
f = file(fpath,'rb')
bits = []
def bufbits():
if len(bits)>=nb: return
bitchunk = f.read(chunksize) # bigendian bytes w/ bits for odd no's 1,3,5...
for byte in bitchunk:
ibyte = ord(byte)
for b in xrange(8):
bits.append('.1'[(0x80>>b)&ibyte>0])
bits.append('.') # add in evens

bufbits()
bits[0:2] = ['1','1'] # account for primes 1 & 2 not in bit file at front
while 1:
bufbits() # make sure
if not bits: break
yield ''.join(bits[:nb])
bits = bits[nb:]

pf = {2:2, 3:6, 5:30, 7:210, 11:2310, 13:30030, 17:510510}[p] # should suffice
bgen = getbitmasks('primes.bitarray', pf)
print 'Prime for mask = %s, mask length = %s' % (p, pf)
for n in xrange(nlines):
s = bgen.next()
nz = s.count('.')
if not n: print 'Precompression possible: (%s-%s zeroes)/%s => %.1f %% of orig' %(
pf,nz,pf,100.*(pf-nz)/pf)
print '%4s: %s' %(nz, len(s)<80 and s or s[:80]+' etc.')

if __name__ == '__main__':
import sys
try:
args = map(int, sys.argv[1:])
p = args.pop(0)
nl = args and args.pop(0) or 5
except: raise; SystemExit("""
Usage: primask prime [number_of_lines]
where prime determines size of mask as product of primes at and below this
and number of lines is number to print.""")
================================================== ===================

This will print segments of the bit map (reconstituted to include all numbers starting with 1,2,3,...).
The segments are the length of a possible mask taken from the front whose non-prime slots are bound
to repeat through the rest, since corresponding slots will be multiples of the primes in the first set.
Since those zeroes repeat, one could remove them from the rest of the bit file, as you did with
even numbers. I print the possible compression based on the number of zeroes in the first segment.
So unless I goofed, ...

Prime for mask = 3, mask length = 6
Precompression possible: (6-2 zeroes)/6 => 66.7 % of orig
2: 111.1.
123 5 <<-- (primes represented, etc)
4: 1...1.
7 +--11
4: 1...1.
| +--17
+------13
4: 1...1.
5: ....1.

Prime for mask = 5, mask length = 30
Precompression possible: (30-19 zeroes)/30 => 36.7 % of orig
19: 111.1.1...1.1...1.1...1.....1.
23: 1.....1...1.1...1.....1.....1.
23: 1.....1...1.1.....1...1.....1.
24: ......1...1.1...1.1...1.......
25: ......1...1.....1.1.........1.

Prime for mask = 7, mask length = 210
Precompression possible: (210-163 zeroes)/210 => 22.4 % of orig
163: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
175: 1...........1...1.1...1.....1.1.........1.....1... ..1.....1.1.....1...1.1....... etc.
177: 1.........1.1.....1...1.....1.......1...1.1...1... ........1.......1...1.......1. etc.
178: 1.........1.1...1.....1.....1.1...........1...1... ..1.......1.........1.......1. etc.
180: ............1...1.1...1.............1...1.1...1... ................1...1.......1. etc.

Prime for mask = 11, mask length = 2310
Precompression possible: (2310-1966 zeroes)/2310 => 14.9 % of orig
1966: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
2030: 1.....................1.....1.1.....1...1.....1... ..........1.....1...1.1.....1. etc.
2043: 1...............1.1...1.....1.1.....1.....1....... ..1.....1...........1......... etc.
2055: ................1.1.........1.1.....1...1.....1... ..1.......1.....1...1......... etc.
2064: 1...............1...................1...1.1....... ..1.................1.......1. etc.

Prime for mask = 13, mask length = 30030
Precompression possible: (30030-26781 zeroes)/30030 => 10.8 % of orig
26781: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
27216: ................1...........1...........1......... ........1.1.....1.....1.....1. etc.
27366: ................1.....1.....1.1.........1.1...1... ................1.....1.....1. etc.
27444: ................1.............1.....1............. ........1.............1....... etc.
27481: 1...................................1.....1...1... ..........1...........1.....1. etc.

Prime for mask = 17, mask length = 510510
Precompression possible: (510510-468178 zeroes)/510510 => 8.3 % of orig
468178: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
472790: ..................1.....................1.1....... ........1...........1.1.....1. etc.
474186: ......................1.......................1... ..1.......1.....1...1.1....... etc.
475041: ..................1...........1.....1............. ..........1................... etc.
475727: ..................1.................1.....1....... ................1...1......... etc.

So it appears we could compress that file (imagining it as containing evens also) quite a bit,
by using a (510510+7)//8 => 63814 byte mask as the pattern for squishing zeroes out of the rest.

BTW, I thought primes ran about 10% of all numbers, so 8.3 % seems a little suspicious.
Did I goof, or what is the right percentage? I guess I'll make a dict of all the gap sizes
with their respective counts in the bit map and see, but have to leave this for now ...

FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.

Yours, Lulu...

[*] I'm missing two primes versus some other claims. My odd numbers
don't include 2; and I called 1 a non-prime. I don't feel strongly
about the second matter though.

[**] Well, a MUCH shorter representation is a Python-coded algorithm for
generating the primes in the first place. But somehow that seems like
cheating :-).

Oops, better post this and go...

Regards,
Bengt Richter
Jul 18 '05 #35
bo**@oz.net (Bengt Richter) wrote previously:
|I thought primes ran about 10% of all numbers, so 8.3 % seems a little
|suspicious.

The primes become sparser over larger sizes. Thinking about the Seive
of Erotosthenes makes that apparent, I think: the larger you get, the
more values get filtered out.

However, your ratios, I believe, *must* be wrong:

|Prime for mask = 3, mask length = 6
|Precompression possible: (6-2 zeroes)/6 => 66.7 % of orig
|Prime for mask = 5, mask length = 30
|Precompression possible: (30-19 zeroes)/30 => 36.7 % of orig
|Prime for mask = 7, mask length = 210
|Precompression possible: (210-163 zeroes)/210 => 22.4 % of orig
[...]

Bracket the Python implementation, just think of the numbers. If you
have a sequence of N bits, the 2 mask takes out half of them. What's
left is odd numbers: exactly 1/3 of which are divisible by 3. Of the
filtered list, exactly 1/5 of what's left is divisible by 5. And so
on.[*]
[*] Actually, exact divisibilty can be off-by-one, since the original
sequence might not divide evenly by all those primes.

So, in other words, the 2 mask gets you to 50%; the 2+3 mask gets
1/2*2/3==33%; the 2+3+5 mask gets 1/2*2/3*4/5==27%; the 2+3+5+7 mask
gets 1/2*2/3*4/5*6/7==22.8%; and so on.

Verifying this "experimentally":
len(filter(lambda n: n%2, range(100))) 50 len(filter(lambda n: n%2 and n%3, range(100))) 33 len(filter(lambda n: n%2 and n%3 and n%5, range(100))) 26 len(filter(lambda n: n%2 and n%3 and n%5 and n%7, range(100))) 22 len(filter(lambda n: n%2 and n%3 and n%5 and n%7 and n%11, range(100))) 21 len(filter(lambda n: n%2 and n%3 and n%5 and n%7 and n%11 and n%13, range(100))) 20 len(filter(lambda n: n%2 and n%3 and n%5 and n%7 and n%11 and n%13 and n%17, range(100)))

19

Anything that is left over after this filtering is--by definition--of
unknown primality. You need to encode that by putting a 1 or 0 in the
bit position corresponding to the number remaining. IOW:

toEncode = filter(lambda n: n%2 and n%3 and n%5 and n%7, range(10**8)))
for n in toEncode:
if isPrime(n):
bitfield.append(1)
else:
bitfield.append(0)

You might want to run that first line on a fast machine with plenty of
memory. :-)

Yours, Lulu...

--
mertz@ _/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: \_\_\_\_ n o
gnosis _/_/ Postmodern Enterprises \_\_
..cx _/_/ \_\_ d o
_/_/_/ IN A WORLD W/O WALLS, THERE WOULD BE NO GATES \_\_\_ z e
Jul 18 '05 #36
Lulu of the Lotus-Eaters <me***@gnosis.cx> wrote:
Anything that is left over after this filtering is--by definition--of
unknown primality. You need to encode that by putting a 1 or 0 in the
bit position corresponding to the number remaining. IOW:

toEncode = filter(lambda n: n%2 and n%3 and n%5 and n%7, range(10**8)))
for n in toEncode:
if isPrime(n):
bitfield.append(1)
else:
bitfield.append(0)

You might want to run that first line on a fast machine with plenty of
memory. :-)

I've been keeping some code to myself, because there seemed to be no
direct connection to this thread. However, now that you mention
needing a lot of resources there's a reason to post because my script
creates an infinite generator function that can produce the numbers
for your toEncode list more efficiently.

Anton

def period(L):
"""suppose we enumerate all integers >= 2 which are
not a multiple of any of the numbers in L (L is a list
of prime numbers) and take the first order differences.
This list repeats itself after some time. The period length
is the product of [x-1 for x in L]."""
return reduce(lambda i,j: i*(j-1),L,1)

def primedeltas(L):
"""return one full period of differences"""
diffs,p = [],period(L)
prev = i = 1
while len(diffs) < p:
i += 2
for j in L:
if i%j == 0: break
else:
diffs.append(i-prev)
prev= i
return diffs

def excludemultiples(L):
"""return each number in L and then return numbers
that are not multiples of any number in L. L is a list
of primes"""
i,pp = 1,primedeltas(L)
for p in L: yield p
while 1:
for p in pp:
i += p
yield i

def test():
L= [2,3,5,7]
em = excludemultiples(L)
for i,m in enumerate(em):
print m,
if i==14:
print
break
#print [x/2 for x in primedeltas(L)]

if __name__=='__main__':
test()
Jul 18 '05 #37

This discussion thread is closed

Replies have been disabled for this discussion.