467,133 Members | 1,114 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,133 developers. It's quick & easy.

Prime number module

Dag
Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8 range
so it would have to fairly efficient

Dag
Jul 18 '05 #1
  • viewed: 7602
Share:
36 Replies
Dag wrote:
Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8
range so it would have to fairly efficient
gmpy is pretty good for this sort of thing, but the primitives it gives
you are quite different -- is_prime to check if a number is prime, and
next_prime to get the smallest prime number larger than a given number.

You'd have to build your own caching on top of this to avoid repeating
computation if you need, e.g., "how many primes are < N" for several
different values of N; and I'm not even sure that gmpy's primitives are
optimal for this (compared with, for example, some kind of sieving).
Anyway, with all of these caveats, here's an example function:

import gmpy

def primes_upto(N):
count = 0
prime = gmpy.mpz(1)
while prime < N:
prime = prime.next_prime()
count += 1
return count

and having saved it in pri.py, here's what I measure in terms of time:

[alex@lancelot python2.3]$ python timeit.py -s 'import pri' -s 'M=1000*1000'
\ 'pri.primes_upto(M)'

10 loops, best of 3: 2.76e+06 usec per loop
[alex@lancelot python2.3]$ python timeit.py -s 'import pri' -s
'M=2*1000*1000' 'pri.primes_upto(M)'
10 loops, best of 3: 4.03e+06 usec per loop

i.e., 2.76 seconds for primes up to 1,000,000 -- about 4 seconds
for primes up to 2,000,000. (This is on my good old trusty Athlon
Linux machine, about 30 months old now and not top-speed even when
new -- I'm sure one of today's typical PC's might take 2 or 3
times less than this, of course). If I was to use this kind of
computation "in production", I'd pre-compute the counts for some
typical N of interest and only generate and count primes in a
narrow band, I think; but it's hard to suggest anything without a
better idea of your application.
Alex

Jul 18 '05 #2
On Mon, 29 Sep 2003 14:53:34 GMT, Dag <da*@velvet.net> wrote:
Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8 range
so it would have to fairly efficient

Dag


I just wrote a fairly simple sieve, which gave primes up to 1,000,000
in a few seconds (1.8GHz P4, Python code). There were 78498 primes in
that range (unless I screwed the code up, but it looked OK for smaller
ranges).

Going for 10^7 took too long for my patience, let alone 10^8, but at
least in theory there should be less than 100 times as many primes in
the range up to 10^8.

So here's the thought - a binary file containing a complete list of
primes up to 10^8 would require roughly 30MB (using 32 bit integers,
which should comfortably handle your requirement). Open in random
access and do a binary search to find a particular prime and the
position in the file should tell you how many primes are smaller than
that one.

30MB shouldn't be too prohibitive on todays machines, though if this
is to be distributed to other people there would of course be issues.
--
Steve Horne

steve at ninereeds dot fsnet dot co dot uk
Jul 18 '05 #3
Stephen Horne <$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$.co.uk> wrote previously:
|I just wrote a fairly simple sieve, which gave primes up to 1,000,000
|in a few seconds (1.8GHz P4, Python code). There were 78498 primes in
|that range...at least in theory there should be less than 100 times as
|many primes in the range up to 10^8.

Quite a few less, actually. Under Gauss' Prime Number Theorem, an
approximation for the number of primes less than N is N/ln(N). I know
there have been some slight refinements in this estimate since 1792, but
in the ranges we're talking about, it's plenty good enough.

So I only expect around 5,428,681 primes less than 10^8 to occur. Well,
that's not SO much less than 7.8M.

|So here's the thought - a binary file containing a complete list of
|primes up to 10^8 would require roughly 30MB (using 32 bit integers,
|which should comfortably handle your requirement).

I wonder if such a data structure is really necessary. Certainly it
produces a class of answers quite quickly. I.e. search for a prime, and
its offset immediately gives you the number of primes less than it.
Heck, searching for any number, even a composite occurring between
primes, works pretty much the same way. Of course, the above
approximation gives you a close answer even quicker.

But if you are worried about disk storage, one easy shortcut is to store
a collection of 16-bit differences between successive primes. That's
half the size, and still lets you answer the desired question *pretty*
quickly (extra addition is required)... or at least generate a local
copy of Horne's data structure in one pass.

Moving farther, even this gap structure is quite compressible. Most
gaps are quite a bit smaller than 65536, so the highbits are zeros. In
fact, I am pretty sure that almost all the gaps are less than 256. So
an immediate compression strategy (saving disk space, costing time to
recreate the transparent structure) is to store gaps as 8-bit values,
with a x00 byte escaping into a larger value (I guess in the next two
bytes).

Maybe I'll try it, and see how small I can make the data... unless I do
my real work :-).

Yours, Lulu...

--
Keeping medicines from the bloodstreams of the sick; food from the bellies
of the hungry; books from the hands of the uneducated; technology from the
underdeveloped; and putting advocates of freedom in prisons. Intellectual
property is to the 21st century what the slave trade was to the 16th.
Jul 18 '05 #4

"Lulu of the Lotus-Eaters" <me***@gnosis.cx> wrote in message

Moving farther, even this gap structure is quite compressible. Most
gaps are quite a bit smaller than 65536, so the highbits are zeros. In
fact, I am pretty sure that almost all the gaps are less than 256. So
an immediate compression strategy (saving disk space, costing time to
recreate the transparent structure) is to store gaps as 8-bit values,
with a x00 byte escaping into a larger value (I guess in the next two
bytes).


You can take this to 512 knowing that the gaps will always be an even
interval.

Emile
Jul 18 '05 #5
Lulu of the Lotus-Eaters wrote:
Stephen Horne <$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$.co.uk> wrote previously:
|I just wrote a fairly simple sieve, which gave primes up to 1,000,000
|in a few seconds (1.8GHz P4, Python code). There were 78498 primes in
|that range...at least in theory there should be less than 100 times as
|many primes in the range up to 10^8.

Quite a few less, actually. Under Gauss' Prime Number Theorem, an
approximation for the number of primes less than N is N/ln(N). I know
there have been some slight refinements in this estimate since 1792, but
in the ranges we're talking about, it's plenty good enough.

So I only expect around 5,428,681 primes less than 10^8 to occur. Well,
that's not SO much less than 7.8M.

|So here's the thought - a binary file containing a complete list of
|primes up to 10^8 would require roughly 30MB (using 32 bit integers,
|which should comfortably handle your requirement).

I wonder if such a data structure is really necessary. Certainly it
produces a class of answers quite quickly. I.e. search for a prime, and
its offset immediately gives you the number of primes less than it.
Heck, searching for any number, even a composite occurring between
primes, works pretty much the same way. Of course, the above
approximation gives you a close answer even quicker.

But if you are worried about disk storage, one easy shortcut is to store
a collection of 16-bit differences between successive primes. That's
half the size, and still lets you answer the desired question *pretty*
quickly (extra addition is required)... or at least generate a local
copy of Horne's data structure in one pass.

Moving farther, even this gap structure is quite compressible. Most
gaps are quite a bit smaller than 65536, so the highbits are zeros. In
fact, I am pretty sure that almost all the gaps are less than 256. So
an immediate compression strategy (saving disk space, costing time to
recreate the transparent structure) is to store gaps as 8-bit values,
with a x00 byte escaping into a larger value (I guess in the next two
bytes).

Maybe I'll try it, and see how small I can make the data... unless I do
my real work :-).
I believe you could implement a hybrid scheme that would be quite fast
and still maintain nearly the same level of compression that you
describe above. In addition to the above compressed data, also store,
uncompressed, every Nth prime. A binary search will get you within N
primes of your answer, to find the exact value, recreate those N-primes.
For a N of, for instance, 64 the level of compression would be minimally
affected but should make finding the number of primes less than a given
number number much faster than the basic compressed scheme.

In fact I wouldn't be suprised if this was faster than the uncompressed
scheme since you're less likely to thrash your memory.

-tim
Yours, Lulu...

--
Keeping medicines from the bloodstreams of the sick; food from the bellies
of the hungry; books from the hands of the uneducated; technology from the
underdeveloped; and putting advocates of freedom in prisons. Intellectual
property is to the 21st century what the slave trade was to the 16th.


Jul 18 '05 #6
Lulu of the Lotus-Eaters wrote:
So I only expect around 5,428,681 primes less than 10^8 to occur.
Well, that's not SO much less than 7.8M.
I found 5,761,455 primes < 1E8.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #7
Lulu of the Lotus Eaters:
So I only expect around 5,428,681 primes less than 10^8 to occur.
Well, that's not SO much less than 7.8M.

Klaus Alexander Seistrup I found 5,761,455 primes < 1E8.


http://www.utm.edu/research/primes/howmany.shtml has the same number.
(found by googling for "5,761,455" - 3rd hit. :)

That's using Proof by Consensus,
http://www.isip.msstate.edu/about_us/misc/humor/proofs/

Andrew
da***@dalkescientific.com


Jul 18 '05 #8
Hello Dag,
Is there a python module that includes functions for working with prime
numbers? I mainly need A function that returns the Nth prime number and
that returns how many prime numbers are less than N, but a prime number
tester would also be nice. I'm dealing with numbers in the 10^6-10^8 range
so it would have to fairly efficient

Try gmpy (http://gmpy.sourceforge.net/)

HTH.
Miki
Jul 18 '05 #9
Andrew Dalke wrote:
I found 5,761,455 primes < 1E8.
http://www.utm.edu/research/primes/howmany.shtml has the same
number. (found by googling for "5,761,455" - 3rd hit. :)


Cool!
That's using Proof by Consensus,
http://www.isip.msstate.edu/about_us/misc/humor/proofs/
:-)

// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #10
On Mon, 29 Sep 2003 17:20:21 -0400, Lulu of the Lotus-Eaters <me***@gnosis.cx> wrote:
Stephen Horne <$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$.co.uk> wrote previously:
|I just wrote a fairly simple sieve, which gave primes up to 1,000,000
|in a few seconds (1.8GHz P4, Python code). There were 78498 primes in
|that range...at least in theory there should be less than 100 times as
|many primes in the range up to 10^8.

Quite a few less, actually. Under Gauss' Prime Number Theorem, an
approximation for the number of primes less than N is N/ln(N). I know
there have been some slight refinements in this estimate since 1792, but
in the ranges we're talking about, it's plenty good enough.

So I only expect around 5,428,681 primes less than 10^8 to occur. Well,
that's not SO much less than 7.8M.

|So here's the thought - a binary file containing a complete list of
|primes up to 10^8 would require roughly 30MB (using 32 bit integers,
|which should comfortably handle your requirement).
If the object is to be able to test primeness of an arbitrary candidate number
in range(10**8), I think I might just make a bit map of the 10**8 on disk,
which would only be
10**8/8000000.

12.5
megabytes.

In fact, that's not that bad to hold in memory, but
you could calculate an exact sector to seek to and read and then you
can check the bit pretty fast. It might be interesting to make a timed
access map for sectors, to see what kinds of seek/read delays there are
on the average. But it would be hard/timeconsuming to reset/flush the OS
and controller cache states to get mechanical delay info. Perhaps it has
to be done at the BIOS level out of non-protected DOS or a disk mounted raw...

Anyway, getting to a known (file-relative) sector and only reading that sector
ought to be about as fast as you could get the info off the disk for an arbitrary
number, IWT.

Nth-prime and how-many-below-N would need some other representations.
I wonder if such a data structure is really necessary. Certainly it
produces a class of answers quite quickly. I.e. search for a prime, and
its offset immediately gives you the number of primes less than it.
Heck, searching for any number, even a composite occurring between
primes, works pretty much the same way. Of course, the above
approximation gives you a close answer even quicker.

But if you are worried about disk storage, one easy shortcut is to store
a collection of 16-bit differences between successive primes. That's
half the size, and still lets you answer the desired question *pretty*
quickly (extra addition is required)... or at least generate a local
copy of Horne's data structure in one pass.

Moving farther, even this gap structure is quite compressible. Most
gaps are quite a bit smaller than 65536, so the highbits are zeros. In
fact, I am pretty sure that almost all the gaps are less than 256. So
an immediate compression strategy (saving disk space, costing time to
recreate the transparent structure) is to store gaps as 8-bit values,
with a x00 byte escaping into a larger value (I guess in the next two
bytes). I wouldn't be surprised if most of the gaps are less than 16...
I wonder what the largest gap between primes in range(10**8) is, and what
the distribution of gaps might be.

Maybe I'll try it, and see how small I can make the data... unless I do
my real work :-).

Hm, the bit map stores the same information in another form, so if you
generate that, a zip compression of that ought to give an idea of how
small you are competing with.

You could translate the bit map to a sequence of prime counts for 8 or 16-bit
chunks with a simple 2*8 or 2**16 byte table mapping bytes or 16-bit chunks to
the counts of bits within them, then sum and only fiddle with one bit chunk to
make an nth-prime table...

The bit map ought to be pretty compressible, since all the bits at indices of a
multiple of 2, 3, 5, and 7 beyond the first byte should be zero, so even as bytes,
the alphabet ought to contain a lot less than 256 states.

Or you could pre-compress the bit map using a mask of redundant zeroes (beyond the
first chunk) generated from some smallish set of small primes like 2,3,5,7. Probably
hard to beat zip though.

Regards,
Bengt Richter
Jul 18 '05 #11
Tim Hochberg wrote:
...
I believe you could implement a hybrid scheme that would be quite fast
and still maintain nearly the same level of compression that you
describe above. In addition to the above compressed data, also store,
uncompressed, every Nth prime. A binary search will get you within N
primes of your answer, to find the exact value, recreate those N-primes.
For a N of, for instance, 64 the level of compression would be minimally
affected but should make finding the number of primes less than a given
number number much faster than the basic compressed scheme.


Still thinking of gmpy's primitives, I came up with a variant of this...:

import gmpy
import array
import bisect

highest_number_of_interest = gmpy.mpz(100*1000*1000)

def make_primes(put_them_here, every=1000):
current_prime = gmpy.mpz(put_them_here[-1])
count = 0
while current_prime < highest_number_of_interest:
current_prime = current_prime.next_prime()
count += 1
if count == every:
put_them_here.append(int(current_prime))
count = 0

try:
savefile = open('primes.dat', 'rb')
except:
every_1000th_prime = array.array('l', [1])
make_primes(every_1000th_prime)
savefile = open('primes.dat', 'wb')
every_1000th_prime.tofile(savefile)
savefile.close()
else:
every_1000th_prime = array.array('l', [])
try: every_1000th_prime.fromfile(savefile, 6000)
except EOFError: pass
else:
warnings.warn("Only %d primes (?)" % len(every_1000th_prime))
warnings.warn("Highest is %d" % every_1000th_prime[-1])
# could fill out and re-save the array here, if needed
savefile.close()

print "%d primes stored (1 every 1000), highest is %d" % (
len(every_1000th_prime), every_1000th_prime[-1])

def nth_prime(N):
N_div_1000, N_mod_1000 = divmod(N, 1000)
try: prime = every_1000th_prime[N_div_1000]
except IndexError:
raise ValueError, "must be N<%d" %
(1000*len(every_1000th_prime)+999)
prime = gmpy.mpz(prime)
for x in xrange(N_mod_1000):
prime = prime.next_prime()
return int(prime)

print "55555th prime is", nth_prime(55555)

import bisect
def primes_upto(N):
if N>highest_number_of_interest:
raise ValueError, "must be N<%d" % highest_number_of_interest
x = bisect.bisect_left(every_1000th_prime, N)
if N == every_1000th_prime[x]: return x*1000+1
prime = gmpy.mpz(every_1000th_prime[x-1])
x = (x-1)*1000
while prime < N:
x += 1
prime = prime.next_prime()
if prime == N: x += 1
return x

print "Up to 555555, there are %d primes" % primes_upto(555555)
The first run, building the file primes.dat (just 23048 bytes), took
almost 7 minutes elapsed time on my machine (250 CPU seconds). But
any successive run is extremely fast:

[alex@lancelot perex]$ time python -O prio.py
5762 primes stored (1 every 1000), highest is 99991609
55555th prime is 686671
Up to 555555, there are 45741 primes
0.06user 0.00system 0:00.14elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (494major+297minor)pagefaults 0swaps
[alex@lancelot perex]$

I've coded this rapidly and sloppily, so I wouldn't be surprised if
there were off-by-one errors here and there (needs to be checked against
some other, independent way to generate/check primes!), but I think the
fundamental ideas are sound.
Alex

Jul 18 '05 #12
|> > So I only expect around 5,428,681 primes less than 10^8 to occur.
|> > Well, that's not SO much less than 7.8M.

|Klaus Alexander Seistrup
|> I found 5,761,455 primes < 1E8.

"Andrew Dalke" <ad****@mindspring.com> wrote previously:
|http://www.utm.edu/research/primes/howmany.shtml has the same number.
|(found by googling for "5,761,455" - 3rd hit. :)

Looking at Andrew's resource, I am reminded that N/(log(N)-1) is a
better approximation for "primes up to N" than was Gauss' original. Of
course, as I wrote, both are accurate at the asymptote. But infinity is
a long way to go.
N = 10**8
N/(log(N)-1)

5740303.8072846411

That looks assuringly closer to the specific total than my first pass.
Obviously, enumeration of all the primes in the range is still more
accurate than any such approximation formula (some further such
approximations can be found at the above URL, and elsewhere).

Yours, Lulu...

--
mertz@ _/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: \_\_\_\_ n o
gnosis _/_/ Postmodern Enterprises \_\_
..cx _/_/ \_\_ d o
_/_/_/ IN A WORLD W/O WALLS, THERE WOULD BE NO GATES \_\_\_ z e
Jul 18 '05 #13
bo**@oz.net (Bengt Richter) wrote previously:
|just make a bit map of the 10**8 on disk, which would only be
| >>> 10**8/8000000.
| 12.5
|megabytes.

I had been thinking about this also, but Bengt beat me to posting
something. I think that we could pick up Emile van Sebille's suggestion
about gaps only needing to store even numbers here; the equivalent being
that we only store a bit array of ODD numbers, getting us down to 6.25
mB of storage space (special casing of the prime '2' needs to be
handled, seems manageable :-)).

Doing this does not necessarily produce a smaller storage format than a
zip'd bitmap of all the numbers. But the in-memory image doesn't
require a decompression step (and the stored version can still be
compressed). I guess the test is something like:

prime_bits = open('bitarray.primes','rb').read() # only odd nums
def isPrime(N):
if N < 2: return False
elif N==2: return True
else:
byte, bit = divmod(N, 16)
bit = bit//2
return ord(prime_bits[byte]) & 2**bit

You could also use a memory-mapped file or the like, of course.

|Nth-prime and how-many-below-N would need some other representations.
|You could translate the bit map to a sequence of prime counts for 8 or 16-bit
|chunks with a simple 2*8 or 2**16 byte table mapping bytes or 16-bit chunks to
|the counts of bits within them, then sum and only fiddle with one bit chunk to
|make an nth-prime table...

The ancillary data structure would indeed speed the "how-many-below"
calculation enormously, while not taking too much space.

Along those lines, what's the quickest way to count bits in Python?
Maybe someone has a clever idea... something that would run a whole
bunch faster than my bit mask style above for counting all the "1"s in a
multibyte string.

Yours, Lulu...

--
mertz@ _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis _/_/ Postmodern Enterprises _/_/ s r
..cx _/_/ MAKERS OF CHAOS.... _/_/ i u
_/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/ g s
Jul 18 '05 #14
Alex Martelli wrote:
Tim Hochberg wrote:
...
I believe you could implement a hybrid scheme that would be quite fast
and still maintain nearly the same level of compression that you
describe above. In addition to the above compressed data, also store,
uncompressed, every Nth prime. A binary search will get you within N
primes of your answer, to find the exact value, recreate those N-primes.
For a N of, for instance, 64 the level of compression would be minimally
affected but should make finding the number of primes less than a given
number number much faster than the basic compressed scheme.

Still thinking of gmpy's primitives, I came up with a variant of this...:


[Nice compact version using gmpy snipped]

Just to keep my hand in, I'm attaching a version based on the idea above
which is in turn based on Emile's and Lulu's ideas for compression.

This produces data files that total just over 6 GB for all primes up to
1E8. The initial run is quite slow, but after that it can determine the
nth prime, or the number of primes below some number in about 0.5 ms on
my box.
The first run, building the file primes.dat (just 23048 bytes), took
almost 7 minutes elapsed time on my machine (250 CPU seconds). But
any successive run is extremely fast:

[alex@lancelot perex]$ time python -O prio.py
5762 primes stored (1 every 1000), highest is 99991609
55555th prime is 686671
Up to 555555, there are 45741 primes
0.06user 0.00system 0:00.14elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (494major+297minor)pagefaults 0swaps
[alex@lancelot perex]$

I've coded this rapidly and sloppily, so I wouldn't be surprised if
there were off-by-one errors here and there (needs to be checked against
some other, independent way to generate/check primes!), but I think the
fundamental ideas are sound.
Well one of us is off by one as I get 45740 primes below 55555. However,
I get 686671 as the 55555th prime, so we agree there. I would not be at
all suprised if it is mine that is off by one.

-tim

Alex

import array, bisect, psyco, struct

def eratosthenes(max=None):
'''Yields the sequence of prime numbers via the Sieve of Eratosthenes.'''
# Mostly based on Alex Marteli's version from the Python Cookbook. The max
# attribute, which I addded, trims the memory usage substantially when
# computing lots of primes. Of course you need to know your max prime in
# advance.
D = {} # map composite integers to primes witnessing their compositeness
q = 2 # first integer to test for primality
while (max is None) or (q <= max):
p = D.pop(q, None)
if p:
x = p + q
while x in D:
x += p
if (x is None) or (x <= max):
D[x] = p
else:
q2 = q*q
if (max is None) or (q2 <= max):
D[q2] = q
yield q
q += 1
psyco.bind(eratosthenes)
def encode(delta):
# To encode N-bytes, preface with (N-1) leading zero-bytes
enc = []
delta /= 2
while delta:
delta, byte = divmod(delta, 128)
enc.append(byte)
enc.reverse()
result = ['\0'] * (len(enc) - 1)
for byte in enc[:-1]:
result.append(chr(byte+1))
result.append(chr(enc[-1]))
return ''.join(result)
def decode(chars):
chars = chars.lstrip('\0')
bytes = []
for c in chars[:-1]:
bytes.append(ord(c)-1)
bytes.append(ord(chars[-1]))
value = bytes[0]
for b in bytes[1:]:
value *= 128
value += b
return 2 * value

def pop_delta(data):
# return delta, data_without_delta
index = 0
while data[index] == '\0':
index += 1
index = 2*index + 1
return decode(data[:index]), data[index:]

def encode_primes(n, chunk, basename):
primes = eratosthenes(n)
index_file = open(basename+'.index', 'wb')
everyn_file = open(basename+'.everyn', 'wb')
encoded_file = open(basename+'.encoded', 'wb')
index_file.write(struct.pack('l', chunk))
everyn_file.write(struct.pack('l', chunk))
# We reserve some space now and come back later and fill in the length
encoded_file.write(struct.pack('l', 0))
# We write to the files a chunk at a time.
encoded_chunk = ''
primes.next() # skip 2
p0 = primes.next()
index = 0
for i, p in enumerate(primes):
delta = p - p0
if not (i % chunk):
index += len(encoded_chunk)
index_file.write(struct.pack('l', index))
everyn_file.write(struct.pack('l', p0))
encoded_file.write(encoded_chunk)
encoded_chunk = ''
encoded_chunk += encode(delta)# This is supposed to be fast under psyco
p0 = p
# write leftover data
index += len(encoded_chunk)
index_file.write(struct.pack('l', index))
encoded_file.write(encoded_chunk)
encoded_file.seek(0)
encoded_file.write(struct.pack('l', i))
psyco.bind(encode_primes)
class Primes:
def __init__(self, basename):
sizel = struct.calcsize('l')
index_file = open(basename+'.index', 'rb')
[self.chunk] = struct.unpack('l', index_file.read(sizel))
index_data = index_file.read()
self.indices = struct.unpack("%sl" % (len(index_data) // sizel), index_data)
index_file.close()
everyn_file = open(basename+'.everyn', 'rb')
[chunk] = struct.unpack('l', everyn_file.read(sizel))
assert self.chunk == chunk, "inconsistent chunk sizes (%s vs %s)" % (self.chunk, chunk)
everyn_data = everyn_file.read()
self.everyn = struct.unpack("%sl" % (len(everyn_data) // sizel), everyn_data)
everyn_file.close()
encoded_file = open(basename+'.encoded', 'rb')
[self.nprimes] = struct.unpack('l', encoded_file.read(sizel))
self.encoded = encoded_file.read()

def __len__(self):
return self.nprimes

def __getitem__(self, n):
if n < 0:
n += self.nprimes
if not isinstance(n, int) and (0 <= n < self._len):
raise IndexError()
if n == 0:
return 2
cindex, pindex = divmod((n-1), self.chunk)
return self.decode_chunk(cindex)[pindex]

def decode_chunk(self, n):
cprimes = [self.everyn[n]]
data = self.encoded[self.indices[n]:self.indices[n+1]]
while data:
delta, data = pop_delta(data)
cprimes.append(cprimes[-1] + delta)
return cprimes

def upto(self, n):
if n < 2:
return 0
elif n < 3:
return 1
else:
n0 = max(bisect.bisect_right(self.everyn, n) - 1, 0)
chunk = self.decode_chunk(n0)
return bisect.bisect_right(chunk, n) + n0*self.chunk + 1
basename = "packedprimes"
try:
primes = Primes(basename)
except: # be more specific...
encode_primes(int(1e8), 64, basename)
primes = Primes(basename)

if __name__ == "__main__":

for n in 1, 2, 2.5, 3, 3.5, 5, 13, 13.5, 555555:
print "Up to", n, ", there are %d primes" % primes.upto(n)
print primes[55555-1]

import timeit
t = timeit.Timer("packedprimes.primes.upto(555555)", "import packedprimes")
print "seconds per call to upto =", t.timeit(number=1000) / 1000
t = timeit.Timer("packedprimes.primes[99834]", "import packedprimes")
print "seconds per call to [] =", t.timeit(number=1000) / 1000

#~ eprimes = eratosthenes(int(1e5))
#~ for i, p in enumerate(eprimes):
#~ assert p == primes[i], "mismatch at prime[%s] (%s vs %s)" % (i, p, primes[i])
#~ print "OK"

Jul 18 '05 #15

"Lulu of the Lotus-Eaters" <me***@gnosis.cx> wrote:
Along those lines, what's the quickest way to count bits in Python?
Maybe someone has a clever idea... something that would run a whole
bunch faster than my bit mask style above for counting all the "1"s in a
multibyte string.


Probably far from the fastest, but in the tradition that you'll get a better
answer by posting, here's a stab at a python method for counting bits:

import string,time

nibbles = [(w+x+y+z, w*8+x*4+y*2+z)
for w in (0,1) for x in (0,1)
for y in (0,1) for z in (0,1)]
bitcounts = [
(chr(hinibblevalue*16 + lonibblevalue),
hibitcount+lobitcount)
for hibitcount, hinibblevalue in nibbles
for lobitcount, lonibblevalue in nibbles ]

bytes = dict(bitcounts)

xlater = "".join([chr(bitcount) for byte, bitcount in bitcounts])
def bitcount1(bytestr):
return sum([bytes[ii] for ii in bytestr])

def bitcount2(bytestr):
return sum(map(ord, string.translate(bytestr,xlater)))

t=time.time()
bitcount1('a'*100000)
print time.time()-t

t=time.time()
bitcount2('a'*100000)
print time.time()-t

--
Emile van Sebille
em***@fenx.com
Jul 18 '05 #16
Lulu of the Lotus-Eaters wrote:
...
Along those lines, what's the quickest way to count bits in Python?
Probably gmpy's popcount function.
Maybe someone has a clever idea... something that would run a whole
bunch faster than my bit mask style above for counting all the "1"s in a
multibyte string.
I would try gmpy.mpz(multibytestring, 256).popcount(). E.g.:
gmpy.mpz('gmpy IS pretty fast for many kinds of things',256).popcount()

163

and, on my trusty 30-months-old Athlon box...:

[alex@lancelot python2.3]$ python timeit.py -s'import gmpy' \ -s'x="gmpy IS pretty fast for many kinds of things"' \
'gmpy.mpz(x,256).popcount()'

100000 loops, best of 3: 8.13 usec per loop
[alex@lancelot python2.3]$
Alex

Jul 18 '05 #17
|Lulu of the Lotus-Eaters wrote:
|> Along those lines, what's the quickest way to count bits in Python?

Alex Martelli <al***@aleax.it> wrote previously:
|Probably gmpy's popcount function.

What about fast AND portable?

(I'm sure gmpy can be installed lots of places, but not necessarily by
receivers of Python code).

Yours, Lulu...

--
_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/
_/_/ ~~~~~~~~~~~~~~~~~~~~[me***@gnosis.cx]~~~~~~~~~~~~~~~~~~~~~ _/_/
_/_/ The opinions expressed here must be those of my employer... _/_/
_/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them! _/_/
Jul 18 '05 #18
Taking up Bengt's suggestion, and some various others, I created a bit
array version of the primality data structure. That is, each bit of the
file 'primes.bitarray' encodes the primality of the number corresponding
to its bit position. However, only ODD numbers are mentioned in this
file, since it's rather easy to rule out evens (and special case 2).
This produces a 6.25 mB file for the first 10**8 numbers.

I was disappointed that gzip only reduces the size by 41%. I would have
guessed better, given the predominance of 0 bits. Still, no reason not
to use the [gzip] module to read the data. You can download the data
file, if you wish, at:

http://gnosis.cx/download/primes.bitarray.gz

I decided to check how quickly I can check primality using this data
structure. Pretty well, to my mind (for reference, I have a PIII
733Mhz; nothing very fast for nowadays). Here's the code:

#!/usr/bin/python
import gzip

bit_source = gzip.open('primes.bitarray.gz')
prime_bits = bit_source.read(100000)

def isPrime(N):
"Check primality on pre-generated bit field (read more as needed)"
if N < 2: return False
elif N==2: return True
elif not N % 2: return False
else:
# Check if we need to extend prime_bits
global prime_bits
if 8*len(prime_bits) < N:
needed = 1+(N//8) - len(prime_bits)
prime_bits += bit_source.read(needed)
byte, bit = divmod(N, 16)
bit = 7 - bit//2
return ord(prime_bits[byte]) & 2**bit

if __name__=='__main__':
from time import clock
from random import randint
# generate a million numbers to check
tests = []
for _ in xrange(10**6):
tests.append(randint(1,10**8))
start = clock()
primes = 0
for n in tests:
if isPrime(n): primes += 1
print "%d primes in 1e6 numbers (%.2f secs)" % (primes,clock()-start)

Here's the timing:

% test_bits.py
57546 primes in 1e6 numbers (29.57 secs)

I figured there was no need to read the whole data file in unless a
check required it--I just read in a smallish 100k at initialization. Of
course, in the course of a million tests, I am pretty sure I wind up
reading in essentially the whole file at some point.

One nice thing to do would be to make 'bit_source' a bit smarter. I.e.
it could be a class that generated more primes when needed (maybe using
gmpy as Alex suggests). In principle, 'isPrime()' could then check
unbounded primes (subject to time and resource constraints).

The next thing would probably be to store sparce accumulations as Bengt
suggested (and Alex implemented in a different fashion). That is,
another structure could store "every 1000th prime" in order to answer
questions like "how-many-less-than" and "which-is-Nth-prime".

Maybe I'll Emile's bitcounting techniques to do this.

Yours, Lulu...

--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
..cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------
Jul 18 '05 #19
On 30 Sep 2003 13:00:39 GMT, bo**@oz.net (Bengt Richter) wrote:
If the object is to be able to test primeness of an arbitrary candidate number
in range(10**8), I think I might just make a bit map of the 10**8 on disk,
which would only be
10**8/8000000.

12.5
megabytes.


If the object was simply to test primeness I would agree, but the
requirement included determining the number of primes below some given
value. It should be easy enough to fix, though - keep a cached count
for every n primes and only redo the count for the last few.
I think bitsets are the way to go for compression. The idea of only
storing every nth prime until you consider how you implement the seive
for the missing primes. You can't just check the interval - you need
to know the primes in earlier intervals to do the seive checks. You
may end up recalculating a lot of the smaller primes again.
It is not too difficult to count the bits in a single integer. One of
my favorite tricks exploits the fact that a single operation on an
integer can be seen as a parallel operation processing 32 or 64 bits
in one go.

Take an integer. Each bit can be considered a count of the number of
bits in that bit position. With a little bit of masking and shifting,
you can get a value where every pair of bits contains the number of
bits that were in that pair of bits in the original. In the 32 bit
case...

x = (x & 0x55555555) + ((x & 0xAAAAAAAA) >> 1)

Next, get the totals for each group of four bits...

x = (x & 0x33333333) + ((x & 0xCCCCCCCC) >> 2)

And so on, until the value contains a single total of the number of
bits in the whole original value.

x = (x & 0x0F0F0F0F) + ((x & 0xF0F0F0F0) >> 2)
x = (x & 0x00FF00FF) + ((x & 0xFF00FF00) >> 4)
x = (x & 0x0000FFFF) + ((x & 0xFFFF0000) >> 8)

A simple looping method would need 32 shifts, 32 bitwise ands and an
average of 16 increments. This takes 5 shifts, 10 bitwise ands and 5
additions. Probably a worthwhile gain.

LuLu points out that you can compress the bitset even further by
storing bits only for odd numbers. This compression can be extended by
looking at larger repeating patterns. For example, if we look at a
each value modulo 2*3 = 6, we can exclude multiples of both two and
three from the bitset.

0 1 2 3 4 5 - value % 6
* . * . * . - multiples of two
* . . * . . - multiples of three
* . * * * . - multiples of two or three

Basically, you only need to store bits where (value % 6) is one of the
values that gets a dot in the bottom line. Then to determine which bit
you need, you set up something like...

lookup = [None, 0, None, None, None, 1]

if lookup [value % 6] is None :
if (value == 2) or (value == 3) :
value is prime
else :
value is not prime
else
lookup_bit ((value / 3) + lookup [value % 6])

So you only need to hold one bit in three (those with value % 6 equal
to either 1 or 5) with both 2 and 3 being special-cased.

Use modulo 2*3*5 == 30 and I think the bitset reduces to 8/30 == just
under 27%, which tells me that along with a rapidly increasing
pattern-checking lookup table size you get rapidly decreasing returns.
No way am I going to work out the figures for modulo 2*3*5*7 = 210!

The modulo 6 case is certainly worthwhile, and maybe the module 30 if
you really need those extra few % points (which are still worth a
couple of MB at a guess), but I doubt it's worth going further than
that.
--
Steve Horne

steve at ninereeds dot fsnet dot co dot uk
Jul 18 '05 #20
Adding another thought to the discussion. Instead of pregenerating
all primes using a seive, store the cumulative counts for, say,
every 10,000 terms (so 10,000 numbers are needed for the 1E8
range requested by the poster) then use a fast prime tester, like the
Rabin-Miller strong pseudoprime test (see
http://mathworld.wolfram.com/Rabin-M...primeTest.html )
to fill in any missing range. Could cache any hits in a bitstring
along with the counts, like

counts = [(None, 9999), # however many are between 0 and 10000
(None, 8888), # between 10,000 and 20,000 (or 0 to 20,000)
...
]

where the None field stores the bitstring May want to do every
1,000 if the tests are too slow.

pycrypto has Rabin-Miller and here's a copy I found in pure Python:
http://senderek.de/pcp/release/tools/number.py

It uses the first 7 primes for tests, which is good for all numbers
up to 341550071728321 ~ 3.4E14.

Andrew
da***@dalkescientific.com
Jul 18 '05 #21
Lulu of the Lotus-Eaters <me***@gnosis.cx> writes:
I was disappointed that gzip only reduces the size by 41%. I would have
guessed better, given the predominance of 0 bits.


I think it is not a surprise. (But maybe that's because I tried it
before.) You need redundancy to compress data. If bitmap of prime
numbers contained lot of redundancy, you could use it to generate
prime numbers. Of course bitmap is not the most efficient way to store
list of prime numbers, but neither is gzip the most efficient way to
find redundancy in it. IIRC gzip basically encodes repeating bit
patterns and it's hardly a suprise that bitmap of prime numbers
doesn't have lots of repeating bit patterns.

--
Juha Autero
http://www.iki.fi/jautero/
Eschew obscurity!
Jul 18 '05 #22
"Andrew Dalke" <ad****@mindspring.com> wrote previously:
|use a fast prime tester, like the Rabin-Miller strong pseudoprime test

Well sure, you can cheat. But aren't you worried about undiscovered
small pseudoprimes? *wink* (I guess for Rabin-Miller they are not
called Carmichael numbers, just for Fermat pseudoprimes[*])

Yours, Lulu...
[*] Does anyone know whether there are Carmichael-like numbers for
Rabin-Miller tests? Last I knew, it was unproven either way, but that
was a while. That is, can any composite pass R-M for every base?

--
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad
---[ Postmodern Enterprises <me***@gnosis.cx> ]--------------------------
Jul 18 '05 #23
|> I was disappointed that gzip only reduces the size by 41%. I would have
|> guessed better, given the predominance of 0 bits.

Juha Autero <Ju*********@iki.fi> wrote previously:
|patterns and it's hardly a suprise that bitmap of prime numbers
|doesn't have lots of repeating bit patterns.

Sure it does. The bit pattern "0 0 0 0 0 0 0" occurs a WHOLE LOT of
times :-). Certainly much more than any other pattern. After all, my
10^8/2 bits only contain 5,761,453 ones among them[*]--there's got to be
a lot of runs of zeros in there :-).

The problem, I think, is that these bit patterns don't fall on byte
borders as neatly as I would like. I am pretty sure that I could write
a custom LZ-style compressor that did better by being bit-oriented
rather than byte-oriented. But I'm not going to do that work, 3.8 megs
is small enough for me.

FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.

Yours, Lulu...
[*] I'm missing two primes versus some other claims. My odd numbers
don't include 2; and I called 1 a non-prime. I don't feel strongly
about the second matter though.

[**] Well, a MUCH shorter representation is a Python-coded algorithm for
generating the primes in the first place. But somehow that seems like
cheating :-).

--
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad
---[ Postmodern Enterprises <me***@gnosis.cx> ]--------------------------
Jul 18 '05 #24
Lulu of the Lotus-Eaters:
FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**].
As I recall, the Burrows-Wheeler transform used in bzip2 uses
a clever way to rearrange the data in a block into a form which is
more compressible, yes?

Well, what about rearranging the bits so that every multiple of
3 is before every non-multiple of 3, then every multiple of 5, etc.
There's a simple algorithm to unravel the ordering, and you'll
end up with lot of 0s at the start of the block. It's very much like
the ordering+compression you did when you moved all the terms
divisible by 2 to the front of the group then realized they could
all be represented as part of the decompression algorithm.

Extend my idea far enough and you'll have a prime sieve.
[**] Well, a MUCH shorter representation is a Python-coded algorithm for
generating the primes in the first place. But somehow that seems like
cheating :-).
Why doesn't bzip2 feel like cheating? It's good at compressing
text because it's designed for the sort of patterns found in text.
Why doesn't mp3 feel like cheating? It's designed to handle the
sounds we normally hear, as compared to the output of a
random noise generator.

In essense, if you know the full corpus beforehand, it's really
easy to compress - just store the identifier for each text.
There's only one text here so the result is trivial compressed
to nothing.

The problem is, your algorithm tries to compress an ordered
list of odd numbers, and so fails to compress as well as a
compression algorithm tuned for just compressing the first
N primes.
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad


Hmm... Seems dated. Terms like whitewater and Clinton
(Bill, not Hillary) don't hit many hot buttons these days, the
number of people in various milita has gone way down[*]
and Libya is working towards renormalization.

What about these words?

Iran nuclear neocon expose POTUS patriot Pakistan armed weaponized
enriched uranium UN smallpox Gitmo invasion Castro Tikrit revolution sarin

Andrew
da***@dalkescientific.com
[*] Actually, US law states
http://www4.law.cornell.edu/uscode/10/311.html

(a) The militia of the United States consists of all able-bodied males at
least 17
years of age and, except as provided in section 313 of title 32, under 45
years of age who are, or who have made a declaration of intention to
become, citizens of the United States and of female citizens of the United
States who are members of the National Guard.

(b) The classes of the militia are -
(1) the organized militia, which consists of the National Guard and
the Naval Militia; and

(2) the unorganized militia, which consists of the members of the
militia who are not members of the National Guard or the Naval Militia

Therefore, the numbers of people in the militia hasn't changed that
drastically -- and I'm a member of the unorganized militia of the US.
Maybe I should get a patch saying that, and wear it next time
I visit Europe? :)
Jul 18 '05 #25
Lulu of the Lotus-Eaters:
Well sure, you can cheat. But aren't you worried about undiscovered
small pseudoprimes?


I am. I'm doing an extensive study of the small pseudoprimes between
0 and 1. I'm working on the distributed computing framework now,
and hope to start the actual mathematical analysis in about 3 months.

Andrew
da***@dalkescientific.com
Jul 18 '05 #26
Lulu of the Lotus-Eaters wrote:
|Lulu of the Lotus-Eaters wrote:
|> Along those lines, what's the quickest way to count bits in Python?

Alex Martelli <al***@aleax.it> wrote previously:
|Probably gmpy's popcount function.

What about fast AND portable?

(I'm sure gmpy can be installed lots of places, but not necessarily by
receivers of Python code).


Why not? gmpy's downloads include a self-contained .pyd for Windows
users (the ones who typically lack easy access to a C compiler) -- don't
most users of Linux, MacOs and other BSD's, etc, typically have gcc
available to install extensions if they want...? Or are you worrying
about the operations' availability to users of Jython...?

I see you guys are having a great time concocting pure-Python ways to
count bits, etc -- but GMP, the library underying gmpy, has carefully
tuned machine-code accelerations for many important operations and
platforms... if one really needs to perform advanced multi-precision
arithmetic, as opposed to just playing with it, GMP is just too
precious a resource to ignore it. Given how widespread Java is, I would
not be surprised if libraries of similar quality were available there
(with native-methods and all) -- I just don't know of any because I
never needed them, so didn't research the issue.

I do acknowlegde the usefulness of pure Python (AND readable, NOT
distorted for fun & optimization-profit:-) code in *teaching* even
the lowest-level concepts of multi-precision arithmetic & bit-fiddling.
And Python is ideal for playing with higher-level architectures, such
as you guys have been doing about various potential ways to store
compressed pre-computed representations of primes. Finally, one day,
if and when pypy triumphs, one will no doubt need to start with pure
Python code (most likely SIMPLE, NON-TRICKY pure Python code) in
order to obtain maximum speed.

But from a pragmatic point of view, I'll keep using gmpy for the
foreseeable future, thanks;-).
Alex
Jul 18 '05 #27
Alex Martelli wrote:
Lulu of the Lotus-Eaters wrote:
...
Along those lines, what's the quickest way to count bits in Python?
Probably gmpy's popcount function.
Maybe someone has a clever idea... something that would run a whole
bunch faster than my bit mask style above for counting all the "1"s in a
multibyte string.


I would try gmpy.mpz(multibytestring, 256).popcount(). E.g.:
gmpy.mpz('gmpy IS pretty fast for many kinds of things',256).popcount() 163

and, on my trusty 30-months-old Athlon box...:

[alex@lancelot python2.3]$ python timeit.py -s'import gmpy' \
-s'x="gmpy IS pretty fast for many kinds of things"' \
'gmpy.mpz(x,256).popcount()'

100000 loops, best of 3: 8.13 usec per loop
[alex@lancelot python2.3]$


Silly me, for not including a "pure Python" alternative run on the
same machine for comparison purposes -- sorry. Given the following
bitcount.py module (only the initialization is coded with gmpy,
and just because I'm lazy -- that doesn't affect timings!):

import gmpy

byte_bit_counts = dict([(chr(b), gmpy.popcount(b)) for b in range(256)])

def popcount(s, bbc=byte_bit_counts):
return sum([bbc[c] for c in s])
import bitcount
x="gmpy IS pretty fast for many kinds of things"
bitcount.popcount(x)

163

and, on the same box:

[alex@lancelot python2.3]$ python timeit.py -s'import bitcount' \ -s'x="gmpy IS pretty fast for many kinds of things"' \
'bitcount.popcount(x)'

10000 loops, best of 3: 82.2 usec per loop

So, the speed advantage of gmpy is just about one order of magnitude
for this trivial operation.
Alex

Jul 18 '05 #28
Alex Martelli <al***@aleax.it> writes:
Lulu of the Lotus-Eaters wrote:
|Lulu of the Lotus-Eaters wrote:
|> Along those lines, what's the quickest way to count bits in Python?

Alex Martelli <al***@aleax.it> wrote previously:
|Probably gmpy's popcount function.

What about fast AND portable?

(I'm sure gmpy can be installed lots of places, but not necessarily by
receivers of Python code).
Why not? gmpy's downloads include a self-contained .pyd for Windows
users (the ones who typically lack easy access to a C compiler) -- don't
most users of Linux, MacOs and other BSD's, etc, typically have gcc
available to install extensions if they want...? Or are you worrying
about the operations' availability to users of Jython...?


I'm not sure "most" OS X installations have the dev tools installed.

I'm not sure GMP has been ported to, e.g. VMS.
But from a pragmatic point of view, I'll keep using gmpy for the
foreseeable future, thanks;-).


Well, sure.

Cheers,
mwh

--
Hmmm... its Sunday afternoon: I could do my work, or I could do a
Fourier analysis of my computer's fan noise.
-- Amit Muthu, ucam.chat (from Owen Dunn's summary of the year)
Jul 18 '05 #29
Lulu of the Lotus-Eaters wrote:
[SNIP]
FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.

[SNIP]

FWIW, using escaped nibbles plus compression put me down around 3.4
megs. This includes storing enough metadata to do fast how many primes
below n testing. Much slower for straight primality testing than the
bitmap scheme though.

-tim

Jul 18 '05 #30
"Andrew Dalke" <ad****@mindspring.com> wrote previously:
|In essense, if you know the full corpus beforehand, it's really
|easy to compress - just store the identifier for each text.

I've invented an excellent compression algorithm. It represents the
12.5 meg bit sequence for prime/no-prime (<10^8) as the bit "1". To
encode other sequences, you use the escape bit "0" as the first bit...

Pick the best algorithm for the job at hand...

Yours, Lulu...

|Iran nuclear neocon expose POTUS patriot Pakistan armed weaponized
|enriched uranium UN smallpox Gitmo invasion Castro Tikrit revolution sarin

Yeah... I like those. I'll add a new .signature to my rotation list.
I can troll both the spooks of the mid-1990s and those of the mid-2000s.

--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
..cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------

Jul 18 '05 #31
Stephen Horne:
I think bitsets are the way to go for compression. The idea of only
storing every nth prime until you consider how you implement the seive
for the missing primes.
Well, my idea was to store the counts for every, say, 10,000th prime
then do a fast prime test to fill in missing ranges, and cache that list
of primes for later use.
Basically, you only need to store bits where (value % 6) is one of the
values that gets a dot in the bottom line. Then to determine which bit
you need, you set up something like...


You could also get better compression testing for %5, %7, %11, etc.
Note that by doing so you convert your lookup system into a simple
prime tester, and if you extend it along 2, 3, 5, ... then it is the seive
of Eratosthenes and you don't need to store any bit patterns at all.

Andrew
da***@dalkescientific.com
Jul 18 '05 #32
On Wed, 01 Oct 2003 19:53:04 GMT, "Andrew Dalke"
<ad****@mindspring.com> wrote:
You could also get better compression testing for %5, %7, %11, etc.
That was the point of the modulo 2*3*5 == 30 and modulo 2*3*5*7 ==
210. As I said, working with the pattern needs a rapidly increasing
modulo size and lookup table, and the advantages rapidly diminish as
you add more and more primes.
Note that by doing so you convert your lookup system into a simple
prime tester, and if you extend it along 2, 3, 5, ... then it is the seive
of Eratosthenes and you don't need to store any bit patterns at all.


No, you don't need the bitset - but the whole thing becomes a waste of
time and space. If you extend it sufficiently to not need any bit
patterns at all, what you do need is a huge modulo-based lookup table
that serves no real purpose at all (it only has benefits for the
second and so-on repetitions of the repeating pattern) and a complete
list of the 'special case' primes (ie all primes up to 10^8).

Storing the list of special primes up to 10^8 would take much more
space than the original bitset, and the lookup table would take even
more space. The whole exercise would be far worse than pointless as
you could get more efficient results by discarding the method and the
lookup table, and just searching that list of 'special case' primes.

Optimisation problems often require a balance. In this case, I was
trying to balance prior knowledge of the repeating pattern of definite
non-primes in the program with externally stored data for the bulk of
the primes (the bitset on disk).
--
Steve Horne

steve at ninereeds dot fsnet dot co dot uk
Jul 18 '05 #33
> "Lulu of the Lotus-Eaters" <me***@gnosis.cx> wrote:
Along those lines, what's the quickest way to count bits in Python?
Maybe someone has a clever idea... something that would run a whole
bunch faster than my bit mask style above for counting all the "1"s in a
multibyte string.


FWIW, using array pops the speed up significantly for long strings:

import string, time, array

nibbles = [(w+x+y+z, w*8+x*4+y*2+z)
for w in (0,1) for x in (0,1)
for y in (0,1) for z in (0,1)]

bitcounts = [
(chr(hinibblevalue*16 + lonibblevalue),
hibitcount+lobitcount)
for hibitcount, hinibblevalue in nibbles
for lobitcount, lonibblevalue in nibbles ]

bytes = dict(bitcounts)

xlater = "".join([chr(bitcount) for byte, bitcount in bitcounts])

def bitcount1(bytestr):
return sum([bytes[ii] for ii in bytestr])

def bitcount2(bytestr):
return sum(map(ord, string.translate(bytestr,xlater)))

def bitcount4(bytestr):
return sum(array.array("B",string.translate(bytestr,xlate r)))

refstr = 'a'*1000000

t=time.time()
bitcount1(refstr)
print time.time()-t

t=time.time()
bitcount2(refstr)
print time.time()-t

t=time.time()
bitcount4(refstr)
print time.time()-t
--

Emile van Sebille
em***@fenx.com
Jul 18 '05 #34
On Wed, 01 Oct 2003 02:35:31 -0400, Lulu of the Lotus-Eaters <me***@gnosis.cx> wrote:
|> I was disappointed that gzip only reduces the size by 41%. I would have
|> guessed better, given the predominance of 0 bits.

Juha Autero <Ju*********@iki.fi> wrote previously:
|patterns and it's hardly a suprise that bitmap of prime numbers
|doesn't have lots of repeating bit patterns.

Sure it does. The bit pattern "0 0 0 0 0 0 0" occurs a WHOLE LOT of
times :-). Certainly much more than any other pattern. After all, my
10^8/2 bits only contain 5,761,453 ones among them[*]--there's got to be
a lot of runs of zeros in there :-).

The problem, I think, is that these bit patterns don't fall on byte
borders as neatly as I would like. I am pretty sure that I could write
a custom LZ-style compressor that did better by being bit-oriented
rather than byte-oriented. But I'm not going to do that work, 3.8 megs
is small enough for me.
Hm, a bit sequence 001010010010000100100001001000010000001001 ... could be
encoded in a series of symbols '001', '01', '001', '001', '00001', '001', ...
which are essentially just different string representations for prime first differences,
e.g., (faking in the starting 0)
prime = [0,2,3,5,7,11,13,17,19,23,29,31]
['0'*(prime[i+1]-prime[i])+'1' for i in range(len(prime)-1)] ['001', '01', '001', '001', '00001', '001', '00001', '001', '00001', '0000001', '001']

Or, using different concrete symbols, [chr(prime[i+1]-prime[i]+ord('a')) for i in range(len(prime)-1)]

['c', 'b', 'c', 'c', 'e', 'c', 'e', 'c', 'e', 'g', 'c']

Maybe it would be best to analyze the frequency and assign huffman codes and
pack those, for maximal compression? I guess that is about what zip would do to
the latter (byte-oriented, as you mentioned) symbols.

But actually, I was looking at a modulo 2*3*5*7... mask as a pre-compression step, which
might pack things even tighter, because it eliminates whole structured patterns
from the whole. One could do a bigger modulo too. If I get ambitious, I might
try it. I have an inkling of a multi-step modulo, multi-bitmap check that could go
fast in smaller space, but I have to meet someone now...

[Back next day, didn't post that]

The idea alluded to above is to make a mask using a selected part of the front
of the big bitmap, and noting that its zeroes are guaranteed to repeat in forward
repeats of that mask, so those zeroes don't need actual representation going forward.

I.e., if pfact(p) is the "factorial" product of all primes from p on down, then
if mlen=pfact(p), [bitmap[i:i+mlen] for i in range(1,len(bitmap):mlen)] will be
(untested) a list of bitmap chunks with guaranteed zeroes everywhere bitmap[1:mlen+1]
had them. I downloaded your bitmap, so it should be easy to confirm...

(Ok, but I reconstituted the theoretical bitmap by including factors of 2 and the
primes 1 & 2 in the following)

===< primask.py >================================================= ===
def getbitmasks(fpath, nb):
chunksize = (nb/2+7)//8 # nb/2 since even bits are not in bitfile
f = file(fpath,'rb')
bits = []
def bufbits():
if len(bits)>=nb: return
bitchunk = f.read(chunksize) # bigendian bytes w/ bits for odd no's 1,3,5...
for byte in bitchunk:
ibyte = ord(byte)
for b in xrange(8):
bits.append('.1'[(0x80>>b)&ibyte>0])
bits.append('.') # add in evens

bufbits()
bits[0:2] = ['1','1'] # account for primes 1 & 2 not in bit file at front
while 1:
bufbits() # make sure
if not bits: break
yield ''.join(bits[:nb])
bits = bits[nb:]

def primask(p, nlines=5):
pf = {2:2, 3:6, 5:30, 7:210, 11:2310, 13:30030, 17:510510}[p] # should suffice
bgen = getbitmasks('primes.bitarray', pf)
print 'Prime for mask = %s, mask length = %s' % (p, pf)
for n in xrange(nlines):
s = bgen.next()
nz = s.count('.')
if not n: print 'Precompression possible: (%s-%s zeroes)/%s => %.1f %% of orig' %(
pf,nz,pf,100.*(pf-nz)/pf)
print '%4s: %s' %(nz, len(s)<80 and s or s[:80]+' etc.')

if __name__ == '__main__':
import sys
try:
args = map(int, sys.argv[1:])
p = args.pop(0)
nl = args and args.pop(0) or 5
primask(p, nl)
except: raise; SystemExit("""
Usage: primask prime [number_of_lines]
where prime determines size of mask as product of primes at and below this
and number of lines is number to print.""")
================================================== ===================

This will print segments of the bit map (reconstituted to include all numbers starting with 1,2,3,...).
The segments are the length of a possible mask taken from the front whose non-prime slots are bound
to repeat through the rest, since corresponding slots will be multiples of the primes in the first set.
Since those zeroes repeat, one could remove them from the rest of the bit file, as you did with
even numbers. I print the possible compression based on the number of zeroes in the first segment.
So unless I goofed, ...

(To read primes from masks start with 1,2,3,(.),5,(.) etc)

[11:00] C:\pywk\clp\prime>primask.py 3
Prime for mask = 3, mask length = 6
Precompression possible: (6-2 zeroes)/6 => 66.7 % of orig
2: 111.1.
123 5 <<-- (primes represented, etc)
4: 1...1.
7 +--11
4: 1...1.
| +--17
+------13
4: 1...1.
5: ....1.

[11:00] C:\pywk\clp\prime>primask.py 5
Prime for mask = 5, mask length = 30
Precompression possible: (30-19 zeroes)/30 => 36.7 % of orig
19: 111.1.1...1.1...1.1...1.....1.
23: 1.....1...1.1...1.....1.....1.
23: 1.....1...1.1.....1...1.....1.
24: ......1...1.1...1.1...1.......
25: ......1...1.....1.1.........1.

[11:00] C:\pywk\clp\prime>primask.py 7
Prime for mask = 7, mask length = 210
Precompression possible: (210-163 zeroes)/210 => 22.4 % of orig
163: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
175: 1...........1...1.1...1.....1.1.........1.....1... ..1.....1.1.....1...1.1....... etc.
177: 1.........1.1.....1...1.....1.......1...1.1...1... ........1.......1...1.......1. etc.
178: 1.........1.1...1.....1.....1.1...........1...1... ..1.......1.........1.......1. etc.
180: ............1...1.1...1.............1...1.1...1... ................1...1.......1. etc.

[11:00] C:\pywk\clp\prime>primask.py 11
Prime for mask = 11, mask length = 2310
Precompression possible: (2310-1966 zeroes)/2310 => 14.9 % of orig
1966: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
2030: 1.....................1.....1.1.....1...1.....1... ..........1.....1...1.1.....1. etc.
2043: 1...............1.1...1.....1.1.....1.....1....... ..1.....1...........1......... etc.
2055: ................1.1.........1.1.....1...1.....1... ..1.......1.....1...1......... etc.
2064: 1...............1...................1...1.1....... ..1.................1.......1. etc.

[11:00] C:\pywk\clp\prime>primask.py 13
Prime for mask = 13, mask length = 30030
Precompression possible: (30030-26781 zeroes)/30030 => 10.8 % of orig
26781: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
27216: ................1...........1...........1......... ........1.1.....1.....1.....1. etc.
27366: ................1.....1.....1.1.........1.1...1... ................1.....1.....1. etc.
27444: ................1.............1.....1............. ........1.............1....... etc.
27481: 1...................................1.....1...1... ..........1...........1.....1. etc.

[11:00] C:\pywk\clp\prime>primask.py 17
Prime for mask = 17, mask length = 510510
Precompression possible: (510510-468178 zeroes)/510510 => 8.3 % of orig
468178: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1... ..1.....1.1.....1...1.1.....1. etc.
472790: ..................1.....................1.1....... ........1...........1.1.....1. etc.
474186: ......................1.......................1... ..1.......1.....1...1.1....... etc.
475041: ..................1...........1.....1............. ..........1................... etc.
475727: ..................1.................1.....1....... ................1...1......... etc.

So it appears we could compress that file (imagining it as containing evens also) quite a bit,
by using a (510510+7)//8 => 63814 byte mask as the pattern for squishing zeroes out of the rest.

BTW, I thought primes ran about 10% of all numbers, so 8.3 % seems a little suspicious.
Did I goof, or what is the right percentage? I guess I'll make a dict of all the gap sizes
with their respective counts in the bit map and see, but have to leave this for now ...

FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.

Yours, Lulu...

[*] I'm missing two primes versus some other claims. My odd numbers
don't include 2; and I called 1 a non-prime. I don't feel strongly
about the second matter though.

[**] Well, a MUCH shorter representation is a Python-coded algorithm for
generating the primes in the first place. But somehow that seems like
cheating :-).


Oops, better post this and go...

Regards,
Bengt Richter
Jul 18 '05 #35
bo**@oz.net (Bengt Richter) wrote previously:
|I thought primes ran about 10% of all numbers, so 8.3 % seems a little
|suspicious.

The primes become sparser over larger sizes. Thinking about the Seive
of Erotosthenes makes that apparent, I think: the larger you get, the
more values get filtered out.

However, your ratios, I believe, *must* be wrong:

|Prime for mask = 3, mask length = 6
|Precompression possible: (6-2 zeroes)/6 => 66.7 % of orig
|Prime for mask = 5, mask length = 30
|Precompression possible: (30-19 zeroes)/30 => 36.7 % of orig
|Prime for mask = 7, mask length = 210
|Precompression possible: (210-163 zeroes)/210 => 22.4 % of orig
[...]

Bracket the Python implementation, just think of the numbers. If you
have a sequence of N bits, the 2 mask takes out half of them. What's
left is odd numbers: exactly 1/3 of which are divisible by 3. Of the
filtered list, exactly 1/5 of what's left is divisible by 5. And so
on.[*]
[*] Actually, exact divisibilty can be off-by-one, since the original
sequence might not divide evenly by all those primes.

So, in other words, the 2 mask gets you to 50%; the 2+3 mask gets
1/2*2/3==33%; the 2+3+5 mask gets 1/2*2/3*4/5==27%; the 2+3+5+7 mask
gets 1/2*2/3*4/5*6/7==22.8%; and so on.

Verifying this "experimentally":
len(filter(lambda n: n%2, range(100))) 50 len(filter(lambda n: n%2 and n%3, range(100))) 33 len(filter(lambda n: n%2 and n%3 and n%5, range(100))) 26 len(filter(lambda n: n%2 and n%3 and n%5 and n%7, range(100))) 22 len(filter(lambda n: n%2 and n%3 and n%5 and n%7 and n%11, range(100))) 21 len(filter(lambda n: n%2 and n%3 and n%5 and n%7 and n%11 and n%13, range(100))) 20 len(filter(lambda n: n%2 and n%3 and n%5 and n%7 and n%11 and n%13 and n%17, range(100)))

19

Anything that is left over after this filtering is--by definition--of
unknown primality. You need to encode that by putting a 1 or 0 in the
bit position corresponding to the number remaining. IOW:

toEncode = filter(lambda n: n%2 and n%3 and n%5 and n%7, range(10**8)))
for n in toEncode:
if isPrime(n):
bitfield.append(1)
else:
bitfield.append(0)

You might want to run that first line on a fast machine with plenty of
memory. :-)

Yours, Lulu...

--
mertz@ _/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: \_\_\_\_ n o
gnosis _/_/ Postmodern Enterprises \_\_
..cx _/_/ \_\_ d o
_/_/_/ IN A WORLD W/O WALLS, THERE WOULD BE NO GATES \_\_\_ z e
Jul 18 '05 #36
Lulu of the Lotus-Eaters <me***@gnosis.cx> wrote:
Anything that is left over after this filtering is--by definition--of
unknown primality. You need to encode that by putting a 1 or 0 in the
bit position corresponding to the number remaining. IOW:

toEncode = filter(lambda n: n%2 and n%3 and n%5 and n%7, range(10**8)))
for n in toEncode:
if isPrime(n):
bitfield.append(1)
else:
bitfield.append(0)

You might want to run that first line on a fast machine with plenty of
memory. :-)


I've been keeping some code to myself, because there seemed to be no
direct connection to this thread. However, now that you mention
needing a lot of resources there's a reason to post because my script
creates an infinite generator function that can produce the numbers
for your toEncode list more efficiently.

Anton

def period(L):
"""suppose we enumerate all integers >= 2 which are
not a multiple of any of the numbers in L (L is a list
of prime numbers) and take the first order differences.
This list repeats itself after some time. The period length
is the product of [x-1 for x in L]."""
return reduce(lambda i,j: i*(j-1),L,1)

def primedeltas(L):
"""return one full period of differences"""
diffs,p = [],period(L)
prev = i = 1
while len(diffs) < p:
i += 2
for j in L:
if i%j == 0: break
else:
diffs.append(i-prev)
prev= i
return diffs

def excludemultiples(L):
"""return each number in L and then return numbers
that are not multiples of any number in L. L is a list
of primes"""
i,pp = 1,primedeltas(L)
for p in L: yield p
while 1:
for p in pp:
i += p
yield i

def test():
L= [2,3,5,7]
em = excludemultiples(L)
for i,m in enumerate(em):
print m,
if i==14:
print
break
#print [x/2 for x in primedeltas(L)]

if __name__=='__main__':
test()
Jul 18 '05 #37

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.