Large Dictionaries

Chris Foote

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris

May 15 '06 #1

Subscribe Post Reply

10770

Richie Hindle

[Chris]

Has anyone written a fast hash module which is more optimal for
large datasets ?

PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."

--
Richie Hindle
ri****@entrian.com

May 15 '06 #2

Roy Smith

In article <1147699064.107490@teuthos>, Chris Foote <ch***@foote.com.au>
wrote:

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Are those clock times or CPU times?

How much memory is your process using and how much is available on your
machine?

I'm guessing a integer takes 4 bytes and a long integer takes roughly one
byte per two decimal digits. Plus a few more bytes to bundle them up into
a tuple. You've probably got something like 20 bytes per key, so 5M of
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe
half full, and figure a hash key is 4 bytes, so that's another 40 meg for
the hash table itself.

Plus whatever the values stored in the dictionary take up. Even if you're
not storing any values (i.e., there's probably 4 bytes for a null pointer
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably
conservative estimates of what various objects must take up in memory, but
you see where this is going. We're already up to 180 meg. I wonder if the
whole problem is that you're just paging yourself to death. A few minutes
watching your system memory performance with ps or top while your program
is running might give you some idea if this is the case.

May 15 '06 #3

Claudio Grondi

Chris Foote wrote:

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris

Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

Claudio

May 15 '06 #4

Aahz

In article <ro***********************@reader1.panix.com>,
Roy Smith <ro*@panix.com> wrote:

In article <1147699064.107490@teuthos>, Chris Foote <ch***@foote.com.au>
wrote:

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Are those clock times or CPU times?

And what are these times measuring? Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.
Once the dict is constructed, lookup times should be quite good.
--
Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/

"I saw `cout' being shifted "Hello world" times to the left and stopped
right there." --Steve Gonedes

May 15 '06 #5

Roy Smith

Aahz <aa**@pythoncraft.com> wrote:

Don't forget that adding keys requires resizing the dict, which is a
moderately expensive operation.

My guess would be that each resize grows the table by a constant
factor, which IIRC, works out to amortized O(n). It's not as good as
creating the dict the right size in the first place, but it's really
not that bad. Somewhat more troubling is that it can lead to memory
fragmentation problems.

I don't understand why Python doesn't have a way to give a size hint
when creating a dict. It seems so obvious to be able to say "d = dict
(size=10*1000*1000)" if you know beforehand that's how many keys
you're going to add.

Looking at the docs, I'm astounded to discover that you can indeed do
exactly that, but you get {size:10000000}. I am befuddled as to why
people thought creating a dict by keyword would be a useful thing to
do, but left out (and, indeed, eliminated the chance to add the syntax
later) the obviously useful ability to hint at the size. I suppose we
could still add:

d = {}
d.reserve (10*1000*1000)

May 15 '06 #6

Maric Michaud

Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit*:

d = {}
d.reserve (10*1000*1000)

d={}.fromkeys(xrange(5*10**6)) ?

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097

May 15 '06 #7

Paul McGuire

"Claudio Grondi" <cl************@freenet.de> wrote in message
news:e4**********@newsreader3.netcologne.de...

Chris Foote wrote:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris

Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

Claudio

sqlite also supports an in-memory database - use pysqlite
(http://initd.org/tracker/pysqlite/wiki) to access this from Python.

-- Paul

May 15 '06 #8

Diez B. Roggisch

Maric Michaud schrieb:

Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit :
d = {}
d.reserve (10*1000*1000)

d={}.fromkeys(xrange(5*10**6)) ?

That is a totally different beast. You don't want to insert arbitrary
keys, you want the internal hash-array to be of the right size.

Diez

May 15 '06 #9

bearophileHUGS

Roy Smith:

I am befuddled as to why
people thought creating a dict by keyword would be a useful thing to
do, but left out (and, indeed, eliminated the chance to add the syntax
later) the obviously useful ability to hint at the size.

Adding the keyword syntax doesn't require much memory to a Python
programmer because it's the same syntax used elsewhere. While adding
another different method to dicts increases their API. Python is
usually designed to have smaller APIs, to help programmers remember
most things.

I agree that a reserve method to Python dicts/sets can be a little
useful, but Python programmers are usually more concerned about doing
things in a short programmer time and short code space, than telling
the computer how to do such things in the faster way (C++ is more for
speed so the reserve of the STL hashes can be more useful).

Bye,
bearophile

May 15 '06 #10

John Machin

1. Why is two minutes to insert 5M keys "bad" for you? What would be
"good"? What would be good/bad look-up times? Have you measured the
typical look-up time? How often is the dict creation required to be
done? How often does the data change? Is multi-user access required for
(a) look-up (b) updating? Have you considered loading the dict from a
pickle?
2. Assuming your code that is creating the dict looks in essence like
this:
adict = {}
for k, v in some_iterable:
adict[k] = v
then any non-linear behaviour can only be in the actual CPython
insertion code. Psyco can't help you there. Psyco *may* help with the
linear part, *if* you have enough memory. What are the corresponding
times without Psyco? In any case, if your code isn't (conceptually)
that simple, then try cutting away the cruft and measuring again.
3. Which version of Python? What OS? OK, psyco -> Intel x86, but what
chip exactly? How much free memory?
4. Consider printing time-so-far results, say every 100K keys. Multiple
step-ups might indicate dict resizings. A dog-leg probably means
running out of memory. Why "roughly" 5M keys???
5. How large are your long_integers?
6. What is the nature of the value associated with each key?
7. Have you experimented with key = a * 2 ** 32 + b instead of key =
(a, b)?

HTH,
John

May 15 '06 #11

lcaamano

Sounds like PyTables could be useful.

http://www.pytables.org

--
lpc

May 16 '06 #12

Chris Foote

Roy Smith wrote:

In article <1147699064.107490@teuthos>, Chris Foote <ch***@foote.com.au>
wrote:
I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s
Are those clock times or CPU times?

User + system CPU time.
How much memory is your process using and how much is available on your
machine?
A very large amount of space is consumed, which is why I didn't give
times for the 10M version which would have eaten my 1G of RAM and
into swap :-)
I'm guessing a integer takes 4 bytes and a long integer takes roughly one
byte per two decimal digits. Plus a few more bytes to bundle them up into
a tuple. You've probably got something like 20 bytes per key, so 5M of
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe
half full, and figure a hash key is 4 bytes, so that's another 40 meg for
the hash table itself.

Plus whatever the values stored in the dictionary take up. Even if you're
not storing any values (i.e., there's probably 4 bytes for a null pointer
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably
conservative estimates of what various objects must take up in memory, but
you see where this is going. We're already up to 180 meg.
Yep, that size sounds about right (my dictionary values are all None).
I wonder if the
whole problem is that you're just paging yourself to death. A few minutes
watching your system memory performance with ps or top while your program
is running might give you some idea if this is the case.

Definitely not.

Thanks anyway,

Chris

May 16 '06 #13

Chris Foote

Aahz wrote:

In article <ro***********************@reader1.panix.com>,
Roy Smith <ro*@panix.com> wrote:
In article <1147699064.107490@teuthos>, Chris Foote <ch***@foote.com.au>
wrote:
I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s Are those clock times or CPU times?

And what are these times measuring?

The loading of a file into a dictionary. i.e. no lookup operations.
Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.
Yep, that's why I probably need a dictionary where I can pre-specify
an approximate size at the time of its creation.
Once the dict is constructed, lookup times should be quite good.

Very good indeed with Psyco!

Cheers,
Chris

May 16 '06 #14

Maric Michaud

Le Lundi 15 Mai 2006 21:07, Diez B. Roggisch a écrit*:

d={}.fromkeys(xrange(5*10**6)) ?

That is a totally different beast. You don't want to insert arbitrary
keys, you want the internal hash-array to be of the right size.

But you can replace the xrange part by any generator function you want.

def get_mykeys(..)
...
yield key

I just the wonder if the time consuming part is the memory allocation of hash
table (key by key) or the hash operation itself.

I don't see a use case where a python programmer should need a
dictionnary "that will be probably big" but can't predict what keys will be
in.

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097

May 16 '06 #15

Duncan Booth

Chris Foote wrote:

Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.

Yep, that's why I probably need a dictionary where I can pre-specify
an approximate size at the time of its creation.

Try splitting the creation of the keys from the creation of the dictionary
and you'll see where the bottleneck really is.

On my system:

r = range(5000000)
d = dict.fromkeys(r)
takes less than 1 second to create the dictionary. So creating a dictionary
with 5 million elements is not in itself a terribly expensive operation
(and yes, even with fromkeys there is no attempt to work out in advance
what size the dictionary should be).

Trying to simulate your data:
scale = 2**32
data = [ (i*scale,i) for i in range(5000000) ]
d = dict.fromkeys(data)
the assignment to data took 42 seconds. Creating the dictionary took 6
seconds.

Trying the variation someone else suggested:
scale = 2**32
data = [ i*scale+i for i in range(5000000) ]
d = dict.fromkeys(data)
creating data took under 6 seconds and about 1 second for d.

so it looks like the issue is not creating a large dictionary, nor lots of
integers, but creating lots of tuples is expensive.

Oh, and if anyone tells you that the intermediate list I created in the
tests above is going to make it slow and you should use iterators instead:
r = range(5000000)
scale = 2**32
s = [ i*scale for i in r ]
from itertools import izip
d = dict.fromkeys(izip(s,r))
The first few lines took a couple of seconds, the last one 1 minute 50
seconds. The alternative version:
scale = 2**32
r = range(5000000)
d = dict.fromkeys((i*scale,i) for i in r)

takes a similar time.

May 16 '06 #16

Chris Foote

Claudio Grondi wrote:

Chris Foote wrote:
p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created by
passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk may be
created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in string
form "long_integer:integer" since bsddb doesn't support keys that aren't
integers or strings.

May 16 '06 #17

Chris Foote

Paul McGuire wrote:

"Claudio Grondi" <cl************@freenet.de> wrote in message
news:e4**********@newsreader3.netcologne.de...
Chris Foote wrote:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

sqlite also supports an in-memory database - use pysqlite
(http://initd.org/tracker/pysqlite/wiki) to access this from Python.

Hi Paul.

I tried that, but the overhead of parsing SQL queries is too high:

dictionary metakit sqlite[1]
---------- ------- ---------
1M numbers 8.8s 22.2s 89.6s
2M numbers 24.0s 43.7s 190.0s
5M numbers 115.3s 105.4s N/A

Thanks for the suggestion, but no go.

Cheers,
Chris

[1] pysqlite V1 & sqlite V3.

May 16 '06 #18

Chris Foote

lcaamano wrote:

Sounds like PyTables could be useful.

http://www.pytables.org

In browsing their excellent documentation, it seems that all concepts
are built around storing and reading HDF5 format files.

Not suitable for this project unfortunately.

Cheers,
Chris

May 16 '06 #19

John Machin

> (my dictionary values are all None).

So in fact all you need is a set. Have you experimented with the Python
2.5 alpha?

May 16 '06 #20

Vladimir 'Yu' Stepanov

Chris Foote wrote:

lcaamano wrote:
Sounds like PyTables could be useful.

http://www.pytables.org

In browsing their excellent documentation, it seems that all concepts
are built around storing and reading HDF5 format files.

Not suitable for this project unfortunately.

Cheers,
Chris

How many values accept long and int in pairs ? It is possible to try to use
the dictionary of dictionaries:

d = {}
for long_val, int_val in list_vals:
d.setdefault(long_val, {})[int_val] = None

May 16 '06 #21

Duncan Booth

John Machin wrote:

(my dictionary values are all None).

So in fact all you need is a set. Have you experimented with the Python
2.5 alpha?

I don't think that will help him: my timings work out about the same on
2.4.2 or 2.5a2. As I pointed out, the bottleneck is creating the tuples.
Creating either a dictionary or a set from 5 million tuples takes a similar
amount of time (about 6 seconds on my system).

BTW, a string key is quite a bit faster than the tuples, even though you
end up with as many temporary tuples during the process:

scale = 2**32
data = [ "%d:%d" % (i*scale,i) for i in range(5000000) ]
d = set(data)

takes about 18 seconds to create the data list and 5 for the set.

May 16 '06 #22

Chris Foote

John Machin wrote:

1. Why is two minutes to insert 5M keys "bad" for you?
What would be "good"?
Hi John.

I'd ideally like to have O(1) performance for the number of keys
inserted. If linear, then 5M insertions taking the same time as
5 x 1M keys, or 44.0s. I see more than double that:

# keys dictionary metakit
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Maybe the dict resizing is the cause, in which case a dict
type which allows an approximate size to be pre-specified
would probably negate the need for resizing.
What would be good/bad look-up times? Have you measured the
typical look-up time?
The lookup time from all in-memory databases I've tried
is good (I need >15,000 lookups per second).
How often is the dict creation required to be
done?
At least once a day upon the process receiving a signal.
I was planning on using a separate thread to load a
temporary dictionary, and swap the old one for the new
one upon completion; that might be good in theory, but
I need a predictable load time over multiple machines,
which is what prompted me to ask about alternate hash
table libraries that are more suitable for large datasets.
How often does the data change?
Almost never, except during a full reload perhaps once
an hour through to once a day.
Is multi-user access required for
(a) look-up (b) updating?
Lookup - yes.
Updating - no.

Have you considered loading the dict from a pickle?
The pickling process consumes too much RAM to
save even a dict containing 1M keys (> 1 GB).
2. Assuming your code that is creating the dict looks in essence like
this:
adict = {}
for k, v in some_iterable:
adict[k] = v
I'm using tuples as keys, with None for the data - it's roughly
like this:

class NumberDB(object):
def __init__(self):
self.numbers = {}
def add(number, wildcard_digits):
self.numbers[(number, wildcard_digits)] = None

where add() is called after converting a line of text from a file
into integers to provide as arguments. i.e. keys are tuple of
form (long_integer, integer).
then any non-linear behaviour can only be in the actual CPython
insertion code. Psyco can't help you there. Psyco *may* help with the
linear part, *if* you have enough memory. What are the corresponding
times without Psyco?
Almost double, because I'm populating the dictionary from
a file containing strings which I have to convert.
In any case, if your code isn't (conceptually)
that simple, then try cutting away the cruft and measuring again.
3. Which version of Python? What OS? OK, psyco -> Intel x86, but what
chip exactly? How much free memory?
My devel machine, used for this testing is:

Python 2.4.2 (#1, Feb 12 2006, 03:59:46)
[GCC 4.1.0 20060210 (Red Hat 4.1.0-0.24)] on linux2
Fedora Core 5 i686
CPU: AMD Athlon(tm) 64 Processor 2800+
1G RAM.
4. Consider printing time-so-far results, say every 100K keys. Multiple
step-ups might indicate dict resizings. A dog-leg probably means
running out of memory.
see attached GIF for this. The X axis represents 100k increments
and the Y axis the wall clock duration from startup.
Why "roughly" 5M keys???
Sorry, that was just a guess :-)
5. How large are your long_integers?
They're phone numbers (tests samples are 11 digits).
6. What is the nature of the value associated with each key?
Not used, so they're set to None.
7. Have you experimented with key = a * 2 ** 32 + b instead of key =
(a, b)?

No I haven't. I'll give that a try and report back, especially after
Duncan's posts suspecting tuple overhead.

Cheers,
Chris

May 16 '06 #23

Claudio Grondi

Chris Foote wrote:

Claudio Grondi wrote:
Chris Foote wrote:
p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in
the standard Python 2.4 distribution.

However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created by
passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk
may be created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in string
form "long_integer:integer" since bsddb doesn't support keys that aren't
integers or strings.

I have to admit, that I haven't wrote any own code to actually test
this, but if 20m25s for storing of a single MByte of strings in a
database table index column is really what you are getting, I can't get
rid of the feeling, that there is something elementary wrong with your
way doing it.

Posting the code for your test cases appears to me to be the only option
to see what is the reason for the mystery you are getting here (this
will clarify also the other mysterious things considered by the posters
to this thread up to now).

Claudio

May 16 '06 #24

jantod

BTW why are python dicts implemented as hash tables and not judy arrays?

May 16 '06 #25

Duncan Booth

ja****@gmail.com wrote:

BTW why are python dicts implemented as hash tables and not judy
arrays?
Probably because:

Python predates judy arrays.
Nobody has proposed replacing the current dict implementation.
Nobody has demonstrated that it would actually be an advantage to
replace the current implementation.
Or maybe Python dicts are just more flexible?

You choose.

A quick scan of the Judy IV Shop Manual reveals:
Judy is a dessert topping and a floor wax, but it’s not intended for
use as a house paint. .... Judy is a programming library that provides a relatively simple
interface (API) for array-like storage of word or string indexes with
optional mapping of indexes to single-word values.
It doesn't sound to me as though they match the flexibility of Python
dictionaries. Without reading further through the documentation, the
implication is that keys are restricted to simple word or string values.
Remember, all that Python requires of a dict key is that it has a hash
value and can be compared for equal/not equal with other keys. It
doesn't require the key to be expressible as a sequence of bits which is
what the Judy manual seems to imply:
You can think of digital trees as peeling (decoding) leading bits off
a key until only one bit is left, but in the case of an unbounded
variable-size key there is no definite “bottom” (that is, a definite
last bit or maximum length for every key). However, there are always
unpopulated subexpanses, except with a fixed-size key where every
possible key value is stored in the data structure. When decoding keys
top-down in this way, each (sub)expanse is defined by the bits already
decoded and the number of bits remaining (if finite).

May 16 '06 #26

Duncan Booth

Duncan Booth wrote:

BTW why are python dicts implemented as hash tables and not judy
arrays?
Probably because:

Python predates judy arrays.
Nobody has proposed replacing the current dict implementation.
Nobody has demonstrated that it would actually be an advantage to
replace the current implementation.
Or maybe Python dicts are just more flexible?

Also, see http://www.dalkescientific.com/Python/PyJudy.html with the
conclusions drawn on performance:
The first conclusion is that Python dictionaries are wicked fast.
There are few cases where PyJudy is faster, though perhaps there might
be a few more if I knew more about the Python extension API. Bear in
mind though that the Judy arrays are in sorted order and the JudyL*
classes have ways to access elements by order.

So for a specialised use Judy arrays could be a good idea, but not as a
general replacement for Python dicts.

May 16 '06 #27

Sion Arrowsmith

Maric Michaud <ma***@aristote.info> wrote:

I don't see a use case where a python programmer should need a
dictionnary "that will be probably big" but can't predict what keys will be
in.

I've recently been working on an app with precisely this
characteristic, although there were enough other bottlenecks that the
cost of growing the dict as needed was irrelevant. (Read ~2m items
from csv file, compare them with items in a database and populate
assorted dictionaries based on the result of that comparison for
subsequent processing. You can make a reasonable guess, and have a
definite upper bound, on the sizes of the dictionaries cheaply right
at the start, but you don't know what the keys in each are going to be
until you discover them through the comparison of file and DB.)

--
\S -- si***@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
___ | "Frankly I have no feelings towards penguins one way or the other"
\X/ | -- Arthur C. Clarke
her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump

May 16 '06 #28

Chris Foote

Claudio Grondi wrote:

Chris Foote wrote:
However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created
by passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk
may be created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in
string form "long_integer:integer" since bsddb doesn't support keys
that aren't integers or strings.
I have to admit, that I haven't wrote any own code to actually test
this, but if 20m25s for storing of a single MByte of strings in a
database table index column is really what you are getting, I can't get
rid of the feeling, that there is something elementary wrong with your
way doing it.

Hi Claudio.

1M is one million, referring to the number of insertions of keys;
not a Megabyte. I'm sorry that you took it that way :-(

Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.
Posting the code for your test cases appears to me to be the only option
to see what is the reason for the mystery you are getting here (this
will clarify also the other mysterious things considered by the posters
to this thread up to now).

I agree that posting some test code would have proved useful, but
the code is big and has too many interdependencies on external
things (e.g. databases, threads & pyro RPC calls) to allow me
to separate out examples easily. But if you go back to my
original posting, I think my question was quite clear.

Best regards,
Chris

May 16 '06 #29

Claudio Grondi

Chris Foote wrote:

Claudio Grondi wrote:
Chris Foote wrote:
However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created
by passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk
may be created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in
string form "long_integer:integer" since bsddb doesn't support keys
that aren't integers or strings.

>

I have to admit, that I haven't wrote any own code to actually test
this, but if 20m25s for storing of a single MByte of strings in a
database table index column is really what you are getting, I can't
get rid of the feeling, that there is something elementary wrong with
your way doing it.

Hi Claudio.

1M is one million, referring to the number of insertions of keys;
not a Megabyte. I'm sorry that you took it that way :-(

Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.
Posting the code for your test cases appears to me to be the only
option to see what is the reason for the mystery you are getting here
(this will clarify also the other mysterious things considered by the
posters to this thread up to now).

I agree that posting some test code would have proved useful, but
the code is big and has too many interdependencies on external
things (e.g. databases, threads & pyro RPC calls) to allow me
to separate out examples easily. But if you go back to my
original posting, I think my question was quite clear.

Best regards,
Chris

I have some code demonstrating the mystery from my point of view
(including timings which differ from what you experience in the order of
a magnitude on my 2.8 GHz P4):

dctA = {}
import time
strt = time.clock()
lst = range(123456789010000000L, 123456789010000000L + 5000000)
import random
endt = time.clock()
print "lst = range(12345678901L, 12345678901L + 5000000) [s] : ", endt -
strt
strt = time.clock()
random.shuffle(lst)
endt = time.clock()
print "random.shuffle(lst) [s] : ", endt -
strt
random.shuffle(lst)
counter = 0
for item in lst:
if (counter == 0):
print "Listing of some of lst items:"
if (counter > 4):
print ":END of listing of lst items"
break
else:
print "item no. %i "%(counter,), repr(item)
counter += 1
#:for
# raw_input('continue with ENTER) #>')
strt = time.clock()
dctA.fromkeys(lst, None)
endt = time.clock()
print "dctA.fromkeys(lst, None) [s] : ", endt -
strt
# raw_input('continue with ENTER) #>')
strt = time.clock()
strLst = []
for item in lst: strLst.append(str(item))
dctA = {}
endt = time.clock()
print "for item in lst: strLst.append(str(item)) [s] : ", endt -
strt
# raw_input('continue with ENTER) #>')
strt = time.clock()
dctA = dctA.fromkeys(strLst, None)
endt = time.clock()
print "dctA.fromkeys(strLst, None) [s] : ", endt -
strt
# raw_input('continue with ENTER) #>')
print "len(dctA) : %i " % (len(dctA),)
counter = 0
for key in dctA.keys():
if (counter == 0):
print "Listing of some of dctA items:"
if (counter > 4):
print ":END of listing of dctA items"
break
else:
print "key no. %i "%(counter,), repr(key)
counter += 1
#:for
raw_input('exit with ENTER) #>')

# Gives as output :
#
# lst = range(12345678901L, 12345678901L + 5000000) [s] : 0.81327347470
# random.shuffle(lst) [s] : 8.06178829991
# Listing of some of lst items:
# item no. 0 123456789011010733L
# item no. 1 123456789013585761L
# item no. 2 123456789013610266L
# item no. 3 123456789011311029L
# item no. 4 123456789010968853L
# :END of listing of lst items
# dctA.fromkeys(lst, None) [s] : 3.11773256098
# for item in lst: strLst.append(str(item)) [s] : 11.5454232312
# dctA.fromkeys(strLst, None) [s] : 3.98027849908
# len(dctA) : 5000000
# Listing of some of dctA items:
# key no. 0 '123456789014515319'
# key no. 1 '123456789014116699'
# key no. 2 '123456789014116698'
# key no. 3 '123456789010800915'
# key no. 4 '123456789014116696'
# :END of listing of dctA items

So the insertion into the dictionary is according to you:
# keys dictionary
5M 115.3s
according to me:
5M 3.9s

I need according to the task manager about 0.5 GByte memory for this
(including the lists I keep alive).

I conclude from that, that your results from Berkeley approach are also
a case of unfortunate way of doing things (in Python and with the DB)
and using the Python dictionaries is here the simplest way to go (I
suppose that your main problem is, that you are not using .fromkeys() on
a list of your input data).

Claudio

May 16 '06 #30

Klaas

>22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

With bdb's it is crucial to insert keys in bytestring-sorted order.
Also, be sure to give it a decent amount of cache.

-Mike

May 16 '06 #31

Chris Foote

Klaas wrote:

22.2s 20m25s[3]
20m to insert 1m keys? You are doing something wrong.

Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

test code is attached.
With bdb's it is crucial to insert keys in bytestring-sorted order.
For the bsddb test, I'm using a plain string. (The module docs list a
string being the only datatype supported for both keys & values).
Also, be sure to give it a decent amount of cache.

The bsddb.hashopen() factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

....
File "bsddb-test.py", line 67, in runtest
db = bsddb.hashopen(None, flag='c', cachesize=8192)
File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
if cachesize is not None: d.set_cachesize(0, cachesize)
bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- DB->set_cachesize: method not permitted when environment
specified')
I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris

May 17 '06 #32

Claudio Grondi

Chris Foote wrote:

Klaas wrote:
22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration
104.3s
Surprising here, that the dictionary population gives the same time, but
the BerkeleyDB inserts the records 6 times faster on my computer than on
yours. I am running Python 2.4.2 on Windows XP SP2, and you?
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration
665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s As I don't have SQLite installed, it is interesting to see if the factor
10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there (with
db.commit()).
Can the same be done in BerkeleyDB, or does BerkeleyDB not support
inserting of records without building an index each single insert
command? If yes, how?

Claudio
test code is attached.
With bdb's it is crucial to insert keys in bytestring-sorted order.

For the bsddb test, I'm using a plain string. (The module docs list a
string being the only datatype supported for both keys & values).
Also, be sure to give it a decent amount of cache.

The bsddb.hashopen() factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

....
File "bsddb-test.py", line 67, in runtest
db = bsddb.hashopen(None, flag='c', cachesize=8192)
File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
if cachesize is not None: d.set_cachesize(0, cachesize)
bsddb._db.DBInvalidArgError: (22, 'Invalid argument --
DB->set_cachesize: method not permitted when environment specified')
I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris
------------------------------------------------------------------------

import bsddb
import random
import time
import sqlite

import psyco
psyco.full()

class WallClockTimer(object):
'''Simple timer for measuring tests.'''
def __init__(self, msg):
self.start = time.time()
self.msg = msg
print time.ctime(self.start), self.msg, 'started'

def stop(self):
end = time.time()
duration = end - self.start
print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration

class NumberGen(object):
'''Phone number generator for testing different methods of storage.'''
def __init__(self, range_start, range_end, n, max_wildcards):

print '''Number generator test for %d number ranges
with a maximum of %d wildcard digits.''' % (n, max_wildcards)

wildcards = range(0, max_wildcards + 1)
# generate n unique numbers and store as keys in number_hash
timer = WallClockTimer('dictionary population')
self.number_hash = {}

for i in xrange(0, n):
unique = False
while not unique:
wildcard_digits = self.gen_wildcard_digits(wildcards)
num = self.gen_number(range_start, range_end)
if wildcard_digits > 0:
num /= (10 ** wildcard_digits)
key = (num, wildcard_digits)
if self.number_hash.has_key(key):
unique = False
else:
unique = True
self.number_hash[key] = None
timer.stop()

def gen_wildcard_digits(self, wildcards):
return random.choice(wildcards)

def gen_number(self, start, end):
return int(random.uniform(start, end))

def storage_test(self, StorageTestClass):
test = StorageTestClass(self.number_hash)

class StorageTest(object):
'''base class for testing storage. Derive a test
class and provide your own runtest() method.'''
def __init__(self, number_hash):
timer = WallClockTimer('%s population' % type(self).__name__)
self.runtest(number_hash)
timer.stop()

class StorageBerkeleyDB(StorageTest):
def runtest(self, number_hash):
db = bsddb.hashopen(None, flag='c', cachesize=8192)
for (num, wildcard_digits) in number_hash.keys():
key = '%d:%d' % (num, wildcard_digits)
db[key] = None
db.close()

class StorageSQLite(StorageTest):
def runtest(self, number_hash):
db = sqlite.connect(':memory:')
cursor = db.cursor()
cursor.execute('CREATE TABLE numbers (num INTEGER, wd INTEGER)')
q = """insert into numbers (num, wd) values (%d, %d)"""
for (num, wildcard_digits) in number_hash.keys():
cursor.execute(q, num, wildcard_digits)
db.commit()
db.close()
if __name__ == '__main__':

test_vals = {'range_start' : 61880 * 10 ** 7,
'range_end' : 61891 * 10 ** 7,
'n' : 1 * 10 ** 4,
'max_wildcards' : 3}

ngen = NumberGen(test_vals['range_start'], test_vals['range_end'],
test_vals['n'], test_vals['max_wildcards'])

ngen.storage_test(StorageBerkeleyDB)
ngen.storage_test(StorageSQLite)

May 17 '06 #33

Chris Foote

Richie Hindle wrote:

[Chris]
Has anyone written a fast hash module which is more optimal for
large datasets ?

PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."

Thanks for the suggestion Richie. PyJudy works brilliantly :-)

Cheers,
Chris

May 18 '06 #34

Chris Foote

Claudio Grondi wrote:

Chris Foote wrote:
Klaas wrote:
22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration
104.3s

Surprising here, that the dictionary population gives the same time, but
the BerkeleyDB inserts the records 6 times faster on my computer than on
yours. I am running Python 2.4.2 on Windows XP SP2, and you?

Fedora core 5 with ext3 filesystem. The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).

Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped,
duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

As I don't have SQLite installed, it is interesting to see if the factor
10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there (with
db.commit()).

SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

Cheers,
Chris

May 18 '06 #35

Claudio Grondi

Chris Foote wrote:

Claudio Grondi wrote:
Chris Foote wrote:
Klaas wrote:

> 22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:
I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped,
duration 104.3s
>
Surprising here, that the dictionary population gives the same time,
but the BerkeleyDB inserts the records 6 times faster on my computer
than on yours. I am running Python 2.4.2 on Windows XP SP2, and you?

Fedora core 5 with ext3 filesystem. The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).

Ok, according to the Windows task manager the Python process
reads/writes to the file system during the run of BerkeleyDB test around
7 GByte(!) of data and the hard drive is continuously busy, where the
size of file I found in the Temp directory is always below 20 MByte. The
hard drive access is probably the main reason for loosing time - here a
question to BerkeleyDB experts:

Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM
or as BerkeleyDB can scale to larger data amount it makes not much sense
to tweak it into RAM?

Chris, is maybe a RAM-disk the right way to go here to save time lost
for accessing the file stored in the file system on the hard drive?

The RAM requirements, according to Windows XP task manager, are below
100 MByte. I am using the NTFS file system (yes, I know, that FAT is in
some configurations faster than NTFS) and XP Professional SP2 without
any tuning of file system caching. The CPU is 100% busy.

What CPU and RAM (SIMM, DDR, DDR2) do you have? I have 2GByte fast DDR
PC400/3200 dual line RAM. It seems, that you are still not getting
results within the range others experience running your code, so I
suppose, it has something to do with the hardware you are using.

Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped,
duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration
65.5s As I don't have SQLite installed, it is interesting to see if the
factor 10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there
(with db.commit()).

SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

One of the reasons I put an eye on BerkeleyDB is that it pretends to
scale to a huge amount (Terrabyte) of data and don't need as much RAM as
Python dictionary and it is not necessary to save/load pickled version
of the data (i.e. here the dictionary) from/to RAM in order to work with
it.
I guess, that in your case BerkeleyDB is for the named reasons probably
the right way to go, except your data will stay small and the Python
dictionary with them will always fit into RAM.

Now I am curious to know which path you have decided to go and why?

Claudio
Cheers,
Chris

May 18 '06 #36

Claudio Grondi

Chris Foote wrote:

Richie Hindle wrote:
[Chris]
Has anyone written a fast hash module which is more optimal for
large datasets ?

PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."

Thanks for the suggestion Richie. PyJudy works brilliantly :-)

Cheers,
Chris

It seems, that there is no Microsoft Windows version of Judy available,
so for this reason PyJudy won't work on Windows - am I right?

Claudio

May 18 '06 #37

Klaas

Chris:

Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.

This is absolutely false.

-Mike

May 24 '06 #38

Klaas

Chris:

class StorageBerkeleyDB(StorageTest):
def runtest(self, number_hash):
db = bsddb.hashopen(None, flag='c', cachesize=8192)
for (num, wildcard_digits) in number_hash.keys():
key = '%d:%d' % (num, wildcard_digits)
db[key] = None
db.close()

BDBs can accomplish what you're looking to do, but they need to be
tuned carefully. I won't get into too many details here, but you have
a few fatal flaws in that code.

1. 8Kb of cache is _pathetic_. Give it a few hundred megs. This is by
far your nbiggest problem.
2. Use BTREE unless you have a good reason to use DBHASH
3. Use proper bdb env creation instead of the hash_open apis.
4. Insert your keys in sorted order.

-Mike

May 24 '06 #39

Thomas Ganss

Klaas schrieb:

4. Insert your keys in sorted order.

This advice is questionable -
it depends on the at least on the db vendor and probably
sometimes on the sort method, if inserting pre-sorted
values is better.

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.

my 0.02 EUR

thomas

May 29 '06 #40

Scott David Daniels

Thomas Ganss wrote:

Klaas schrieb:
4. Insert your keys in sorted order.

This advice is questionable -....

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.

Don't use your gut for such issues. Pre-sorted data is such
a common special case (in fact, the only easily describable
special case) that many systems handle this specially. For
example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

If you are using your gut without testing, go ahead and
presort. In any case, reading documents and testing beats
gut feels every time.

--Scott David Daniels
sc***********@acm.org

May 29 '06 #41

Martin v. Löwis

Roy Smith wrote:

My guess would be that each resize grows the table by a constant
factor, which IIRC, works out to amortized O(n).
Right. For <50000 entries, the size is multiplied by 4 each time,
and doubled each time for large dictionaries.
It's not as good as
creating the dict the right size in the first place, but it's really
not that bad. Somewhat more troubling is that it can lead to memory
fragmentation problems.
Depends on your operating system, of course. You get over a page size
fairly quickly, and then many systems use anonymous mappings, where
a range of pages gets mapped from swap on allocation, and unmapped
on deallocation. So while this can cause address space fragmentation,
it doesn't cause memory fragmentation (i.e. memory that is allocated
by the process but can't be used).
I don't understand why Python doesn't have a way to give a size hint
when creating a dict. It seems so obvious to be able to say "d = dict
(size=10*1000*1000)" if you know beforehand that's how many keys
you're going to add.

There is no need for it. If dicts don't work well in some cases, these
cases should be investigated and fixed, instead of working around the
problem by loading the problem onto the developer.

As you said, it *should* have linear time, with the number of
insertions. If it doesn't (and it appears not to), that may be due to
hash collisions, or due to the time for computing hashes not being
constant.

Regards,
Martin

May 30 '06 #42

Lawrence D'Oliveiro

In article <44********@nntp0.pdx.net>,
Scott David Daniels <sc***********@acm.org> wrote:

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

Jun 5 '06 #43

Steve Holden

Lawrence D'Oliveiro wrote:

In article <44********@nntp0.pdx.net>,
Scott David Daniels <sc***********@acm.org> wrote:

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

Isn't that just your definition of "reasonable"?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Love me, love my blog http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Jun 5 '06 #44

Aahz

In article <ld***********************@lust.ihug.co.nz>,
Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealand> wrote:

In article <44********@nntp0.pdx.net>,
Scott David Daniels <sc***********@acm.org> wrote:

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

Read some of the old discussions in the python-dev archives.
--
Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/

"I saw `cout' being shifted "Hello world" times to the left and stopped
right there." --Steve Gonedes

Jun 5 '06 #45

Iain King

Lawrence D'Oliveiro wrote:

In article <44********@nntp0.pdx.net>,
Scott David Daniels <sc***********@acm.org> wrote:
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

An already sorted list can be pathological for Quicksort, depending on
how you code it.

Iain

Jun 5 '06 #46

Tim Peters

[Scott David Daniels]

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro] But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)? O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Python's current sorting algorithm exploits pre-existing order of many
kinds. For example, take a sorted list, "cut it in half, and riffle
shuffle the two halves together"; e.g.,

[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

That turns out to be a supernaturally good case too.

Jun 5 '06 #47

Jim Segrave

In article <ma***************************************@python. org>,
Tim Peters <ti********@gmail.com> wrote:

[Scott David Daniels]
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro]
But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?

For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)? O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof, whereas some other
sorts, quicksort being a prime example, require great care to
avoid pathological cases.

--
Jim Segrave (je*@jes-2.demon.nl)

Jun 5 '06 #48

Tim Peters

[Jim Segrave]

Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof,
Write a heapsort and time it. It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start. Initially
reverse-sorted is a "good case" for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible). "good" and "bad" are both
O(N log N) here, though.

BTW, most people have never heard of Smoothsort, which was Dijkstra's
excruciating attempt to make heapsort exploit pre-existing order:

http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796.PDF

That's one of a hundred algorithms I rejected for Python ;-)
whereas some other sorts, quicksort being a prime example, require
great care to avoid pathological cases.

Indeed, Python gave up on quicksort for that reason. While the
samplesort hybrid that came between quicksort and the current sort
also had theoretical O(N**2) worst-case behavior, it was exceedingly
difficult to create such an input, and (unlike as for every quicksort
variant Python tried) no real-life case of undue sloth was ever
reported against it.

Jun 5 '06 #49

Jim Segrave

In article <ma***************************************@python. org>,
Tim Peters <ti********@gmail.com> wrote:

[Jim Segrave]
Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof,

Write a heapsort and time it. It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start. Initially
reverse-sorted is a "good case" for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible). "good" and "bad" are both
O(N log N) here, though.

I did implement a crude heapsort before posting this and measured the
number of comparisions and the number of swaps.

==================================================

#!/usr/local/bin/python

import random

class HeapSorter(object):
def __init__(self, seq):
self.compares = 0
self.swaps = 0
self.length = len(seq)
self.seq = seq

def __sift(self, i):
while True:
j = 2 * i + 1
if j + 1 < self.length:
self.compares += 1
if self.seq[j + 1] > self.seq[j]:
j = j + 1
if j >= self.length:
return

self.compares += 1
if self.seq[i] > self.seq[j]:
return

self.swaps += 1
self.seq[i], self.seq[j] = (self.seq[j], self.seq[i])
i = j

def __makeheap(self):
for i in range(self.length / 2, -1, -1):
self.__sift(i)

def sort(self):
self.__makeheap()
for i in range(self.length - 1, -1, -1):
self.swaps += 1
self.seq[i], self.seq[0] = (self.seq[0], self.seq[i])
self.length -= 1
self.__sift(0)

if __name__ == '__main__':

s = list(range(100000))
random.shuffle(s)

heap = HeapSorter(s)
heap.sort()
print "Random list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)
heap = HeapSorter(s)
heap.sort()
print "Sorted list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)

s.reverse()
heap = HeapSorter(s)
heap.sort()
print "Reverse Sorted list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)

==================================================

Which gave the following results:

/usr/home/jes% python ./heapsort
Random list: comapres 3019969, swaps 1575263
Sorted list: comapres 3112517, swaps 1650855
Reverse Sorted list: comapres 2926640, swaps 1497435

Assuming compares and swaps dominate other bookkeeping, then heapsort
is quite stable in its performance, although the constant in the O(N log N)
is much larger than for other sorts.

--
Jim Segrave (je*@jes-2.demon.nl)

Jun 5 '06 #50

Large Dictionaries

Similar topics