473,785 Members | 3,388 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Large Dictionaries

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris
May 15 '06
57 10847
1. Why is two minutes to insert 5M keys "bad" for you? What would be
"good"? What would be good/bad look-up times? Have you measured the
typical look-up time? How often is the dict creation required to be
done? How often does the data change? Is multi-user access required for
(a) look-up (b) updating? Have you considered loading the dict from a
pickle?
2. Assuming your code that is creating the dict looks in essence like
this:
adict = {}
for k, v in some_iterable:
adict[k] = v
then any non-linear behaviour can only be in the actual CPython
insertion code. Psyco can't help you there. Psyco *may* help with the
linear part, *if* you have enough memory. What are the corresponding
times without Psyco? In any case, if your code isn't (conceptually)
that simple, then try cutting away the cruft and measuring again.
3. Which version of Python? What OS? OK, psyco -> Intel x86, but what
chip exactly? How much free memory?
4. Consider printing time-so-far results, say every 100K keys. Multiple
step-ups might indicate dict resizings. A dog-leg probably means
running out of memory. Why "roughly" 5M keys???
5. How large are your long_integers?
6. What is the nature of the value associated with each key?
7. Have you experimented with key = a * 2 ** 32 + b instead of key =
(a, b)?

HTH,
John

May 15 '06 #11
Sounds like PyTables could be useful.

http://www.pytables.org

--
lpc

May 16 '06 #12
Roy Smith wrote:
In article <1147699064.107 490@teuthos>, Chris Foote <ch***@foote.co m.au>
wrote:
I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s
Are those clock times or CPU times?


User + system CPU time.
How much memory is your process using and how much is available on your
machine?
A very large amount of space is consumed, which is why I didn't give
times for the 10M version which would have eaten my 1G of RAM and
into swap :-)
I'm guessing a integer takes 4 bytes and a long integer takes roughly one
byte per two decimal digits. Plus a few more bytes to bundle them up into
a tuple. You've probably got something like 20 bytes per key, so 5M of
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe
half full, and figure a hash key is 4 bytes, so that's another 40 meg for
the hash table itself.

Plus whatever the values stored in the dictionary take up. Even if you're
not storing any values (i.e., there's probably 4 bytes for a null pointer
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably
conservative estimates of what various objects must take up in memory, but
you see where this is going. We're already up to 180 meg.
Yep, that size sounds about right (my dictionary values are all None).
I wonder if the
whole problem is that you're just paging yourself to death. A few minutes
watching your system memory performance with ps or top while your program
is running might give you some idea if this is the case.


Definitely not.

Thanks anyway,

Chris

May 16 '06 #13
Aahz wrote:
In article <ro************ ***********@rea der1.panix.com> ,
Roy Smith <ro*@panix.co m> wrote:
In article <1147699064.107 490@teuthos>, Chris Foote <ch***@foote.co m.au>
wrote:
I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s Are those clock times or CPU times?


And what are these times measuring?


The loading of a file into a dictionary. i.e. no lookup operations.
Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.
Yep, that's why I probably need a dictionary where I can pre-specify
an approximate size at the time of its creation.
Once the dict is constructed, lookup times should be quite good.


Very good indeed with Psyco!

Cheers,
Chris
May 16 '06 #14
Le Lundi 15 Mai 2006 21:07, Diez B. Roggisch a écrit*:
d={}.fromkeys(x range(5*10**6)) ?


That is a totally different beast. You don't want to insert arbitrary
keys, you want the internal hash-array to be of the right size.


But you can replace the xrange part by any generator function you want.

def get_mykeys(..)
...
yield key

I just the wonder if the time consuming part is the memory allocation of hash
table (key by key) or the hash operation itself.

I don't see a use case where a python programmer should need a
dictionnary "that will be probably big" but can't predict what keys will be
in.

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
May 16 '06 #15
Chris Foote wrote:
Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.


Yep, that's why I probably need a dictionary where I can pre-specify
an approximate size at the time of its creation.


Try splitting the creation of the keys from the creation of the dictionary
and you'll see where the bottleneck really is.

On my system:
r = range(5000000)
d = dict.fromkeys(r )
takes less than 1 second to create the dictionary. So creating a dictionary
with 5 million elements is not in itself a terribly expensive operation
(and yes, even with fromkeys there is no attempt to work out in advance
what size the dictionary should be).

Trying to simulate your data:
scale = 2**32
data = [ (i*scale,i) for i in range(5000000) ]
d = dict.fromkeys(d ata)
the assignment to data took 42 seconds. Creating the dictionary took 6
seconds.

Trying the variation someone else suggested:
scale = 2**32
data = [ i*scale+i for i in range(5000000) ]
d = dict.fromkeys(d ata)
creating data took under 6 seconds and about 1 second for d.

so it looks like the issue is not creating a large dictionary, nor lots of
integers, but creating lots of tuples is expensive.

Oh, and if anyone tells you that the intermediate list I created in the
tests above is going to make it slow and you should use iterators instead:
r = range(5000000)
scale = 2**32
s = [ i*scale for i in r ]
from itertools import izip
d = dict.fromkeys(i zip(s,r))
The first few lines took a couple of seconds, the last one 1 minute 50
seconds. The alternative version:
scale = 2**32
r = range(5000000)
d = dict.fromkeys(( i*scale,i) for i in r)


takes a similar time.
May 16 '06 #16
Claudio Grondi wrote:
Chris Foote wrote:
p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Python Bindings (\Python24\Lib\ bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs \_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.


However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created by
passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk may be
created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in string
form "long_integer:i nteger" since bsddb doesn't support keys that aren't
integers or strings.
May 16 '06 #17
Paul McGuire wrote:
"Claudio Grondi" <cl************ @freenet.de> wrote in message
news:e4******** **@newsreader3. netcologne.de.. .
Chris Foote wrote:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Python Bindings (\Python24\Lib\ bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs \_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.


sqlite also supports an in-memory database - use pysqlite
(http://initd.org/tracker/pysqlite/wiki) to access this from Python.


Hi Paul.

I tried that, but the overhead of parsing SQL queries is too high:

dictionary metakit sqlite[1]
---------- ------- ---------
1M numbers 8.8s 22.2s 89.6s
2M numbers 24.0s 43.7s 190.0s
5M numbers 115.3s 105.4s N/A

Thanks for the suggestion, but no go.

Cheers,
Chris

[1] pysqlite V1 & sqlite V3.
May 16 '06 #18
lcaamano wrote:
Sounds like PyTables could be useful.

http://www.pytables.org


In browsing their excellent documentation, it seems that all concepts
are built around storing and reading HDF5 format files.

Not suitable for this project unfortunately.

Cheers,
Chris

May 16 '06 #19
> (my dictionary values are all None).

So in fact all you need is a set. Have you experimented with the Python
2.5 alpha?

May 16 '06 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
3850
by: robin | last post by:
I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all in memory. There is a single key field, so a dictionary is an obvious choice for a structure, since Python optimises these nicely. But is there a better choice? Is it worth building some sort of tree?
8
1749
by: placid | last post by:
Hi all, Just wondering if anyone knows how to pop up the dialog that windows pops up when copying/moving/deleting files from one directory to another, in python ? Cheers
2
2113
by: Gabriel Genellina | last post by:
En Wed, 30 Jul 2008 21:29:39 -0300, <python@bdurham.comescribi�: You could use a different hash algorithm yielding a smaller value (crc32, by example, fits on an integer). At the expense of having more collisions, and more processing time to check those possible duplicates. -- Gabriel Genellina
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10147
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9947
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7496
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6737
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5380
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3645
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2877
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.