473,769 Members | 7,355 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Large Dictionaries

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris
May 15 '06 #1
57 10845

[Chris]
Has anyone written a fast hash module which is more optimal for
large datasets ?


PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."

--
Richie Hindle
ri****@entrian. com
May 15 '06 #2
In article <1147699064.107 490@teuthos>, Chris Foote <ch***@foote.co m.au>
wrote:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s


Are those clock times or CPU times?

How much memory is your process using and how much is available on your
machine?

I'm guessing a integer takes 4 bytes and a long integer takes roughly one
byte per two decimal digits. Plus a few more bytes to bundle them up into
a tuple. You've probably got something like 20 bytes per key, so 5M of
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe
half full, and figure a hash key is 4 bytes, so that's another 40 meg for
the hash table itself.

Plus whatever the values stored in the dictionary take up. Even if you're
not storing any values (i.e., there's probably 4 bytes for a null pointer
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably
conservative estimates of what various objects must take up in memory, but
you see where this is going. We're already up to 180 meg. I wonder if the
whole problem is that you're just paging yourself to death. A few minutes
watching your system memory performance with ps or top while your program
is running might give you some idea if this is the case.
May 15 '06 #3
Chris Foote wrote:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris

Python Bindings (\Python24\Lib\ bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs \_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

Claudio
May 15 '06 #4
In article <ro************ ***********@rea der1.panix.com> ,
Roy Smith <ro*@panix.co m> wrote:
In article <1147699064.107 490@teuthos>, Chris Foote <ch***@foote.co m.au>
wrote:

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s


Are those clock times or CPU times?


And what are these times measuring? Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.
Once the dict is constructed, lookup times should be quite good.
--
Aahz (aa**@pythoncra ft.com) <*> http://www.pythoncraft.com/

"I saw `cout' being shifted "Hello world" times to the left and stopped
right there." --Steve Gonedes
May 15 '06 #5
Aahz <aa**@pythoncra ft.com> wrote:
Don't forget that adding keys requires resizing the dict, which is a
moderately expensive operation.


My guess would be that each resize grows the table by a constant
factor, which IIRC, works out to amortized O(n). It's not as good as
creating the dict the right size in the first place, but it's really
not that bad. Somewhat more troubling is that it can lead to memory
fragmentation problems.

I don't understand why Python doesn't have a way to give a size hint
when creating a dict. It seems so obvious to be able to say "d = dict
(size=10*1000*1 000)" if you know beforehand that's how many keys
you're going to add.

Looking at the docs, I'm astounded to discover that you can indeed do
exactly that, but you get {size:10000000} . I am befuddled as to why
people thought creating a dict by keyword would be a useful thing to
do, but left out (and, indeed, eliminated the chance to add the syntax
later) the obviously useful ability to hint at the size. I suppose we
could still add:

d = {}
d.reserve (10*1000*1000)
May 15 '06 #6
Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit*:
d = {}
d.reserve (10*1000*1000)


d={}.fromkeys(x range(5*10**6)) ?

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
May 15 '06 #7
"Claudio Grondi" <cl************ @freenet.de> wrote in message
news:e4******** **@newsreader3. netcologne.de.. .
Chris Foote wrote:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris

Python Bindings (\Python24\Lib\ bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs \_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

Claudio


sqlite also supports an in-memory database - use pysqlite
(http://initd.org/tracker/pysqlite/wiki) to access this from Python.

-- Paul
May 15 '06 #8
Maric Michaud schrieb:
Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit :
d = {}
d.reserve (10*1000*1000)


d={}.fromkeys(x range(5*10**6)) ?


That is a totally different beast. You don't want to insert arbitrary
keys, you want the internal hash-array to be of the right size.

Diez
May 15 '06 #9
Roy Smith:
I am befuddled as to why
people thought creating a dict by keyword would be a useful thing to
do, but left out (and, indeed, eliminated the chance to add the syntax
later) the obviously useful ability to hint at the size.


Adding the keyword syntax doesn't require much memory to a Python
programmer because it's the same syntax used elsewhere. While adding
another different method to dicts increases their API. Python is
usually designed to have smaller APIs, to help programmers remember
most things.

I agree that a reserve method to Python dicts/sets can be a little
useful, but Python programmers are usually more concerned about doing
things in a short programmer time and short code space, than telling
the computer how to do such things in the faster way (C++ is more for
speed so the reserve of the STL hashes can be more useful).

Bye,
bearophile

May 15 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
3850
by: robin | last post by:
I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all in memory. There is a single key field, so a dictionary is an obvious choice for a structure, since Python optimises these nicely. But is there a better choice? Is it worth building some sort of tree?
8
1748
by: placid | last post by:
Hi all, Just wondering if anyone knows how to pop up the dialog that windows pops up when copying/moving/deleting files from one directory to another, in python ? Cheers
2
2112
by: Gabriel Genellina | last post by:
En Wed, 30 Jul 2008 21:29:39 -0300, <python@bdurham.comescribi�: You could use a different hash algorithm yielding a smaller value (crc32, by example, fits on an integer). At the expense of having more collisions, and more processing time to check those possible duplicates. -- Gabriel Genellina
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10049
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9997
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9865
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8873
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5448
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3965
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.