473,795 Members | 2,512 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Large Dictionaries

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris
May 15 '06
57 10849
>22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

With bdb's it is crucial to insert keys in bytestring-sorted order.
Also, be sure to give it a decent amount of cache.

-Mike

May 16 '06 #31
Klaas wrote:
22.2s 20m25s[3]
20m to insert 1m keys? You are doing something wrong.


Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeley DB population started
Wed May 17 22:29:32 2006 StorageBerkeley DB population stopped, duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

test code is attached.
With bdb's it is crucial to insert keys in bytestring-sorted order.
For the bsddb test, I'm using a plain string. (The module docs list a
string being the only datatype supported for both keys & values).
Also, be sure to give it a decent amount of cache.


The bsddb.hashopen( ) factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

....
File "bsddb-test.py", line 67, in runtest
db = bsddb.hashopen( None, flag='c', cachesize=8192)
File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
if cachesize is not None: d.set_cachesize (0, cachesize)
bsddb._db.DBInv alidArgError: (22, 'Invalid argument -- DB->set_cachesiz e: method not permitted when environment
specified')
I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris

May 17 '06 #32
Chris Foote wrote:
Klaas wrote:
22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeley DB population started
Wed May 17 16:35:59 2006 StorageBerkeley DB population stopped, duration
104.3s
Surprising here, that the dictionary population gives the same time, but
the BerkeleyDB inserts the records 6 times faster on my computer than on
yours. I am running Python 2.4.2 on Windows XP SP2, and you?
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeley DB population started
Wed May 17 22:29:32 2006 StorageBerkeley DB population stopped, duration
665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s As I don't have SQLite installed, it is interesting to see if the factor
10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there (with
db.commit()).
Can the same be done in BerkeleyDB, or does BerkeleyDB not support
inserting of records without building an index each single insert
command? If yes, how?

Claudio
test code is attached.
With bdb's it is crucial to insert keys in bytestring-sorted order.

For the bsddb test, I'm using a plain string. (The module docs list a
string being the only datatype supported for both keys & values).
Also, be sure to give it a decent amount of cache.

The bsddb.hashopen( ) factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

....
File "bsddb-test.py", line 67, in runtest
db = bsddb.hashopen( None, flag='c', cachesize=8192)
File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
if cachesize is not None: d.set_cachesize (0, cachesize)
bsddb._db.DBInv alidArgError: (22, 'Invalid argument --
DB->set_cachesiz e: method not permitted when environment specified')
I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris
------------------------------------------------------------------------

import bsddb
import random
import time
import sqlite

import psyco
psyco.full()

class WallClockTimer( object):
'''Simple timer for measuring tests.'''
def __init__(self, msg):
self.start = time.time()
self.msg = msg
print time.ctime(self .start), self.msg, 'started'

def stop(self):
end = time.time()
duration = end - self.start
print time.ctime(end) , self.msg, 'stopped, duration %.1fs' % duration

class NumberGen(objec t):
'''Phone number generator for testing different methods of storage.'''
def __init__(self, range_start, range_end, n, max_wildcards):

print '''Number generator test for %d number ranges
with a maximum of %d wildcard digits.''' % (n, max_wildcards)

wildcards = range(0, max_wildcards + 1)
# generate n unique numbers and store as keys in number_hash
timer = WallClockTimer( 'dictionary population')
self.number_has h = {}

for i in xrange(0, n):
unique = False
while not unique:
wildcard_digits = self.gen_wildca rd_digits(wildc ards)
num = self.gen_number (range_start, range_end)
if wildcard_digits > 0:
num /= (10 ** wildcard_digits )
key = (num, wildcard_digits )
if self.number_has h.has_key(key):
unique = False
else:
unique = True
self.number_has h[key] = None
timer.stop()

def gen_wildcard_di gits(self, wildcards):
return random.choice(w ildcards)

def gen_number(self , start, end):
return int(random.unif orm(start, end))

def storage_test(se lf, StorageTestClas s):
test = StorageTestClas s(self.number_h ash)

class StorageTest(obj ect):
'''base class for testing storage. Derive a test
class and provide your own runtest() method.'''
def __init__(self, number_hash):
timer = WallClockTimer( '%s population' % type(self).__na me__)
self.runtest(nu mber_hash)
timer.stop()

class StorageBerkeley DB(StorageTest) :
def runtest(self, number_hash):
db = bsddb.hashopen( None, flag='c', cachesize=8192)
for (num, wildcard_digits ) in number_hash.key s():
key = '%d:%d' % (num, wildcard_digits )
db[key] = None
db.close()

class StorageSQLite(S torageTest):
def runtest(self, number_hash):
db = sqlite.connect( ':memory:')
cursor = db.cursor()
cursor.execute( 'CREATE TABLE numbers (num INTEGER, wd INTEGER)')
q = """insert into numbers (num, wd) values (%d, %d)"""
for (num, wildcard_digits ) in number_hash.key s():
cursor.execute( q, num, wildcard_digits )
db.commit()
db.close()
if __name__ == '__main__':

test_vals = {'range_start' : 61880 * 10 ** 7,
'range_end' : 61891 * 10 ** 7,
'n' : 1 * 10 ** 4,
'max_wildcards' : 3}

ngen = NumberGen(test_ vals['range_start'], test_vals['range_end'],
test_vals['n'], test_vals['max_wildcards'])

ngen.storage_te st(StorageBerke leyDB)
ngen.storage_te st(StorageSQLit e)

May 17 '06 #33
Richie Hindle wrote:
[Chris]
Has anyone written a fast hash module which is more optimal for
large datasets ?


PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."


Thanks for the suggestion Richie. PyJudy works brilliantly :-)

Cheers,
Chris
May 18 '06 #34
Claudio Grondi wrote:
Chris Foote wrote:
Klaas wrote:
22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.


I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeley DB population started
Wed May 17 16:35:59 2006 StorageBerkeley DB population stopped, duration
104.3s

Surprising here, that the dictionary population gives the same time, but
the BerkeleyDB inserts the records 6 times faster on my computer than on
yours. I am running Python 2.4.2 on Windows XP SP2, and you?


Fedora core 5 with ext3 filesystem. The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeley DB population started
Wed May 17 22:29:32 2006 StorageBerkeley DB population stopped,
duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

As I don't have SQLite installed, it is interesting to see if the factor
10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there (with
db.commit()).


SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

Cheers,
Chris
May 18 '06 #35
Chris Foote wrote:
Claudio Grondi wrote:
Chris Foote wrote:
Klaas wrote:

> 22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:
I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeley DB population started
Wed May 17 16:35:59 2006 StorageBerkeley DB population stopped,
duration 104.3s
>
Surprising here, that the dictionary population gives the same time,
but the BerkeleyDB inserts the records 6 times faster on my computer
than on yours. I am running Python 2.4.2 on Windows XP SP2, and you?


Fedora core 5 with ext3 filesystem. The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).

Ok, according to the Windows task manager the Python process
reads/writes to the file system during the run of BerkeleyDB test around
7 GByte(!) of data and the hard drive is continuously busy, where the
size of file I found in the Temp directory is always below 20 MByte. The
hard drive access is probably the main reason for loosing time - here a
question to BerkeleyDB experts:

Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM
or as BerkeleyDB can scale to larger data amount it makes not much sense
to tweak it into RAM?

Chris, is maybe a RAM-disk the right way to go here to save time lost
for accessing the file stored in the file system on the hard drive?

The RAM requirements, according to Windows XP task manager, are below
100 MByte. I am using the NTFS file system (yes, I know, that FAT is in
some configurations faster than NTFS) and XP Professional SP2 without
any tuning of file system caching. The CPU is 100% busy.

What CPU and RAM (SIMM, DDR, DDR2) do you have? I have 2GByte fast DDR
PC400/3200 dual line RAM. It seems, that you are still not getting
results within the range others experience running your code, so I
suppose, it has something to do with the hardware you are using.
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeley DB population started
Wed May 17 22:29:32 2006 StorageBerkeley DB population stopped,
duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration
65.5s As I don't have SQLite installed, it is interesting to see if the
factor 10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there
(with db.commit()).


SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

One of the reasons I put an eye on BerkeleyDB is that it pretends to
scale to a huge amount (Terrabyte) of data and don't need as much RAM as
Python dictionary and it is not necessary to save/load pickled version
of the data (i.e. here the dictionary) from/to RAM in order to work with
it.
I guess, that in your case BerkeleyDB is for the named reasons probably
the right way to go, except your data will stay small and the Python
dictionary with them will always fit into RAM.

Now I am curious to know which path you have decided to go and why?

Claudio
Cheers,
Chris


May 18 '06 #36
Chris Foote wrote:
Richie Hindle wrote:
[Chris]
Has anyone written a fast hash module which is more optimal for
large datasets ?


PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."


Thanks for the suggestion Richie. PyJudy works brilliantly :-)

Cheers,
Chris

It seems, that there is no Microsoft Windows version of Judy available,
so for this reason PyJudy won't work on Windows - am I right?

Claudio
May 18 '06 #37
Chris:
Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.


This is absolutely false.

-Mike

May 24 '06 #38
Chris:
class StorageBerkeley DB(StorageTest) :
def runtest(self, number_hash):
db = bsddb.hashopen( None, flag='c', cachesize=8192)
for (num, wildcard_digits ) in number_hash.key s():
key = '%d:%d' % (num, wildcard_digits )
db[key] = None
db.close()


BDBs can accomplish what you're looking to do, but they need to be
tuned carefully. I won't get into too many details here, but you have
a few fatal flaws in that code.

1. 8Kb of cache is _pathetic_. Give it a few hundred megs. This is by
far your nbiggest problem.
2. Use BTREE unless you have a good reason to use DBHASH
3. Use proper bdb env creation instead of the hash_open apis.
4. Insert your keys in sorted order.

-Mike

May 24 '06 #39
Klaas schrieb:
4. Insert your keys in sorted order.

This advice is questionable -
it depends on the at least on the db vendor and probably
sometimes on the sort method, if inserting pre-sorted
values is better.

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.

my 0.02 EUR

thomas
May 29 '06 #40

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
3850
by: robin | last post by:
I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all in memory. There is a single key field, so a dictionary is an obvious choice for a structure, since Python optimises these nicely. But is there a better choice? Is it worth building some sort of tree?
8
1750
by: placid | last post by:
Hi all, Just wondering if anyone knows how to pop up the dialog that windows pops up when copying/moving/deleting files from one directory to another, in python ? Cheers
2
2116
by: Gabriel Genellina | last post by:
En Wed, 30 Jul 2008 21:29:39 -0300, <python@bdurham.comescribi�: You could use a different hash algorithm yielding a smaller value (crc32, by example, fits on an integer). At the expense of having more collisions, and more processing time to check those possible duplicates. -- Gabriel Genellina
0
10438
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10164
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10001
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9042
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7540
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6780
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
4113
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3727
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2920
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.