473,399 Members | 2,159 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

huge dictionary -> bsddb/pickle question

Hi,

I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] , .... }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

1) Use bsddb. when an insert is done, the db will have key1 as key and
the value(i.e db[key1] will be be pickleled value of dict2). after
1000 inserts , I close and open the db ,inorder to flush the contents
to disk. Also when I try to insert a key, if its already present, I
unpickle the value and change something in dict2 and then pickle it
back to the bsddb.

2)Instead of pickling the value(dict2) and storing in bsddb
immediately, I keep the dict1(outer dictionary in memory) and when it
reaches 1000 inserts, I store it to bsddb as before, pickling each
individual value. The advantage of this is, when one insert is done,
if its already present, I adjust the value and I dont need to unpickle
and pickle it back, if its the memory. If its not present in memory, I
will still need to lookup in bsddb

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)

Jun 15 '07 #1
2 2501
In <11*********************@o11g2000prd.googlegroups. com>, lazy wrote:
I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] , .... }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

[…]

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)
I guess your guess about the pickling as bottleneck is correct but
measuring/profiling will give more confidence.

Maybe another database than bsddb might be useful here. An SQL one like
SQLite or maybe an object DB like zodb or Durus.

Ciao,
Marc 'BlackJack' Rintsch
Jun 15 '07 #2
lazy <ar******@gmail.comwrote:
...
key1=>{key11=>[1,2] , key12=>[6,7] , .... }
...
Im processesing HUGE(~100M inserts into the dictionary) data.
What will you need from the "saved" version later? If you need lookups
by single keys or key pairs then a relational DB may be best; if you
need to recover the whole huge dictionary into memory (assuming it will
fit, e.g. you have a 64-bit machine with ample RAM) then it might be
more efficient to "dump" things into a custom-designed format. If I was
uncertain about future uses of the saved data, I'd go for the maximal
flexibility afforded by a relational DB -- if you save to a relational
DB you can later easily use the data in every which way, including
ad-hoc ones (such flexibility is a key architectural feature of
relational databases). E.g., a single table with 4 fields might be
enough: key, subkey, leftint, rightint -- each field of the appropriate
type (you do tell us that the two ints are smallish integers, but we
don't know directly about the types of the keys -- besides a hint that
the main keys are presumably strings of some kind since that's what
bsddb likes as its own keys, but still, choosing the right kind of
string may be important in order to save space and obtain good
performance). The (key,subkey) pair would be the primary key on that
table, and unless and until you need peculiar "navigation" in the future
you may not require other indices, constraints, or whatever.

At the scale you're talking about (multi-gigabyte, it appears) it's
probably worth using a powerful DB (such as PostgreSQL, interfacing it
to Python via psycopg2) rather than a lightweight one (such as sqlite,
which comes with Python 2.5); however, depending on how critical
performance is, such issues are often best resolved by benchmarking
(similarly, you might benchmark the performance for inserting into the
properly constrained table from the start, vs originally making the
table constraints-less and using an ALTER TABLE later to add the primary
key constraint/index that will be useful for later lookups; "bulk
insertions" into DBs can often benefit from the latter idea).
Alex
Jun 15 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: none | last post by:
or is it just me? I am having a problem with using a dictionary as an attribute of a class. This happens in python 1.5.2 and 2.2.2 which I am accessing through pythonwin builds 150 and 148...
1
by: GrelEns | last post by:
hello, having almost 1,000 tar.gz files in different directories (could not change that) and these archives contain over 1,000,000 text files. I would like to build a tool to access as quickly...
4
by: brianobush | last post by:
# # My problem is that I want to create a # class, but the variables aren't known # all at once. So, I use a dictionary to # store the values in temporarily. # Then when I have a complete set, I...
15
by: barbaros | last post by:
Hello everybody, I need some advice on the following problem. I want to write a program (in C or C++) which will build a huge dictionary (hash array, if you prefer). The keys are strings, the...
1
by: john wright | last post by:
I have a dictionary oject I created and I want to bind a listbox to it. I am including the code for the dictionary object. Here is the error I am getting: "System.Exception: Complex...
8
by: Brian L. Troutwine | last post by:
I've got a problem that I can't seem to get my head around and hoped somebody might help me out a bit: I've got a dictionary, A, that is arbitarily large and may contains ints, None and more...
13
by: vd12005 | last post by:
Hello, i would like to sort(ed) and reverse(d) the result of many huge dictionaries (a single dictionary will contain ~ 150000 entries). Keys are words, values are count (integer). i'm...
3
by: JamesB | last post by:
I have a config screen in my app that updates a dictionary<key,value>. I want to store a copy of this before going into the config screen so if the user wants to cancel all their changes I can...
8
by: Bob Altman | last post by:
Hi all, I'm trying to do something that should be really easy, but I can't think of an obvious way to do it. I have a dictionary whose value is a "value type" (as opposed to a reference type --...
1
by: sachin2 | last post by:
I am using 3 types of dictionaries. 1) Dictionary<string, string > d = new Dictionary<string, string>(); 2) Dictionary<string, List<string>> d = new Dictionary<string, List<string>>(); 3)...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.