473,602 Members | 2,846 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

huge dictionary -> bsddb/pickle question

Hi,

I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] , .... }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

1) Use bsddb. when an insert is done, the db will have key1 as key and
the value(i.e db[key1] will be be pickleled value of dict2). after
1000 inserts , I close and open the db ,inorder to flush the contents
to disk. Also when I try to insert a key, if its already present, I
unpickle the value and change something in dict2 and then pickle it
back to the bsddb.

2)Instead of pickling the value(dict2) and storing in bsddb
immediately, I keep the dict1(outer dictionary in memory) and when it
reaches 1000 inserts, I store it to bsddb as before, pickling each
individual value. The advantage of this is, when one insert is done,
if its already present, I adjust the value and I dont need to unpickle
and pickle it back, if its the memory. If its not present in memory, I
will still need to lookup in bsddb

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)

Jun 15 '07 #1
2 2523
In <11************ *********@o11g2 000prd.googlegr oups.com>, lazy wrote:
I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] , .... }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

[…]

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)
I guess your guess about the pickling as bottleneck is correct but
measuring/profiling will give more confidence.

Maybe another database than bsddb might be useful here. An SQL one like
SQLite or maybe an object DB like zodb or Durus.

Ciao,
Marc 'BlackJack' Rintsch
Jun 15 '07 #2
lazy <ar******@gmail .comwrote:
...
key1=>{key11=>[1,2] , key12=>[6,7] , .... }
...
Im processesing HUGE(~100M inserts into the dictionary) data.
What will you need from the "saved" version later? If you need lookups
by single keys or key pairs then a relational DB may be best; if you
need to recover the whole huge dictionary into memory (assuming it will
fit, e.g. you have a 64-bit machine with ample RAM) then it might be
more efficient to "dump" things into a custom-designed format. If I was
uncertain about future uses of the saved data, I'd go for the maximal
flexibility afforded by a relational DB -- if you save to a relational
DB you can later easily use the data in every which way, including
ad-hoc ones (such flexibility is a key architectural feature of
relational databases). E.g., a single table with 4 fields might be
enough: key, subkey, leftint, rightint -- each field of the appropriate
type (you do tell us that the two ints are smallish integers, but we
don't know directly about the types of the keys -- besides a hint that
the main keys are presumably strings of some kind since that's what
bsddb likes as its own keys, but still, choosing the right kind of
string may be important in order to save space and obtain good
performance). The (key,subkey) pair would be the primary key on that
table, and unless and until you need peculiar "navigation " in the future
you may not require other indices, constraints, or whatever.

At the scale you're talking about (multi-gigabyte, it appears) it's
probably worth using a powerful DB (such as PostgreSQL, interfacing it
to Python via psycopg2) rather than a lightweight one (such as sqlite,
which comes with Python 2.5); however, depending on how critical
performance is, such issues are often best resolved by benchmarking
(similarly, you might benchmark the performance for inserting into the
properly constrained table from the start, vs originally making the
table constraints-less and using an ALTER TABLE later to add the primary
key constraint/index that will be useful for later lookups; "bulk
insertions" into DBs can often benefit from the latter idea).
Alex
Jun 15 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
3578
by: none | last post by:
or is it just me? I am having a problem with using a dictionary as an attribute of a class. This happens in python 1.5.2 and 2.2.2 which I am accessing through pythonwin builds 150 and 148 respectively In the sample code you see that I have class Item and class Dict class Dict contains a dictionary called items. The items dictionary will contain instances of Item that are keyed off of the Item name. In __main__ I create two...
1
1283
by: GrelEns | last post by:
hello, having almost 1,000 tar.gz files in different directories (could not change that) and these archives contain over 1,000,000 text files. I would like to build a tool to access as quickly as possible any or sub-collection of these text files to serve them by http upon user request. does anyone have ideas on the good way to do it ? (i was thinking of a mapping in a dictionary whose keys would be filename,
4
2511
by: brianobush | last post by:
# # My problem is that I want to create a # class, but the variables aren't known # all at once. So, I use a dictionary to # store the values in temporarily. # Then when I have a complete set, I want to # init a class from that dictionary. # However, I don't want to specify the # dictionary gets by hand # since it is error prone.
15
2495
by: barbaros | last post by:
Hello everybody, I need some advice on the following problem. I want to write a program (in C or C++) which will build a huge dictionary (hash array, if you prefer). The keys are strings, the values are floating point numbers. The dictionary will soon become larger than the RAM memory (I expect more than ten gygabytes). So I need to save it (or parts of it) to the disk, while still being able to
1
9243
by: john wright | last post by:
I have a dictionary oject I created and I want to bind a listbox to it. I am including the code for the dictionary object. Here is the error I am getting: "System.Exception: Complex DataBinding accepts as a data source either an IList or an IListSource at System.Windows.Forms.ListControl.set_DataSource(Object value)
8
13305
by: Brian L. Troutwine | last post by:
I've got a problem that I can't seem to get my head around and hoped somebody might help me out a bit: I've got a dictionary, A, that is arbitarily large and may contains ints, None and more dictionaries which themselves may contain ints, None and more dictionaries. Each of the sub-dictionaries is also arbitrarily large. When pretty printing A, in the context I'm using A for, it's rather helpful to remove all key:value pairs where value...
13
2341
by: vd12005 | last post by:
Hello, i would like to sort(ed) and reverse(d) the result of many huge dictionaries (a single dictionary will contain ~ 150000 entries). Keys are words, values are count (integer). i'm wondering if i can have a 10s of these in memory, or if i should proceed one after the other. but moreover i'm interested in saving theses as values, keys sorted and
3
2206
by: JamesB | last post by:
I have a config screen in my app that updates a dictionary<key,value>. I want to store a copy of this before going into the config screen so if the user wants to cancel all their changes I can simply set the working copies to be the backup. So, in my code, before the config screen is shown, I do this: Dictionary<string, myClassTempDic = MainDic;
8
7969
by: Bob Altman | last post by:
Hi all, I'm trying to do something that should be really easy, but I can't think of an obvious way to do it. I have a dictionary whose value is a "value type" (as opposed to a reference type -- in this case, a Boolean): Dim dic As New Dictionary(Of Int32, Boolean) dic(123) = True I want to set all of the existing entries in the dictionary to False. The
1
2270
by: sachin2 | last post by:
I am using 3 types of dictionaries. 1) Dictionary<string, string > d = new Dictionary<string, string>(); 2) Dictionary<string, List<string>> d = new Dictionary<string, List<string>>(); 3) Dictionary<string, Dictionary<string, string>> d = new Dictionary<string, Dictionary<string, string>>(); Now I am using GetDictionaryType Function where I an sending a dictionary object. In GetDictionaryType() I want to find out which dictionary...
0
7920
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8401
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8404
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8268
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6730
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5867
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
3900
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2418
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
0
1254
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.