472,984 Members | 1,927 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,984 software developers and data experts.

memory efficient set/dictionary

What is the best to go about using a large set (or dictionary) that
doesn't fit into main memory? What is Python's (2.5 let's say)
overhead for storing int in the set, and how much for storing int ->
int mapping in the dict?

Please recommend a module that allows persistent set/dict storage +
fast query that best fits my problem, and as lightweight as possible.
For queries, the hit ratio is about 10%. Fast updates would be nice,
but i can rewrite the algo so that the data is static, so update speed
is not critical.

Or am i better off not using Python here? Cheers.

Jun 10 '07 #1
6 5667
On Sun, 10 Jun 2007 07:27:56 -0700, koara wrote:
What is the best to go about using a large set (or dictionary) that
doesn't fit into main memory? What is Python's (2.5 let's say)
overhead for storing int in the set, and how much for storing int ->
int mapping in the dict?
How do you know it won't fit in main memory if you don't know the
overhead? A guess? You've tried it and your computer crashed?

Please recommend a module that allows persistent set/dict storage +
fast query that best fits my problem,
Usually I love guessing what people's problems are before making a
recommendation, but I'm feeling whimsical so I think I'll ask first.

What is the problem you are trying to solve? How many keys do you have?
Can you group them in some way, e.g. alphabetically? Do you need to search
on random keys, or can you queue them and do them in the order of your
choice?

and as lightweight as possible.
For queries, the hit ratio is about 10%. Fast updates would be nice,
but i can rewrite the algo so that the data is static, so update speed
is not critical.

Or am i better off not using Python here? Cheers.

--
Steven.

Jun 10 '07 #2
Hello Steven,

On Jun 10, 5:29 pm, Steven D'Aprano
<s...@REMOVE.THIS.cybersource.com.auwrote:
...
How do you know it won't fit in main memory if you don't know the
overhead? A guess? You've tried it and your computer crashed?
exactly
Please recommend a module that allows persistent set/dict storage +
fast query that best fits my problem,

Usually I love guessing what people's problems are before making a
recommendation, but I'm feeling whimsical so I think I'll ask first.

What is the problem you are trying to solve? How many keys do you have?
Corpus processing. There are in the order of billions to tens of
billions keys (64bit integers).
Can you group them in some way, e.g. alphabetically? Do you need to search
on random keys, or can you queue them and do them in the order of your
choice?

Yes, keys in sets and dictionaries can be grouped in any way, order
doesn't matter. Not sure what you mean.
Yes, i need fast random access (at least i do without having to
rethink and rewrite everything, which is what i'd like to avoid with
the help of this thread :-)

Thanks for the reply!
Jun 10 '07 #3
Please recommend a module that allows persistent set/dict storage +
fast query that best fits my problem,
What is the problem you are trying to solve? How many keys do you have?

Corpus processing. There are in the order of billions to tens of
billions keys (64bit integers).
I would recommend you to use a database since it meets your
requirements (off-memory, fast, persistent). The bsdddb module
(berkeley db) even gives you a dictionary like interface.
http://www.python.org/doc/lib/module-bsddb.html

regards,
Rafa
Jun 10 '07 #4
Rafael Darder Calvo wrote:
Please recommend a module that allows persistent set/dict storage +
fast query that best fits my problem,

What is the problem you are trying to solve? How many keys do you have?

Corpus processing. There are in the order of billions to tens of
billions keys (64bit integers).
I would recommend you to use a database since it meets your
requirements (off-memory, fast, persistent). The bsdddb module
(berkeley db) even gives you a dictionary like interface.
http://www.python.org/doc/lib/module-bsddb.html
Standard SQL databases can work for this, but generally your
recommendation of using bsddb works very well for int -int mappings.
In particular, I would suggest using a btree, if only because I have had
troubles in the past with colliding keys in the bsddb.hash (and recno is
just a flat file, and will attempt to create a file i*(record size) to
write to record number i .

As an alternative, there are many search-engine known methods for
mapping int -[int, int, ...], which can be implemented as int -int,
where the second int is a pointer to an address on disk. Looking into a
few of the open source search implementations may be worthwhile.

- Josiah
Jun 11 '07 #5
I would recommend you to use a database since it meets your
requirements (off-memory, fast, persistent). The bsdddb module
(berkeley db) even gives you a dictionary like interface.
http://www.python.org/doc/lib/module-bsddb.html

Standard SQL databases can work for this, but generally your
recommendation of using bsddb works very well for int -int mappings.
In particular, I would suggest using a btree, if only because I have had
troubles in the past with colliding keys in the bsddb.hash (and recno is
just a flat file, and will attempt to create a file i*(record size) to
write to record number i .

As an alternative, there are many search-engine known methods for
mapping int -[int, int, ...], which can be implemented as int -int,
where the second int is a pointer to an address on disk. Looking into a
few of the open source search implementations may be worthwhile.
Thanks guys! I will look into bsddb, hopefully this doesn't keep all
keys in memory, i couldn't find answer to that during my (very brief)
look into the documentation.

And how about the extra memory used for set/dict'ing of integers? Is
there a simple answer?

Jun 11 '07 #6
koara wrote:
>>I would recommend you to use a database since it meets your
requirements (off-memory, fast, persistent). The bsdddb module
(berkeley db) even gives you a dictionary like interface.
http://www.python.org/doc/lib/module-bsddb.html
Standard SQL databases can work for this, but generally your
recommendation of using bsddb works very well for int -int mappings.
In particular, I would suggest using a btree, if only because I have had
troubles in the past with colliding keys in the bsddb.hash (and recno is
just a flat file, and will attempt to create a file i*(record size) to
write to record number i .

As an alternative, there are many search-engine known methods for
mapping int -[int, int, ...], which can be implemented as int -int,
where the second int is a pointer to an address on disk. Looking into a
few of the open source search implementations may be worthwhile.

Thanks guys! I will look into bsddb, hopefully this doesn't keep all
keys in memory, i couldn't find answer to that during my (very brief)
look into the documentation.
No, bsddb does not keep all data in memory.

And how about the extra memory used for set/dict'ing of integers? Is
there a simple answer?
A non-long integer for a 32 bit Python uses 12 bytes. A non-long
integer for a 64 bit Python uses 24 bytes.

Each entry in a dictionary for a 32 bit Python uses 12 bytes; 4 for the
hash, 4 for the key pointer, 4 for the value pointer. Double that to 24
bytes each in the 64 bit Python (I don't know if the hash actually has
its range increased, but if one doesn't double the size of the pointers,
then you have alignment issues that can slow down dictionary access).

Sets only have the hash and key pointers, so only use 8/16 bytes per entry.
- Josiah
Jun 11 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Benjamin Scott | last post by:
Hello. I attempted to build a compound dictionary: len(Lst)=1000 len(nuerLst)=250 len(nuestLst)=500 Dict={}
12
by: Anon | last post by:
Hello all, I'm hoping for some guidance here... I am a c/c++ "expert", but a complete python virgin. I'm trying to create a program that loads in the entire FreeDB database (excluding the...
4
by: Thomas Rast | last post by:
Hello everyone My scenario is somewhat involved, so if you're in a hurry, here's an abstract: I have a daemon that needs about 80MB of RAM to build its internal data structures, but can pack...
1
by: DJTB | last post by:
zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...
6
by: Ganesan selvaraj | last post by:
I using C# .net. i want to split the text files based of the some condition. my source text file size may be 4 kb to 4 gb. some time when i split the i got the "out of memory exception. when i...
3
by: RubenDV | last post by:
I am writing code for a dictionary type in c, but i have some problems allocating the memory for it. The dictionary type itself is a struct with the following contents: typedef struct { char **...
4
by: Kev | last post by:
Hello, I have an Access 2003 database running on an XP network. I have a datasheet subform containing a 28 day roster - shift1 to shift28. Each record has 1 RosterEmpID, 1 EmployeeNumber, 28...
7
by: Jim | last post by:
I have an application that will maintain an in-memory database in the form of a list of lists. Does anyone know of a way to search for and retreive "records" from such a structure? Many thanks,...
1
by: buu | last post by:
It's strange to me, but, create a dictionary and fill it with 1 mil. of some objects. then, see the memory consumption (arised, of course). then, clean the dictionary.... memory consumption is...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
tracyyun
by: tracyyun | last post by:
Hello everyone, I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...
4
NeoPa
by: NeoPa | last post by:
Hello everyone. I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report). I know it can be done by selecting :...
3
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
1
by: Teri B | last post by:
Hi, I have created a sub-form Roles. In my course form the user selects the roles assigned to the course. 0ne-to-many. One course many roles. Then I created a report based on the Course form and...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM) Please note that the UK and Europe revert to winter time on...
4
by: GKJR | last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.