472,364 Members | 2,144 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,364 software developers and data experts.

Re: Optimizing size of very large dictionaries

En Wed, 30 Jul 2008 21:29:39 -0300, <py****@bdurham.comescribi�:
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?

I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.
You could use a different hash algorithm yielding a smaller value (crc32,
by example, fits on an integer). At the expense of having more collisions,
and more processing time to check those possible duplicates.

--
Gabriel Genellina

Jul 31 '08 #1
2 2006
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?
Sure. You can build your own version of a dict using
UserDict.DictMixin. The underlying structure can be as space
efficient as you want.

FWIW, dictionaries automatically become more space efficient at
largers sizes (50,000+ records). The size quadrupling strategy falls
back to just doubling whenever a dict gets two thirds full.
Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.
Tons of memory can be saved by not storing the contents of the
record. Just make an initial pass to identify the digest values of
possible duplicates. The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

bag = collections.defaultdict(int)
for record in logs:
bag[sha1(record).digest()] += 1
possible_dups = set()
while bag:
hashval, cnt = bag.popitem()
if cnt 1:
possible_dups.add(hashvalue)
seen = set()
for record in logs:
if record in seen:
print 'Duplicate:', record
elif sha1(record).digest() in possible_dups:
seen.add(record)

Raymond

P.S. If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.
Jul 31 '08 #2


Raymond Hettinger wrote:
>>Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.

Tons of memory can be saved by not storing the contents of the
record. Just make an initial pass to identify the digest values of
possible duplicates. The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

bag = collections.defaultdict(int)
for record in logs:
bag[sha1(record).digest()] += 1
possible_dups = set()
while bag:
hashval, cnt = bag.popitem()
if cnt 1:
possible_dups.add(hashvalue)
Since actual counts above 1 are not needed, I believe a bit more memory
could be saved by computing possible_dups incrementally. The logic is a
bit trickier, and seeing the above helps.

once, dups = set(), set()
for record in logs:
d = sha1(record).digest()
if d in once:
once.remove(d)
dups.add(d)
elif d not in dups:
once.add(d)
seen = set()
for record in logs:
if record in seen:
print 'Duplicate:', record
elif sha1(record).digest() in possible_dups:
seen.add(record)

Raymond

P.S. If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.
--
http://mail.python.org/mailman/listinfo/python-list
Aug 1 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: robin | last post by:
I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all...
12
by: Anon | last post by:
Hello all, I'm hoping for some guidance here... I am a c/c++ "expert", but a complete python virgin. I'm trying to create a program that loads in the entire FreeDB database (excluding the...
1
by: DJTB | last post by:
zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...
0
by: Aidan | last post by:
The setup: a Sax parser in a servlet and a Java client (same machine) which uploads an XML document containing CDATA elements which hold base 64 encoded binary files. The servlet then SAX parses...
4
by: Flashman | last post by:
A little confusing with setting up optimizing options with 2003 .NET. Under the Optimization Tab. if you set to /O1 or /O2 is the program ignoring the settings for Inline Function expansion,...
5
by: Claudio Grondi | last post by:
I have just started to play around with the bsddb3 module interfacing the Berkeley Database. Beside the intended database file databaseFile.bdb I see in same directory also the __db.001...
8
by: placid | last post by:
Hi all, Just wondering if anyone knows how to pop up the dialog that windows pops up when copying/moving/deleting files from one directory to another, in python ? Cheers
0
by: M.-A. Lemburg | last post by:
On 2008-07-31 02:29, python@bdurham.com wrote: If you don't have a problem with taking a small performance hit, then I'd suggest to have a look at mxBeeBase, which is an on-disk dictionary...
10
by: orsula | last post by:
Hi Guys, I have a class A composed of several string and int members. I would like to manage a huge amount (several thousands) of A objects in a dictionary where each object has its unique key....
2
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
1
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
0
by: Rahul1995seven | last post by:
Introduction: In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python has gained popularity among beginners and experts...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.