Re: Optimizing size of very large dictionaries

En Wed, 30 Jul 2008 21:29:39 -0300, <py****@bdurham.comescribiï¿½:

Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?

I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.

You could use a different hash algorithm yielding a smaller value (crc32,
by example, fits on an integer). At the expense of having more collisions,
and more processing time to check those possible duplicates.

--
Gabriel Genellina

Jul 31 '08 #1

Subscribe Post Reply

2092

Raymond Hettinger

Are there any techniques I can use to strip a dictionary data

structure down to the smallest memory overhead possible?

Sure. You can build your own version of a dict using
UserDict.DictMixin. The underlying structure can be as space
efficient as you want.

FWIW, dictionaries automatically become more space efficient at
largers sizes (50,000+ records). The size quadrupling strategy falls
back to just doubling whenever a dict gets two thirds full.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.

Tons of memory can be saved by not storing the contents of the
record. Just make an initial pass to identify the digest values of
possible duplicates. The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

bag = collections.defaultdict(int)
for record in logs:
bag[sha1(record).digest()] += 1
possible_dups = set()
while bag:
hashval, cnt = bag.popitem()
if cnt 1:
possible_dups.add(hashvalue)
seen = set()
for record in logs:
if record in seen:
print 'Duplicate:', record
elif sha1(record).digest() in possible_dups:
seen.add(record)

Raymond

P.S. If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.

Jul 31 '08 #2

Terry Reedy

Raymond Hettinger wrote:

>>Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.

Tons of memory can be saved by not storing the contents of the
record. Just make an initial pass to identify the digest values of
possible duplicates. The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

bag = collections.defaultdict(int)
for record in logs:
bag[sha1(record).digest()] += 1
possible_dups = set()
while bag:
hashval, cnt = bag.popitem()
if cnt 1:
possible_dups.add(hashvalue)

Since actual counts above 1 are not needed, I believe a bit more memory
could be saved by computing possible_dups incrementally. The logic is a
bit trickier, and seeing the above helps.

once, dups = set(), set()
for record in logs:
d = sha1(record).digest()
if d in once:
once.remove(d)
dups.add(d)
elif d not in dups:
once.add(d)

seen = set()
for record in logs:
if record in seen:
print 'Duplicate:', record
elif sha1(record).digest() in possible_dups:
seen.add(record)

Raymond

P.S. If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.
--
http://mail.python.org/mailman/listinfo/python-list

Aug 1 '08 #3

Similar topics

very large dictionaries

by: robin | last post by:

I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all...

Python

optimizing memory utilization

by: Anon | last post by:

Hello all, I'm hoping for some guidance here... I am a c/c++ "expert", but a complete python virgin. I'm trying to create a program that loads in the entire FreeDB database (excluding the...

Python

ZODB memory problems (was: processing a Very Large file)

by: DJTB | last post by:

zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...

Python

Optimizing SAX CDATA handling

by: Aidan | last post by:

The setup: a Sax parser in a servlet and a Java client (same machine) which uploads an XML document containing CDATA elements which hold base 64 encoded binary files. The servlet then SAX parses...

.NET Framework

Optimizing with 2003 .NET

by: Flashman | last post by:

A little confusing with setting up optimizing options with 2003 .NET. Under the Optimization Tab. if you set to /O1 or /O2 is the program ignoring the settings for Inline Function expansion,...

.NET Framework

bsddb3 database file, what are the __db.001, __db.002, __db.003files for?

by: Claudio Grondi | last post by:

I have just started to play around with the bsddb3 module interfacing the Berkeley Database. Beside the intended database file databaseFile.bdb I see in same directory also the __db.001...

Python

Windows Copy Gui

by: placid | last post by:

Hi all, Just wondering if anyone knows how to pop up the dialog that windows pops up when copying/moving/deleting files from one directory to another, in python ? Cheers

Python

Re: Optimizing size of very large dictionaries

by: M.-A. Lemburg | last post by:

On 2008-07-31 02:29, python@bdurham.com wrote: If you don't have a problem with taking a small performance hit, then I'd suggest to have a look at mxBeeBase, which is an on-disk dictionary...

Python

C# DIctionary size limit

by: orsula | last post by:

Hi Guys, I have a class A composed of several string and int members. I would like to manage a huge amount (several thousands) of A objects in a dictionary where each object has its unique key....

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp