By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,358 Members | 1,278 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,358 IT Pros & Developers. It's quick & easy.

Re: Optimizing size of very large dictionaries

P: n/a
En Wed, 30 Jul 2008 21:29:39 -0300, <py****@bdurham.comescribi�:
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?

I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.
You could use a different hash algorithm yielding a smaller value (crc32,
by example, fits on an integer). At the expense of having more collisions,
and more processing time to check those possible duplicates.

--
Gabriel Genellina

Jul 31 '08 #1
Share this Question
Share on Google+
2 Replies

P: n/a
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?
Sure. You can build your own version of a dict using
UserDict.DictMixin. The underlying structure can be as space
efficient as you want.

FWIW, dictionaries automatically become more space efficient at
largers sizes (50,000+ records). The size quadrupling strategy falls
back to just doubling whenever a dict gets two thirds full.
Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.
Tons of memory can be saved by not storing the contents of the
record. Just make an initial pass to identify the digest values of
possible duplicates. The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

bag = collections.defaultdict(int)
for record in logs:
bag[sha1(record).digest()] += 1
possible_dups = set()
while bag:
hashval, cnt = bag.popitem()
if cnt 1:
possible_dups.add(hashvalue)
seen = set()
for record in logs:
if record in seen:
print 'Duplicate:', record
elif sha1(record).digest() in possible_dups:
seen.add(record)

Raymond

P.S. If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.
Jul 31 '08 #2

P: n/a


Raymond Hettinger wrote:
>>Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.

Tons of memory can be saved by not storing the contents of the
record. Just make an initial pass to identify the digest values of
possible duplicates. The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

bag = collections.defaultdict(int)
for record in logs:
bag[sha1(record).digest()] += 1
possible_dups = set()
while bag:
hashval, cnt = bag.popitem()
if cnt 1:
possible_dups.add(hashvalue)
Since actual counts above 1 are not needed, I believe a bit more memory
could be saved by computing possible_dups incrementally. The logic is a
bit trickier, and seeing the above helps.

once, dups = set(), set()
for record in logs:
d = sha1(record).digest()
if d in once:
once.remove(d)
dups.add(d)
elif d not in dups:
once.add(d)
seen = set()
for record in logs:
if record in seen:
print 'Duplicate:', record
elif sha1(record).digest() in possible_dups:
seen.add(record)

Raymond

P.S. If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.
--
http://mail.python.org/mailman/listinfo/python-list
Aug 1 '08 #3

This discussion thread is closed

Replies have been disabled for this discussion.