By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,995 Members | 1,217 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,995 IT Pros & Developers. It's quick & easy.

Populating a dictionary, fast

P: n/a

The id2name.txt file is an index of primary keys to strings. They look like this:

11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30 _seconds_ to run.
This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash is not fast enough.)

Nov 10 '07 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On Sat, 10 Nov 2007 13:56:35 -0800, Michael Bacarella wrote:
The id2name.txt file is an index of primary keys to strings. They look
like this:

11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
That's an awfully complicated way to iterate over a file. Try this
instead:

id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30
_seconds_ to run. This would seem to implicate the line id2name[id] =
name as being excruciatingly slow.
No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash is
not fast enough.)
You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.

--
Steven.
Nov 10 '07 #2

P: n/a
Michael Bacarella <mb**@gpshopper.comwrites:
Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash
is not fast enough.)
As Steven says maybe you need to add more ram to your system. The
memory overhead of dictionary cells is considerable. If worse comes
to worse you could concoct some more storage-efficient representation.
Nov 10 '07 #3

P: n/a
On Nov 10, 4:56 pm, Michael Bacarella <m...@gpshopper.comwrote:
This would seem to implicate the line id2name[id] = name as being excruciatingly slow.
As others have pointed out there is no way that this takes 45
minutes.Must be something with your system or setup.

A functionally equivalent code for me runs in about 49 seconds!
(it ends up using about twice as much ram as the data on disk)

i.


Nov 11 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.