On Sat, 10 Nov 2007 13:56:35 -0800, Michael Bacarella wrote:
Quote:
The id2name.txt file is an index of primary keys to strings. They look
like this:
>
11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n
>
The file's properties are:
>
# wc -l id2name.txt
>
8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt
>
I'm loading the file into memory with code like this:
>
id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
That's an awfully complicated way to iterate over a file. Try this
instead:
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.
Quote:
This takes about 45 *minutes*
>
If I comment out the last line in the loop body it takes only about 30
_seconds_ to run. This would seem to implicate the line id2name[id] =
name as being excruciatingly slow.
No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.
Quote:
Is there a fast, functionally equivalent way of doing this?
>
(Yes, I really do need this cached. No, an RDBMS or disk-based hash is
not fast enough.)
You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.
--
Steven.