Lists, tuples and memory.

Elbert Lev

Hi, all!

Here is the problem:
I have a file, which contains a common dictionary - one word per line
(appr. 700KB and 70000 words). I have to read it in memory for future
"spell checking" of the words comming from the customer. The file is
presorted. So here it goes:

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt"))

Works like a charm. It takes on my machine 0.7 seconds to do the trick
and python.exe (from task manager data) was using before this line is
executed
2636K, and after 5520K. The difference is 2884K. Not that bad, taking
into account that in C I'd read the file in memory (700K) scan for
CRs, count them, replace with '\0' and allocate the index vector of
the word beginnigs of the size I found while counting CRs. In this
particular case index vector would be almost 300K. So far so good!

Then I realized, that lstdict as a list is an overkill. Tuple is
enough in my case. So I modified the code:

t = tuple(file("D:\\CommonDictionary.txt"))
lstdict = map(lambda x: x.lower().strip(), t)

This code works a little bit faster: 0.5 sec, but takes 5550K memory.
And maybe this is understandable: after all the first line creates a
list and a tuple and the second another tuple (all of the same size).

But then I decieded to compare this pieces:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 3

As expected, after line 2 memory was 5550K, but after line 3 it jumped
to 7996K!!!

The question:

If refference counting is used, why the second assignment to lstdict
(line 3) did not free the memory alocated by the first one (line 2)
and reuse it?

So one more experiment:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2
del lstdict # 3
lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 4

In this case executing line 4 did not add memory!

By the way, the search speed is very-very good when done by function:

def inDict(lstdict, word):
try:
return lstdict[bisect(lstdict, word) - 1] == word
except:
return False

500 words are tested in less then 0.03 seconds.

Jul 18 '05 #1

Subscribe Post Reply

2104

Christopher T King

On 15 Jul 2004, Elbert Lev wrote:

But then I decieded to compare this pieces:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 3

As expected, after line 2 memory was 5550K, but after line 3 it jumped
to 7996K!!!

The question:

If refference counting is used, why the second assignment to lstdict
(line 3) did not free the memory alocated by the first one (line 2)
and reuse it?

Because after line 2, you still have a 2MB tuple sitting around in 't'. ;)

Try this:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t)
del t # 2

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 3

Python doesn't know that you're not going to try to access t at some
future point, even though there are no more accesses in your code.

A couple of other thoughts on what you're doing:

1) Stick with lists. Tuples are meant to be used not really as immutable
lists, but as a way to group related items of different types. e.g. the
sequence 'apple','orange','pear','banana' should be stored in a list,
whereas the group of related items 'apple','red',3,5 (presumably
describing an apple in some predefined manner, say,
fruit,color,width,height) should be stored in a tuple. At least that's
what Guido wants us to do. ;)

2) Assuming you're using a newer version of Python, try using a list
comprehension instead of map(). It's a little bit cleaner, and will
probably be a bit faster too:

lstdict = [x.lower().strip() for x in file("D:\\CommonDictionary.txt")]

If you've never worked with them before, list comprehensions are a new
syntax intended to replace map(), filter(), and other such constructs with
one unified syntax. They're usually very easy to read and write (as the
one above is) since they help you avoid using lambda.

Jul 18 '05 #2

Larry Bates

Any reason not to put the words into a dictionary?

Then your code becomes:

lstdict=dict([(x.lower().strip(), None) for x in
file("D:\\CommonDictionary.txt")])
if lstdict.has_key(word):
do something

(not tested, but should be close)

I'll bet access is faster (even if loading is slightly slower).
You also benefit from the fact that the dictionary file
doesn't need to be kept sorted.

HTH,
Larry Bates
Syscon, Inc.

"Elbert Lev" <el*******@hotmail.com> wrote in message
news:94**************************@posting.google.c om...

Hi, all!

Here is the problem:
I have a file, which contains a common dictionary - one word per line
(appr. 700KB and 70000 words). I have to read it in memory for future
"spell checking" of the words comming from the customer. The file is
presorted. So here it goes:

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt"))

Works like a charm. It takes on my machine 0.7 seconds to do the trick
and python.exe (from task manager data) was using before this line is
executed
2636K, and after 5520K. The difference is 2884K. Not that bad, taking
into account that in C I'd read the file in memory (700K) scan for
CRs, count them, replace with '\0' and allocate the index vector of
the word beginnigs of the size I found while counting CRs. In this
particular case index vector would be almost 300K. So far so good!

Then I realized, that lstdict as a list is an overkill. Tuple is
enough in my case. So I modified the code:

t = tuple(file("D:\\CommonDictionary.txt"))
lstdict = map(lambda x: x.lower().strip(), t)

This code works a little bit faster: 0.5 sec, but takes 5550K memory.
And maybe this is understandable: after all the first line creates a
list and a tuple and the second another tuple (all of the same size).

But then I decieded to compare this pieces:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 3

As expected, after line 2 memory was 5550K, but after line 3 it jumped
to 7996K!!!

The question:

If refference counting is used, why the second assignment to lstdict
(line 3) did not free the memory alocated by the first one (line 2)
and reuse it?

So one more experiment:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2
del lstdict # 3
lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 4

In this case executing line 4 did not add memory!

By the way, the search speed is very-very good when done by function:

def inDict(lstdict, word):
try:
return lstdict[bisect(lstdict, word) - 1] == word
except:
return False

500 words are tested in less then 0.03 seconds.

Jul 18 '05 #3

Roy Smith

"Larry Bates" <lb****@swamisoft.com> wrote:

Any reason not to put the words into a dictionary?
[...]
I'll bet access is faster (even if loading is slightly slower).

One trick I've used when I have something with a long startup time
because it's parsing static files and building data structures is to
build the data structure once, then pickle it (see the pickle and
cPickle modules).

You production code then just has to call cPickle.load (file) and you're
up and running way faster than re-parsing the original data files.

Jul 18 '05 #4

Andrew MacIntyre

On Fri, 15 Jul 2004, Elbert Lev wrote:

{...}

But then I decieded to compare this pieces:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 3

As expected, after line 2 memory was 5550K, but after line 3 it jumped
to 7996K!!!

The question:

If refference counting is used, why the second assignment to lstdict
(line 3) did not free the memory alocated by the first one (line 2)
and reuse it?

So one more experiment:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2
del lstdict # 3
lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 4

In this case executing line 4 did not add memory!

The reason the memory is not released without the explicit del is that the
reference count of the original object is not adjusted until the
rebinding of "lstdict" takes place, which is after the new object is fully
constructed. The del in your experiment forces this adjustment.

--
Andrew I MacIntyre "These thoughts are mine alone..."
E-mail: an*****@bullseye.apana.org.au (pref) | Snail: PO Box 370
an*****@pcug.org.au (alt) | Belconnen ACT 2616
Web: http://www.andymac.org/ | Australia

Jul 18 '05 #5

Terry Reedy

"Larry Bates" <lb****@swamisoft.com> wrote in message
news:CL********************@comcast.com...

Any reason not to put the words into a dictionary?
for spell checking, I think I would...
Then your code becomes:

lstdict=dict([(x.lower().strip(), None) for x in
file("D:\\CommonDictionary.txt")])

The list comp creates an unnecessary (and, in this case, large)
intermediate list.
In 2.4, removing the [] pair would result in a generator expression.
In the meanwhile...

def lostrip(src): .... for w in src: yield w.lower().strip(), None
.... t = dict(lostrip(['abc ', 'DEG\n']))
t

{'abc': None, 'deg': None}

Terry J. Reedy

Jul 18 '05 #6

Dennis Lee Bieber

On Thu, 15 Jul 2004 15:52:53 -0400, Christopher T King
<sq******@WPI.EDU> declaimed the following in comp.lang.python:

whereas the group of related items 'apple','red',3,5 (presumably
describing an apple in some predefined manner, say,
fruit,color,width,height) should be stored in a tuple. At least that's
Must be a mutant red pear <G>

I don't think I've ever seen an apple with a 2" difference in
height vs width. Granted, a Red Delicious (being a common apple) does
have a bit of a height advantage...

Now, if it were 3 (per pound) and 5 (pounds) <G> (I think one
reference for carb count [diabetic here] has small apples at 3/pound,
large at 2/pound).

-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #7

Elbert Lev

Andrew MacIntyre <an*****@bullseye.apana.org.au> wrote in message news:<ma*************************************@pyth on.org>...

On Fri, 15 Jul 2004, Elbert Lev wrote:

{...}
But then I decieded to compare this pieces:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 3

As expected, after line 2 memory was 5550K, but after line 3 it jumped
to 7996K!!!

The question:

If refference counting is used, why the second assignment to lstdict
(line 3) did not free the memory alocated by the first one (line 2)
and reuse it?

So one more experiment:

t = tuple(file("D:\\CommonDictionary.txt")) # 1
lstdict = map(lambda x: x.lower().strip(), t) # 2
del lstdict # 3
lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 4

In this case executing line 4 did not add memory!

The reason the memory is not released without the explicit del is that the
reference count of the original object is not adjusted until the
rebinding of "lstdict" takes place, which is after the new object is fully
constructed. The del in your experiment forces this adjustment.

Thanks!

This is what I wanted to know.

Jul 18 '05 #8

Arthur

On Thu, 15 Jul 2004 15:52:53 -0400, Christopher T King
<sq******@WPI.EDU> wrote:

1) Stick with lists. Tuples are meant to be used not really as immutable
lists, but as a way to group related items of different types. e.g. the
sequence 'apple','orange','pear','banana' should be stored in a list,
whereas the group of related items 'apple','red',3,5 (presumably
describing an apple in some predefined manner, say,
fruit,color,width,height) should be stored in a tuple. At least that's
what Guido wants us to do. ;)
But with Fletcher's confession of sticking with tabs ... its anarchy
out there.

2) Assuming you're using a newer version of Python, try using a list
comprehension instead of map(). It's a little bit cleaner, and will
probably be a bit faster too:

Faster is good.

Modern and progessive governance. The tax incentive concept at work.

Art

Jul 18 '05 #9

John Lenton

On 15 Jul 2004 12:09:47 -0700, Elbert Lev <el*******@hotmail.com> wrote:

Hi, all!

Here is the problem:
I have a file, which contains a common dictionary - one word per line
(appr. 700KB and 70000 words). I have to read it in memory for future
"spell checking" of the words comming from the customer. The file is
presorted. So here it goes:

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt"))

Works like a charm. It takes on my machine 0.7 seconds to do the trick
and python.exe (from task manager data) was using before this line is
executed
2636K, and after 5520K. The difference is 2884K. Not that bad, taking
into account that in C I'd read the file in memory (700K) scan for
CRs, count them, replace with '\0' and allocate the index vector of
the word beginnigs of the size I found while counting CRs. In this
particular case index vector would be almost 300K. So far so good!

Then I realized, that lstdict as a list is an overkill. Tuple is
enough in my case. So I modified the code:

t = tuple(file("D:\\CommonDictionary.txt"))
lstdict = map(lambda x: x.lower().strip(), t)

This code works a little bit faster: 0.5 sec, but takes 5550K memory.
And maybe this is understandable: after all the first line creates a
list and a tuple and the second another tuple (all of the same size).

any reason you didn't do it as

lstdict = tuple([x.strip().lower(), file("D:\\CommonDictionary.txt")])

?

also, you might find you can use chunks of the bisect module to speed
up your searching, if you find using a dict too expensive. That's
assuming the bisect module is written in C, which I haven't checked.

--
John Lenton (jl*****@gmail.com) -- Random fortune:
bash: fortune: command not found

Jul 18 '05 #10

Elbert Lev

"Larry Bates" <lb****@swamisoft.com> wrote in message news:<CL********************@comcast.com>...

Any reason not to put the words into a dictionary?

Then your code becomes:

lstdict=dict([(x.lower().strip(), None) for x in
file("D:\\CommonDictionary.txt")])
if lstdict.has_key(word):
do something

(not tested, but should be close)

I'll bet access is faster (even if loading is slightly slower).
You also benefit from the fact that the dictionary file
doesn't need to be kept sorted.

HTH,
Larry Bates
Syscon, Inc.

Dictionary has the almost 100 times faster access time theb bisect of
a list. Build time is about 1 second (very good). Memory consumption
almost 8 MB - 2.5 times higher then with the list (not too good).
Unless very fast access time is an imperative, I'd rather not to use
it.

Off topic: isn't this a great feature of Python? In almost no time I
checked several data representations and was abble to choose the best
for my case. In C++, even with stl, such research would take at least
a day.

Jul 18 '05 #11

Christopher T King

On Fri, 16 Jul 2004, Dennis Lee Bieber wrote:

On Thu, 15 Jul 2004 15:52:53 -0400, Christopher T King
<sq******@WPI.EDU> declaimed the following in comp.lang.python:
whereas the group of related items 'apple','red',3,5 (presumably
describing an apple in some predefined manner, say,
fruit,color,width,height) should be stored in a tuple. At least that's

Must be a mutant red pear <G>

I don't think I've ever seen an apple with a 2" difference in
height vs width. Granted, a Red Delicious (being a common apple) does
have a bit of a height advantage...

Let's just say spacetime is warped... ;)

Jul 18 '05 #12

Christopher T King

On 16 Jul 2004, Elbert Lev wrote:

"Larry Bates" <lb****@swamisoft.com> wrote in message news:<CL********************@comcast.com>...
Any reason not to put the words into a dictionary?

Dictionary has the almost 100 times faster access time theb bisect of
a list. Build time is about 1 second (very good). Memory consumption
almost 8 MB - 2.5 times higher then with the list (not too good).
Unless very fast access time is an imperative, I'd rather not to use
it.

Try using a set instead of a dictionary. You should get the good access
time of dictionaries with nearly the low memory usage of a list:

from sets import Set

lstdict=Set([x.lower().strip() for x in file("D:\\CommonDictionary.txt")])

if word in lstdict:
<do something>

Combine the above with the suggestions to use generators for best
performance.

Also available is sets.ImmutableSet. I'm not sure if this provides
enhanced performance over Set though.

Jul 18 '05 #13

Peter Otten

Christopher T King wrote:

Try using a set instead of a dictionary. You should get the good access
time of dictionaries with nearly the low memory usage of a list:

sets.Set() holds its data in a dict. I fear the same goes for 2.4's builtin
set type which is coded in C but also built on top of dict.

Peter

Jul 18 '05 #14

Christopher T King

On Fri, 16 Jul 2004, Peter Otten wrote:

Christopher T King wrote:
Try using a set instead of a dictionary. You should get the good access
time of dictionaries with nearly the low memory usage of a list:

sets.Set() holds its data in a dict. I fear the same goes for 2.4's builtin
set type which is coded in C but also built on top of dict.

Ick. I thought part of the reason Set was created (aside from enabling
set operations) was to improve on the dict() storage method.

Jul 18 '05 #15

David Eppstein

In article <Pi**************************************@ccc7.wpi .edu>,
Christopher T King <sq******@WPI.EDU> wrote:

On Fri, 16 Jul 2004, Peter Otten wrote:
Christopher T King wrote:
Try using a set instead of a dictionary. You should get the good access
time of dictionaries with nearly the low memory usage of a list:

sets.Set() holds its data in a dict. I fear the same goes for 2.4's builtin
set type which is coded in C but also built on top of dict.

Ick. I thought part of the reason Set was created (aside from enabling
set operations) was to improve on the dict() storage method.

What alternative storage method did you have in mind that would be as
efficient and that would allow the same types of objects to be collected
into sets?

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

Jul 18 '05 #16

Terry Reedy

"Peter Otten" <__*******@web.de> wrote in message
news:cd*************@news.t-online.com...

Christopher T King wrote:
Try using a set instead of a dictionary. You should get the good access
time of dictionaries with nearly the low memory usage of a list:
sets.Set() holds its data in a dict. I fear the same goes for 2.4's

builtin set type which is coded in C but also built on top of dict.

My impression was that it was based on the dict code, but without reading
the source, I don't really know.

TJR

Jul 18 '05 #17

Christopher T King

On Fri, 16 Jul 2004, David Eppstein wrote:

In article <Pi**************************************@ccc7.wpi .edu>,
Christopher T King <sq******@WPI.EDU> wrote:
Ick. I thought part of the reason Set was created (aside from enabling
set operations) was to improve on the dict() storage method.

What alternative storage method did you have in mind that would be as
efficient and that would allow the same types of objects to be collected
into sets?

One that didn't have to store a pointer to None for every single item :/

Jul 18 '05 #18

Peter Otten

Terry Reedy wrote:

"Peter Otten" <__*******@web.de> wrote in message
news:cd*************@news.t-online.com...
Christopher T King wrote:
> Try using a set instead of a dictionary. You should get the good access
> time of dictionaries with nearly the low memory usage of a list:

sets.Set() holds its data in a dict. I fear the same goes for 2.4's

builtin
set type which is coded in C but also built on top of dict.

My impression was that it was based on the dict code, but without reading
the source, I don't really know.

The "fear" was rhetoric, I actually looked it up. In setobject.c, as of
2.4a1:

make_new_set(PyTypeObject *type, PyObject *iterable)
{
PyObject *data = NULL;
PyObject *tmp;
PySetObject *so = NULL;

data = PyDict_New();
if (data == NULL)
return NULL;
....

Peter

Jul 18 '05 #19

Lists, tuples and memory.

Similar topics