Python object overhead?

Matt Garman

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt
Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readlines.py
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py

Mar 23 '07 #1

Subscribe Post Reply

2694

Mark Nenadov

On Fri, 23 Mar 2007 15:11:35 -0600, Matt Garman wrote:

>
Is this "just the way it is" or am I overlooking something obvious?

Matt,

If you iterate over even the smallest object instantiation a large amount
of times, it will be costly compared to a simple list append.

I don't think you can get around some overhead with the objects.

However, in terms of generally efficiency not specifically related to
object instantiation, you should look into xreadlines().

I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():
..
--
Mark Nenadov -skype: marknenadov, web: http://www.marknenadov.com
-"Glory is fleeting, but obscurity is forever." -- Napoleon Bonapart

Mar 23 '07 #2

Gabriel Genellina

En Fri, 23 Mar 2007 18:27:25 -0300, Mark Nenadov
<ma**@freelance-developer.comescribió:

I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():

Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3 :(
A file is now its own line iterator:

f = open(...)
for line in f:
...

--
Gabriel Genellina

Mar 23 '07 #3

Mark Nenadov

On Fri, 23 Mar 2007 19:11:23 -0300, Gabriel Genellina wrote:

Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3 :(
A file is now its own line iterator:

f = open(...)
for line in f:
...

Gabriel,

Thanks for pointing that out! I had completely forgotten about
that!

I've tested them before. readlines() is very slow. The deprecated
xreadlines() is close in speed to open() as an iterator. In my particular
test, I found the following:

readlines() -32 "time units"
xreadlines() -0.7 "time units"
open() iterator -0.41 "time units"

--
Mark Nenadov -skype: marknenadov, web: http://www.marknenadov.com
-"They need not trust me right away simply because the British say
that I am O.K.; but they are so ridiculous. Microphones everywhere
and planted so obviously. Why, if I bend over to smell a bowl of
flowers, I scratch my nose on a microphone."
-- Tricyle (Dushko Popov) on American Intelligence

Mar 23 '07 #4

Bjoern Schliessmann

Matt Garman wrote:

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing
this causes the memory overhead to go up two or three times.

(Note that almost everything in Python is an object!)

Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF

"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line

What's this class intended to do?

Regards,
Björn

--
BOFH excuse #1:

clock speed

Mar 24 '07 #5

Paul Rubin

Bjoern Schliessmann <us**************************@spamgourmet.comwrite s:

if len(line) == 0: break # EOF
"one blank line" == "EOF"? That's strange. Intended?

A blank line would have length 1 (a newline character).

Mar 24 '07 #6

Facundo Batista

Bjoern Schliessmann wrote:

>while True:
line = file.readline()
if len(line) == 0: break # EOF

"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".

"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght 0 (because of
newline).

>class FileRecord:
def __init__(self, line):
self.line = line

What's this class intended to do?

Unless I understood it wrong, it's just an object that holds the line
inside.

Just OO purity, not practicality...

--
.. Facundo
..
Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/

Mar 24 '07 #7

John Nagle

Matt Garman wrote:

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

Why do you want all the records in memory at once? Are you
doing some lookup on them, or what? If you're processing files
sequentially, don't keep them all in memory.

You're getting into the size range where it may be time to
use a database.

John Nagle

Mar 24 '07 #8

John Machin

On 24/03/2007 8:11 AM, Matt Garman wrote:

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record.

An object with only 1 attribute and no useful methods seems a little
pointless; I presume you will elaborate it later.

However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt
Example 1: read lines into list:
# begin readlines.py

Interesting name for the file :-)
How about using the file.readlines() method?
Why do you want all 200Mb in memory at once anyway?

import sys, time
filedata = list()
file = open(sys.argv[1])

You have just clobbered the builtin file() function/type. In this case
it doesn't matter, but you should lose the habit, quickly.

while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top

How about using raw_input('Hit the Any key...') ?

# end readlines.py
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py

After all that, you still need to split the lines into the more-than-one
fieldS (plural) that one would expect in a record.

A possibly faster alternative to (fastest_line_reader_so_far,
(line.split('|')) is to use the csv module, as in the following example,
which also shows one way of making an object out of a row of data.

C:\junk>type readpipe.py
import sys, csv

class Contacts(object):
__slots__ = ['first', 'family', 'email']
def __init__(self, row):
for attrname, value in zip(self.__slots__, row):
setattr(self, attrname, value)

def readpipe(fname):
if hasattr(fname, 'read'):
f = fname
else:
f = open(fname, 'rb')
# 'b' is in case you'd like your script to be portable
reader = csv.reader(
f,
delimiter='|',
quoting=csv.QUOTE_NONE,
# Set quotechar to a char that you don't expect in your data
# e.g. the ASCII control char BEL (0x07). This is necessary
# for Python 2.3, whose csv module used the quoting arg only when
# writing, otherwise your " characters may get stripped off.
quotechar='\x07',
skipinitialspace=True,
)
for row in reader:
if row == ['']: # blank line
continue
c = Contacts(row)
# do something useful with c, e.g.
print [(x, getattr(c, x)) for x in dir(c)
if not x.startswith('_')]

if __name__ == '__main__':
if sys.argv[1:2]:
readpipe(sys.argv[1])
else:
print '*** Testing ***'
import cStringIO
readpipe(cStringIO.StringIO('''\
Bi**************@aol.com
Joseph ("Joe")|B*********@acoy.com
"Joe"|B*********@acoy.com

Sa****************@northpole.org
'''))

C:\junk>\python23\python readpipe.py
*** Testing ***
[('email', 'b***@aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 's*****@northpole.org'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>\python25\python readpipe.py
*** Testing ***
[('email', 'b***@aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 's*****@northpole.org'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>

HTH,
John

Mar 24 '07 #9

Felipe Almeida Lessa

On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :

(Note that almost everything in Python is an object!)

Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?

--
Felipe.

Mar 24 '07 #10

Gabriel Genellina

En Sat, 24 Mar 2007 18:07:57 -0300, Felipe Almeida Lessa
<fe**********@gmail.comescribió:

On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :
>(Note that almost everything in Python is an object!)

Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?

The syntax, by example; an "if" statement is not an object.

--
Gabriel Genellina

Mar 24 '07 #11

Bjoern Schliessmann

Facundo Batista wrote:

"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght 0 (because
of newline).

Ah, K. Normally, I strip the read line and then test "if not line".
His check /is/ okay, but IMHO it's a little bit weird.

Unless I understood it wrong, it's just an object that holds the
line inside.

A Python string would technically be the same ;)

Just OO purity, not practicality...

:)

Regards,
Björn

--
BOFH excuse #378:

Operators killed by year 2000 bug bite.

Mar 25 '07 #12

Bjoern Schliessmann

Felipe Almeida Lessa wrote:

Could you tell me what in Python isn't an object?

Difficult ;) All data structures are (CMIIW). Functions and Types
are objects, too.

Are you counting old-style classes and instances as "not object"s?

No, both are.

Regards,
Björn

--
BOFH excuse #366:

ATM cell has no roaming feature turned on, notebooks can't connect

Mar 25 '07 #13

Bruno Desthuilliers

Matt Garman a écrit :

I'm trying to use Python to work with large pipe ('|') delimited data
files.

Looks like a job for the csv module (in the standard lib).

The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

Just for the record, *everything* in Python is an object - so the
problem is not about 'using objects'. Now Of course, a complex object
might eat up more space than a simple one...

Python has 2 simple types for structured data : tuples (like database
rows), and dicts (associative arrays). You can use the csv module to
parse a csv-like format into either tuples or dicts. If you want to save
memory, tuples may be the best choice.

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

What are you doing with your records ? Do you *really* need to keep the
whole list in memory ? Else you can just work line by line:

source = open(sys.argv[1])
for line in source:
do_something_with(line)
source.close()

This will avoid building a huge in-memory list.

While we're at it, your snippets are definitively unpythonic and
overcomplicated:
(snip)

filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()

(snip)

filedata = open(sys.argv[1]).readlines())

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:

class FileRecord(object):

def __init__(self, line):
self.line = line

If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.

records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()

records = map(FileRecord, open(sys.argv[1]).readlines()))

Mar 26 '07 #14

Bruno Desthuilliers

On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :
>(Note that almost everything in Python is an object!)

Could you tell me what in Python isn't an object?

statements and expressions ?-)

Mar 26 '07 #15

Bruno Desthuilliers

Bruno Desthuilliers a écrit :

Matt Garman a écrit :

(snip)

class FileRecord(object):

> def __init__(self, line):
self.line = line

If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.

Hem... Forget about this comment - not enough coffein yet I'm afraid.

Mar 26 '07 #16

Matt Garman

On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :

"one blank line" == "EOF"? That's strange. Intended?

In my case, I know my input data doesn't have any blank lines.
However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line

What's this class intended to do?

Store a line :) I just wanted to post two runnable examples. So the
above class's real intention is just to be a (contrived) example.

In the program I actually wrote, my class structure was a bit more
interesting. After storing the input line, I'd then call split("|")
(to tokenize the line). Each token would then be assigned to an
member variable. Some of the member variables turned into ints or
floats as well.

My input data had three record types; all had a few common attributes.
So I created a parent class and three child classes.

Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Finally, for what it's worth: the total run time memory requirements
of my program is roughly 20x the datafile size. A 200MB file
literally requires 4GB of RAM to effectively process. Note that, in
addition to the class structure I defined above, I also create two
caches of all the data (two dicts with different keys from the
collection of objects). This is necessary to ensure the program runs
in a semi-reasonable amount of time.

Thanks to all for your input and suggestions. I received many more
responses than I expected!

Matt

Mar 26 '07 #17

Bruno Desthuilliers

Matt Garman a écrit :
(snip)

Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Don't know if this could solve your problem, but have considered using
an intermediate (preferably embedded) SQL database (something like
SQLite) ?

Mar 26 '07 #18

Bjoern Schliessmann

Matt Garman wrote:

In my case, I know my input data doesn't have any blank lines.

8)

I work with a (not self-written) perl script that does funny things
with blank lines in input files. Yeah, blank lines "aren't supposed
to" be in the input data ...

However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.

The principle was okay (to check if the string is totally empty). I
always used readlines so far and didn't have the problem.

Thanks to all for your input and suggestions. I received many
more responses than I expected!

You're welcome. :)

Regards,
Björn

--
BOFH excuse #374:

It's the InterNIC's fault.

Mar 26 '07 #19

Python object overhead?

Similar topics