473,386 Members | 1,745 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Python object overhead?

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt
Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readlines.py
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py
Mar 23 '07 #1
18 2694
On Fri, 23 Mar 2007 15:11:35 -0600, Matt Garman wrote:

>
Is this "just the way it is" or am I overlooking something obvious?

Matt,

If you iterate over even the smallest object instantiation a large amount
of times, it will be costly compared to a simple list append.

I don't think you can get around some overhead with the objects.

However, in terms of generally efficiency not specifically related to
object instantiation, you should look into xreadlines().

I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():
..
--
Mark Nenadov -skype: marknenadov, web: http://www.marknenadov.com
-"Glory is fleeting, but obscurity is forever." -- Napoleon Bonapart

Mar 23 '07 #2
En Fri, 23 Mar 2007 18:27:25 -0300, Mark Nenadov
<ma**@freelance-developer.comescribió:
I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():
Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3 :(
A file is now its own line iterator:

f = open(...)
for line in f:
...

--
Gabriel Genellina

Mar 23 '07 #3
On Fri, 23 Mar 2007 19:11:23 -0300, Gabriel Genellina wrote:
Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3 :(
A file is now its own line iterator:

f = open(...)
for line in f:
...
Gabriel,

Thanks for pointing that out! I had completely forgotten about
that!

I've tested them before. readlines() is very slow. The deprecated
xreadlines() is close in speed to open() as an iterator. In my particular
test, I found the following:

readlines() -32 "time units"
xreadlines() -0.7 "time units"
open() iterator -0.41 "time units"

--
Mark Nenadov -skype: marknenadov, web: http://www.marknenadov.com
-"They need not trust me right away simply because the British say
that I am O.K.; but they are so ridiculous. Microphones everywhere
and planted so obviously. Why, if I bend over to smell a bowl of
flowers, I scratch my nose on a microphone."
-- Tricyle (Dushko Popov) on American Intelligence

Mar 23 '07 #4
Matt Garman wrote:
Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing
this causes the memory overhead to go up two or three times.
(Note that almost everything in Python is an object!)
Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
What's this class intended to do?

Regards,
Björn

--
BOFH excuse #1:

clock speed

Mar 24 '07 #5
Bjoern Schliessmann <us**************************@spamgourmet.comwrite s:
if len(line) == 0: break # EOF
"one blank line" == "EOF"? That's strange. Intended?
A blank line would have length 1 (a newline character).
Mar 24 '07 #6
Bjoern Schliessmann wrote:

>while True:
line = file.readline()
if len(line) == 0: break # EOF

"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".
"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght 0 (because of
newline).

>class FileRecord:
def __init__(self, line):
self.line = line

What's this class intended to do?
Unless I understood it wrong, it's just an object that holds the line
inside.

Just OO purity, not practicality...

--
.. Facundo
..
Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
Mar 24 '07 #7
Matt Garman wrote:
I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.
Why do you want all the records in memory at once? Are you
doing some lookup on them, or what? If you're processing files
sequentially, don't keep them all in memory.

You're getting into the size range where it may be time to
use a database.

John Nagle
Mar 24 '07 #8
On 24/03/2007 8:11 AM, Matt Garman wrote:
I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record.
An object with only 1 attribute and no useful methods seems a little
pointless; I presume you will elaborate it later.
However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt
Example 1: read lines into list:
# begin readlines.py
Interesting name for the file :-)
How about using the file.readlines() method?
Why do you want all 200Mb in memory at once anyway?
import sys, time
filedata = list()
file = open(sys.argv[1])
You have just clobbered the builtin file() function/type. In this case
it doesn't matter, but you should lose the habit, quickly.
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
How about using raw_input('Hit the Any key...') ?
# end readlines.py
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py
After all that, you still need to split the lines into the more-than-one
fieldS (plural) that one would expect in a record.

A possibly faster alternative to (fastest_line_reader_so_far,
(line.split('|')) is to use the csv module, as in the following example,
which also shows one way of making an object out of a row of data.

C:\junk>type readpipe.py
import sys, csv

class Contacts(object):
__slots__ = ['first', 'family', 'email']
def __init__(self, row):
for attrname, value in zip(self.__slots__, row):
setattr(self, attrname, value)

def readpipe(fname):
if hasattr(fname, 'read'):
f = fname
else:
f = open(fname, 'rb')
# 'b' is in case you'd like your script to be portable
reader = csv.reader(
f,
delimiter='|',
quoting=csv.QUOTE_NONE,
# Set quotechar to a char that you don't expect in your data
# e.g. the ASCII control char BEL (0x07). This is necessary
# for Python 2.3, whose csv module used the quoting arg only when
# writing, otherwise your " characters may get stripped off.
quotechar='\x07',
skipinitialspace=True,
)
for row in reader:
if row == ['']: # blank line
continue
c = Contacts(row)
# do something useful with c, e.g.
print [(x, getattr(c, x)) for x in dir(c)
if not x.startswith('_')]

if __name__ == '__main__':
if sys.argv[1:2]:
readpipe(sys.argv[1])
else:
print '*** Testing ***'
import cStringIO
readpipe(cStringIO.StringIO('''\
Bi**************@aol.com
Joseph ("Joe")|B*********@acoy.com
"Joe"|B*********@acoy.com

Sa****************@northpole.org
'''))

C:\junk>\python23\python readpipe.py
*** Testing ***
[('email', 'b***@aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 's*****@northpole.org'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>\python25\python readpipe.py
*** Testing ***
[('email', 'b***@aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 's*****@northpole.org'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>

HTH,
John
Mar 24 '07 #9
On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :
(Note that almost everything in Python is an object!)
Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?

--
Felipe.
Mar 24 '07 #10
En Sat, 24 Mar 2007 18:07:57 -0300, Felipe Almeida Lessa
<fe**********@gmail.comescribió:
On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :
>(Note that almost everything in Python is an object!)

Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?
The syntax, by example; an "if" statement is not an object.

--
Gabriel Genellina

Mar 24 '07 #11
Facundo Batista wrote:
"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght 0 (because
of newline).
Ah, K. Normally, I strip the read line and then test "if not line".
His check /is/ okay, but IMHO it's a little bit weird.
Unless I understood it wrong, it's just an object that holds the
line inside.
A Python string would technically be the same ;)
Just OO purity, not practicality...
:)

Regards,
Björn

--
BOFH excuse #378:

Operators killed by year 2000 bug bite.

Mar 25 '07 #12
Felipe Almeida Lessa wrote:
Could you tell me what in Python isn't an object?
Difficult ;) All data structures are (CMIIW). Functions and Types
are objects, too.
Are you counting old-style classes and instances as "not object"s?
No, both are.

Regards,
Björn

--
BOFH excuse #366:

ATM cell has no roaming feature turned on, notebooks can't connect

Mar 25 '07 #13
Matt Garman a écrit :
I'm trying to use Python to work with large pipe ('|') delimited data
files.
Looks like a job for the csv module (in the standard lib).
The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)
Just for the record, *everything* in Python is an object - so the
problem is not about 'using objects'. Now Of course, a complex object
might eat up more space than a simple one...

Python has 2 simple types for structured data : tuples (like database
rows), and dicts (associative arrays). You can use the csv module to
parse a csv-like format into either tuples or dicts. If you want to save
memory, tuples may be the best choice.
This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?
What are you doing with your records ? Do you *really* need to keep the
whole list in memory ? Else you can just work line by line:

source = open(sys.argv[1])
for line in source:
do_something_with(line)
source.close()

This will avoid building a huge in-memory list.

While we're at it, your snippets are definitively unpythonic and
overcomplicated:
(snip)
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
(snip)

filedata = open(sys.argv[1]).readlines())

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
class FileRecord(object):
def __init__(self, line):
self.line = line
If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
records = map(FileRecord, open(sys.argv[1]).readlines()))
Mar 26 '07 #14
Felipe Almeida Lessa a écrit :
On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :
>(Note that almost everything in Python is an object!)

Could you tell me what in Python isn't an object?
statements and expressions ?-)

Mar 26 '07 #15
Bruno Desthuilliers a écrit :
Matt Garman a écrit :
(snip)
class FileRecord(object):
> def __init__(self, line):
self.line = line

If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.
Hem... Forget about this comment - not enough coffein yet I'm afraid.
Mar 26 '07 #16
On 3/23/07, Bjoern Schliessmann
<us**************************@spamgourmet.comwrote :
"one blank line" == "EOF"? That's strange. Intended?
In my case, I know my input data doesn't have any blank lines.
However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line

What's this class intended to do?
Store a line :) I just wanted to post two runnable examples. So the
above class's real intention is just to be a (contrived) example.

In the program I actually wrote, my class structure was a bit more
interesting. After storing the input line, I'd then call split("|")
(to tokenize the line). Each token would then be assigned to an
member variable. Some of the member variables turned into ints or
floats as well.

My input data had three record types; all had a few common attributes.
So I created a parent class and three child classes.

Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Finally, for what it's worth: the total run time memory requirements
of my program is roughly 20x the datafile size. A 200MB file
literally requires 4GB of RAM to effectively process. Note that, in
addition to the class structure I defined above, I also create two
caches of all the data (two dicts with different keys from the
collection of objects). This is necessary to ensure the program runs
in a semi-reasonable amount of time.

Thanks to all for your input and suggestions. I received many more
responses than I expected!

Matt
Mar 26 '07 #17
Matt Garman a écrit :
(snip)
Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)
Don't know if this could solve your problem, but have considered using
an intermediate (preferably embedded) SQL database (something like
SQLite) ?

Mar 26 '07 #18
Matt Garman wrote:
In my case, I know my input data doesn't have any blank lines.
8)

I work with a (not self-written) perl script that does funny things
with blank lines in input files. Yeah, blank lines "aren't supposed
to" be in the input data ...
However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.
The principle was okay (to check if the string is totally empty). I
always used readlines so far and didn't have the problem.
Thanks to all for your input and suggestions. I received many
more responses than I expected!
You're welcome. :)

Regards,
Björn

--
BOFH excuse #374:

It's the InterNIC's fault.

Mar 26 '07 #19

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Ben Fairbank | last post by:
I have to pick a language to commit to for general purpose scientific/statistical/utility/database programming at my office and have pretty much narrowed it down to R or Python. Problem: none of...
467
by: mike420 | last post by:
THE GOOD: 1. pickle 2. simplicity and uniformity 3. big library (bigger would be even better) THE BAD:
2
by: GinTon | last post by:
EyeDB is a free ODBMS based on the ODMG 3 specification with programming interfaces for C++ and Java. It is very powerfull, mature, safe and stable. In fact, it was developed in 1992 for the Genome...
17
by: Sunburned Surveyor | last post by:
I was thinking of a way I could make writing Python Class Files a little less painful. I was considering a Ptyhon script that read a file with a list of property names and method names and then...
5
by: Santiago Romero | last post by:
Is there a way to check the REAL size in memory of a python object? Something like or or something like that ...
270
by: Jordan | last post by:
Hi everyone, I'm a big Python fan who used to be involved semi regularly in comp.lang.python (lots of lurking, occasional posting) but kind of trailed off a bit. I just wrote a frustration...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.