On 24/03/2007 8:11 AM, Matt Garman wrote:
I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.
Since each line corresponds to a record, what I'm trying to do is
create an object from each record.
An object with only 1 attribute and no useful methods seems a little
pointless; I presume you will elaborate it later.
However, it seems that doing this
causes the memory overhead to go up two or three times.
See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)
This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).
Is this "just the way it is" or am I overlooking something obvious?
Thanks,
Matt
Example 1: read lines into list:
# begin readlines.py
Interesting name for the file :-)
How about using the file.readlines() method?
Why do you want all 200Mb in memory at once anyway?
import sys, time
filedata = list()
file = open(sys.argv[1])
You have just clobbered the builtin file() function/type. In this case
it doesn't matter, but you should lose the habit, quickly.
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
How about using raw_input('Hit the Any key...') ?
# end readlines.py
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py
After all that, you still need to split the lines into the more-than-one
fieldS (plural) that one would expect in a record.
A possibly faster alternative to (fastest_line_reader_so_far,
(line.split('|')) is to use the csv module, as in the following example,
which also shows one way of making an object out of a row of data.
C:\junk>type readpipe.py
import sys, csv
class Contacts(object):
__slots__ = ['first', 'family', 'email']
def __init__(self, row):
for attrname, value in zip(self.__slots__, row):
setattr(self, attrname, value)
def readpipe(fname):
if hasattr(fname, 'read'):
f = fname
else:
f = open(fname, 'rb')
# 'b' is in case you'd like your script to be portable
reader = csv.reader(
f,
delimiter='|',
quoting=csv.QUOTE_NONE,
# Set quotechar to a char that you don't expect in your data
# e.g. the ASCII control char BEL (0x07). This is necessary
# for Python 2.3, whose csv module used the quoting arg only when
# writing, otherwise your " characters may get stripped off.
quotechar='\x07',
skipinitialspace=True,
)
for row in reader:
if row == ['']: # blank line
continue
c = Contacts(row)
# do something useful with c, e.g.
print [(x, getattr(c, x)) for x in dir(c)
if not x.startswith('_')]
if __name__ == '__main__':
if sys.argv[1:2]:
readpipe(sys.argv[1])
else:
print '*** Testing ***'
import cStringIO
readpipe(cStringIO.StringIO('''\
Bi**************@aol.com
Joseph ("Joe")|B*********@acoy.com
"Joe"|B*********@acoy.com
Sa****************@northpole.org
'''))
C:\junk>\python23\python readpipe.py
*** Testing ***
[('email', 'b***@aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 's*****@northpole.org'), ('family', 'Claus'), ('first', 'Santa')]
C:\junk>\python25\python readpipe.py
*** Testing ***
[('email', 'b***@aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'j****@acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 's*****@northpole.org'), ('family', 'Claus'), ('first', 'Santa')]
C:\junk>
HTH,
John