On Wed, 12 Dec 2007 16:44:01 -0800, igor.tatarinov wrote:
Here is some of my code. Tell me what's wrong with it :)
def loadFile(inputFile, loader):
# .zip files don't work with zlib
Pardon?
f = popen('zcat ' + inputFile)
for line in f:
loader.handleLine(line)
Do you really need to compress the file? Five million lines isn't a lot.
It depends on the length of each line, naturally, but I'd be surprised if
it were more than 100MB.
...
In Loader class:
def handleLine(self, line):
# filter out 'wrong' lines
if not self._dataFormat(line): return
Who knows what the _dataFormat() method does? How complicated is it? Why
is it a private method?
# add a new output record
rec = self.result.addRecord()
Who knows what this does? How complicated it is?
for col in self._dataFormat.colFormats:
Hmmm... a moment ago, _dataFormat seemed to be a method, or at least a
callable. Now it has grown a colFormats attribute. Complicated and
confusing.
value = parseValue(line, col)
rec[col.attr] = value
And here is parseValue (will using a hash-based dispatch make it much
faster?):
Possibly, but not enough to reduce 20 minutes to one or two.
But you know something? Your code looks like a bad case of over-
generalisation. I assume it's a translation of your C++ code -- no wonder
it takes an entire minute to process the file! (Oh lord, did I just say
that???) Object-oriented programming is a useful tool, but sometimes you
don't need a HyperDispatcherLoaderManagerCreator, you just need a hammer.
In your earlier post, you gave the data specification:
"The text file has a fixed format (like a punchcard). The columns contain
integer, real, and date values. The output files are the same values in
binary."
Easy-peasy. First, some test data:
fp = open('BIG', 'w')
for i in xrange(5000000):
anInt = i % 3000
aBool = ['TRUE', 'YES', '1', 'Y', 'ON',
'FALSE', 'NO', '0', 'N', 'OFF'][i % 10]
aFloat = ['1.12', '-3.14', '0.0', '7.42'][i % 4]
fp.write('%s %s %s\n' % (anInt, aBool, aFloat))
if i % 45000 == 0:
# Write a comment and a blank line.
fp.write('# this is a comment\n \n')
fp.close()
Now let's process it:
import struct
# Define converters for each type of value to binary.
def fromBool(s):
"""String to boolean byte."""
s = s.upper()
if s in ('TRUE', 'YES', '1', 'Y', 'ON'):
return struct.pack('b', True)
elif s in ('FALSE', 'NO', '0', 'N', 'OFF'):
return struct.pack('b', False)
else:
raise ValueError('not a valid boolean')
def fromInt(s):
"""String to integer bytes."""
return struct.pack('l', int(s))
def fromFloat(s):
"""String to floating point bytes."""
return struct.pack('f', float(s))
# Assume three fields...
DEFAULT_FORMAT = [fromInt, fromBool, fromFloat]
# And three files...
OUTPUT_FILES = ['ints.out', 'bools.out', 'floats.out']
def process_line(s, format=DEFAULT_FORMAT):
s = s.strip()
fields = s.split() # I assume the fields are whitespace separated
assert len(fields) == len(format)
return [f(x) for (x, f) in zip(fields, format)]
def process_file(infile, outfiles=OUTPUT_FILES):
out = [open(f, 'wb') for f in outfiles]
for line in file(infile, 'r'):
# ignore leading/trailing whitespace and comments
line = line.strip()
if line and not line.startswith('#'):
fields = process_line(line)
# now write the fields to the files
for x, fp in zip(fields, out):
fp.write(x)
for f in out:
f.close()
And now let's use it and see how long it takes:
>>import time
s = time.time(); process_file('BIG'); time.time() - s
129.58465385437012
Naturally if your converters are more complex (e.g. date-time), or if you
have more fields, it will take longer to process, but then I've made no
effort at all to optimize the code.
--
Steven.