By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,264 Members | 1,795 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,264 IT Pros & Developers. It's quick & easy.

speed problems

P: n/a
^
Hi group,

I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files for further processing.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;
perl runs:
0.07 real 0.05 user 0.01 sys
0.07 real 0.05 user 0.01 sys
0.07 real 0.04 user 0.02 sys
python runs:
0.27 real 0.23 user 0.03 sys
0.28 real 0.21 user 0.05 sys
0.27 real 0.19 user 0.06 sys

This was measured with a small uncompressed logfile (1.4M). The difference
grows much bigger whenever it needs to uncompress things.

Here are both scripts, could you please have a look and tell me where I
should look for optimizations?

perl:
my (@maillogs) = (
"/home/logs/maillog", "/home/logs/maillog.0.gz",
"/home/logs/maillog.1.gz", "/home/logs/maillog.2.gz",
"/home/logs/maillog.3.gz",
);

my ($gzip) = "/usr/bin/gzip";
my ($bzip2)= "/usr/bin/bzip2";

my ($total) = 0.0;
my (%virstat);

foreach my $logfile (@maillogs)
{
if ( -f $logfile )
{
# is it compressed?
if ( $logfile =~ /\.[bg]z2?$/ )
{
if ( !open LF, "$gzip -cd $logfile|" )
{
open LF, "$bzip2 -cd $logfile|" or
die "unable to uncompress '$logfile'\n";
}
}
else
{
open LF, "<$logfile" or die "couldn't open '$logfile'\n";
}

while (<LF>)
{
if (/INFECTED/)
{
# we need only the virus name
$_ =~ s/.*INFECTED.*\((.*)\).*/$1/g;
# if multiple virusses found
if (/, /)
{
# split them
my (@vir) = split /, /, $_;
foreach my $v (@vir)
{
chomp $v;
$virstat{$v}++;
$total++;
}
}
else
{
chomp;
$virstat{$_}++;
$total++;
}
}
}
close LF;
}
# else
# {
# print STDERR "'$logfile' doesn't exist, skipping it.\n";
# }
}

foreach my $v (sort keys %virstat)
{
my $p = ($virstat{$v}/$total)*100;
$p = sprintf "%s:\t%5.2f%%", $v, $p;
print "$p\n";
}
#---end of perl script ---

python:
import os
import string
import re

maillogs = [
"/home/logs/maillog", "/home/logs/maillog.0.gz",
"/home/logs/maillog.1.gz", "/home/logs/maillog.2.gz",
"/home/logs/maillog.3.gz"
]
virstat={}
total=0.0 # keep this float

for logfile in maillogs:
if os.path.isfile( logfile ):
# is it compressed?
if logfile[-3:] == '.gz':
import gzip
lf = gzip.GzipFile( logfile, "r" )
else:
if logfile[-4:] == '.bz2':
import bz2
lf = bz2.BZ2File( logfile, "r" )
else:
# uncompressed
lf = open( logfile, "r" )

for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED \((.*)\)" ).search( line ).group(1)
if string.count( vname, ", " ):
for vnam in string.split( vname, ", " ):
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1
total += 1
else:
if vname not in virstat:
virstat[vname] = 1
else:
virstat[vname] += 1
total += 1
lf.close()
# else:
# print "logfile '%s' doesn't exist, skipping it." % logfile

for vname in virstat.keys():
p = (virstat[vname]/total)*100
print "%s: %5.2f%%" % (vname, p)
#--- End of python script ---
Thanks for any help you can provide,
Kind regards,

Axel
Jul 18 '05 #1
Share this Question
Share on Google+
14 Replies


P: n/a
On 2004-06-03, ^ <ax**@axel.truedestiny.net> wrote:
Here are both scripts, could you please have a look and tell me where I
should look for optimizations?
Well, I see one major difference and one place I'd do something
differently.

Perl: my ($gzip) = "/usr/bin/gzip";
my ($bzip2)= "/usr/bin/bzip2";
First off you're using exernal programs here for decompression. This is a
trade off of making a system call vs internal implementation. Maybe Python's
implementation is slower? I don't know, just pointing out that it is a
difference. Personally when programming tools like this I try to keep
everything internal because I've had endless system calls kill the run-time.
However with the few files you're iterating over the cost might be the other
way 'round. :)

Python: for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED \((.*)\)" ).search( line ).group(1)


If I read this correctly you're compiling this regex every time you're
going through the for loop. So every line the regex is compiled again. You
might want to compile the regex outside the loop and only use the compiled
version inside the loop.

I *think* that Perl caches compiled regexs which is why they don't have
two different ways of calling the regex while Python, in giving two different
calls to the regex, will compile it every time if you expressedly call for a
compile. Again, just a guess based on how I presume the languages work and
how I'd write them differently.

--
Steve C. Lamb | I'm your priest, I'm your shrink, I'm your
PGP Key: 8B6E99C5 | main connection to the switchboard of souls.
-------------------------------+---------------------------------------------
Jul 18 '05 #2

P: n/a
> First off you're using exernal programs here for decompression. This
is a
trade off of making a system call vs internal implementation. Maybe Python's implementation is slower? I don't know, just pointing out that it is a
difference. Personally when programming tools like this I try to keep
everything internal because I've had endless system calls kill the run-time. However with the few files you're iterating over the cost might be the other way 'round. :)

I'll be looping over these files only, but I thought using python's gzip
module would be faster then spawning gzip itself the way I did in the perl
script.
Python:
for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED \((.*)\)" ).search(
line ).group(1)
If I read this correctly you're compiling this regex every time you're
going through the for loop. So every line the regex is compiled again. You might want to compile the regex outside the loop and only use the compiled
version inside the loop.


Well, only for lines containing 'INFECTED' then. Good point. (I suddenly
remember some c stuff in which it made a huge difference) I've placed it
outside the loop now, but the times are still the same.

Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.

Thanks for your quick reply,
Kind regards,

Axel
Jul 18 '05 #3

P: n/a
"^" <ax**@axel.truedestiny.net> writes:
could you please [...] tell me where I should look for
optimizations?


import profile
help(profile)

import hotshot
help(hotshot)

(Teach a man to fish ... and all that :-)
Jul 18 '05 #4

P: n/a

"Jacek Generowicz" <ja**************@cern.ch> wrote in message
news:ty*************@pcepsft001.cern.ch...
"^" <ax**@axel.truedestiny.net> writes:
could you please [...] tell me where I should look for
optimizations?


import profile
help(profile)

import hotshot
help(hotshot)

(Teach a man to fish ... and all that :-)


:-)

That's kewl!
Thanks!

Regards,
Axel
Jul 18 '05 #5

P: n/a
Axel Scheepers wrote:
Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.


line in lf also reads the file line by line, if you're using 2.3. In 2.1
or 2.2 you can use xreadlines for that. I don't if it makes any
difference in performance though, for small files.

--
"Codito ergo sum"
Roel Schroeven
Jul 18 '05 #6

P: n/a
On 2004-06-03, Roel Schroeven <rs****************@fastmail.fm> wrote:
Axel Scheepers wrote:
Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.


line in lf also reads the file line by line, if you're using 2.3. In 2.1
or 2.2 you can use xreadlines for that. I don't if it makes any
difference in performance though, for small files.


I would suspect that reading the entire file at once would yeild
slightly better performance for non-huge files.

--
Grant Edwards grante Yow! My ELBOW is a remote
at FRENCH OUTPOST!!
visi.com
Jul 18 '05 #7

P: n/a
Am Donnerstag, 3. Juni 2004 17:04 schrieb Axel Scheepers:
Another difference might be while( <filehandle>) and line in
lf.readlines().


In Python, while( <FH> ) becomes:

for line in fh:
<something>

This will truly iterate over the lines of the file, not preload anything into
memory.

HTH!

Heiko.

Jul 18 '05 #8

P: n/a
In addition to the items Steve Lamb noted, I have a few suggestions:

Place the whole script in a function and call it. This will give you an
immediate speedup of some percent, because lookup of names that are
local to a function is faster than looking up names at the module level.
for line in lf.readlines(): Unless the bzip2 or gzip modules don't support it, you should write
for line in lf:
instead. This is likely to improve memory consumption, and may improve
the program speed too.
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED \((.*)\)" ).search( line ).group(1)
Unless you arrived at this two-step process through profiling, it's
probably better to write
m = infected_rx.search(line)
if m:
vname = m.group(1)
...
if string.count( vname, ", " ):
for vnam in string.split( vname, ", " ): [...] else:
If there are no ", " in vname, the split will produce a single item.
Also, there's no no reason to use the "string" module anymore, as
opposed to string methods. Finally, splitting on single characters is
likely to be optimized, but I'm not sure.

I'd just use
for vnam in vname.split(","):
vnam = vnam.strip()
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1


You have several alternatives here:
try:
virstat[vnam] += 1
except KeyError:
virstat[vnam] = 1
or
virstat[vnam] = virstat.get(vnam, 0) + 1

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFAvz3bJd01MZaTXX0RAhBfAJ41BXto3mvPuQeFTCIAH0 riUgxN4gCfbq5U
9VTbyFf2cUyLTX3tXmLnLuc=
=PDTc
-----END PGP SIGNATURE-----

Jul 18 '05 #9

P: n/a
Steve Lamb <gr**@despair.dmiyu.org> wrote in
news:sl*****************@dmiyu.org:
Python:
for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED \((.*)\)" ).search( line
).group(1)


If I read this correctly you're compiling this regex every time
you're
going through the for loop. So every line the regex is compiled
again. You might want to compile the regex outside the loop and only
use the compiled version inside the loop.

I *think* that Perl caches compiled regexs which is why they don't
have
two different ways of calling the regex while Python, in giving two
different calls to the regex, will compile it every time if you
expressedly call for a compile. Again, just a guess based on how I
presume the languages work and how I'd write them differently.


No, Python will cache the calls to compile the regex so you won't get much
speed difference unless you have enough different regexes to overflow the
cache. Pulling the compile out of the loop is a good idea on general
principles though.

The code you quoted does have one place to optimise: using readlines,
especially on a large file will be *much* slower than just iterating over
the file object directly.

i.e. use

for line in lf:
... whatever ...

Some other things that could be improved (although I suspect the real
problem was calling readlines):

The original code posted uses functions from the string module. Using
string methods instead ought to be faster e.g. line.count("INFECTED")
instead of string.line(count, "INFECTED")

Use
if logfile.endswith('.gz'):
instead of:
if logfile[-3:] == '.gz':

Use:
if "INFECTED" in line:
instead of calling line.count

I don't understand why the inner loop needs two cases, one for vname
containing a ',' and one where it doesn't. It looks to me as though the
code could just split whether or not there is a comma. If there isn't one
it just returns the original string.

Untested revised code:

INFECTEDPAT = re.compile( "INFECTED \((.*)\)" )
for line in lf:
if "INFECTED" in line:
vname = INFECTEDPAT.search(line).group(1)
for vnam in vname.split(", "):
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1
total += 1
lf.close()
Jul 18 '05 #10

P: n/a
On Thu, 03 Jun 2004 14:33:58 GMT, "^" <ax**@axel.truedestiny.net>
declaimed the following in comp.lang.python:

for logfile in maillogs:
if os.path.isfile( logfile ):
# is it compressed?
if logfile[-3:] == '.gz':
import gzip
It's probably only costing a few milliseconds (vs a reload() ),
but why not move the import (and the next one too) to the top of the
program, rather than having the interpreter do the internal lookup only
to discover that it may already have imported the module...
lf = gzip.GzipFile( logfile, "r" )
else:
if logfile[-4:] == '.bz2':
import bz2
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #11

P: n/a
> I've become interested in Python a while ago and just converted a simple
perl script to python.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;


I had the same problem in dealing with a large (compressed) file using
python vs c++ (using the gzip library to open/read a file): the
results were in favour of python against c++ this time;)

I think the problem is in the flow:

gzip -> file/pipe -> perl
file -> zlib -> python

The decompression through zlib is wfar slower because it is carried
using data chunks: there is no mean to control where a chunk will
terminate (like in correspondence to an EOL).
Try using: zcat <myfile> | myprogram.py
In my case it solved all the puzzling speed problems: zcat
uncoditionally dumps the data irregarding where the EOL happens (so
the upper layer should not wait for the next chunk to be
decompressed).
regards,
antonio cavallo
Jul 18 '05 #12

P: n/a
"^" <ax**@axel.truedestiny.net> wrote in message news:<qF****************@amsnews05.chello.com>...
Hi group,

I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files for further processing.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;
. . .
Thanks for any help you can provide,
Kind regards,

Axel


I've halved the python time on my test by changing the entire inner loop
to:

pat = re.compile( "MALWARE:\s+(.*)" )
for line in lf:
mo = pat.search( line )
if mo:
for vnam in mo.group(1).split( ", "):
virstat[vnam] = virstat.get(vnam,0) + 1
total += 1
lf.close()

(with changes form my logfile format):

Of course, its no longer the same as the perl version, and
the perl version would also probably benefit from something
similar.
Jul 18 '05 #13

P: n/a
In article <63**************************@posting.google.com >,
nh****@docuweb.ca (Neal Holtz) wrote:

I've halved the python time on my test by changing the entire inner loop
to:

pat = re.compile( "MALWARE:\s+(.*)" )
for line in lf:
mo = pat.search( line )
if mo:
for vnam in mo.group(1).split( ", "):
virstat[vnam] = virstat.get(vnam,0) + 1
total += 1
lf.close()


A few random thoughts...

1) How often is mo true? In other words, what percentage of the lines
in the file match the pattern? If it's very high or very low, that
might give you some ideas where to look. Running the profiler will help
at lot too!

2) It's probably a minor tweak, but you could factor out the
"pat.search" name lookup cost by doing something like:

pat = re.compile( "MALWARE:\s+(.*)" )
patSearch = pat.search
for line in lf:
mo = patSearch (line)
....

3) There's a certain amount of duplication of effort going on between
the regex search and the split; two passes over the same data. Is it
possible to write a regex which parses it all in one pass? Of course,
if the answer to question #1 is "very low percentage", then this is
probably not the place to be looking.

4) What does virstat.get do?

Lastly, a general programming style comment. I'm a fan of longish
variable names which describe what they're doing. I had to think a bit
to figure out that vnam probably stands for "virus name". It would have
been easier for me to figure out if you named the variable something
like virusName (or even virus_name, if you're an underscore fan). Same
with the other variable names.
Jul 18 '05 #14

P: n/a
Hans-Peter Jansen <hp*@urpla.net> wrote:
if logfile.endswith('.gz'):
#ifd, lfd = os.popen2("%s %s" % (gzip, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (gzip, logfile))
elif logfile.endswith('.bz2'):
#ifd, lfd = os.popen2("%s %s" % (bzip2, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (bzip2, logfile))
else:
# uncompressed
lfd = open(logfile, "r")


Why stop there? You've left on the verge of collapsing into the fully
reduced (and regularized) form:

if logfile.endswith('.gz'):
cat_command = 'zcat'
elif logfile.endswith('.bz2'):
cat_command = 'bzcat'
else:
cat_command = 'cat'
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (cat_command, logfile))

(for that matter, is there some reason to use popen2 and the
unnecessary ifd?)

I've found it advantageous to preprocess large inputs with grep - the
tens of MB of squid logs that are skimmed by a useful little CGI script
really benefited from that! Python's raw I/O may be as good as
anything, but for line by line parsing where a majority of the (many)
lines are discarded, a grep prefilter is a big win.

Which may or may not bring us back to... well, it's not a corollary to
Steve Lamb's guideline for using shell script, though it's clearly
related. Maybe it's the contrapositive. Never get so wrapped up in
using that Python hammer that you forget about those carefully honed
specialized tools. Use that one line of shell to the best effect!
<grin>

--
Anyone who calls economics the dismal science
has never been exposed to educationist theories
at any length. An hour or two is a surfeit.
Jul 18 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.