speed problems

Hi group,

I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files for further processing.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;
perl runs:
0.07 real 0.05 user 0.01 sys
0.07 real 0.05 user 0.01 sys
0.07 real 0.04 user 0.02 sys
python runs:
0.27 real 0.23 user 0.03 sys
0.28 real 0.21 user 0.05 sys
0.27 real 0.19 user 0.06 sys

This was measured with a small uncompressed logfile (1.4M). The difference
grows much bigger whenever it needs to uncompress things.

Here are both scripts, could you please have a look and tell me where I
should look for optimizations?

perl:
my (@maillogs) = (
"/home/logs/maillog", "/home/logs/maillog.0.gz",
"/home/logs/maillog.1.gz", "/home/logs/maillog.2.gz",
"/home/logs/maillog.3.gz",
);

my ($gzip) = "/usr/bin/gzip";
my ($bzip2)= "/usr/bin/bzip2";

my ($total) = 0.0;
my (%virstat);

foreach my $logfile (@maillogs)
{
if ( -f $logfile )
{
# is it compressed?
if ( $logfile =~ /\.[bg]z2?$/ )
{
if ( !open LF, "$gzip -cd $logfile|" )
{
open LF, "$bzip2 -cd $logfile|" or
die "unable to uncompress '$logfile'\n";
}
}
else
{
open LF, "<$logfile" or die "couldn't open '$logfile'\n";
}

while (<LF>)
{
if (/INFECTED/)
{
# we need only the virus name
$_ =~ s/.*INFECTED.*$(.*)$.*/$1/g;
# if multiple virusses found
if (/, /)
{
# split them
my (@vir) = split /, /, $_;
foreach my $v (@vir)
{
chomp $v;
$virstat{$v}++;
$total++;
}
}
else
{
chomp;
$virstat{$_}++;
$total++;
}
}
}
close LF;
}
# else
# {
# print STDERR "'$logfile' doesn't exist, skipping it.\n";
# }
}

foreach my $v (sort keys %virstat)
{
my $p = ($virstat{$v}/$total)*100;
$p = sprintf "%s:\t%5.2f%%", $v, $p;
print "$p\n";
}
#---end of perl script ---

python:
import os
import string
import re

maillogs = [
"/home/logs/maillog", "/home/logs/maillog.0.gz",
"/home/logs/maillog.1.gz", "/home/logs/maillog.2.gz",
"/home/logs/maillog.3.gz"
]
virstat={}
total=0.0 # keep this float

for logfile in maillogs:
if os.path.isfile( logfile ):
# is it compressed?
if logfile[-3:] == '.gz':
import gzip
lf = gzip.GzipFile( logfile, "r" )
else:
if logfile[-4:] == '.bz2':
import bz2
lf = bz2.BZ2File( logfile, "r" )
else:
# uncompressed
lf = open( logfile, "r" )

for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search( line ).group(1)
if string.count( vname, ", " ):
for vnam in string.split( vname, ", " ):
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1
total += 1
else:
if vname not in virstat:
virstat[vname] = 1
else:
virstat[vname] += 1
total += 1
lf.close()
# else:
# print "logfile '%s' doesn't exist, skipping it." % logfile

for vname in virstat.keys():
p = (virstat[vname]/total)*100
print "%s: %5.2f%%" % (vname, p)
#--- End of python script ---
Thanks for any help you can provide,
Kind regards,

Axel

Jul 18 '05 #1

Subscribe Post Reply

1715

Steve Lamb

On 2004-06-03, ^ <ax**@axel.truedestiny.net> wrote:

Here are both scripts, could you please have a look and tell me where I
should look for optimizations?
Well, I see one major difference and one place I'd do something
differently.

Perl: my ($gzip) = "/usr/bin/gzip";
my ($bzip2)= "/usr/bin/bzip2";
First off you're using exernal programs here for decompression. This is a
trade off of making a system call vs internal implementation. Maybe Python's
implementation is slower? I don't know, just pointing out that it is a
difference. Personally when programming tools like this I try to keep
everything internal because I've had endless system calls kill the run-time.
However with the few files you're iterating over the cost might be the other
way 'round. :)

Python: for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search( line ).group(1)

If I read this correctly you're compiling this regex every time you're
going through the for loop. So every line the regex is compiled again. You
might want to compile the regex outside the loop and only use the compiled
version inside the loop.

I *think* that Perl caches compiled regexs which is why they don't have
two different ways of calling the regex while Python, in giving two different
calls to the regex, will compile it every time if you expressedly call for a
compile. Again, just a guess based on how I presume the languages work and
how I'd write them differently.

--
Steve C. Lamb | I'm your priest, I'm your shrink, I'm your
PGP Key: 8B6E99C5 | main connection to the switchboard of souls.
-------------------------------+---------------------------------------------

Jul 18 '05 #2

Axel Scheepers

> First off you're using exernal programs here for decompression. This
is a

trade off of making a system call vs internal implementation. Maybe Python's implementation is slower? I don't know, just pointing out that it is a
difference. Personally when programming tools like this I try to keep
everything internal because I've had endless system calls kill the run-time. However with the few files you're iterating over the cost might be the other way 'round. :)

I'll be looping over these files only, but I thought using python's gzip
module would be faster then spawning gzip itself the way I did in the perl
script.
Python:
for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search(
line ).group(1)
If I read this correctly you're compiling this regex every time you're
going through the for loop. So every line the regex is compiled again. You might want to compile the regex outside the loop and only use the compiled
version inside the loop.

Well, only for lines containing 'INFECTED' then. Good point. (I suddenly
remember some c stuff in which it made a huge difference) I've placed it
outside the loop now, but the times are still the same.

Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.

Thanks for your quick reply,
Kind regards,

Axel

Jul 18 '05 #3

Jacek Generowicz

"^" <ax**@axel.truedestiny.net> writes:

could you please [...] tell me where I should look for
optimizations?

import profile
help(profile)

import hotshot
help(hotshot)

(Teach a man to fish ... and all that :-)

Jul 18 '05 #4

Axel Scheepers

"Jacek Generowicz" <ja**************@cern.ch> wrote in message
news:ty*************@pcepsft001.cern.ch...

"^" <ax**@axel.truedestiny.net> writes:
could you please [...] tell me where I should look for
optimizations?

import profile
help(profile)

import hotshot
help(hotshot)

(Teach a man to fish ... and all that :-)

:-)

That's kewl!
Thanks!

Regards,
Axel

Jul 18 '05 #5

Roel Schroeven

Axel Scheepers wrote:

Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.

line in lf also reads the file line by line, if you're using 2.3. In 2.1
or 2.2 you can use xreadlines for that. I don't if it makes any
difference in performance though, for small files.

--
"Codito ergo sum"
Roel Schroeven

Jul 18 '05 #6

Grant Edwards

On 2004-06-03, Roel Schroeven <rs****************@fastmail.fm> wrote:

Axel Scheepers wrote:
Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.

line in lf also reads the file line by line, if you're using 2.3. In 2.1
or 2.2 you can use xreadlines for that. I don't if it makes any
difference in performance though, for small files.

I would suspect that reading the entire file at once would yeild
slightly better performance for non-huge files.

--
Grant Edwards grante Yow! My ELBOW is a remote
at FRENCH OUTPOST!!
visi.com

Jul 18 '05 #7

Heiko Wundram

Am Donnerstag, 3. Juni 2004 17:04 schrieb Axel Scheepers:

Another difference might be while( <filehandle>) and line in
lf.readlines().

In Python, while( <FH> ) becomes:

for line in fh:
<something>

This will truly iterate over the lines of the file, not preload anything into
memory.

HTH!

Heiko.

Jul 18 '05 #8

Jeff Epler

In addition to the items Steve Lamb noted, I have a few suggestions:

Place the whole script in a function and call it. This will give you an
immediate speedup of some percent, because lookup of names that are
local to a function is faster than looking up names at the module level.

for line in lf.readlines(): Unless the bzip2 or gzip modules don't support it, you should write
for line in lf:
instead. This is likely to improve memory consumption, and may improve
the program speed too.
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search( line ).group(1)
Unless you arrived at this two-step process through profiling, it's
probably better to write
m = infected_rx.search(line)
if m:
vname = m.group(1)
...
if string.count( vname, ", " ):
for vnam in string.split( vname, ", " ): [...] else:
If there are no ", " in vname, the split will produce a single item.
Also, there's no no reason to use the "string" module anymore, as
opposed to string methods. Finally, splitting on single characters is
likely to be optimized, but I'm not sure.

I'd just use
for vnam in vname.split(","):
vnam = vnam.strip()
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1

You have several alternatives here:
try:
virstat[vnam] += 1
except KeyError:
virstat[vnam] = 1
or
virstat[vnam] = virstat.get(vnam, 0) + 1

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFAvz3bJd01MZaTXX0RAhBfAJ41BXto3mvPuQeFTCIAH0 riUgxN4gCfbq5U
9VTbyFf2cUyLTX3tXmLnLuc=
=PDTc
-----END PGP SIGNATURE-----

Jul 18 '05 #9

Duncan Booth

Steve Lamb <gr**@despair.dmiyu.org> wrote in
news:sl*****************@dmiyu.org:

Python:
for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search( line
).group(1)

If I read this correctly you're compiling this regex every time
you're
going through the for loop. So every line the regex is compiled
again. You might want to compile the regex outside the loop and only
use the compiled version inside the loop.

I *think* that Perl caches compiled regexs which is why they don't
have
two different ways of calling the regex while Python, in giving two
different calls to the regex, will compile it every time if you
expressedly call for a compile. Again, just a guess based on how I
presume the languages work and how I'd write them differently.

No, Python will cache the calls to compile the regex so you won't get much
speed difference unless you have enough different regexes to overflow the
cache. Pulling the compile out of the loop is a good idea on general
principles though.

The code you quoted does have one place to optimise: using readlines,
especially on a large file will be *much* slower than just iterating over
the file object directly.

i.e. use

for line in lf:
... whatever ...

Some other things that could be improved (although I suspect the real
problem was calling readlines):

The original code posted uses functions from the string module. Using
string methods instead ought to be faster e.g. line.count("INFECTED")
instead of string.line(count, "INFECTED")

Use
if logfile.endswith('.gz'):
instead of:
if logfile[-3:] == '.gz':

Use:
if "INFECTED" in line:
instead of calling line.count

I don't understand why the inner loop needs two cases, one for vname
containing a ',' and one where it doesn't. It looks to me as though the
code could just split whether or not there is a comma. If there isn't one
it just returns the original string.

Untested revised code:

INFECTEDPAT = re.compile( "INFECTED $(.*)$" )
for line in lf:
if "INFECTED" in line:
vname = INFECTEDPAT.search(line).group(1)
for vnam in vname.split(", "):
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1
total += 1
lf.close()

Jul 18 '05 #10

Dennis Lee Bieber

On Thu, 03 Jun 2004 14:33:58 GMT, "^" <ax**@axel.truedestiny.net>
declaimed the following in comp.lang.python:

for logfile in maillogs:
if os.path.isfile( logfile ):
# is it compressed?
if logfile[-3:] == '.gz':
import gzip
It's probably only costing a few milliseconds (vs a reload() ),
but why not move the import (and the next one too) to the top of the
program, rather than having the interpreter do the internal lookup only
to discover that it may already have imported the module...
lf = gzip.GzipFile( logfile, "r" )
else:
if logfile[-4:] == '.bz2':
import bz2
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #11

Antonio Cavallo

> I've become interested in Python a while ago and just converted a simple

perl script to python.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;

I had the same problem in dealing with a large (compressed) file using
python vs c++ (using the gzip library to open/read a file): the
results were in favour of python against c++ this time;)

I think the problem is in the flow:

gzip -> file/pipe -> perl
file -> zlib -> python

The decompression through zlib is wfar slower because it is carried
using data chunks: there is no mean to control where a chunk will
terminate (like in correspondence to an EOL).
Try using: zcat <myfile> | myprogram.py
In my case it solved all the puzzling speed problems: zcat
uncoditionally dumps the data irregarding where the EOL happens (so
the upper layer should not wait for the next chunk to be
decompressed).
regards,
antonio cavallo

Jul 18 '05 #12

Neal Holtz

"^" <ax**@axel.truedestiny.net> wrote in message news:<qF****************@amsnews05.chello.com>...

Hi group,

I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files for further processing.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;
. . .
Thanks for any help you can provide,
Kind regards,

Axel

I've halved the python time on my test by changing the entire inner loop
to:

pat = re.compile( "MALWARE:\s+(.*)" )
for line in lf:
mo = pat.search( line )
if mo:
for vnam in mo.group(1).split( ", "):
virstat[vnam] = virstat.get(vnam,0) + 1
total += 1
lf.close()

(with changes form my logfile format):

Of course, its no longer the same as the perl version, and
the perl version would also probably benefit from something
similar.

Jul 18 '05 #13

Roy Smith

In article <63**************************@posting.google.com >,
nh****@docuweb.ca (Neal Holtz) wrote:

I've halved the python time on my test by changing the entire inner loop
to:

pat = re.compile( "MALWARE:\s+(.*)" )
for line in lf:
mo = pat.search( line )
if mo:
for vnam in mo.group(1).split( ", "):
virstat[vnam] = virstat.get(vnam,0) + 1
total += 1
lf.close()

A few random thoughts...

1) How often is mo true? In other words, what percentage of the lines
in the file match the pattern? If it's very high or very low, that
might give you some ideas where to look. Running the profiler will help
at lot too!

2) It's probably a minor tweak, but you could factor out the
"pat.search" name lookup cost by doing something like:

pat = re.compile( "MALWARE:\s+(.*)" )
patSearch = pat.search
for line in lf:
mo = patSearch (line)
....

3) There's a certain amount of duplication of effort going on between
the regex search and the split; two passes over the same data. Is it
possible to write a regex which parses it all in one pass? Of course,
if the answer to question #1 is "very low percentage", then this is
probably not the place to be looking.

4) What does virstat.get do?

Lastly, a general programming style comment. I'm a fan of longish
variable names which describe what they're doing. I had to think a bit
to figure out that vnam probably stands for "virus name". It would have
been easier for me to figure out if you named the variable something
like virusName (or even virus_name, if you're an underscore fan). Same
with the other variable names.

Jul 18 '05 #14

Martin Maney

Hans-Peter Jansen <hp*@urpla.net> wrote:

if logfile.endswith('.gz'):
#ifd, lfd = os.popen2("%s %s" % (gzip, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (gzip, logfile))
elif logfile.endswith('.bz2'):
#ifd, lfd = os.popen2("%s %s" % (bzip2, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (bzip2, logfile))
else:
# uncompressed
lfd = open(logfile, "r")

Why stop there? You've left on the verge of collapsing into the fully
reduced (and regularized) form:

if logfile.endswith('.gz'):
cat_command = 'zcat'
elif logfile.endswith('.bz2'):
cat_command = 'bzcat'
else:
cat_command = 'cat'
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (cat_command, logfile))

(for that matter, is there some reason to use popen2 and the
unnecessary ifd?)

I've found it advantageous to preprocess large inputs with grep - the
tens of MB of squid logs that are skimmed by a useful little CGI script
really benefited from that! Python's raw I/O may be as good as
anything, but for line by line parsing where a majority of the (many)
lines are discarded, a grep prefilter is a big win.

Which may or may not bring us back to... well, it's not a corollary to
Steve Lamb's guideline for using shell script, though it's clearly
related. Maybe it's the contrapositive. Never get so wrapped up in
using that Python hammer that you forget about those carefully honed
specialized tools. Use that one line of shell to the best effect!
<grin>

--
Anyone who calls economics the dismal science
has never been exposed to educationist theories
at any length. An hour or two is a surfeit.

Jul 18 '05 #15

by: Rob Ristroph | last post by:

I have tried out PHP 5 for the first time (with assistance from this group -- thanks!). The people I was working with have a site that uses lots of php objects. They are having problems with...

PHP

speed

by: WindAndWaves | last post by:

Hi Gurus I am building my first ever PHP site. Should I worry about speed? These are the parameters of my site - MySQL database with about 500 records (about 50 fields each) and a couple...

PHP

Oracle DB access - having problems with speed?

by: Markku Uttula | last post by:

I think I'm doing something wrong. I'm able to connect to Oracle just fine, execute queries and all, but I'm having serious problems with the speed :( For example, the following PHP-script on my...

PHP

Python future performance and speed

by: Neuruss | last post by:

It seems there are quite a few projects aimed to improve Python's speed and, therefore, eliminate its main limitation for mainstream acceptance. I just wonder what do you all think? Will Python...

Python

Runnig at Native speed...Dream or reality ?

by: main{}; | last post by:

I can't ignore the speed of .NET managed applications in manipulating string, I/O and arithmetic operations. However, we can never compare the speed of a C/C++ program with its .NET counterpart...

.NET Framework

Java speed vs. C++.

by: Mike Cox | last post by:

Hi. I recently ran a benchmark against two simple programs, one written in Java and the other in C++. The both accomplish the same thing, outputting "Hello World" on my screen. The C++ program...

C / C++

python speed

by: Krystian | last post by:

Hi are there any future perspectives for Python to be as fast as java? i would like to use Python as a language for writing games. best regards krystian

Python

my code is too slow, how do I speed it up?

by: lawrence k | last post by:

The following function is way too slow. If anyone has any suggestions about how to speed it up, I'd be grateful for them. We have to call this function 36 times on one page, and I think each time...

PHP

gcc 64bit compiler does not offer any speed advantage

by: llothar | last post by:

I must say i didn't expect this. I just did some measures on FreeBSD 6.2 with gcc 3.4.6 and there is absolutely no significant difference between 32 and 64 bit mode - neither in compilation speed,...

C / C++

How do I get the PC's Processor speed?

by: kyosohma | last post by:

Hi, We use a script here at work that runs whenever someone logs into their machine that logs various bits of information to a database. One of those bits is the CPU's model and speed. While...

Python

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

speed problems

Similar topics