Help with script with performance problems

Dennis Roberts

I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in:) Here is my script. I am looking for constructive comments -
please don't bash my newbie code.

#!/usr/bin/python -u

import string
import sys

clients = {}
queries = {}
count = 0

print "Each dot is 100000 lines..."

f = sys.stdin

while 1:

line = f.readline()

if count % 100000 == 0:
sys.stdout.write(".")

if line:
splitline = string.split(line)

try:
(month, day, time, stype, source, qtype, query, ctype,
record) = splitline
except:
print "problem spliting line", count
print line
break

try:
words = string.split(source,'#')
source = words[0]
except:
print "problem splitting source", count
print line
break

if clients.has_key(source):
clients[source] = clients[source] + 1
else:
clients[source] = 1

if queries.has_key(query):
queries[query] = queries[query] + 1
else:
queries[query] = 1

else:
print
break

count = count + 1

f.close()

print count, "lines processed"

for numclient, count in clients.items():
if count > 100000:
print "%s,%s" % (numclient, count)

for numquery, count in queries.items():
if count > 100000:
print "%s,%s" % (numquery, count)

Jul 18 '05 #1

Subscribe Post Reply

2162

Ville Vainio

go**********@spacerodent.org (Dennis Roberts) writes:

is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
Isn't parsing logs a batch-oriented thing, where 20 minutes more
wouldn't matter all that much? Log parsing is the home field of Perl,
so python probably can't match its performance there, but other
advantages of Python might make you still want to avoid going back to
Perl. As long as it's 'efficient enough', who cares?
f = sys.stdin
Have you tried using a normal file instead of stdin? BTW, you can
iterate over a file easily by "for line in open("mylog.log"):". ISTR
it's also more efficient than readline()'s, because it caches the
lines instead of reading them one by one. You can also get the line
numbers by doing "for linenum, line in enumerate(open("mylog.log")):"

splitline = string.split(line)
Do not use 'string' module (it's deprecated), use string methods
instead: line.split()
clients[source] = clients[source] + 1

clients[source] += 1

or another way to handle the common 'add 1, might not exist' idiom:
clients[source] = 1 + clients.get(source,0)

See http://aspn.activestate.com/ASPN/Coo...n/Recipe/66516
--
Ville Vainio http://www.students.tut.fi/~vainio24

Jul 18 '05 #2

Miki Tebeka

Hello Dennis,

A general note: Use the "hotshot" module to find where you spend most of your time.

splitline = string.split(line)

My guess is that if you'll use the "re" module things will be much faster.

import re
ws_split = re.compile("\s+").split
....
splitline = split(line)
....

HTH.

Miki

Jul 18 '05 #3

Paul Clinch

mi***@zoran.co.il (Miki Tebeka) wrote in message news:<62*************************@posting.google.c om>...

Hello Dennis,

A general note: Use the "hotshot" module to find where you spend most of your time.
splitline = string.split(line)

My guess is that if you'll use the "re" module things will be much faster.

import re
ws_split = re.compile("\s+").split
...
splitline = split(line)
...

HTH.

Miki

An alternative in python 2.3 is the timeit module, the following
extracted from doc.s:-
import timeit

timer1 = timeit.Timer('unicode("abc")')
timer2 = timeit.Timer('"abc" + u""')

# Run three trials
print timer1.repeat(repeat=3, number=100000)
print timer2.repeat(repeat=3, number=100000)

# On my laptop this outputs:
# [0.36831796169281006, 0.37441694736480713, 0.35304892063140869]
# [0.17574405670166016, 0.18193507194519043, 0.17565798759460449]

Regards Paul Clinch

Jul 18 '05 #4

Dennis Roberts

Ville Vainio <vi********************@spamtut.fi> wrote in message news:<du*************@amadeus.cc.tut.fi>...

f = sys.stdin

Have you tried using a normal file instead of stdin? BTW, you can
iterate over a file easily by "for line in open("mylog.log"):". ISTR
it's also more efficient than readline()'s, because it caches the
lines instead of reading them one by one. You can also get the line
numbers by doing "for linenum, line in enumerate(open("mylog.log")):"

i have a 240207 line sample log file that I test with. The script I
submitted parsed it in 18 seconds. My perl script parsed it in 4
seconds.

The new python script, using a normal file as suggested above, does it
in 3 seconds!

Changed "f = sys.stdin" to "f = open('sample', 'r')".

Thanks Ville!

Note (I made the other changes one at a time as well - the file open
change was the only one that made it faster)

Jul 18 '05 #5

Aahz

In article <a9**************************@posting.google.com >,
Dennis Roberts <go**********@spacerodent.org> wrote:

I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in:) Here is my script. I am looking for constructive comments -
please don't bash my newbie code.

If you haven't yet, make sure you upgrade to Python 2.3; there are a lot
of speed enhancements. Also, it allows you to switch to idioms that work
more like Perl's:

for line in f:
fields = line.split()
...

Generally speaking, contrary to what another poster suggested, string
methods will almost always be faster than regexes (assuming that a
string method does what you want directly, of course; using multiple
string methods may or may not be faster than regexes).
--
Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/

Weinberg's Second Law: If builders built buildings the way programmers wrote
programs, then the first woodpecker that came along would destroy civilization.

Jul 18 '05 #6

Peter Otten

Dennis Roberts wrote:

I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in:) Here is my script. I am looking for constructive comments -
please don't bash my newbie code.

Below is my version of your script. It tries to use more idiomatic Python
and is about 20%t faster on some bogus data - but nowhere near to close the
performance gap you claim to the perl script.
However, it took 143 seconds to process 10**7 lines generated by

<makesample.py>
import itertools, sys
sample = "%dmonth day time stype source%d#sowhat qtype %dquery ctype record"
thousand = itertools.cycle(range(1000))
hundred = itertools.cycle(range(100))

out = file(sys.argv[1], "w")
try:
try:
count = int(sys.argv[2])
except IndexError:
count = 10**7
for i in range(count):
print >> out, sample % (i, thousand.next(), hundred.next())
finally:
out.close()
</makesample.py>

with Python 2.3.2 on my 2.6GHz P4. Would that mean Perl would do it in 17
seconds? Anyway, the performance problem would rather be your computer :-),
Python should be fast enough for the purpose.

Peter

<parselog.py>
#!/usr/bin/python -u
#Warning, not seriously tested
import sys

#import time
#starttime = time.time()

clients = {}
queries = {}
lineNo = -1

threshold = 100
pointmod = 100000

f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.write(".")

try:
month, day, timestr, stype, source, qtype, query, ctype, record
= line.split()
except ValueError:
raise Exception("problem splitting line %d\n%s" % (lineNo,
line))

source = source.split('#', 1)[0]

clients[source] = clients.get(source, 0) + 1
queries[query] = queries.get(query, 0) + 1
finally:
f.close()

print
print lineNo+1, "lines processed"

for numclient, count in clients.iteritems():
if count > threshold:
print "%s,%s" % (numclient, count)

for numquery, count in queries.iteritems():
if count > threshold:
print "%s,%s" % (numquery, count)

#print "time:", time.time() - starttime
</parselog.py>

Jul 18 '05 #7

Peter Otten

Peter Otten wrote:

However, it took 143 seconds to process 10**7 lines generated by

I just downloaded psycho, oops, keep misspelling the name :-) and it brings
down the time to 92 seconds - almost for free. I must say I'm impressed,
the psycologist(s) did an excellent job.

Peter

#!/usr/bin/python -u
import psyco, sys
psyco.full()

def main():
clients = {}
queries = {}
lineNo = -1

threshold = 100
pointmod = 100000

f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.write(".")

try:
month, day, timestr, stype, source, qtype, query, ctype,
record = line.split()
except ValueError:
raise Exception("problem splitting line %d\n%s" % (lineNo,
line))

source = source.split('#', 1)[0]

clients[source] = clients.get(source, 0) + 1
queries[query] = queries.get(query, 0) + 1
finally:
f.close()

print
print lineNo+1, "lines processed"

for numclient, count in clients.iteritems():
if count > threshold:
print "%s,%s" % (numclient, count)

for numquery, count in queries.iteritems():
if count > threshold:
print "%s,%s" % (numquery, count)

import time
starttime = time.time()
main()
print "time:", time.time() - starttime

Jul 18 '05 #8

Similar topics

Trigger and Row Update Help

by: Jason | last post by:

I have a table that matches up Securities and Exchanges. Individual securities can belong on multiple exchanges. One of the columns, named PrimaryExchangeFlag, indicates if a particular exchange is...

Microsoft SQL Server

Complex client-side javascript problem, need help

by: Derek | last post by:

Hi, I've built a rather large CGI that dumps a lot of data and a fairly complex javascript app out to the client's browser. Granted this may be poor style according to someone web design...

Javascript

Large script source file

by: Christopher Benson-Manica | last post by:

We have a fairly large (1500 line) .js file that contains script that most of our pages use. My personal opinion is that this is not easy to maintain, but others are concerned that with the script...

Javascript

Cannot use mail() in IE, only works in a debugger--help

by: baustin75 | last post by:

Posted: Mon Oct 03, 2005 1:41 pm Post subject: cannot mail() in ie only when debugging in php designer 2005 -------------------------------------------------------------------------------- ...

PHP

Help on DAO in ACC2000

by: Vladislav Moltchanov | last post by:

Recently I have changed from Acc97 to Acc2000 (I had to supply some data entry tools for field data collection for users still using Acc97). Among lot of other complications coming with Acc2000...

Microsoft Access / VBA

Need help on designing a project

by: Mardy | last post by:

Hi all, I'm starting to think the way I've implemented my program (http://www.mardy.it/eligante) is all wrong. Basically, what I want is a web application, which might run as CGI scripts in...

Python

Performance question - passing data between objects (PHP4-style OOP)

by: Markus Ernst | last post by:

Hello A class that composes the output of shop-related data gets some info from the main shop class. Now I wonder whether it is faster to store the info in the output class or get it from the...

PHP

Help - Timing Logic

by: Jay | last post by:

I have a multi threaded VB.NET application (4 threads) that I use to send text messages to many, many employees via system.timer at a 5 second interval. Basically, I look in a SQL table (queue) to...

Visual Basic .NET

Help Jquery: unable to register a ready function

by: souporpower | last post by:

Hello All I am trying to activate a link using Jquery. Here is my code; <html> <head> <script type="text/javascript" src="../../resources/js/ jquery-1.2.6.js"</script> <script...

Javascript

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware