Help with script with performance problems

Dennis Roberts

I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in:) Here is my script. I am looking for constructive comments -
please don't bash my newbie code.

#!/usr/bin/python -u

import string
import sys

clients = {}
queries = {}
count = 0

print "Each dot is 100000 lines..."

f = sys.stdin

while 1:

line = f.readline()

if count % 100000 == 0:
sys.stdout.writ e(".")

if line:
splitline = string.split(li ne)

try:
(month, day, time, stype, source, qtype, query, ctype,
record) = splitline
except:
print "problem spliting line", count
print line
break

try:
words = string.split(so urce,'#')
source = words[0]
except:
print "problem splitting source", count
print line
break

if clients.has_key (source):
clients[source] = clients[source] + 1
else:
clients[source] = 1

if queries.has_key (query):
queries[query] = queries[query] + 1
else:
queries[query] = 1

else:
print
break

count = count + 1

f.close()

print count, "lines processed"

for numclient, count in clients.items() :
if count > 100000:
print "%s,%s" % (numclient, count)

for numquery, count in queries.items() :
if count > 100000:
print "%s,%s" % (numquery, count)

Jul 18 '05 #1

Subscribe Reply

2194

Ville Vainio

go**********@sp acerodent.org (Dennis Roberts) writes:

is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
Isn't parsing logs a batch-oriented thing, where 20 minutes more
wouldn't matter all that much? Log parsing is the home field of Perl,
so python probably can't match its performance there, but other
advantages of Python might make you still want to avoid going back to
Perl. As long as it's 'efficient enough', who cares?
f = sys.stdin
Have you tried using a normal file instead of stdin? BTW, you can
iterate over a file easily by "for line in open("mylog.log "):". ISTR
it's also more efficient than readline()'s, because it caches the
lines instead of reading them one by one. You can also get the line
numbers by doing "for linenum, line in enumerate(open( "mylog.log" )):"

splitline = string.split(li ne)
Do not use 'string' module (it's deprecated), use string methods
instead: line.split()
clients[source] = clients[source] + 1

clients[source] += 1

or another way to handle the common 'add 1, might not exist' idiom:
clients[source] = 1 + clients.get(sou rce,0)

See http://aspn.activestate.com/ASPN/Coo...n/Recipe/66516
--
Ville Vainio http://www.students.tut.fi/~vainio24

Jul 18 '05 #2

Miki Tebeka

Hello Dennis,

A general note: Use the "hotshot" module to find where you spend most of your time.

splitline = string.split(li ne)

My guess is that if you'll use the "re" module things will be much faster.

import re
ws_split = re.compile("\s+ ").split
....
splitline = split(line)
....

HTH.

Miki

Jul 18 '05 #3

Paul Clinch

mi***@zoran.co. il (Miki Tebeka) wrote in message news:<62******* *************** ***@posting.goo gle.com>...

Hello Dennis,

A general note: Use the "hotshot" module to find where you spend most of your time.
splitline = string.split(li ne)

My guess is that if you'll use the "re" module things will be much faster.

import re
ws_split = re.compile("\s+ ").split
...
splitline = split(line)
...

HTH.

Miki

An alternative in python 2.3 is the timeit module, the following
extracted from doc.s:-
import timeit

timer1 = timeit.Timer('u nicode("abc")')
timer2 = timeit.Timer('" abc" + u""')

# Run three trials
print timer1.repeat(r epeat=3, number=100000)
print timer2.repeat(r epeat=3, number=100000)

# On my laptop this outputs:
# [0.3683179616928 1006, 0.3744169473648 0713, 0.3530489206314 0869]
# [0.1757440567016 6016, 0.1819350719451 9043, 0.1756579875946 0449]

Regards Paul Clinch

Jul 18 '05 #4

Dennis Roberts

Ville Vainio <vi************ ********@spamtu t.fi> wrote in message news:<du******* ******@amadeus. cc.tut.fi>...

f = sys.stdin

Have you tried using a normal file instead of stdin? BTW, you can
iterate over a file easily by "for line in open("mylog.log "):". ISTR
it's also more efficient than readline()'s, because it caches the
lines instead of reading them one by one. You can also get the line
numbers by doing "for linenum, line in enumerate(open( "mylog.log" )):"

i have a 240207 line sample log file that I test with. The script I
submitted parsed it in 18 seconds. My perl script parsed it in 4
seconds.

The new python script, using a normal file as suggested above, does it
in 3 seconds!

Changed "f = sys.stdin" to "f = open('sample', 'r')".

Thanks Ville!

Note (I made the other changes one at a time as well - the file open
change was the only one that made it faster)

Jul 18 '05 #5

Aahz

In article <a9************ **************@ posting.google. com>,
Dennis Roberts <go**********@s pacerodent.org> wrote:

I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in:) Here is my script. I am looking for constructive comments -
please don't bash my newbie code.

If you haven't yet, make sure you upgrade to Python 2.3; there are a lot
of speed enhancements. Also, it allows you to switch to idioms that work
more like Perl's:

for line in f:
fields = line.split()
...

Generally speaking, contrary to what another poster suggested, string
methods will almost always be faster than regexes (assuming that a
string method does what you want directly, of course; using multiple
string methods may or may not be faster than regexes).
--
Aahz (aa**@pythoncra ft.com) <*> http://www.pythoncraft.com/

Weinberg's Second Law: If builders built buildings the way programmers wrote
programs, then the first woodpecker that came along would destroy civilization.

Jul 18 '05 #6

Peter Otten

Dennis Roberts wrote:

I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in:) Here is my script. I am looking for constructive comments -
please don't bash my newbie code.

Below is my version of your script. It tries to use more idiomatic Python
and is about 20%t faster on some bogus data - but nowhere near to close the
performance gap you claim to the perl script.
However, it took 143 seconds to process 10**7 lines generated by

<makesample.p y>
import itertools, sys
sample = "%dmonth day time stype source%d#sowhat qtype %dquery ctype record"
thousand = itertools.cycle (range(1000))
hundred = itertools.cycle (range(100))

out = file(sys.argv[1], "w")
try:
try:
count = int(sys.argv[2])
except IndexError:
count = 10**7
for i in range(count):
print >> out, sample % (i, thousand.next() , hundred.next())
finally:
out.close()
</makesample.py>

with Python 2.3.2 on my 2.6GHz P4. Would that mean Perl would do it in 17
seconds? Anyway, the performance problem would rather be your computer :-),
Python should be fast enough for the purpose.

Peter

<parselog.py>
#!/usr/bin/python -u
#Warning, not seriously tested
import sys

#import time
#starttime = time.time()

clients = {}
queries = {}
lineNo = -1

threshold = 100
pointmod = 100000

f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.writ e(".")

try:
month, day, timestr, stype, source, qtype, query, ctype, record
= line.split()
except ValueError:
raise Exception("prob lem splitting line %d\n%s" % (lineNo,
line))

source = source.split('# ', 1)[0]

clients[source] = clients.get(sou rce, 0) + 1
queries[query] = queries.get(que ry, 0) + 1
finally:
f.close()

print
print lineNo+1, "lines processed"

for numclient, count in clients.iterite ms():
if count > threshold:
print "%s,%s" % (numclient, count)

for numquery, count in queries.iterite ms():
if count > threshold:
print "%s,%s" % (numquery, count)

#print "time:", time.time() - starttime
</parselog.py>

Jul 18 '05 #7

Peter Otten

Peter Otten wrote:

However, it took 143 seconds to process 10**7 lines generated by

I just downloaded psycho, oops, keep misspelling the name :-) and it brings
down the time to 92 seconds - almost for free. I must say I'm impressed,
the psycologist(s) did an excellent job.

Peter

#!/usr/bin/python -u
import psyco, sys
psyco.full()

def main():
clients = {}
queries = {}
lineNo = -1

threshold = 100
pointmod = 100000

f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.writ e(".")

try:
month, day, timestr, stype, source, qtype, query, ctype,
record = line.split()
except ValueError:
raise Exception("prob lem splitting line %d\n%s" % (lineNo,
line))

source = source.split('# ', 1)[0]

clients[source] = clients.get(sou rce, 0) + 1
queries[query] = queries.get(que ry, 0) + 1
finally:
f.close()

print
print lineNo+1, "lines processed"

for numclient, count in clients.iterite ms():
if count > threshold:
print "%s,%s" % (numclient, count)

for numquery, count in queries.iterite ms():
if count > threshold:
print "%s,%s" % (numquery, count)

import time
starttime = time.time()
main()
print "time:", time.time() - starttime

Jul 18 '05 #8

Similar topics

8597

Trigger and Row Update Help

by: Jason | last post by:

I have a table that matches up Securities and Exchanges. Individual securities can belong on multiple exchanges. One of the columns, named PrimaryExchangeFlag, indicates if a particular exchange is the primary exchange for that symbol. Each symbol can only have one primary exchange. I am trying to write a insert/update/delete trigger that enforces this rule. The rules I have thought of are as follows: Insert If new row has flag...

Microsoft SQL Server

4801

Complex client-side javascript problem, need help

by: Derek | last post by:

Hi, I've built a rather large CGI that dumps a lot of data and a fairly complex javascript app out to the client's browser. Granted this may be poor style according to someone web design philosophy but that is the way things need to work for now here. The problem I'm having is that it appears that the browsers (IE, mozilla and netscape) are sometimes getting confused about wether the javascript code is running. By this I mean when I...

Javascript

1914

Large script source file

by: Christopher Benson-Manica | last post by:

We have a fairly large (1500 line) .js file that contains script that most of our pages use. My personal opinion is that this is not easy to maintain, but others are concerned that with the script placed in separate smaller files the web server will have to process several other requests for script files, which will impact the performance of the server. Is that concern well placed? If so, is there some other way to break up a large...

Javascript

5460

Cannot use mail() in IE, only works in a debugger--help

by: baustin75 | last post by:

Posted: Mon Oct 03, 2005 1:41 pm Post subject: cannot mail() in ie only when debugging in php designer 2005 -------------------------------------------------------------------------------- Hello, I have a very simple problem but cannot seem to figure it out. I have a very simple php script that sends a test email to myself. When I debug it in PHP designer, it works with no problems, I get the test email. If

PHP

1842

Help on DAO in ACC2000

by: Vladislav Moltchanov | last post by:

Recently I have changed from Acc97 to Acc2000 (I had to supply some data entry tools for field data collection for users still using Acc97). Among lot of other complications coming with Acc2000 (compared with very reliable acc97) I have got problem with Help: In access 2000 all topics with DAO are listed in index but otherwise not available, so I am using two PCs when developing something, to have advantage of ACC97 help, where all these...

Microsoft Access / VBA

1193

Need help on designing a project

by: Mardy | last post by:

Hi all, I'm starting to think the way I've implemented my program (http://www.mardy.it/eligante) is all wrong. Basically, what I want is a web application, which might run as CGI scripts in apache (and this is working) or even as a standalone application, in which case it would use it's own internal webserver. The question is about this homemade webserver: right now it's a slightly modified version of the standard CGIHTTPServer module....

Python

1895

Performance question - passing data between objects (PHP4-style OOP)

by: Markus Ernst | last post by:

Hello A class that composes the output of shop-related data gets some info from the main shop class. Now I wonder whether it is faster to store the info in the output class or get it from the main class whenever it is needed: class shop_main { var $prices = null; function &get_prices() {

PHP

2566

Help - Timing Logic

by: Jay | last post by:

I have a multi threaded VB.NET application (4 threads) that I use to send text messages to many, many employees via system.timer at a 5 second interval. Basically, I look in a SQL table (queue) to determine who needs to receive the text message then send the message to the address. Only problem is, the employee may receive up to 4 of the same messages because each thread gets the recors then sends the message. I need somehow to prevent...

Visual Basic .NET

8359

Help Jquery: unable to register a ready function

by: souporpower | last post by:

Hello All I am trying to activate a link using Jquery. Here is my code; <html> <head> <script type="text/javascript" src="../../resources/js/ jquery-1.2.6.js"</script> <script language="javascript" type="text/javascript">

Javascript

7934

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

7870

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

7992

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8225

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6639

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5732

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

3850

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3891

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1199

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General