I have a script to parse a dns querylog and generate some statistics.
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes. My python script takes 25 minutes. It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs). The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did!
After some googling and reading Eric Raymonds essay on python I jumped
in:) Here is my script. I am looking for constructive comments -
please don't bash my newbie code.
#!/usr/bin/python -u
import string
import sys
clients = {}
queries = {}
count = 0
print "Each dot is 100000 lines..."
f = sys.stdin
while 1:
line = f.readline()
if count % 100000 == 0:
sys.stdout.writ e(".")
if line:
splitline = string.split(li ne)
try:
(month, day, time, stype, source, qtype, query, ctype,
record) = splitline
except:
print "problem spliting line", count
print line
break
try:
words = string.split(so urce,'#')
source = words[0]
except:
print "problem splitting source", count
print line
break
if clients.has_key (source):
clients[source] = clients[source] + 1
else:
clients[source] = 1
if queries.has_key (query):
queries[query] = queries[query] + 1
else:
queries[query] = 1
else:
print
break
count = count + 1
f.close()
print count, "lines processed"
for numclient, count in clients.items() :
if count > 100000:
print "%s,%s" % (numclient, count)
for numquery, count in queries.items() :
if count > 100000:
print "%s,%s" % (numquery, count) 7 2194 go**********@sp acerodent.org (Dennis Roberts) writes: is enough of a difference that unless I can figure out what I did wrong or a better way of doing it I might not be able to use python (since most of what I do is parsing various logs). The main reason to
Isn't parsing logs a batch-oriented thing, where 20 minutes more
wouldn't matter all that much? Log parsing is the home field of Perl,
so python probably can't match its performance there, but other
advantages of Python might make you still want to avoid going back to
Perl. As long as it's 'efficient enough', who cares?
f = sys.stdin
Have you tried using a normal file instead of stdin? BTW, you can
iterate over a file easily by "for line in open("mylog.log "):". ISTR
it's also more efficient than readline()'s, because it caches the
lines instead of reading them one by one. You can also get the line
numbers by doing "for linenum, line in enumerate(open( "mylog.log" )):"
splitline = string.split(li ne)
Do not use 'string' module (it's deprecated), use string methods
instead: line.split()
clients[source] = clients[source] + 1
clients[source] += 1
or another way to handle the common 'add 1, might not exist' idiom:
clients[source] = 1 + clients.get(sou rce,0)
See http://aspn.activestate.com/ASPN/Coo...n/Recipe/66516
--
Ville Vainio http://www.students.tut.fi/~vainio24
Hello Dennis,
A general note: Use the "hotshot" module to find where you spend most of your time. splitline = string.split(li ne)
My guess is that if you'll use the "re" module things will be much faster.
import re
ws_split = re.compile("\s+ ").split
....
splitline = split(line)
....
HTH.
Miki mi***@zoran.co. il (Miki Tebeka) wrote in message news:<62******* *************** ***@posting.goo gle.com>... Hello Dennis,
A general note: Use the "hotshot" module to find where you spend most of your time.
splitline = string.split(li ne) My guess is that if you'll use the "re" module things will be much faster.
import re ws_split = re.compile("\s+ ").split ... splitline = split(line) ...
HTH.
Miki
An alternative in python 2.3 is the timeit module, the following
extracted from doc.s:-
import timeit
timer1 = timeit.Timer('u nicode("abc")')
timer2 = timeit.Timer('" abc" + u""')
# Run three trials
print timer1.repeat(r epeat=3, number=100000)
print timer2.repeat(r epeat=3, number=100000)
# On my laptop this outputs:
# [0.3683179616928 1006, 0.3744169473648 0713, 0.3530489206314 0869]
# [0.1757440567016 6016, 0.1819350719451 9043, 0.1756579875946 0449]
Regards Paul Clinch
Ville Vainio <vi************ ********@spamtu t.fi> wrote in message news:<du******* ******@amadeus. cc.tut.fi>... f = sys.stdin
Have you tried using a normal file instead of stdin? BTW, you can iterate over a file easily by "for line in open("mylog.log "):". ISTR it's also more efficient than readline()'s, because it caches the lines instead of reading them one by one. You can also get the line numbers by doing "for linenum, line in enumerate(open( "mylog.log" )):"
i have a 240207 line sample log file that I test with. The script I
submitted parsed it in 18 seconds. My perl script parsed it in 4
seconds.
The new python script, using a normal file as suggested above, does it
in 3 seconds!
Changed "f = sys.stdin" to "f = open('sample', 'r')".
Thanks Ville!
Note (I made the other changes one at a time as well - the file open
change was the only one that made it faster)
In article <a9************ **************@ posting.google. com>,
Dennis Roberts <go**********@s pacerodent.org> wrote: I have a script to parse a dns querylog and generate some statistics. For a 750MB file a perl script using the same methods (splits) can parse the file in 3 minutes. My python script takes 25 minutes. It is enough of a difference that unless I can figure out what I did wrong or a better way of doing it I might not be able to use python (since most of what I do is parsing various logs). The main reason to try python is I had to look at some early scripts I wrote in perl and had no idea what the hell I was thinking or what the script even did! After some googling and reading Eric Raymonds essay on python I jumped in:) Here is my script. I am looking for constructive comments - please don't bash my newbie code.
If you haven't yet, make sure you upgrade to Python 2.3; there are a lot
of speed enhancements. Also, it allows you to switch to idioms that work
more like Perl's:
for line in f:
fields = line.split()
...
Generally speaking, contrary to what another poster suggested, string
methods will almost always be faster than regexes (assuming that a
string method does what you want directly, of course; using multiple
string methods may or may not be faster than regexes).
--
Aahz (aa**@pythoncra ft.com) <*> http://www.pythoncraft.com/
Weinberg's Second Law: If builders built buildings the way programmers wrote
programs, then the first woodpecker that came along would destroy civilization.
Dennis Roberts wrote: I have a script to parse a dns querylog and generate some statistics. For a 750MB file a perl script using the same methods (splits) can parse the file in 3 minutes. My python script takes 25 minutes. It is enough of a difference that unless I can figure out what I did wrong or a better way of doing it I might not be able to use python (since most of what I do is parsing various logs). The main reason to try python is I had to look at some early scripts I wrote in perl and had no idea what the hell I was thinking or what the script even did! After some googling and reading Eric Raymonds essay on python I jumped in:) Here is my script. I am looking for constructive comments - please don't bash my newbie code.
Below is my version of your script. It tries to use more idiomatic Python
and is about 20%t faster on some bogus data - but nowhere near to close the
performance gap you claim to the perl script.
However, it took 143 seconds to process 10**7 lines generated by
<makesample.p y>
import itertools, sys
sample = "%dmonth day time stype source%d#sowhat qtype %dquery ctype record"
thousand = itertools.cycle (range(1000))
hundred = itertools.cycle (range(100))
out = file(sys.argv[1], "w")
try:
try:
count = int(sys.argv[2])
except IndexError:
count = 10**7
for i in range(count):
print >> out, sample % (i, thousand.next() , hundred.next())
finally:
out.close()
</makesample.py>
with Python 2.3.2 on my 2.6GHz P4. Would that mean Perl would do it in 17
seconds? Anyway, the performance problem would rather be your computer :-),
Python should be fast enough for the purpose.
Peter
<parselog.py>
#!/usr/bin/python -u
#Warning, not seriously tested
import sys
#import time
#starttime = time.time()
clients = {}
queries = {}
lineNo = -1
threshold = 100
pointmod = 100000
f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.writ e(".")
try:
month, day, timestr, stype, source, qtype, query, ctype, record
= line.split()
except ValueError:
raise Exception("prob lem splitting line %d\n%s" % (lineNo,
line))
source = source.split('# ', 1)[0]
clients[source] = clients.get(sou rce, 0) + 1
queries[query] = queries.get(que ry, 0) + 1
finally:
f.close()
print
print lineNo+1, "lines processed"
for numclient, count in clients.iterite ms():
if count > threshold:
print "%s,%s" % (numclient, count)
for numquery, count in queries.iterite ms():
if count > threshold:
print "%s,%s" % (numquery, count)
#print "time:", time.time() - starttime
</parselog.py>
Peter Otten wrote: However, it took 143 seconds to process 10**7 lines generated by
I just downloaded psycho, oops, keep misspelling the name :-) and it brings
down the time to 92 seconds - almost for free. I must say I'm impressed,
the psycologist(s) did an excellent job.
Peter
#!/usr/bin/python -u
import psyco, sys
psyco.full()
def main():
clients = {}
queries = {}
lineNo = -1
threshold = 100
pointmod = 100000
f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.writ e(".")
try:
month, day, timestr, stype, source, qtype, query, ctype,
record = line.split()
except ValueError:
raise Exception("prob lem splitting line %d\n%s" % (lineNo,
line))
source = source.split('# ', 1)[0]
clients[source] = clients.get(sou rce, 0) + 1
queries[query] = queries.get(que ry, 0) + 1
finally:
f.close()
print
print lineNo+1, "lines processed"
for numclient, count in clients.iterite ms():
if count > threshold:
print "%s,%s" % (numclient, count)
for numquery, count in queries.iterite ms():
if count > threshold:
print "%s,%s" % (numquery, count)
import time
starttime = time.time()
main()
print "time:", time.time() - starttime This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Jason |
last post by:
I have a table that matches up Securities and Exchanges. Individual
securities can belong on multiple exchanges. One of the columns, named
PrimaryExchangeFlag, indicates if a particular exchange is the primary
exchange for that symbol. Each symbol can only have one primary
exchange.
I am trying to write a insert/update/delete trigger that enforces this
rule. The rules I have thought of are as follows:
Insert If new row has flag...
|
by: Derek |
last post by:
Hi,
I've built a rather large CGI that dumps a lot of data and a fairly
complex javascript app out to the client's browser. Granted this may
be poor style according to someone web design philosophy but that is
the way things need to work for now here. The problem I'm having is
that it appears that the browsers (IE, mozilla and netscape) are
sometimes getting confused about wether the javascript code is
running. By this I mean when I...
|
by: Christopher Benson-Manica |
last post by:
We have a fairly large (1500 line) .js file that contains script that
most of our pages use. My personal opinion is that this is not
easy to maintain, but others are concerned that with the script placed
in separate smaller files the web server will have to process several
other requests for script files, which will impact the performance of
the server. Is that concern well placed? If so, is there some other
way to break up a large...
|
by: baustin75 |
last post by:
Posted: Mon Oct 03, 2005 1:41 pm Post subject: cannot mail() in ie
only when debugging in php designer 2005
--------------------------------------------------------------------------------
Hello,
I have a very simple problem but cannot seem to figure it out. I have a
very simple php script that sends a test email to myself. When I debug
it in PHP designer, it works with no problems, I get the test email. If
|
by: Vladislav Moltchanov |
last post by:
Recently I have changed from Acc97 to Acc2000 (I had to supply some data
entry tools for field data collection for users still using Acc97).
Among lot of other complications coming with Acc2000 (compared with very
reliable acc97) I have got problem with Help: In access 2000 all topics
with DAO are listed in index but otherwise not available, so I am using
two PCs when developing something, to have advantage of ACC97 help,
where all these...
| |
by: Mardy |
last post by:
Hi all,
I'm starting to think the way I've implemented my program
(http://www.mardy.it/eligante) is all wrong.
Basically, what I want is a web application, which might run as CGI
scripts in apache (and this is working) or even as a standalone
application, in which case it would use it's own internal webserver.
The question is about this homemade webserver: right now it's a slightly
modified version of the standard CGIHTTPServer module....
|
by: Markus Ernst |
last post by:
Hello
A class that composes the output of shop-related data gets some info
from the main shop class. Now I wonder whether it is faster to store the
info in the output class or get it from the main class whenever it is
needed:
class shop_main {
var $prices = null;
function &get_prices() {
|
by: Jay |
last post by:
I have a multi threaded VB.NET application (4 threads) that I use to send
text messages to many, many employees via system.timer at a 5 second
interval. Basically, I look in a SQL table (queue) to determine who needs
to receive the text message then send the message to the address. Only
problem is, the employee may receive up to 4 of the same messages because
each thread gets the recors then sends the message. I need somehow to
prevent...
|
by: souporpower |
last post by:
Hello All
I am trying to activate a link using Jquery. Here is my code;
<html>
<head>
<script type="text/javascript" src="../../resources/js/
jquery-1.2.6.js"</script>
<script language="javascript" type="text/javascript">
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |