By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,001 Members | 1,262 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,001 IT Pros & Developers. It's quick & easy.

Parsing apache log files

P: n/a

I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.

A typical (good) line, looks like this
111.111.111.11 - - [16/Feb/2004:04:09:49 -0800] "GET /ads/redirectads/336x280redirect.htm HTTP/1.1" 304 - "http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

which I can split fine, by spliting on the " first, then splitting each
bit up on the appropriate thing. mostly spaces. But occasionaly I get
something like
11.111.11.111 - - [16/Feb/2004:10:35:12 -0800] "GET /ads/redirectads/468x60redirect.htm HTTP/1.1" 200 541 "http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera 7.20 [ru\"]"

note the [ru\" at the end.
I am looking for a way to strip out the IP, day, time, requested url,
referrer, bytes, status, and user agent, and what I have, though a bit
crufty, works 99.99% of the time, but then something like this shows up.

I have a couple of approaches. Reject the bad entries, save them to a
file, then manually enter them, problem is, with 10 million entries, and
about 1 in 1000 being bad...
Although as I write this, I think maybe I can use the \ to warn me, and
behave accordingly? hm, I'll have to try that.

In the meantime, is there some obvious method, or module that I have
missed ?
--
Jim Richardson http://www.eskimo.com/~warlock
Windows is the answer, but only if the question was
'what is the intellectual equivalent of being a galley slave?'
--Larry Smith, in comp.os.linux.misc
Jul 18 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a
> In the meantime, is there some obvious method, or module that I have
missed ?


I use a regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) - - \[([^\[\]:]+):'
'(\d+:\d+:\d+) -(\d\d\d\d\)] ("[^"]*") '
'(\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #day/month/year
a.group(3) #time of day
a.group(4) #timezone
a.group(5) #request
a.group(6) #code 200 for success, 404 for not found, etc.
a.group(7) #bytes transferred
a.group(8) #referrer
a.group(9) #browser
else:
#this line did not match.

That should work for most any line you get, but you may want to run it
over a few megs of your logs just to check and see if that else
statement is ever caught for a non-empty line.

- Josiah
Jul 18 '05 #2

P: n/a
"Jim Richardson" <wa*****@eskimo.com> wrote in message
news:oq************@grendel.myth...

I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.

pyparsing examples directory includes an HTTP server log parser. Using your
data, there was one minor error where the bytesSent field in the first line
was just a dash instead of an integer. After correcting that, I ran it
against your test lines and got this output:

fields.numBytesSent = -
fields.timestamp = ['16/Feb/2004:04:09:49', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
fields.referrer =
http://www.foobarp.org/theme_detail....at=0&mid=27512
fields.cmd = ['GET', '/ads/redirectads/336x280redirect.htm', 'HTTP/1.1']
fields.ipAddr = 111.111.111.11
fields.statusCode = 304

fields.numBytesSent = 541
fields.timestamp = ['16/Feb/2004:10:35:12', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera
7.20 [ru
fields.referrer = http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56
fields.cmd = ['GET', '/ads/redirectads/468x60redirect.htm', 'HTTP/1.1']
fields.ipAddr = 11.111.11.111
fields.statusCode = 200

Download pyparsing at http://pyparsing.sourceforge.net.

Here's the change you'll have to make to the example:

Change:
integer.setResultsName("statusCode") +
integer.setResultsName("numBytesSent") +
to:
(integer | "-").setResultsName("statusCode") +
(integer | "-").setResultsName("numBytesSent") +

-- Paul
Jul 18 '05 #3

P: n/a

On Thu, 19 Feb 2004 22:32:24 -0800,
Josiah Carlson <jc******@nospam.uci.edu> wrote:
In the meantime, is there some obvious method, or module that I have
missed ?


I use a regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) - - \[([^\[\]:]+):'
'(\d+:\d+:\d+) -(\d\d\d\d\)] ("[^"]*") '
'(\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #day/month/year
a.group(3) #time of day
a.group(4) #timezone
a.group(5) #request
a.group(6) #code 200 for success, 404 for not found, etc.
a.group(7) #bytes transferred
a.group(8) #referrer
a.group(9) #browser
else:
#this line did not match.

That should work for most any line you get, but you may want to run it
over a few megs of your logs just to check and see if that else
statement is ever caught for a non-empty line.

- Josiah

thanks, although reading that re makes my brain hurt! :), and I don't
think it handles the case where the dashes are something else (the dash
is a place holder for some data that wasn't there on this request,
bytelength, referrer, something) but I'll look into it, thanks for the
example.

--
Jim Richardson http://www.eskimo.com/~warlock
Ok, the guy who made the netfilter Makefile was probably on some really
interesting and probably highly illegal drugs when he wrote it.
-- Linus Torvalds
Jul 18 '05 #4

P: n/a

On Fri, 20 Feb 2004 09:18:01 GMT,
Paul McGuire <pt***@users.sourceforge.net> wrote:
"Jim Richardson" <wa*****@eskimo.com> wrote in message
news:oq************@grendel.myth...

I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.

pyparsing examples directory includes an HTTP server log parser. Using your
data, there was one minor error where the bytesSent field in the first line
was just a dash instead of an integer. After correcting that, I ran it
against your test lines and got this output:

fields.numBytesSent = -
fields.timestamp = ['16/Feb/2004:04:09:49', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
fields.referrer =
http://www.foobarp.org/theme_detail....at=0&mid=27512
fields.cmd = ['GET', '/ads/redirectads/336x280redirect.htm', 'HTTP/1.1']
fields.ipAddr = 111.111.111.11
fields.statusCode = 304

fields.numBytesSent = 541
fields.timestamp = ['16/Feb/2004:10:35:12', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera
7.20 [ru
fields.referrer = http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56
fields.cmd = ['GET', '/ads/redirectads/468x60redirect.htm', 'HTTP/1.1']
fields.ipAddr = 11.111.11.111
fields.statusCode = 200

Download pyparsing at http://pyparsing.sourceforge.net.

Here's the change you'll have to make to the example:

Change:
integer.setResultsName("statusCode") +
integer.setResultsName("numBytesSent") +
to:
(integer | "-").setResultsName("statusCode") +
(integer | "-").setResultsName("numBytesSent") +

-- Paul


now *this* looks interesting. Thanks a lot!

--
Jim Richardson http://www.eskimo.com/~warlock
" ... a language is just an dialect with an army and a navy."
-- Paul Tomblin, in a.s.r.
Jul 18 '05 #5

P: n/a
> thanks, although reading that re makes my brain hurt! :), and I don't
think it handles the case where the dashes are something else (the dash
is a place holder for some data that wasn't there on this request,
bytelength, referrer, something) but I'll look into it, thanks for the
example.


It depends on which dash you were talking about. The dash immediately
after the response code is the number of bytes sent, and is handled by
the regular expression.

Unless you use identd checks, the first '-' will always be there, though
the second '-' is the identity of the client given through http auth,
which may or may not be important to you.

Modifying the regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) (-|\w*) (-|\w*) '
'\[([^\[\]:]+):(\d+:\d+:\d+) -(\d\d\d\d\)] '
'("[^"]*") (\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #identd response (if any)
a.group(3) #http auth user
a.group(4) #day/month/year
a.group(5) #time of day
a.group(6) #timezone
a.group(7) #request
a.group(8) #code 200 for success, 404 for not found, etc.
a.group(9) #bytes transferred
a.group(10) #referrer
a.group(11) #browser
else:
#this line did not match.
There you go.
- Josiah
Jul 18 '05 #6

P: n/a

On Fri, 20 Feb 2004 09:20:00 -0800,
Josiah Carlson <jc******@nospam.uci.edu> wrote:
thanks, although reading that re makes my brain hurt! :), and I don't
think it handles the case where the dashes are something else (the dash
is a place holder for some data that wasn't there on this request,
bytelength, referrer, something) but I'll look into it, thanks for the
example.


It depends on which dash you were talking about. The dash immediately
after the response code is the number of bytes sent, and is handled by
the regular expression.

Unless you use identd checks, the first '-' will always be there, though
the second '-' is the identity of the client given through http auth,
which may or may not be important to you.


<snip>

It was the http auth, which for some reason, show up from time to time,
may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp :) Thank you very much for your
help.

--
Jim Richardson http://www.eskimo.com/~warlock
One man's religion is another man's belly laugh.
Jul 18 '05 #7

P: n/a
> It was the http auth, which for some reason, show up from time to time,
may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp :) Thank you very much for your
help.


It is relatively easy to generate the regular expression by hand, I did.
But I agree, it is a bit dense if you didn't do it. I need to pull
out the python docs every time I see a regular expression that I didn't
write.

- Josiah
Jul 18 '05 #8

P: n/a

On Fri, 20 Feb 2004 22:53:10 -0800,
Josiah Carlson <jc******@nospam.uci.edu> wrote:
It was the http auth, which for some reason, show up from time to time,
may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp :) Thank you very much for your
help.


It is relatively easy to generate the regular expression by hand, I did.
But I agree, it is a bit dense if you didn't do it. I need to pull
out the python docs every time I see a regular expression that I didn't
write.

- Josiah

I can parse it if I think hard about what it does :) I guess that means
that the python interp is smarter than me :) Thanks again.

--
Jim Richardson http://www.eskimo.com/~warlock
"`If there's anything more important than my ego around, I
want it caught and shot now.'"
-- Zaphod
Jul 18 '05 #9

P: n/a
On Fri, 20 Feb 2004 23:36:45 -0800,
Jim Richardson <wa*****@eskimo.com> wrote:

On Fri, 20 Feb 2004 22:53:10 -0800,
Josiah Carlson <jc******@nospam.uci.edu> wrote:
It was the http auth, which for some reason, show up from time to time,
may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp :) Thank you very much for your
help.


It is relatively easy to generate the regular expression by hand, I did.
But I agree, it is a bit dense if you didn't do it. I need to pull
out the python docs every time I see a regular expression that I didn't
write.

- Josiah

I can parse it if I think hard about what it does :) I guess that means
that the python interp is smarter than me :) Thanks again.

Oh, and there was a bug! I found a bug! woot!

(hey, I'm not very good at this, It's cool when I fix a bug)

--
Jim Richardson http://www.eskimo.com/~warlock
Do I LOOK like a damn people person?
Jul 18 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.