By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,973 Members | 1,854 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,973 IT Pros & Developers. It's quick & easy.

Parsing space delimited records

P: n/a
I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott

Oct 29 '08 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@microsof t.com...
I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
Oct 30 '08 #2

P: n/a
I was hoping to avoid taking the time to create a regular expression as there
are 17 fields per S3 record. It took me a while but here is what I ended up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*? ")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.

"Stanimir Stoyanov" wrote:
Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@microsof t.com...
I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
Oct 30 '08 #3

P: n/a
I am sure there is *more* elegant solution to the problem, can you post a
sample log output, and do you want to get the individual words out of the
log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you have
the actual log text (see my previous code example for the 'log text' case).

--
Stanimir Stoyanov
http://stoyanoff.info

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:71**********************************@microsof t.com...
>I was hoping to avoid taking the time to create a regular expression as
there
are 17 fields per S3 record. It took me a while but here is what I ended
up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*? ")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.

"Stanimir Stoyanov" wrote:
>Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@microso ft.com...
I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces
it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
Oct 30 '08 #4

P: n/a
Unless you're somehow married to the format, just drop the time zone:

string[] fields = record.Replace(' +0000','',Split(' ');
"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@microsof t.com...
I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
Oct 30 '08 #5

P: n/a
Er, make that:

string[] fields = record.Replace(' +0000','').Split(' ');

"Mark S. Milley" <ma*********@binaryswitch.comwrote in message
news:90**********************************@microsof t.com...
Unless you're somehow married to the format, just drop the time zone:

string[] fields = record.Replace(' +0000','',Split(' ');
"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@microsof t.com...
>I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
Oct 30 '08 #6

P: n/a
Hello Stanimir,

If you do a Regex.Match with the following regex:

^((\[(?<result>[^\]]*)\]|(?<result>[^ ]*))([ ]|$)*

Should get you a Match object with 1 named group and 17 captures in there.
Exactly what you need...

You should also be able to use the Log parser class that the IIS team once
published... but I cannot find a link at the moment...

Jesse
I am sure there is *more* elegant solution to the problem, can you
post a sample log output, and do you want to get the individual words
out of the log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you
have
the actual log text (see my previous code example for the 'log text'
case).
--
Stanimir Stoyanov
http://stoyanoff.info
"M1iS" <M1**@discussions.microsoft.comwrote in message
news:71**********************************@microsof t.com...
>I was hoping to avoid taking the time to create a regular expression
as
there
are 17 fields per S3 record. It took me a while but here is what I
ended
up
with:
(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-
9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)
(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*?")(\s +)(.*?)(\s+)(.*?)(\s
+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh
well problem solved.

"Stanimir Stoyanov" wrote:
>>Hi Scott,

I personally would use Regular Expressions to split the words in a
smart way. Below is a sample console application to demonstrate it.
The regular expression \[.*\]\s*|.+ means that it can select from
two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";
Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();
if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}
Console.Read();
}
}
"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@micros oft.com...

I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:
[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the
spaces
it
also splits date field:
string[] fields = record.Split(' ');

What can I do to get around this?

Scott
--
Jesse Houwing
jesse.houwing at sogeti.nl
Oct 30 '08 #7

P: n/a
Below is an example of what is in a log file. I'm just trying to read the
logs and dump the fields into a database.

4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:44:21 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
AAE9C2CCFFE5E6DB REST.GET.ACL - "GET /?acl HTTP/1.1" 200 - 556 - 488 - "-" "-"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:44:24 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
66FB31B05AFA84E9 REST.GET.LOGGING_STATUS - "GET /?logging HTTP/1.1" 200 - 244
- 171 - "-" "-"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:44:56 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
40AC4747CFF7ACFD REST.GET.BUCKET - "GET / HTTP/1.1" 200 - 1298 - 15 12 "-"
"Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:44:56 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
5938B6855868E040 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1298 - 642 473
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:45:33 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
16F565F75362B5A8 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1298 - 508 293
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:45:33 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
D61C9201C46617CF REST.PUT.OBJECT testFile.zip "PUT /testFile.zip HTTP/1.1"
200 - - 17428 334 11 "-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:45:34 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
B2FEB30917A1F050 REST.GET.BUCKET - "GET / HTTP/1.1" 200 - 1634 - 181 15 "-"
"Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:45:34 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
B41FCF38CD590562 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1634 - 15 13 "-"
"Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:46:11 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
C42BF5C887E61F18 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1634 - 476 299
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:46:12 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
A590228971F16081 REST.PUT.OBJECT testFile.zip "PUT /testFile.zip HTTP/1.1"
200 - - 1487163 20298 48 "-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:46:32 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
6528418F2CCABB59 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1969 - 312 309
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887 testBucket
[28/Oct/2008:21:46:33 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb2 0cfc2a21655887
EE65B98BD633E32C REST.GET.BUCKET - "GET / HTTP/1.1" 200 - 1969 - 16 14 "-"
"Amazon S3 CSharp Library"
"Stanimir Stoyanov" wrote:
I am sure there is *more* elegant solution to the problem, can you post a
sample log output, and do you want to get the individual words out of the
log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you have
the actual log text (see my previous code example for the 'log text' case).

--
Stanimir Stoyanov
http://stoyanoff.info

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:71**********************************@microsof t.com...
I was hoping to avoid taking the time to create a regular expression as
there
are 17 fields per S3 record. It took me a while but here is what I ended
up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*? ")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.

"Stanimir Stoyanov" wrote:
Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@microsof t.com...
I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces
it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott

Oct 30 '08 #8

P: n/a
One of the following regular expressions might fit better:

\[.*\]|\"[^\"]*\"|[^\s-]+

or

\[.*\]|\"[^\"]*\"|[^\s]+

The difference is that the first omits single dashes as found on some rows
(in between figures), e.g.
200 - 1634 - 181 15

--
Stanimir Stoyanov
http://stoyanoff.info

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:D1**********************************@microsof t.com...
Below is an example of what is in a log file. I'm just trying to read the
logs and dump the fields into a database.

<SNIPPED>

"Stanimir Stoyanov" wrote:
>I am sure there is *more* elegant solution to the problem, can you post a
sample log output, and do you want to get the individual words out of the
log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you
have
the actual log text (see my previous code example for the 'log text'
case).

--
Stanimir Stoyanov
http://stoyanoff.info

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:71**********************************@microso ft.com...
>I was hoping to avoid taking the time to create a regular expression as
there
are 17 fields per S3 record. It took me a while but here is what I
ended
up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*? ")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.

"Stanimir Stoyanov" wrote:

Hi Scott,

I personally would use Regular Expressions to split the words in a
smart
way. Below is a sample console application to demonstrate it. The
regular
expression \[.*\]\s*|.+ means that it can select from two
alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

"M1iS" <M1**@discussions.microsoft.comwrote in message
news:81**********************************@microso ft.com...
I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the
spaces
it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott

Oct 31 '08 #9

This discussion thread is closed

Replies have been disabled for this discussion.