pyparsing question

hubritic

I am trying to parse data that looks like this:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
2BFA76F6 1208230607 T S SYSPROC SYSTEM
SHUTDOWN BY USER
A6D1BD62 1215230807 I
H Firmware Event

My problem is that sometimes there is a RESOURCE_NAME and sometimes
not, so I wind up with "Firmware" as my RESOURCE_NAME and "Event" as
my DESCRIPTION. The formating seems to use a set number of spaces.

I have tried making RESOURCE_NAME an Optional(Word(alphanums))) and
Description OneOrMore(Word(alphas) + LineEnd(). So the question is,
how can I avoid having the first word of Description sucked into
RESOURCE_NAME when that field should be blank?
The data I have has a fixed number of characters per field, so I could
split it up that way, but wouldn't that defeat the purpose of using a
parser? I am determined to become proficient with pyparsing so I am
using it even when it could be considered overkill; thus, it has gone
past mere utility now, this is a matter of principle!

thanks

Jan 1 '08 #1

Subscribe Post Reply

1700

John Machin

On Jan 2, 10:32 am, hubritic <colinland...@gmail.comwrote:

The data I have has a fixed number of characters per field, so I could
split it up that way, but wouldn't that defeat the purpose of using a
parser?

The purpose of a parser is to parse. Data in fixed columns does not
need parsing.

I am determined to become proficient with pyparsing so I am
using it even when it could be considered overkill; thus, it has gone
past mere utility now, this is a matter of principle!

An extremely misguided "principle". Would you use an AK47 on the
flies around your barbecue? A better principle is to choose the best
tool for the job.

Jan 2 '08 #2

hubritic

On Jan 1, 4:18 pm, John Machin <sjmac...@lexicon.netwrote:

On Jan 2, 10:32 am, hubritic <colinland...@gmail.comwrote:

The data I have has a fixed number of characters per field, so I could
split it up that way, but wouldn't that defeat the purpose of using a
parser?

The purpose of a parser is to parse. Data in fixed columns does not
need parsing.

I am determined to become proficient with pyparsing so I am
using it even when it could be considered overkill; thus, it has gone
past mere utility now, this is a matter of principle!

An extremely misguided "principle". Would you use an AK47 on the
flies around your barbecue? A better principle is to choose the best
tool for the job.

Your principle is no doubt the saner one for the real world, but your
example of AK47 is a bit off.
We generally know enough about an AK47 to know that it is not
something to kill flies with. Consider, though, if
someone unfamiliar with the concept of guns and mayhem got an AK47 for
xmas and was only told that it was
really good for killing things. He would try it out and would discover
that indeed it kills all sorts of things.
So he might try killing flies. Then he would discover the limitations;
those already familiar with guns would wonder
why he would waste his time.

Jan 2 '08 #3

Paul McGuire

On Jan 1, 5:32*pm, hubritic <colinland...@gmail.comwrote:

I am trying to parse data that looks like this:

IDENTIFIER * *TIMESTAMP * T *C * RESOURCE_NAME * DESCRIPTION
2BFA76F6 * * 1208230607 * T * S * SYSPROC * * * * * * * * * *SYSTEM
SHUTDOWN BY USER
A6D1BD62 * 1215230807 * * I
H * * * * * * * * * * * * * * * * * * * * * *Firmware Event

<snip>

The data I have has a fixed number of characters per field, so I could
split it up that way, but wouldn't that defeat the purpose of using a
parser? *

I think you have this backwards. I use pyparsing for a lot of text
processing, but if it is not a good fit, or if str.split is all that
is required, there is no real rationale for using anything more
complicated.

I am determined to become proficient with pyparsing so I am
using it even when it could be considered overkill; thus, it has gone
past mere utility now, this is a matter of principle!

Well, I'm glad you are driven to learn pyparsing if it kills you, but
John Machin has a good point. This data is really so amenable to
something as simple as:

for line in logfile:
id,timestamp,t,c resource_and_description = line.split(None,4)

that it is difficult to recommend pyparsing for this case. The sample
you posted was space-delimited, but if it is tab-delimited, and there
is a pair of tabs between the "H" and "Firmware Event" on the second
line, then just use split("\t") for your data and be done.

Still, pyparsing may be helpful in disambiguating that RESOURCE_NAME
and DESCRIPTION text. One approach would be to enumerate (if
possible) the different values of RESOURCE_NAME. Something like this:

ident = Word(alphanums)
timestamp = Word(nums,exact=10)

# I don't know what these are, I'm just getting the values
# from the sample text you posted
t_field = oneOf("T I")
c_field = oneOf("S H")

# I'm just guessing here, you'll need to provide the actual
# values from your log file
resource_name = oneOf("SYSPROC USERPROC IOSUBSYS whatever")

logline = ident("identifier") + timestamp("time") + \
t_field("T") + c_field("C") + \
Optional(resource_name, default="")("resource") + \
Optional(restOfLine, default="")("description")
Another tack to take might be to use a parse action on the resource
name, to verify the column position of the found token by using the
pyparsing method col:

def matchOnlyAtCol(n):
def verifyCol(strg,locn,toks):
if col(locn,strg) != n: raise
ParseException(strg,locn,"matched token not at column %d" % n)
return verifyCol

resource_name = Word(alphas).setParseAction(matchOnlyAtCol(35))

This will only work if your data really is columnar - the example text
that you posted isn't. (Hmm, I like that matchOnlyAtCol method, I
think I'll add that to the next release of pyparsing...)

Here are some similar parsers that might give you some other ideas:
http://pyparsing.wikispaces.com/spac...erLogParser.py
http://mail.python.org/pipermail/pyt...ad.html#301450

In the second link, I made a similar remark, that pyparsing may not be
the first tool to try, but the variability of the input file made the
non-pyparsing options pretty hairy-looking with special case code, so
in the end, pyparsing was no more complex to use.

Good luck!
-- Paul

Jan 2 '08 #4

Similar topics

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could....

Python

trouble pyparsing

by: the.theorist | last post by:

Hey, I'm trying my hand and pyparsing a log file (named l.log): FIRSTLINE PROPERTY1 DATA1 PROPERTY2 DATA2 PROPERTYS LIST ID1 data1 ID2 data2

Python

Pyparsing Question.

by: Ant | last post by:

I have a home-grown Wiki that I created as an excercise, with it's own wiki markup (actually just a clone of the Trac wiki markup). The wiki text parser I wrote works nicely, but makes heavy use of...

Python

PyParsing and Headaches

by: Bytter | last post by:

Hi, I'm trying to construct a parser, but I'm stuck with some basic stuff... For example, I want to match the following: letter = "A"..."Z" | "a"..."z" literal = letter+ include_bool := "+"...

Python

pyparsing Catch-22

by: 7stud | last post by:

To the developer: 1) I went to the pyparsing wiki to download the pyparsing module and try it 2) At the wiki, there was no index entry in the table of contents for Downloads. After searching...

Python

Problem using Optional pyparsing

by: Nathan Harmston | last post by:

Hi, I know this isnt the pyparsing list, but it doesnt seem like there is one. I m trying to use pyparsing to parse a file however I cant get the Optional keyword to work. My file generally...

Python

Help With PyParsing of output from win32pdhutil.ShowAllProcesses()

by: Steve | last post by:

Hi All (especially Paul McGuire!) Could you lend a hand in the grammar and paring of the output from the function win32pdhutil.ShowAllProcesses()? This is the code that I have so far (it is...

Python

Pyparsing Question

by: Ant | last post by:

Hi all, I have a question on PyParsing. I am trying to create a parser for a hierarchical todo list format, but have hit a stumbling block. I have parsers for the header of the list (title and...

Python

ANN: pyparsing 1.5.1 released

by: Paul McGuire | last post by:

I've just uploaded to SourceForge and PyPI the latest update to pyparsing, version 1.5.1. It has been a couple of months since 1.5.0 was released, and a number of bug-fixes and enhancements have...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing