"Intro to Pyparsing" Article at ONLamp

Paul McGuire

I just published my first article on ONLamp, a beginner's walkthrough for
pyparsing.

Please check it out at
http://www.onlamp.com/pub/a/python/2...pyparsing.html, and be sure to
post any questions or comments.

-- Paul

Jan 27 '06 #1

Subscribe Reply

2895

Anton Vredegoor

Paul McGuire wrote:

I just published my first article on ONLamp, a beginner's walkthrough for
pyparsing.

Please check it out at
http://www.onlamp.com/pub/a/python/2...pyparsing.html, and be sure to
post any questions or comments.

I like your article and pyparsing. But since you ask for comments I'll
give some. For unchanging datafile formats pyparsing seems to be OK.
But for highly volatile data like videotext pages or maybe some html
tables one often has the experience of failure after investing some
time in writing a grammar because the dataformats seem to change
between the times one uses the script. For example, I had this
experience when parsing chess games from videotext pages I grab from my
videotext enabled TV capture card. Maybe once or twice in a year
there's a chess page with games on videotext, but videotext chess
display format always changes slightly in the meantime so I have to
adapt my script. For such things I've switched back to 'hand' coding
because it seems to be more flexible.

(Or use a live internet connection to view the game instead of parsing
videotext, but that's a lot less fun, and I don't have internet in some
places.)

What I would like to see, in order to improve on this situation is a
graphical (tkinter) editor-highlighter in which it would be possible to
select blocks of text from an (example) page and 'name' this block of
text and select a grammar which it complies with, in order to assign a
role to it later. That would be the perfect companion to pyparsing.

At the moment I don't even know if such a thing would be feasible, or
how hard it would be to make it, but I remember having seen data
analyzing tools based on fixed column width data files, which is of
course in a whole other league of difficulty of programming, but at
least it gives some encouragement to the idea that it would be
possible.

Thank you for your ONLamp article and for making pyparsing available. I
had some fun experimenting with it and it gave me some insights in
parsing grammars.

Anton

Jan 28 '06 #2

Paul McGuire

"Anton Vredegoor" <an************ *@gmail.com> wrote in message
news:11******** *************@g 47g2000cwa.goog legroups.com...

I like your article and pyparsing. But since you ask for comments I'll
give some. For unchanging datafile formats pyparsing seems to be OK.
But for highly volatile data like videotext pages or maybe some html
tables one often has the experience of failure after investing some
time in writing a grammar because the dataformats seem to change
between the times one uses the script.
There are two types of parsers: design-driven and data-driven. With
design-driven parsing, you start with a BNF that defines your language or
data format, and then construct the corresponding grammar parser. As the
design evolves and expands (new features, keywords, additional options), the
parser has to be adjusted to keep up.

With data-driven parsing, you are starting with data to be parsed, and you
have to discern the patterns that structure this data. Data-driven parsing
usually shows this exact phenomenon that you describe, that new structures
that were not seen or recognized before arrive in new data files, and the
parser breaks. There are a number of steps you can take to make your parser
less fragile in the face of uncertain data inputs:
- using results names to access parsed tokens, instead of relying on simple
position within an array of tokens
- anticipating features that are not shown in the input data, but that are
known to be supported (for example, the grammar expressions returned by
pyparsing's makeHTMLTags method support arbitrary HTML attributes - this
creates a more robust parser than simply coding a parser or regexp to match
"'<A HREF=' + quotedString")
- accepting case-insensitive inputs
- accepting whitespace between adjacent tokens, but not requiring it -
pyparsing already does this for you
For example, I had this
experience when parsing chess games from videotext pages I grab from my
videotext enabled TV capture card. Maybe once or twice in a year
there's a chess page with games on videotext, but videotext chess
display format always changes slightly in the meantime so I have to
adapt my script. For such things I've switched back to 'hand' coding
because it seems to be more flexible.

Do these chess games display in PGN format (for instance, "15. Bg5 Rf8 16.
a3 Bd5 17. Re1+ Nde5")? The examples directory that comes with pyparsing
includes a PGN parser (submitted by Alberto Santini).
What I would like to see, in order to improve on this situation is a
graphical (tkinter) editor-highlighter in which it would be possible to
select blocks of text from an (example) page and 'name' this block of
text and select a grammar which it complies with, in order to assign a
role to it later. That would be the perfect companion to pyparsing.

At the moment I don't even know if such a thing would be feasible...
There are some commercial parser generator products that work exactly this
way, so I'm sure it's feasible. Yes, this would be a huge enabler for
creating grammars.
Thank you for your ONLamp article and for making pyparsing available. I
had some fun experimenting with it and it gave me some insights in
parsing grammars.

Glad you enjoyed it, thanks for taking the time to reply!

-- Paul

Jan 28 '06 #3

Anton Vredegoor

Paul McGuire wrote:

There are two types of parsers: design-driven and data-driven. With
design-driven parsing, you start with a BNF that defines your language or
data format, and then construct the corresponding grammar parser. As the
design evolves and expands (new features, keywords, additional options), the
parser has to be adjusted to keep up.

With data-driven parsing, you are starting with data to be parsed, and you
have to discern the patterns that structure this data. Data-driven parsing
usually shows this exact phenomenon that you describe, that new structures
that were not seen or recognized before arrive in new data files, and the
parser breaks. There are a number of steps you can take to make your parser
less fragile in the face of uncertain data inputs:
- using results names to access parsed tokens, instead of relying on simple
position within an array of tokens
- anticipating features that are not shown in the input data, but that are
known to be supported (for example, the grammar expressions returned by
pyparsing's makeHTMLTags method support arbitrary HTML attributes - this
creates a more robust parser than simply coding a parser or regexp to match
"'<A HREF=' + quotedString")
- accepting case-insensitive inputs
- accepting whitespace between adjacent tokens, but not requiring it -
pyparsing already does this for you

I'd like to add another parser type, lets call this a natural language
parser type. Here we have to quickly adapt to human typing errors or
problems with the tranmission channel. I think videotext pages offer
both kinds of challenges, so could provide good training material. Of
course in such circumstances it seems to be hardly possible for a
computer alone to produce correct parsing. Sometimes I even have to
start up a chess program to inspect a game after parsing it into a pgn
file and correct unlikely or impossible move sequences. So since we're
now into human assisted parsing anyway, the most gain would be made in
further inproving the user interface?

For example, I had this
experience when parsing chess games from videotext pages I grab from my
videotext enabled TV capture card. Maybe once or twice in a year
there's a chess page with games on videotext, but videotext chess
display format always changes slightly in the meantime so I have to
adapt my script. For such things I've switched back to 'hand' coding
because it seems to be more flexible.

Do these chess games display in PGN format (for instance, "15. Bg5 Rf8 16.
a3 Bd5 17. Re1+ Nde5")? The examples directory that comes with pyparsing
includes a PGN parser (submitted by Alberto Santini).

Ah, now I remember, I think this was what got me started on pyparsing
some time ago. The dutch videotext pages are online too (and there's a
game today):

http://teletekst.nos.nl/tekst/683-01.html

But as I said there can be transmission errors and human errors. And
the dutch notation is used, for example a L is a B, a P is a K, D is Q,
T is R. I'd be interested in a parser that could make inferences about
chess games and use it to correct these pages!

What I would like to see, in order to improve on this situation is a
graphical (tkinter) editor-highlighter in which it would be possible to
select blocks of text from an (example) page and 'name' this block of
text and select a grammar which it complies with, in order to assign a
role to it later. That would be the perfect companion to pyparsing.

At the moment I don't even know if such a thing would be feasible...

There are some commercial parser generator products that work exactly this
way, so I'm sure it's feasible. Yes, this would be a huge enabler for
creating grammars.

And pave the way for a natural language parser. Maybe there's even some
(sketchy) path now to link computer languages and natural languages. In
my mind Python has always been closer to human languages than other
programming languages. From what I learned about it, language
recognition is the easy part, language production is what is hard. But
even the easy part has a long way to go, and since we're also using a
*visual* interface for something that in the end originates from sound
sequences (even what I type here is essentially a representation of a
verbal report) we have ultimately a difficult switch back to auditory
parsing ahead of us.

But in the meantime the tools produced (even if only for text parsing)
are already useful and entertaining. Keep up the good work.

Anton.

Jan 29 '06 #4

Christopher Subich

Anton Vredegoor wrote:

And pave the way for a natural language parser. Maybe there's even some
(sketchy) path now to link computer languages and natural languages. In
my mind Python has always been closer to human languages than other
programming languages. From what I learned about it, language
recognition is the easy part, language production is what is hard. But
even the easy part has a long way to go, and since we're also using a

I think you're underestimating just how far a "long" way to go is, for
natural language processing. I daresay that no current
computer-language parser will come even close to recognizing a
significant fraction of human language.

Using English, because that's the only language I'm fluent in, consider
the sentence:

"The horse raced past the barn fell."

It's just one of many "garden path sentences," where something that
occurs late in the sentence needs to trigger a reparse of the entire
sentence. This is made even worse because of the semantic meanings of
English words -- English, along with every other nonconstructed language
that I know of, is grammatically ambiguous, in that semantic meanings
are necessary to make 100% confident parses.

That's indeed the basis of a class of humour.

"Generating " human language -- turning concepts into words -- is the
easy part. A "concept->English" transformer would only need to
transform into a subset of English, and nobody will notice the difference.

--
It's just an object; it's not what you think.
:wq

Jan 30 '06 #5

Peter Hansen

Christopher Subich wrote:

Using English, because that's the only language I'm fluent in, consider
the sentence:

"The horse raced past the barn fell."

It's just one of many "garden path sentences," where something that
occurs late in the sentence needs to trigger a reparse of the entire
sentence.

I can't parse that at all. Are you sure it's correct? Aren't "raced"
and "fell" both trying to be verbs on the same subject? English surely
doesn't allow that forbids that sort of thing. (<wink>)

-Peter

Jan 30 '06 #6

Terry Reedy

"Peter Hansen" <pe***@engcorp. com> wrote in message
news:dr******** **@sea.gmane.or g...

Christopher Subich wrote:
Using English, because that's the only language I'm fluent in, consider
the sentence:

"The horse raced past the barn fell."

It's just one of many "garden path sentences," where something that
occurs late in the sentence needs to trigger a reparse of the entire
sentence.

I can't parse that at all.

Upon seeing 'fell' as the main verb, you have to reparse 'raced past the
barn' as not the predicate but as a past participle adjectival phrase, like
'bought last year' or 'expected to win'.

The phrase parsed as a predicate reparses as a modifier, as in this
sentence ;-)

Terry Jan Reedy

Jan 30 '06 #7

Dave Hansen

On Mon, 30 Jan 2006 16:39:51 -0500 in comp.lang.pytho n, Peter Hansen
<pe***@engcorp. com> wrote:

Christopher Subich wrote:
Using English, because that's the only language I'm fluent in, consider
the sentence:

"The horse raced past the barn fell."

It's just one of many "garden path sentences," where something that
occurs late in the sentence needs to trigger a reparse of the entire
sentence.

I can't parse that at all. Are you sure it's correct? Aren't "raced"
and "fell" both trying to be verbs on the same subject? English surely
doesn't allow that forbids that sort of thing. (<wink>)

I had a heck of a time myself. Try "The horse that was raced..." and
see if it doesn't make more sense.

Regards,
-=Dave

--
Change is inevitable, progress is not.

Jan 30 '06 #8

Bengt Richter

On Mon, 30 Jan 2006 16:39:51 -0500, Peter Hansen <pe***@engcorp. com> wrote:

Christopher Subich wrote:
Using English, because that's the only language I'm fluent in, consider
the sentence:

"The horse raced past the barn fell."

It's just one of many "garden path sentences," where something that
occurs late in the sentence needs to trigger a reparse of the entire
sentence.

I can't parse that at all. Are you sure it's correct? Aren't "raced"
and "fell" both trying to be verbs on the same subject? English surely
doesn't allow that forbids that sort of thing. (<wink>)

The computer at CMU is pretty good at parsing. You can try it at
http://www.link.cs.cmu.edu/link/submit-sentence-4.html

Here's what it did with "The horse raced past the barn fell." :

++++Time 0.00 seconds (81.38 total)
Found 2 linkages (2 with no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS=0 AND=0 LEN=13)

+------------------------Xp------------------------+
| +----------------Ss---------------+ |
+-----Wd-----+ +----Js----+ | |
| +--Ds-+---Mv--+--MVp--+ +--Ds-+ | |
| | | | | | | | |
LEFT-WALL the horse.n raced.v past.p the barn.n fell.v .

Constituent tree:

(S (NP (NP The horse)
(VP raced
(PP past
(NP the barn))))
(VP fell)
.)

IIUC, that's the way I parse it too ;-)

(I.e., "The horse [being] raced past the barn fell.")

BTW, the online response has some clickable elements in the diagram
to get to definitions of the terms.

Regards,
Bengt Richter

Jan 30 '06 #9

Steve Holden

Bengt Richter wrote:

On Mon, 30 Jan 2006 16:39:51 -0500, Peter Hansen <pe***@engcorp. com> wrote: [...]
The computer at CMU is pretty good at parsing. You can try it at
http://www.link.cs.cmu.edu/link/submit-sentence-4.html

Here's what it did with "The horse raced past the barn fell." :

[...]

I suppose we shouldn't torment these programs ...

++++Time 0.03 seconds (81.41 total)
Found 10 linkages (6 with no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=19)
+------Os-----+----------Bs*t---------+-----MVt-----+ +-
+-Sp*i-+--Ce--+Sp*i+ +---D*u--+-----R----+--Cr--+--Ss-+--MVb-+
+Mpc+
| | | | | | | | | |
| |
I.p thought.v I.p saw.v the langauge[?].n that.r Python is.v better.a
than in
---Js----+
+---Ds--+
| |
the corridor.n

Constituent tree:

(S (NP I)
(VP thought
(SBAR (S (NP I)
(VP saw
(NP (NP the langauge)
(SBAR (WHNP that)
(S (NP Python)
(VP is
(ADVP better)
(PP (NP than)
(PP in
(NP the corridor))))))) )))))
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Jan 30 '06 #10

Similar topics

5125

Frustrated with PHP’s "include"

by: steve | last post by:

I am quite frustrated with php’s include, as I have spent a ton of time on it already... anyone can tell me why it was designed like this (or something I don’t get)? The path in include is relative NOT to the immediate script that is including it, but is relative to the top-level calling script. In practice, this means that you have to constantly worry and adjust paths in includes, based on the startup scripts that call these...

PHP

8858

How to store HTML code (with " ", ' ') inside a variable in php script?

by: Maxim Vexler | last post by:

Hello to everyone, Assuming i have this simple script : <?PHP //Opening tag =' $html_header=' <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

PHP

9922

"#" prefix for location.hash?

by: Yimin Rong | last post by:

Does anyone know if there are any browsers where you must specify "#" as a prefix when setting the hash for the location? For example, the following would move to the intro section of the document: window.location.hash = "#intro"; But in the same browser, this would not work:

Javascript

2233

Correct C++ tutorial part 3 "Intro to loops" available (Windows, mingw/msvc/std)

by: Alf P. Steinbach | last post by:

The third part of my attempted Correct C++ tutorial is now available, although for now only in Word format (use free Open Office if no Word), and also, it's not yet been extensively reviewed -- comments welcome! "Intro to loops" <url: http://home.no.net/dubjai/win32cpptut/w32cpptut_01_03.doc> As before I expect that here's mental food and technical points that we can strongly disagree about, for both newbie and the experienced.

C / C++

2257

New "High-performance SQL" article on developerWorks/DB2

by: Serge Rielau | last post by:

Contains a major blurp on SELECT FROM INSERT and some other hopefully useful tricks. http://www-106.ibm.com/developerworks/db2/library/techarticle/dm-0411rielau/ Enjoy Serge

DB2 Database

1371

Open Office BETA Debuts With "Access-like" Application

by: dedmike | last post by:

Yesterday, Slashdot profiled the new Open Office 2.0 BETA with what was referred to as an "Access-like application." That "Access-like application" is HSQLDB (misidentified as hSQL). Slashdot story: http://slashdot.org/article.pl?sid=04/12/20/1517206&tid=185 First Review: http://www.theinquirer.net/?article=20293

Microsoft Access / VBA

1330

"flushing"/demanding generator contents - implications for injection of control

by: metaperl | last post by:

For this program: def reverse(data): for index in range(len(data)-1, -1, -1): yield data r = reverse("golf") for char in r: print char

Python

1400

using re module to find " but not " alone ... is this a BUG in re?

by: anton | last post by:

Hi, I want to replace all occourences of " by \" in a string. But I want to leave all occourences of \" as they are. The following should happen: this I want " while I dont want this \"

Python

2459

print("@input") VS print(@input)

by: sixtyfootersdude | last post by:

Good Morning! I am just starting to learn perl and I am somewhat mistifide about when I should do: print("@input"); and when I should do: print(@input)

Perl

10003

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8825

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7370

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6643

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5271

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5410

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3918

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3529

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2797

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General