[pyparsing] make sure entire string was parsed

Steven Bethard

How do I make sure that my entire string was parsed when I call a
pyparsing element's parseString method? Here's a dramatically
simplified version of my problem:

py> import pyparsing as pp
py> match = pp.Word(pp.nums )
py> def parse_num(s, loc, toks):
.... n, = toks
.... return int(n) + 10
....
py> match.setParseA ction(parse_num )
W:(0123...)
py> match.parseStri ng('121abc')
([131], {})

I want to know (somehow) that when I called match.parseStri ng(), there
was some of the string left over (in this case, 'abc') after the parse
was complete. How can I do this? (I don't think I can do character
counting; all my internal setParseAction( ) functions return non-strings).

STeVe

P.S. FWIW, I've included the real code below. I need to throw an
exception when I call the parseString method of cls._root_node or
cls._root_nodes and the entire string is not consumed.

----------------------------------------------------------------------
# some character classes
printables_tran s = _pp.printables. translate
word_chars = printables_tran s(_id_trans, '()')
syn_tag_chars = printables_tran s(_id_trans, '()-=')
func_tag_chars = printables_tran s(_id_trans, '()-=0123456789')

# basic tag components
sep = _pp.Literal('-').leaveWhitesp ace()
alt_sep = _pp.Literal('=' ).leaveWhitespa ce()
special_word = _pp.Combine(sep + _pp.Word(syn_ta g_chars) + sep)
supp_sep = (alt_sep | sep).suppress()
syn_word = _pp.Word(syn_ta g_chars).leaveW hitespace()
func_word = _pp.Word(func_t ag_chars).leave Whitespace()
id_word = _pp.Word(_pp.nu ms).leaveWhites pace()

# the different tag types
special_tag = special_word.se tResultsName('t ag')
syn_tag = syn_word.setRes ultsName('tag')
func_tags = _pp.ZeroOrMore( supp_sep + func_word)
func_tags = func_tags.setRe sultsName('func s')
id_tag = _pp.Optional(su pp_sep + id_word).setRes ultsName('id')
tags = special_tag | (syn_tag + func_tags + id_tag)
def get_tag(orig_st ring, tokens_start, tokens):
tokens = dict(tokens)
tag = tokens.pop('tag ')
if tag == '-NONE-':
tag = None
functions = list(tokens.pop ('funcs', []))
id = tokens.pop('id' , None)
return [dict(tag=tag, functions=funct ions, id=id)]
tags.setParseAc tion(get_tag)

# node parentheses
start = _pp.Literal('(' ).suppress()
end = _pp.Literal(')' ).suppress()

# words
word = _pp.Word(word_c hars).setResult sName('word')

# leaf nodes
leaf_node = tags + _pp.Optional(wo rd)
def get_leaf_node(o rig_string, tokens_start, tokens):
try:
tag_dict, word = tokens
word = cls._unescape(w ord)
except ValueError:
tag_dict, = tokens
word = None
return cls(word=word, **tag_dict)
leaf_node.setPa rseAction(get_l eaf_node)

# node, recursive
node = _pp.Forward()

# branch nodes
branch_node = tags + _pp.OneOrMore(n ode)
def get_branch_node (orig_string, tokens_start, tokens):
return cls(children=to kens[1:], **tokens[0])
branch_node.set ParseAction(get _branch_node)

# node, recursive
node << start + (branch_node | leaf_node) + end

# root node may have additional parentheses
cls._root_node = node | start + node + end
cls._root_nodes = _pp.OneOrMore(c ls._root_node)

Sep 10 '05 #1

Subscribe Reply

2214

Paul McGuire

Steven -

Thanks for giving pyparsing a try! To see whether your input text
consumes the whole string, add a StringEnd() element to the end of your
BNF. Then if there is more text after the parsed text, parseString
will throw a ParseException.

I notice you call leaveWhitespace on several of your parse elements, so
you may have to rstrip() the input text before calling parseString. I
am curious whether leaveWhitespace is really necessary for your
grammar. If it is, you can usually just call leaveWhitespace on the
root element, and this will propagate to all the sub elements.

Lastly, you may get caught up with operator precedence, I think your
node assignment statement may need to change from
node << start + (branch_node | leaf_node) + end
to
node << (start + (branch_node | leaf_node) + end)

HTH,
-- Paul

Sep 11 '05 #2

Steven Bethard

Paul McGuire wrote:

Thanks for giving pyparsing a try! To see whether your input text
consumes the whole string, add a StringEnd() element to the end of your
BNF. Then if there is more text after the parsed text, parseString
will throw a ParseException.
Thanks, that's exactly what I was looking for.
I notice you call leaveWhitespace on several of your parse elements, so
you may have to rstrip() the input text before calling parseString. I
am curious whether leaveWhitespace is really necessary for your
grammar. If it is, you can usually just call leaveWhitespace on the
root element, and this will propagate to all the sub elements.
Yeah, sorry, I was still messing around with that part of the code. My
problem is that I have to differentiate between:

(NP -x-y)

and:

(NP-x -y)

I'm doing this now using Combine. Does that seem right?
Lastly, you may get caught up with operator precedence, I think your
node assignment statement may need to change from
node << start + (branch_node | leaf_node) + end
to
node << (start + (branch_node | leaf_node) + end)

I think I'm okay:

py> 2 << 1 + 2
16
py> (2 << 1) + 2
6
py> 2 << (1 + 2)
16

Thanks for the help!

STeVe

Sep 11 '05 #3

Paul McGuire

Steve -

I have to differentiate between:
(NP -x-y)
and:
(NP-x -y)
I'm doing this now using Combine. Does that seem right?

If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace :

from pyparsing import *

thing = Word(alphanums+ "-")
LPAREN = Literal("(").su ppress()
RPAREN = Literal(")").su ppress()
node = LPAREN + OneOrMore(thing ) + RPAREN

print node.parseStrin g("(NP -x-y)")
print node.parseStrin g("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']
Your examples helped me to see what my operator precedence concern was.
Fortunately, your usage was an And, composed using '+' operators. If
your construct was a MatchFirst, composed using '|' operators, things
aren't so pretty:

print 2 << 1 | 3
print 2 << (1 | 3)

7
16

So I've just gotten into the habit of parenthesizing anything I load
into a Forward using '<<'.

-- Paul

Sep 11 '05 #4

Steven Bethard

Paul McGuire wrote:

I have to differentiate between:
(NP -x-y)
and:
(NP-x -y)
I'm doing this now using Combine. Does that seem right?

If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace :

from pyparsing import *

thing = Word(alphanums+ "-")
LPAREN = Literal("(").su ppress()
RPAREN = Literal(")").su ppress()
node = LPAREN + OneOrMore(thing ) + RPAREN

print node.parseStrin g("(NP -x-y)")
print node.parseStrin g("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']

I actually need to break these into:

['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

I know the dict syntax afterwards isn't quite what pyparsing would
output, but hopefully my intent is clear. I need to use the dict-style
results from setResultsName( ) calls because in the full grammar, I have
a lot of optional elements. For example:

(NP-1 -a)
--> {'tag':'NP', 'id':'1', 'word':'-a'}
(NP-x-2 -B)
--> {'tag':'NP', 'functions':['x'], 'id':'2', 'word':'-B'}
(NP-x-y=2-3 -4)
--> {'tag':'NP', 'functions':['x', 'y'], 'coord':'2', 'id':'3',
'word':'-4'}
(-NONE- x)
--> {'tag':None, 'word':'x'}

STeVe

P.S. In case you're curious, here's my current draft of the code:

# some character classes
printables_tran s = _pp.printables. translate
word_chars = printables_tran s(_id_trans, '()')
word_elem = _pp.Word(word_c hars)
syn_chars = printables_tran s(_id_trans, '()-=')
syn_word = _pp.Word(syn_ch ars)
func_chars = printables_tran s(_id_trans, '()-=0123456789')
func_word = _pp.Word(func_c hars)
num_word = _pp.Word(_pp.nu ms)

# tag separators
dash = _pp.Literal('-')
tag_sep = dash.suppress()
coord_sep = _pp.Literal('=' ).suppress()

# tag types (use Combine to guarantee no spaces)
special_tag = _pp.Combine(das h + syn_word + dash)
syn_tag = syn_word
func_tags = _pp.ZeroOrMore( _pp.Combine(tag _sep + func_word))
coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word))
id_tag = _pp.Optional(_p p.Combine(tag_s ep + num_word))

# give tag types result names
special_tag = special_tag.set ResultsName('ta g')
syn_tag = syn_tag.setResu ltsName('tag')
func_tags = func_tags.setRe sultsName('func s')
coord_tag = coord_tag.setRe sultsName('coor d')
id_tag = id_tag.setResul tsName('id')

# combine tag types into a tags element
normal_tags = syn_tag + func_tags + coord_tag + id_tag
tags = special_tag | _pp.Combine(nor mal_tags)
def get_tag(orig_st ring, tokens_start, tokens):
tokens = dict(tokens)
tag = tokens.pop('tag ')
if tag == '-NONE-':
tag = None
functions = list(tokens.pop ('funcs', []))
coord = tokens.pop('coo rd', None)
id = tokens.pop('id' , None)
return [dict(tag=tag, functions=funct ions,
coord=coord, id=id)]
tags.setParseAc tion(get_tag)

# node parentheses
start = _pp.Literal('(' ).suppress()
end = _pp.Literal(')' ).suppress()

# words
word = word_elem.setRe sultsName('word ')

# leaf nodes
leaf_node = tags + _pp.Optional(wo rd)
def get_leaf_node(o rig_string, tokens_start, tokens):
try:
tag_dict, word = tokens
word = cls._unescape(w ord)
except ValueError:
tag_dict, = tokens
word = None
return cls(word=word, **tag_dict)
leaf_node.setPa rseAction(get_l eaf_node)

# node, recursive
node = _pp.Forward()

# branch nodes
branch_node = tags + _pp.OneOrMore(n ode)
def get_branch_node (orig_string, tokens_start, tokens):
return cls(children=to kens[1:], **tokens[0])
branch_node.set ParseAction(get _branch_node)

# node, recursive
node << start + (branch_node | leaf_node) + end

# root node may have additional parentheses
root_node = node | start + node + end
root_nodes = _pp.OneOrMore(r oot_node)

# make sure nodes start and end string
str_start = _pp.StringStart ()
str_end = _pp.StringEnd()
cls._root_node = str_start + root_node + str_end
cls._root_nodes = str_start + root_nodes + str_end

Sep 12 '05 #5

Steven Bethard

Steven Bethard wrote:

Paul McGuire wrote:
I have to differentiate between:
(NP -x-y)
and:
(NP-x -y)
I'm doing this now using Combine. Does that seem right?

If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace :

from pyparsing import *

thing = Word(alphanums+ "-")
LPAREN = Literal("(").su ppress()
RPAREN = Literal(")").su ppress()
node = LPAREN + OneOrMore(thing ) + RPAREN

print node.parseStrin g("(NP -x-y)")
print node.parseStrin g("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']

I actually need to break these into:

['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

Oops, sorry, the last line should have been:

['NP', 'x', '-y'] {tag:'NP', 'functions':['x'], 'word':'-y'}

Sorry to introduce confusion into an already confusing parsing problem. ;)

STeVe

Sep 12 '05 #6

Paul McGuire

Steve -

Wow, this is a pretty dense pyparsing program. You are really pushing
the envelope in your use of ParseResults, dicts, etc., but pretty much
everything seems to be working.

I still don't know the BNF you are working from, but here are some
other "shots in the dark":

1. I'm surprised func_word does not permit numbers anywhere in the
body. Is this just a feature you have not implemented yet? As long as
func_word does not start with a digit, you can still define one
unambiguously to allow numbers after the first character if you define
func_word as

func_word = _pp.Word(func_c hars,func_chars +_pp.nums)

Perhaps similar for syn_word as well.

2. Is coord an optional sub-element of a func? If so, you might want
to group them so that they stay together, something like:

coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word))
func_tags = _pp.ZeroOrMore( _pp.Group(tag_s ep + func_word+coord _tag))

You might also add a default value for coord_tag if none is supplied,
to simplify your parse action?

coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word),None)

Now the coords and funcs will be kept together.

3. Of course, you are correct in using Combine to ensure that you only
accept adjacent characters. But you only need to use it at the
outermost level.

4. You can use several dict-like functions directly on a ParseResults
object, such as keys(), items(), values(), in, etc. Also, the []
notation and the .attribute notation are nearly identical, except that
[] refs on a missing element will raise a KeyError, .attribute will
always return something. For instance, in your example, the getTag()
parse action uses dict.pop() to extract the 'coord' field. If coord is
present, you could retrieve it using "tokens['coord']" or
"tokens.coo rd". If coord is missing, "tokens['coord']" will raise a
KeyError, but tokens.coord will return an empty string. If you need to
"listify" a ParseResults, try calling asList().
It's not clear to me what if any further help you are looking for, now
that your initial question (about StringEnd()) has been answered. But
please let us know how things work out.

-- Paul

Sep 13 '05 #7

Steven Bethard

Paul McGuire wrote:

I still don't know the BNF you are working from
Just to satisfy any curiosity you might have, it's the Penn TreeBank
format: http://www.cis.upenn.edu/~treebank/
(Except that the actual Penn Treebank data unfortunately differs from
the format spec in a few ways.)
1. I'm surprised func_word does not permit numbers anywhere in the
body. Is this just a feature you have not implemented yet? As long as
func_word does not start with a digit, you can still define one
unambiguously to allow numbers after the first character if you define
func_word as

func_word = _pp.Word(func_c hars,func_chars +_pp.nums)
Ahh, very nice. The spec's vague, but this is probably what I want to do.
2. Is coord an optional sub-element of a func?
No, functions, coord and id are optional sub-elements of the tags string.
You might also add a default value for coord_tag if none is supplied,
to simplify your parse action?
Oh, that's nice. I missed that functionality.
It's not clear to me what if any further help you are looking for, now
that your initial question (about StringEnd()) has been answered.

Yes, thanks, you definitely answered the initial question. And your
followup commentary was also very helpful. Thanks again!

STeVe

Sep 13 '05 #8

Similar topics

1469

Pyparsing question

by: Khoa Nguyen | last post by:

Hi, I am a newbie to Python and pyparsing. I am having difficulty creating a grammar that spans multiple lines, the input data look like this RTSP/1.0 200 OK\r\n Cseq: 1\r\n Session: 12345-1\r\n \r\n

Python

2354

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could. Or how can I replace the html &entities; in a string "blablabla&blablabal&balbalbal" with the chars they mean using re.sub? I found out they are stored in an dict . I though about this functionality:

Python

2893

"Intro to Pyparsing" Article at ONLamp

by: Paul McGuire | last post by:

I just published my first article on ONLamp, a beginner's walkthrough for pyparsing. Please check it out at http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html, and be sure to post any questions or comments. -- Paul

Python

1592

PyParsing and Headaches

by: Bytter | last post by:

Hi, I'm trying to construct a parser, but I'm stuck with some basic stuff... For example, I want to match the following: letter = "A"..."Z" | "a"..."z" literal = letter+ include_bool := "+" | "-" term = literal

Python

2063

pyparsing Catch-22

by: 7stud | last post by:

To the developer: 1) I went to the pyparsing wiki to download the pyparsing module and try it 2) At the wiki, there was no index entry in the table of contents for Downloads. After searching around a bit, I finally discovered a tiny link buried in some text at the top of the home page. 3) Link goes to sourceforge. At sourceforge, there was a nice, green 'download' button that stood out from the page. 4) I clicked on the download...

Python

2049

using pyparsing to extract METEO DATAS

by: napolpie | last post by:

DISCUSSION IN USER nappie writes: Hello, I'm Peter and I'm new in python codying and I'm using parsying to extract data from one meteo Arpege file. This file is long file and it's composed by word and number arguments like this: GRILLE EURAT5 Coin Nord-Ouest : 46.50/ 0.50 Coin Sud-E Hello, I'm Peter and I'm new in python codying and I'm using parsying to extract data from one meteo Arpege file.

Python

4725

Is pyparsing really a recursive descent parser?

by: Just Another Victim of the Ambient Morality | last post by:

Is pyparsing really a recursive descent parser? I ask this because there are grammars it can't parse that my recursive descent parser would parse, should I have written one. For instance: from pyparsing import * grammar = OneOrMore(Word(alphas)) + Literal('end') grammar.parseString('First Second Third end')

Python

1685

More fun with PyParsing - almost did it on my own..

by: rh0dium | last post by:

Hi all, I almost did my first pyparsing without help but here we go again. Let's start with my code. The sample data is listed below. # This will gather the following ( "NamedPin" "PinDirection" "OptionalSignal" ) guts = Group( LPAR.suppress() + quotedString.setParseAction(removeQuotes).setResultsName("name") + quotedString.setParseAction(removeQuotes).setResultsName("direction")

Python

5279

pyparsing: match empty line

by: Marek Kubica | last post by:

Hi, I am trying to get this stuff working, but I still fail. I have a format which consists of three elements: \d{4}M?-\d (4 numbers, optional M, dash, another number) EMPTY (the <EMPTYtoken) (the <PAGEBREAKtoken. The line may contain whitespaces, but nothing else)

Python

8991

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8830

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9370

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9321

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

6796

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6074

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

3312

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2782

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2215

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General