By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,035 Members | 1,324 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,035 IT Pros & Developers. It's quick & easy.

PyParsing and Headaches

P: n/a
Hi,

I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:

letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal

So I defined this as:

literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal

The problem is that:

term.parseString("+a") -(['+', 'a'], {}) # OK
term.parseString("+ a") -(['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.

Can anyone give me an hand here?

Cheers!

Hugo Ferreira

BTW, the following is the complete grammar I'm trying to implement with
pyparsing:

## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"

And this is where I got so far:

word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)
--
GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85

Nov 22 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On Wed, Nov 22, 2006 at 11:17:52AM -0800, Bytter wrote:
Hi,

I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:

letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal

So I defined this as:

literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal
+ here means that you allow a space. You need to explicitly override this.
Try:

term = Combine(include_bool + literal)
>
The problem is that:

term.parseString("+a") -(['+', 'a'], {}) # OK
term.parseString("+ a") -(['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.

Can anyone give me an hand here?

Cheers!

Hugo Ferreira

BTW, the following is the complete grammar I'm trying to implement with
pyparsing:

## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"

And this is where I got so far:

word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)
--
GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85

--
http://mail.python.org/mailman/listinfo/python-list
Nov 22 '06 #2

P: n/a
"Bytter" <by****@gmail.comwrote in message
news:11**********************@j44g2000cwa.googlegr oups.com...
Hi,

I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:

letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal

So I defined this as:

literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal

The problem is that:

term.parseString("+a") -(['+', 'a'], {}) # OK
term.parseString("+ a") -(['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.
As Chris pointed out in his post, the most direct way to fix this is to use
Combine. Note that Combine does two things: it requires the expressions to
be adjacent, and it combines the results into a single token. For instance,
when defining the expression for a real number, something like:

realnum = Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums)

Pyparsing would parse "3.14159" into the separate tokens ['', '3', '.',
'14159']. For this grammar, pyparsing would also accept "2. 23" as ['',
'2', '.', '23'], even though there is a space between the decimal point and
"23". But by wrapping it inside Combine, as in:

realnum = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums))

we accomplish two things: pyparsing only matches if all the elements are
adjacent, with no whitespace or comments; and the matched token is returned
as ['3.14159']. (Yes, I left off scientific notation, but it is an
extension of the same issue.)

Pyparsing in general does implicit whitespace skipping; it is part of the
zen of pyparsing, and distinguishes it from conventional regexps (although I
think there is a new '?' switch for re's that puts '\s*'s between re terms
for you). This is to simplify the grammar definition, so that it doesn't
need to be littered with "optional whitespace or comments could go here"
expressions; instead, whitespace and comments (or "ignorables" in pyparsing
terminology) are parsed over before every grammar expression. I instituted
this out of recoil from a previous project, in which a co-developer
implemented a boolean parser by first tokenizing by whitespace, then parsing
out the tokens. Unfortunately, this meant that "color=='blue' &&
size=='medium'" would not parse successfully, instead requiring "color ==
'blue' && size == 'medium'". It doesn't seem like much, but our support
guys got many calls asking why the boolean clauses weren't matching. I
decided that when I wrote a parser, "y=m*x+b" would be just as parseable as
"y = m * x + b". For that matter, you'd be surprised where whitespace and
comments sneak in to people's source code: spaces after left parentheses and
comments after semicolons, for example, are easily forgotten when spec'ing
out the syntax for a C "for" statement; whitespace inside HTML tags is
another unanticipated surprise.

So looking at your grammar, you say you don't want to have this be a
successful parse:
term.parseString("+ a") -(['+', 'a'], {})

because, "It shouldn't recognize any token since I didn't said the SPACE was
allowed between include_bool and literal." In fact, pyparsing allows spaces
by default, that's why the given parse succeeds. I would turn this question
around, and ask you in terms of your grammar - what SHOULD be allowed
between include_bool and literal? If spaces are not a problem, then your
grammar as-is is sufficient. If spaces are absolutely verboten, then there
are 2 or 3 different techniques in pyparsing to disable the
whitespace-skipping behavior, depending on whether you want all whitespace
skipping disabled, just for literals of a certain type, or just for literals
when following a leading include_bool sign.

Thanks for giving pyparsing a try; if you want further help, you can post
here, or on the pyparsing wiki - the discussion threads on the Home page are
a pretty good support and message log.

-- Paul
Nov 22 '06 #3

P: n/a
(This message has already been sent to the mailing-list, but I don't
have sure this is arriving well since it doesn't come up in the usenet,
so I'm posting it through here now.)

Chris,

Thanks for your quick answer. That changes a lot of stuff, and now I'm
able to do my parsing as I intended to.

Still, there's a remaining problem. By using Combine(), everything is
interpreted as a single token. Though what I need is that
'include_bool' and 'literal' be parsed as separated tokens, though
without a space in the middle...

Paul,

Thanks for your detailed explanation. One of the things I think is
missing from the documentation (or that I couldn't find easy) is the
kind of explanation you give about 'The Way of PyParsing'. For example,
It took me a while to understand that I could easily implement simple
recursions using OneOrMany(Group()). Or maybe things were out there and
I didn't searched enough...

Still, fwiw, congratulations for the library. PyParsing allowed me to
do in just a couple of hours, including learning about it's API (minus
this little inconvenient) what would have taken me a couple of days
with, for example, ANTLR (in fact, I've already put aside ANTLR more
than once in the past for a built-from-scratch parser).

Cheers,

Hugo Ferreira

On Nov 22, 7:50 pm, Chris Lambacher <c...@kateandchris.netwrote:
On Wed, Nov 22, 2006 at 11:17:52AM -0800, Bytter wrote:
Hi,
I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:
letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal
So I defined this as:
literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal+ here means that you allow a space. You need to explicitly override this.
Try:

term = Combine(include_bool + literal)
The problem is that:
term.parseString("+a") -(['+', 'a'], {}) # OK
term.parseString("+ a") -(['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.
Can anyone give me an hand here?
Cheers!
Hugo Ferreira
BTW, the following is the complete grammar I'm trying to implement with
pyparsing:
## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"
And this is where I got so far:
word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)
--
GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85
--
http://mail.python.org/mailman/listinfo/python-list
Nov 23 '06 #4

P: n/a
Heya there,

Ok, found the solution. I just needed to use leaveWhiteSpace() in the
places I want pyparsing to take into consideration the spaces.
Thx for the help.

Cheers!

Hugo Ferreira

On Nov 23, 11:57 am, "Bytter" <byt...@gmail.comwrote:
(This message has already been sent to the mailing-list, but I don't
have sure this is arriving well since it doesn't come up in the usenet,
so I'm posting it through here now.)

Chris,

Thanks for your quick answer. That changes a lot of stuff, and now I'm
able to do my parsing as I intended to.

Still, there's a remaining problem. By using Combine(), everything is
interpreted as a single token. Though what I need is that
'include_bool' and 'literal' be parsed as separated tokens, though
without a space in the middle...

Paul,

Thanks for your detailed explanation. One of the things I think is
missing from the documentation (or that I couldn't find easy) is the
kind of explanation you give about 'The Way of PyParsing'. For example,
It took me a while to understand that I could easily implement simple
recursions using OneOrMany(Group()). Or maybe things were out there and
I didn't searched enough...

Still, fwiw, congratulations for the library. PyParsing allowed me to
do in just a couple of hours, including learning about it's API (minus
this little inconvenient) what would have taken me a couple of days
with, for example, ANTLR (in fact, I've already put aside ANTLR more
than once in the past for a built-from-scratch parser).

Cheers,

Hugo Ferreira

On Nov 22, 7:50 pm, Chris Lambacher <c...@kateandchris.netwrote:
On Wed, Nov 22, 2006 at 11:17:52AM -0800, Bytter wrote:
Hi,
I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:
letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal
So I defined this as:
literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal+ here means that you allow a space. You need to explicitly override this.
Try:
term = Combine(include_bool + literal)
The problem is that:
term.parseString("+a") -(['+', 'a'], {}) # OK
term.parseString("+ a") -(['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.
Can anyone give me an hand here?
Cheers!
Hugo Ferreira
BTW, the following is the complete grammar I'm trying to implement with
pyparsing:
## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"
And this is where I got so far:
word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)
--
GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85
--
>http://mail.python.org/mailman/listinfo/python-list
Nov 23 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.