By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,534 Members | 1,322 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,534 IT Pros & Developers. It's quick & easy.

regex for balanced parentheses?

P: n/a
There's no regex that detects balanced parentheses,
or is there?

That is, search('and so ((x+y)+z) = (x+(y+z))')
should return '((x+y)+z)'.

Not just a theoretical question, I'm cleaning up
a large body of TeX code and a regex that did that
would be very convenient. (Yes, I know it's not
hard to detect balanced parentheses by hand...)

I don't know that stuff, but I seen to recall reading
that there's a theoretical notion of "regular expression"
in CS, and a regular expression in that sense cannot
do this. But in the same place I read that the actual
regexes in programming languages include things which
are not regular expressions in that theoretical sense.
David C. Ullrich
Jun 27 '08 #1
Share this Question
Share on Google+
5 Replies


P: n/a
On Jun 12, 6:06*am, David C. Ullrich <dullr...@sprynet.comwrote:
There's no regex that detects balanced parentheses,
or is there?

That is, search('and so ((x+y)+z) = (x+(y+z))')
should return '((x+y)+z)'.

Not just a theoretical question, I'm cleaning up
a large body of TeX code and a regex that did that
would be very convenient. (Yes, I know it's not
hard to detect balanced parentheses by hand...)

I don't know that stuff, but I seen to recall reading
that there's a theoretical notion of "regular expression"
in CS, and a regular expression in that sense cannot
do this. But in the same place I read that the actual
regexes in programming languages include things which
are not regular expressions in that theoretical sense.

David C. Ullrich
Pyparsing includes several helper methods for building common
expression patterns, such as delimitedList, oneOf, operatorPrecedence,
countedArray - and a fairly recent addition, nestedExpr. nestedExpr
creates an expression for matching nested text within opening and
closing delimiters, such as ()'s, []'s, {}'s, etc. The default
delimiters are ()'s. You can also specify a content expression, so
that pyparsing will look for and construct meaningful results. The
default is to return any text nested within the delimiters, broken up
by whitespace.

Here is your sample string parsed using the default nestedExpr:
>>from pyparsing import nestedExpr
for e in nestedExpr().searchString('and so ((x+y)+z) = (x+(y+z))'):
... print e[0]
...
[['x+y'], '+z']
['x+', ['y+z']]

Pyparsing found 2 matches in your input string. Note that the parens
are gone from the results - nestedExpr returns a nested list
structure, with nesting corresponding to the ()'s found in the
original string.

Pyparsing supports parse-time callbacks, called 'parse actions', and
it comes with several commonly used methods, such as removeQuotes,
upcaseTokens, and keepOriginalText. The purpose of keepOriginalText
is to revert any structuring or parsing an expression or other parse
actions might do, and just return the originally matched text.

Here is how keepOriginalText gives you back just the nested
parenthetical expressions, without any additional processing or
grouping:
>>from pyparsing import keepOriginalText
matchedParens = nestedExpr().setParseAction(keepOriginalText)
for e in matchedParens.searchString('and so ((x+y)+z) = (x+(y+z))'):
... print e[0]
...
((x+y)+z)
(x+(y+z))

-- Paul
Jun 27 '08 #2

P: n/a
On Thu, 12 Jun 2008 06:38:16 -0700 (PDT), Paul McGuire
<pt***@austin.rr.comwrote:
>On Jun 12, 6:06*am, David C. Ullrich <dullr...@sprynet.comwrote:
>There's no regex that detects balanced parentheses,
or is there?

[...]

Pyparsing includes several helper methods for building common
expression patterns, such as delimitedList, oneOf, operatorPrecedence,
countedArray - and a fairly recent addition, nestedExpr. nestedExpr
creates an expression for matching nested text within opening and
closing delimiters, such as ()'s, []'s, {}'s, etc.
Keen. Howdya know I wanted that? Thanks.

TeX is one of the amazing things about free software. Knuth
is great in many ways. He totally blew it in one detail,
unfortunately one that comes up a lot: '$' is an opening
delimiter, for which the corresponding closing delimiter
is also '$'. Then better yet, '$$' is another double-duty
delimiter... what I've done with that is first split
on '$$', taking the odd-numbered bits to be the parts
enclosed in $$..$$, and then taking the remining
parts and splitting on $. Not hard but it gives me a
creepy feeling.

Hence the question: Can pyparsing tell the difference
between '$' and '$'? (heh-heh).
The default
delimiters are ()'s. You can also specify a content expression, so
that pyparsing will look for and construct meaningful results. The
default is to return any text nested within the delimiters, broken up
by whitespace.

Here is your sample string parsed using the default nestedExpr:
>>>from pyparsing import nestedExpr
for e in nestedExpr().searchString('and so ((x+y)+z) = (x+(y+z))'):
... print e[0]
...
[['x+y'], '+z']
['x+', ['y+z']]

Pyparsing found 2 matches in your input string. Note that the parens
are gone from the results - nestedExpr returns a nested list
structure, with nesting corresponding to the ()'s found in the
original string.

Pyparsing supports parse-time callbacks, called 'parse actions', and
it comes with several commonly used methods, such as removeQuotes,
upcaseTokens, and keepOriginalText. The purpose of keepOriginalText
is to revert any structuring or parsing an expression or other parse
actions might do, and just return the originally matched text.

Here is how keepOriginalText gives you back just the nested
parenthetical expressions, without any additional processing or
grouping:
>>>from pyparsing import keepOriginalText
matchedParens = nestedExpr().setParseAction(keepOriginalText)
for e in matchedParens.searchString('and so ((x+y)+z) = (x+(y+z))'):
... print e[0]
...
((x+y)+z)
(x+(y+z))

-- Paul
David C. Ullrich
Jun 27 '08 #3

P: n/a
Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)

I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.

-- Paul
Jun 27 '08 #4

P: n/a
"Paul McGuire" <pt***@austin.rr.comwrote in message
news:20**********************************@a1g2000h sb.googlegroups.com...
Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)

I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.

-- Paul
Definitely agree that TeX can get pretty complicated. My method (writing a
converter from one TeX tag system to another) was to pre-parse using string
match/replace for really simple stuff, regular expressions for the more
complex and pyparsing for the really tough stuff.

One thing that was surprisingly hard for me to figure out was filtering out
comments. I finally just looped through the file line by line, looking for a
'%' that wasn't in a verbatim environment and wasn't escaped, etc.
Funny how sometimes the simplest thing can be difficult to handle.

Definitely pyparsing made the job possible; I can't imagine that job without
it.

--Tim Arnold

Jun 27 '08 #5

P: n/a
In article
<20**********************************@a1g2000hsb.g ooglegroups.com>,
Paul McGuire <pt***@austin.rr.comwrote:
Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)
Thanks. Not actually parsing TeX, just going through and
making little changes to a few things. Easier since I wrote
the original crap so I know that various things don't come
up (for example _every_ % is the start of a comment and
_none_ of those comments is actually important, they're
all just old stuff commented out.)
I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.

-- Paul
--
David C. Ullrich
Jun 27 '08 #6

This discussion thread is closed

Replies have been disabled for this discussion.