468,140 Members | 1,406 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,140 developers. It's quick & easy.

regular expression for nested parentheses

I have been trying to write a regular expression that identifies a
block of text enclosed by (potentially nested) parentheses. I've found
solutions using other regular expression engines (for example, my text
editor, BBEdit, which uses the PCRE library), but have not been able
to replicate it using python's re module.

Here's a version that works using the PCRE syntax, along with the
python error message. I'm hoping for this to identify the string '(foo
(bar) (baz))'

% python -V
Python 2.5.1
% python
pyimport re
pytext = 'buh (foo (bar) (baz)) blee'
pyno_ws = lambda s: ''.join(s.split())
pyrexp = r"""(?P<parens>
.... \(
.... (?>
.... (?[^()]+ ) |
.... (?P>parens)
.... )*
.... \)
.... )"""
pyprint re.findall(no_ws(rexp), text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/re.py", line 167, in findall
return _compile(pattern, flags).findall(string)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/re.py", line 233, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of pattern

From what I understand of the PCRE syntax, the (?>) construct is a non-
capturing subpattern, and (?P>parens) is a recursive call to the
enclosing (named) pattern. So my best guess at a python equivalent is
this:

pyrexp2 = r"""(?P<parens>
.... \(
.... (?=
.... (?= [^()]+ ) |
.... (?P=parens)
.... )*
.... \)
.... )"""
pyprint re.findall(no_ws(rexp2), text)
[]

....which results in no match. I've played around quite a bit with
variations on this theme, but haven't been able to come up with one
that works.

Can anyone help me understand how to construct a regular expression
that does the job in python?

Thanks -

Dec 9 '07 #1
5 7685
On Dec 10, 8:13 am, Noah Hoffman <noah.hoff...@gmail.comwrote:
I have been trying to write a regular expression that identifies a
block of text enclosed by (potentially nested) parentheses. I've found
solutions using other regular expression engines (for example, my text
editor, BBEdit, which uses the PCRE library), but have not been able
to replicate it using python's re module.
A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.

Looks like you need a parser; try pyparsing.

[snip]
pyrexp = r"""(?P<parens>
... \(
... (?>
... (?[^()]+ ) |
... (?P>parens)
... )*
... \)
... )"""
pyprint re.findall(no_ws(rexp), text)
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?

Dec 9 '07 #2
On Dec 9, 1:41 pm, John Machin <sjmac...@lexicon.netwrote:
A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.
Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?
Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?
Dec 9 '07 #3
On Dec 10, 8:53 am, Noah Hoffman <noah.hoff...@gmail.comwrote:
On Dec 9, 1:41 pm, John Machin <sjmac...@lexicon.netwrote:
A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.

Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?

Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?
Under a very limited definition of "work". That technique would not
produce correct answers on patterns that contain any *significant*
whitespace e.g. you want to match "foo" and "bar" separated by one or
more spaces (but not tabs, newlines etc) ....
pattern = r"""
foo
[ ]+
bar
"""
Dec 9 '07 #4
On Dec 9, 10:12 pm, John Machin <sjmac...@lexicon.netwrote:
On Dec 10, 8:53 am, Noah Hoffman <noah.hoff...@gmail.comwrote:
On Dec 9, 1:41 pm, John Machin <sjmac...@lexicon.netwrote:
A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.
Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?
Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?

Under a very limited definition of "work". That technique would not
produce correct answers on patterns that contain any *significant*
whitespace e.g. you want to match "foo" and "bar" separated by one or
more spaces (but not tabs, newlines etc) ....
pattern = r"""
foo
[ ]+
bar
"""
You can also escape a literal space:

pattern = r"""
foo
\ +
bar
"""
Dec 10 '07 #5
On Dec 10, 12:22 pm, MRAB <goo...@mrabarnett.plus.comwrote:
On Dec 9, 10:12 pm, John Machin <sjmac...@lexicon.netwrote:


On Dec 10, 8:53 am, Noah Hoffman <noah.hoff...@gmail.comwrote:
On Dec 9, 1:41 pm, John Machin <sjmac...@lexicon.netwrote:
A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.
Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?
Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?
Under a very limited definition of "work". That technique would not
produce correct answers on patterns that contain any *significant*
whitespace e.g. you want to match "foo" and "bar" separated by one or
more spaces (but not tabs, newlines etc) ....
pattern = r"""
foo
[ ]+
bar
"""

You can also escape a literal space:

pattern = r"""
foo
\ +
bar
"""
I know that. *Any* method of putting in a literal significant space is
clobbered by the OP's "trick" of removing *all* whitespace instead of
using the VERBOSE flag, which also permits comments:
pattern = r"""
\ + # ugly
[ ]+ # not quite so ugly
"""
Dec 10 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by joemono | last post: by
9 posts views Thread by MJ | last post: by
10 posts views Thread by Lee Kuhn | last post: by
5 posts views Thread by Tony Marston | last post: by
5 posts views Thread by shawnmkramer | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.