regex for balanced parentheses?

David C. Ullrich

There's no regex that detects balanced parentheses,
or is there?

That is, search('and so ((x+y)+z) = (x+(y+z))')
should return '((x+y)+z)'.

Not just a theoretical question, I'm cleaning up
a large body of TeX code and a regex that did that
would be very convenient. (Yes, I know it's not
hard to detect balanced parentheses by hand...)

I don't know that stuff, but I seen to recall reading
that there's a theoretical notion of "regular expression"
in CS, and a regular expression in that sense cannot
do this. But in the same place I read that the actual
regexes in programming languages include things which
are not regular expressions in that theoretical sense.
David C. Ullrich

Jun 27 '08 #1

Subscribe Post Reply

4916

Paul McGuire

On Jun 12, 6:06*am, David C. Ullrich <dullr...@sprynet.comwrote:

There's no regex that detects balanced parentheses,
or is there?

That is, search('and so ((x+y)+z) = (x+(y+z))')
should return '((x+y)+z)'.

Not just a theoretical question, I'm cleaning up
a large body of TeX code and a regex that did that
would be very convenient. (Yes, I know it's not
hard to detect balanced parentheses by hand...)

I don't know that stuff, but I seen to recall reading
that there's a theoretical notion of "regular expression"
in CS, and a regular expression in that sense cannot
do this. But in the same place I read that the actual
regexes in programming languages include things which
are not regular expressions in that theoretical sense.

David C. Ullrich

Pyparsing includes several helper methods for building common
expression patterns, such as delimitedList, oneOf, operatorPrecedence,
countedArray - and a fairly recent addition, nestedExpr. nestedExpr
creates an expression for matching nested text within opening and
closing delimiters, such as ()'s, []'s, {}'s, etc. The default
delimiters are ()'s. You can also specify a content expression, so
that pyparsing will look for and construct meaningful results. The
default is to return any text nested within the delimiters, broken up
by whitespace.

Here is your sample string parsed using the default nestedExpr:

>>from pyparsing import nestedExpr
for e in nestedExpr().searchString('and so ((x+y)+z) = (x+(y+z))'):

... print e[0]
...
[['x+y'], '+z']
['x+', ['y+z']]

Pyparsing found 2 matches in your input string. Note that the parens
are gone from the results - nestedExpr returns a nested list
structure, with nesting corresponding to the ()'s found in the
original string.

Pyparsing supports parse-time callbacks, called 'parse actions', and
it comes with several commonly used methods, such as removeQuotes,
upcaseTokens, and keepOriginalText. The purpose of keepOriginalText
is to revert any structuring or parsing an expression or other parse
actions might do, and just return the originally matched text.

Here is how keepOriginalText gives you back just the nested
parenthetical expressions, without any additional processing or
grouping:

>>from pyparsing import keepOriginalText
matchedParens = nestedExpr().setParseAction(keepOriginalText)
for e in matchedParens.searchString('and so ((x+y)+z) = (x+(y+z))'):

... print e[0]
...
((x+y)+z)
(x+(y+z))

-- Paul

Jun 27 '08 #2

David C. Ullrich

On Thu, 12 Jun 2008 06:38:16 -0700 (PDT), Paul McGuire
<pt***@austin.rr.comwrote:

>On Jun 12, 6:06*am, David C. Ullrich <dullr...@sprynet.comwrote:
>There's no regex that detects balanced parentheses,
or is there?

[...]

Pyparsing includes several helper methods for building common
expression patterns, such as delimitedList, oneOf, operatorPrecedence,
countedArray - and a fairly recent addition, nestedExpr. nestedExpr
creates an expression for matching nested text within opening and
closing delimiters, such as ()'s, []'s, {}'s, etc.

Keen. Howdya know I wanted that? Thanks.

TeX is one of the amazing things about free software. Knuth
is great in many ways. He totally blew it in one detail,
unfortunately one that comes up a lot: '$' is an opening
delimiter, for which the corresponding closing delimiter
is also '$'. Then better yet, '$$' is another double-duty
delimiter... what I've done with that is first split
on '$$', taking the odd-numbered bits to be the parts
enclosed in $$..$$, and then taking the remining
parts and splitting on $. Not hard but it gives me a
creepy feeling.

Hence the question: Can pyparsing tell the difference
between '$' and '$'? (heh-heh).

The default
delimiters are ()'s. You can also specify a content expression, so
that pyparsing will look for and construct meaningful results. The
default is to return any text nested within the delimiters, broken up
by whitespace.

Here is your sample string parsed using the default nestedExpr:

>>>from pyparsing import nestedExpr
for e in nestedExpr().searchString('and so ((x+y)+z) = (x+(y+z))'):

... print e[0]
...
[['x+y'], '+z']
['x+', ['y+z']]

Pyparsing found 2 matches in your input string. Note that the parens
are gone from the results - nestedExpr returns a nested list
structure, with nesting corresponding to the ()'s found in the
original string.

Pyparsing supports parse-time callbacks, called 'parse actions', and
it comes with several commonly used methods, such as removeQuotes,
upcaseTokens, and keepOriginalText. The purpose of keepOriginalText
is to revert any structuring or parsing an expression or other parse
actions might do, and just return the originally matched text.

Here is how keepOriginalText gives you back just the nested
parenthetical expressions, without any additional processing or
grouping:

>>>from pyparsing import keepOriginalText
matchedParens = nestedExpr().setParseAction(keepOriginalText)
for e in matchedParens.searchString('and so ((x+y)+z) = (x+(y+z))'):

... print e[0]
...
((x+y)+z)
(x+(y+z))

-- Paul

David C. Ullrich

Jun 27 '08 #3

Paul McGuire

Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)

I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.

-- Paul

Jun 27 '08 #4

Tim Arnold

"Paul McGuire" <pt***@austin.rr.comwrote in message
news:20**********************************@a1g2000h sb.googlegroups.com...

Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)

I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.

-- Paul

Definitely agree that TeX can get pretty complicated. My method (writing a
converter from one TeX tag system to another) was to pre-parse using string
match/replace for really simple stuff, regular expressions for the more
complex and pyparsing for the really tough stuff.

One thing that was surprisingly hard for me to figure out was filtering out
comments. I finally just looped through the file line by line, looking for a
'%' that wasn't in a verbatim environment and wasn't escaped, etc.
Funny how sometimes the simplest thing can be difficult to handle.

Definitely pyparsing made the job possible; I can't imagine that job without
it.

--Tim Arnold

Jun 27 '08 #5

David C. Ullrich

In article
<20**********************************@a1g2000hsb.g ooglegroups.com>,
Paul McGuire <pt***@austin.rr.comwrote:

Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)

Thanks. Not actually parsing TeX, just going through and
making little changes to a few things. Easier since I wrote
the original crap so I know that various things don't come
up (for example _every_ % is the start of a comment and
_none_ of those comments is actually important, they're
all just old stuff commented out.)

I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.

-- Paul

--
David C. Ullrich

Jun 27 '08 #6

by: aevans1108 | last post by:

expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a...

.NET Framework

How can I embed the *regex* engine into C program?

by: alphatan | last post by:

Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...

C / C++

any regex gurus out there?

by: bill tie | last post by:

I'd appreciate it if you could advise. 1. How do I replace "\" (backslash) with anything? 2. Suppose I want to replace (a) every occurrence of characters "a", "b", "c", "d" with "x", (b)...

C# / C Sharp

Regex help, please - Recognize quoted strings?

by: Dave | last post by:

I'm struggling with something that should be fairly simple. I just don't know the regext syntax very well, unfortunately. I'd like to parse words out of what is basically a boolean search...

C# / C Sharp

regex: multiple matches after initial text

by: Chance Hopkins | last post by:

I'm trying to match a set of matches after some initial text: mytext: "something" "somethingelse" "another thing" "maybe another" (?:mytext: )(?<mymatch>{1,1}+{1,1}+)+ I only get the last one...

C# / C Sharp

Seeking regex optimizer

by: Kay Schluehr | last post by:

I have a list of strings ls = and want to create a regular expression sx from it, such that sx.match(s) yields a SRE_Match object when s starts with an s_i for one i in . There might be...

Python

boost regex format string

by: Aek | last post by:

Hi everyone, I am trying to construct a regular expression and format string to use with a boost::regex_replace() In my file the sample text is: // .fx shader file FLOAT JOE 3545f; FLOAT...

C / C++

RegEx to find a word not enclosed in paranthesis

by: vmoreau | last post by:

I have a text and I need to find a Word that are not enclosed in paranthesis. Can it be done with a regex? Is someone could help me? I am not familar with regex... Example looking for WORD:...

C# / C Sharp

Need a regex for this...

by: =?Utf-8?B?UmF5IE1pdGNoZWxs?= | last post by:

I'm trying to learn regex but since I've spent way too much time on the following "simple" case, there's obviously something I'm missing. I need to find all occurrences of a specific...

C# / C Sharp

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

regex for balanced parentheses?

Similar topics