432,651 Members | 1,757 Online + Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,651 IT Pros & Developers. It's quick & easy.

# need some regular expression help

 P: n/a I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? Thanks for any help! Oct 7 '06 #1
14 Replies

 P: n/a Chris wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? This is not possible with regular expressions - they can't "remember" how many parens they already encountered. You will need a real parser for this - pyparsing seems to be the most popular choice today, I personally like spark. I'm sure you find an example-grammar that will parse simple arithmetical expressions like the one above. Diez Oct 7 '06 #2

 P: n/a Chris wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? No, there is so such pattern. You will have to code up a function. Consider what your spec really is: '42^((2x+2)sin(x)) + (log(2)/log(5))' has the same number of left and right parentheses; so does the zero-length string; so does ') + (' -- perhaps you need to add 'and starts with a "("' Consider what you are going to do with input like this: print '(' + some_text + ')' Maybe you need to do some lexical analysis and work at the level of tokens rather than individual characters. Which then raises the usual question: you have a perception that regular expressions are the solution -- to what problem?? HTH, John Oct 7 '06 #3

 P: n/a On 7 Oct 2006 15:00:29 -0700, Diez B. Roggisch Chris wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? This is not possible with regular expressions - they can't "remember" how many parens they already encountered. Remember that regular expressions are used to represent regular grammars. Most regex engines actually aren't regular in that they support fancy things like look-behind/ahead and capture groups...IIRC, these cannot be part of a true regular expression library. With that said, the quote-unquote regexes in Lua have a special feature that supports balanced expressions. I believe Python has a PCRE lib somewhere; you may be able to use the experimental ??{ } construct in that case. -- Theerasak Oct 8 '06 #4

 P: n/a In article <11*********************@e3g2000cwe.googlegroups.c om>, "Chris"

 P: n/a Why does it need to be a regex? There is a very simple and well-known algorithm which does what you want. Start with i=0. Walk the string one character at a time, incrementing i each time you see a '(', and decrementing it each time you see a ')'. At the end of the string, the count should be back to 0. If at any time during the process, the count goes negative, you've got mis-matched parentheses. The algorithm runs in O(n), same as a regex. Regex is a wonderful tool, but it's not the answer to all problems. Following Roy's suggestion, one could use something like: >>s = '42^((2x+2)sin(x)) + (log(2)/log(5))'d = {'(':1, ')':-1}sum(d.get(c, 0) for c in s) 0 If you get a sum() 0, then you have too many "(", and if you have sum() < 0, you have too many ")" characters. A sum() of 0 means there's the same number of parens. It still doesn't solve the aforementioned problem of things like ')))(((' which is balanced, but psychotic. :) -tkc Oct 8 '06 #6

 P: n/a hanumizzle wrote: On 7 Oct 2006 15:00:29 -0700, Diez B. Roggisch

 P: n/a On 8 Oct 2006 01:49:50 -0700, Diez B. Roggisch

 P: n/a Tim Chase: It still doesn't solve the aforementioned problem of things like ')))(((' which is balanced, but psychotic. :) This may solve the problem: def balanced(txt): d = {'(':1, ')':-1} tot = 0 for c in txt: tot += d.get(c, 0) if tot < 0: return False return tot == 0 print balanced("42^((2x+2)sin(x)) + (log(2)/log(5))") # True print balanced("42^((2x+2)sin(x) + (log(2)/log(5))") # False print balanced("42^((2x+2)sin(x))) + (log(2)/log(5))") # False print balanced(")))(((") # False A possibile alternative for Py 2.5. The dict solution looks better, but this may be faster: def balanced2(txt): tot = 0 for c in txt: tot += 1 if c=="(" else (-1 if c==")" else 0) if tot < 0: return False return tot == 0 Bye, bearophile Oct 8 '06 #9

 P: n/a be************@lycos.com wrote: The dict solution looks better, but this may be faster: it's slightly faster, but both your alternatives are about 10x slower than a straightforward: def balanced(txt): return txt.count("(") == txt.count(")") Oct 8 '06 #10

 P: n/a Thus spoke Diez B. Roggisch (on 2006-10-08 10:49): Certainly true, and it always gives me a hard time because I don't know to which extend a regular expression nowadays might do the job because of these extensions. It was so much easier back in the old times.... Right, in perl, this would be a no-brainer, its documented all over the place, like: my \$re; \$re = qr{ (?: (?[^\\()]+ | \\. ) | \( (??{ \$re }) \) )* }xs; where you have a 'delayed execution' of the (??{ \$re }) which in the end makes the whole a thing recursive one, it gets expanded and executed if the match finds its way to it. Above regex will match balanced parens, as in: my \$good = 'a + (b / (c - 2)) * (d ^ (e+f)) '; my \$bad1 = 'a + (b / (c - 2) * (d ^ (e+f)) '; my \$bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f) )'; if you do: print "ok \n" if \$good =~ /^\$re\$/; print "ok \n" if \$bad1 =~ /^\$re\$/; print "ok \n" if \$bad2 =~ /^\$re\$/; This in some depth documented e.g. in http://japhy.perlmonk.org/articles/tpj/2004-summer.html (topic: Recursive Regexes) Regards M. Oct 8 '06 #11

 P: n/a Mirco Wahab schrieb: Thus spoke Diez B. Roggisch (on 2006-10-08 10:49): >Certainly true, and it always gives me a hard time because I don't knowto which extend a regular expression nowadays might do the job becauseof these extensions. It was so much easier back in the old times.... Right, in perl, this would be a no-brainer, its documented all over the place, like: my \$re; \$re = qr{ (?: (?[^\\()]+ | \\. ) | \( (??{ \$re }) \) )* }xs; where you have a 'delayed execution' of the (??{ \$re }) which in the end makes the whole a thing recursive one, it gets expanded and executed if the match finds its way to it. Above regex will match balanced parens, as in: my \$good = 'a + (b / (c - 2)) * (d ^ (e+f)) '; my \$bad1 = 'a + (b / (c - 2) * (d ^ (e+f)) '; my \$bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f) )'; if you do: print "ok \n" if \$good =~ /^\$re\$/; print "ok \n" if \$bad1 =~ /^\$re\$/; print "ok \n" if \$bad2 =~ /^\$re\$/; This in some depth documented e.g. in http://japhy.perlmonk.org/articles/tpj/2004-summer.html (topic: Recursive Regexes) That clearly is a recursive grammar rule, and thus it can't be regular anymore :) But first of all, I find it ugly - the clean separation of lexical and syntactical analysis is better here, IMHO - and secondly, what are the properties of that parsing? Is it LL(k), LR(k), backtracking? Diez Oct 8 '06 #12

 P: n/a Fredrik Lundh wrote: it's slightly faster, but both your alternatives are about 10x slower than a straightforward: def balanced(txt): return txt.count("(") == txt.count(")") I know, but if you read my post again you see that I have shown those solutions to mark ")))(((" as bad expressions. Just counting the parens isn't enough. Bye, bearophile Oct 8 '06 #13

 P: n/a "Diez B. Roggisch"

 P: n/a On 10/8/06, Roy Smith

### This discussion thread is closed

Replies have been disabled for this discussion. 