regexp non-greedy matching bug?

John Hazen

I want to match one or two instances of a pattern in a string.

According to the docs for the 're' module
( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier
is greedy by default, and adding a '?' after a qualifier makes it
non-greedy.

The "*", "+", and "?" qualifiers are all greedy...
Adding "?" after the qualifier makes it perform the match in
non-greedy or minimal fashion...

In the following example, though my re is intended to allow for 1 or 2
instinces of 'foo', there are 2 in the string I'm matching. So, I would
expect group(1) and group(3) to both be populated. (When I remove the
conditional match on the 2nd foo, the grouping is as I expect.)

$ python2.4
Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
Type "help", "copyright" , "credits" or "license" for more information.

import re
foofoo = re.compile(r'^( foo)(.*?)(foo)? (.*?)$')
foofoo.match(s) .group(0) 'foobarbazfooba r' foofoo.match(s) .group(1) 'foo' foofoo.match(s) .group(2) '' foofoo.match(s) .group(3)
foofoo.match(s) .group(4) 'barbazfoobar' foofoo = re.compile(r'^( foo)(.*?)(foo)( .*?)$')
foofoo.match(s) .group(0) 'foobarbazfooba r' foofoo.match(s) .group(1) 'foo' foofoo.match(s) .group(2) 'barbaz' foofoo.match(s) .group(3) 'foo' foofoo.match(s) .group(4) 'bar'

So, is this a bug, or just a problem with my understanding? If it's my
brain that's broken, what's the proper way to do this with regexps?

And, if the above is expected behavior, should I submit a doc bug? It's
clear that the "?" qualifier (applied to the second foo group) is _not_
greedy in this situation.

-John

Dec 4 '05 #1

Subscribe Reply

2460

Sam Pointon

My understanding of .*? and its ilk is that they will match as little
as is possible for the rest of the pattern to match, like .* will match
as much as possible. In the first instance, the first (.*?) did not
have to match anything, because all of the rest of the pattern can be
matched 0 or more times. I think that such a situation (non-greedy
operator followed by operators that match 0 or more times) will never
match. However, in the second instance, it has to match something later
on in the string, so it will capture something.

I believe that this is an operator precedence problem (greedy ? losing
to .*?), but is to be expected. So, if this is the case, by all means
it should be added in a note to the docs.

However, I am not a regular expression expert, so my analysis of the
situation may be well off the mark.

Dec 4 '05 #2

Mike Meyer

John Hazen <jo**@hazen.net > writes:

I want to match one or two instances of a pattern in a string.
Then you should be using the split() method of the match object on the
pattern in question.
According to the docs for the 're' module
( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier
is greedy by default, and adding a '?' after a qualifier makes it
non-greedy.
The "*", "+", and "?" qualifiers are all greedy...
Adding "?" after the qualifier makes it perform the match in
non-greedy or minimal fashion...
The thing to understand is that regular expressions are *search*
functions, that return the first parsing that matches. They search a
space of possible matches to each term in the expression. If some term
fails to match, the preceeding term goes on to its next match, and you
try again. The "greedy" vs. "non-greedy" describes the order that the
term in question tries matches. If it's greedy, it will try the
longest possible match first. If it's non-greedy, it'll try the
shortest possible match first.

$ python2.4
Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
Type "help", "copyright" , "credits" or "license" for more information. import re
foofoo = re.compile(r'^( foo)(.*?)(foo)? (.*?)$')
foofoo.match(s) .group(0) 'foobarbazfooba r' foofoo.match(s) .group(1) 'foo' foofoo.match(s) .group(2) '' foofoo.match(s) .group(3)
foofoo.match(s) .group(4) 'barbazfoobar'
First, this pattern doesn't look for one or two instances of "foo" in
a string. It looks for a string that starts with "foo" and maybe has a
second "foo" in it as well.

Here's what it does:

^ must match the beginning of the string (BTW, you can get the same
behavior by leaving off the ^ and using search instead of match).

(foo) must match the first foo.

(.*?) is non-greedy, so it tries the shortest thing that matches, the
empty string.

(foo)? matches what's next in the string by matching 0 occurrences of
(foo).

(.*?) tries the empty string as well.

$ fails to match the end of the string, so we backtrack, and the
second (.*?) tries again, adding a character. This keeps on until the
(.*?) matches the rest of the string and the $ succeeds.

Here, the only pattern that gets retried is the last one. It keeps
adding characters until it swallows the remainder of the string. This
is exactly what's expected. You can't change this behavior with
greedy/non-greedy, as that last term will always try searching the
entire string before it fails and the preceeding "(foo)?" gets to try
something else.

foofoo = re.compile(r'^( foo)(.*?)(foo)( .*?)$')
foofoo.match(s) .group(0) 'foobarbazfooba r' foofoo.match(s) .group(1) 'foo' foofoo.match(s) .group(2) 'barbaz' foofoo.match(s) .group(3) 'foo' foofoo.match(s) .group(4) 'bar'

This time, the second (foo) keeps failing, forcing the first (.*?) to
add characters until the (foo) succeeds.
So, is this a bug, or just a problem with my understanding? If it's my
brain that's broken, what's the proper way to do this with regexps?

To do what you said you want to do, you want to use the split method:

foo = re.compile('foo ')
if 2 <= len(foo.split(s )) <= 3:
print "We had one or two 'foo's"

As the founder of SPARE, I'd say the proper way to do this is with
string methods:

if 2 <= len(s.split('fo o')) <= 3:
print "We had one or two 'foo's"

Where the two split's will produce exactly the same things.

You might be able to match "starts with 'foo' and has at most one
other 'foo'" with lookahead patternds, but I'm not going to go into
that. If you really need the string to start with foo, I'd just use
"s.startswith(' foo')" to check it.

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Dec 4 '05 #3

John Hazen

[Mike Meyer]

The thing to understand is that regular expressions are *search*
functions, that return the first parsing that matches. They search a
space of possible matches to each term in the expression. If some term
fails to match, the preceeding term goes on to its next match, and you
try again. The "greedy" vs. "non-greedy" describes the order that the
term in question tries matches. If it's greedy, it will try the
longest possible match first. If it's non-greedy, it'll try the
shortest possible match first.
That's a good explanation. Thanks.

[John]
I want to match one or two instances of a pattern in a string.
> foofoo = re.compile(r'^( foo)(.*?)(foo)? (.*?)$')
[Mike] First, this pattern doesn't look for one or two instances of "foo" in
a string. It looks for a string that starts with "foo" and maybe has a
second "foo" in it as well.
Right. In simplifying the expression for public consumption, one of the
terms I dropped was r'^.*?(foo)...' .
To do what you said you want to do, you want to use the split method:

foo = re.compile('foo ')
if 2 <= len(foo.split(s )) <= 3:
print "We had one or two 'foo's"
Well, this would solve my dumbed down example, but each foo in the
original expression was a stand-in for a more complex term. I was using
match groups to extract the parts of the match that I wanted. Here's an
example (using Tim's correction) that actually demonstrates what I'm
doing:
s = 'zzzfoo123barxx xfoo456baryyy'
s2 = 'zzzfoo123barxx xfooyyy'
foobar2 = re.compile(r'^. *?foo(\d+)bar(. *foo(\d+)bar)?. *$')
print foobar2.match(s ).group(1) 123 print foobar2.match(s ).group(3) 456 print foobar2.match(s 2).group(1) 123 print foobar2.match(s 2).group(3) None

Looking at re.split, it doesn't look like it returns the actual matching
text, so I don't think that fits my need.
As the founder of SPARE...

Hmm, not a very effective name. A google search didn't fing any obvious
hits (even after adding the "python" qualifier, and removing "spare time"
and "spare parts" hits). (I couldn't find it off your homepage,
either.)

Thanks for your help. If you have any suggestions about a non-re way to
do the above, I'd be interested.

-John

Dec 4 '05 #4

Fredrik Lundh

Mike Meyer wrote:

^ must match the beginning of the string (BTW, you can get the same
behavior by leaving off the ^ and using search instead of match).

that's backwards, isn't it? using ^ with match is usually pointless (since
match only looks at the first position anyway), and using ^ with search
is also usually pointless...

(the only time you really need ^ is in certain multiline searches; depending
on what flags you use, ^ can match not only at the beginning of a string,
but also after each newline character in the target string)

</F>

Dec 4 '05 #5

Mike Meyer

John Hazen <jo**@hazen.net > writes:

To do what you said you want to do, you want to use the split method:

foo = re.compile('foo ')
if 2 <= len(foo.split(s )) <= 3:
print "We had one or two 'foo's"
Well, this would solve my dumbed down example, but each foo in the
original expression was a stand-in for a more complex term.

That actually doesn't matter. Just replace 'foo' with your more
complex term.

foo2 = re.compile(r'fo o(\d+)bar') I was using
match groups to extract the parts of the match that I wanted. Here's an
example (using Tim's correction) that actually demonstrates what I'm
doing:
s = 'zzzfoo123barxx xfoo456baryyy'
s2 = 'zzzfoo123barxx xfooyyy'
foobar2 = re.compile(r'^. *?foo(\d+)bar(. *foo(\d+)bar)?. *$')
print foobar2.match(s ).group(1) 123 print foobar2.match(s ).group(3) 456 foo2.split('zzz foo123barxxxfoo 456baryyy') ['zzz', '123', 'xxx', '456', 'yyy'] print foobar2.match(s 2).group(1) 123 print foobar2.match(s 2).group(3) None foo2.split('zzz foo123barxxxfoo yyy')

['zzz', '123', 'xxxfooyyy']
Looking at re.split, it doesn't look like it returns the actual matching
text, so I don't think that fits my need.

split() returns the text matched by groups in the pattern used to do
the split, and is documented as doing so. The solution you gave
doesn't return "the actual matching text" for the instances, but just
the text in the groups inside that text - which is exactly what
split() returns.

While on that topic, I'll note that the solution Tim gave you doesn't
solve the problem as I originally understood it, either. You said you
wanted to match one or two instances, which I read as only one or two
instances, so that more than two instances would be treated as a
failure. On rereading it, I can see where I was wrong.
As the founder of SPARE...

Hmm, not a very effective name. A google search didn't fing any obvious
hits (even after adding the "python" qualifier, and removing "spare time"
and "spare parts" hits). (I couldn't find it off your homepage,
either.)

That's the Society for the Prevention of Abuse of Regular
Expressions. One of these days, my proof-reader will get back to me
and I'll add a link about it to my home page.

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Dec 4 '05 #6

Mike Meyer

"Fredrik Lundh" <fr*****@python ware.com> writes:

Mike Meyer wrote:
^ must match the beginning of the string (BTW, you can get the same
behavior by leaving off the ^ and using search instead of match).

that's backwards, isn't it? using ^ with match is usually pointless (since
match only looks at the first position anyway), and using ^ with search
is also usually pointless...

Yup, you're right. The '^' was redundant in the OP's post.

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Dec 4 '05 #7

Aahz

In article <ma************ *************** ************@py thon.org>,
Fredrik Lundh <fr*****@python ware.com> wrote:

that's backwards, isn't it? using ^ with match is usually pointless (since
match only looks at the first position anyway), and using ^ with search
is also usually pointless...

While you're technically correct, I've been bitten too many times by
forgetting whether to use match() or search(). I've fixed that problem
by choosing to always use search() and combine with ^ as appropriate.
--
Aahz (aa**@pythoncra ft.com) <*> http://www.pythoncraft.com/

"Don't listen to schmucks on USENET when making legal decisions. Hire
yourself a competent schmuck." --USENET schmuck (aka Robert Kern)

Dec 4 '05 #8

Fredrik Lundh

Aahz wrote:

While you're technically correct, I've been bitten too many times by
forgetting whether to use match() or search(). I've fixed that problem
by choosing to always use search() and combine with ^ as appropriate.

that's a bit suboptimal, though, at least for cases where failed matches
are rather common:

C:\>timeit -s "import re; p = re.compile('b') " "p.match('a'*10 0)"
100000 loops, best of 3: 6.14 usec per loop

C:\>timeit -s "import re; p = re.compile('^b' )" "p.match('a'*10 0)"
100000 loops, best of 3: 6.25 usec per loop

C:\>timeit -s "import re; p = re.compile('^b' )" "p.search('a'*1 00)"
100000 loops, best of 3: 15.4 usec per loop

(afaik, search doesn't have any heuristics for figuring out if it can skip
the search, so it'll check ^ against all available positions)

on the other hand, benchmarking RE:s always results in confusing
results:

C:\>timeit -s "import re; p = re.compile('b') " "p.search('a'*1 00)"
100000 loops, best of 3: 4.32 usec per loop

(should this really be *faster* than match for this case ?)

</F>

Dec 5 '05 #9

Similar topics

2349

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could. Or how can I replace the html &entities; in a string "blablabla&blablabal&balbalbal" with the chars they mean using re.sub? I found out they are stored in an dict . I though about this functionality:

Python

2050

Help with RegExp

by: Syed Ali | last post by:

Hello, I am trying to create a regexp to express non letters and space. I tried using: var myreg = new RegExp (""); However, it is not working. Basically I want to allow only words with letters in a textfield, space is ok, but no special characters such as $%^ or numbers such as

Javascript

24749

RegExp for remove all trailing CrLf's?

by: McKirahan | last post by:

How would I use a regular expression to remove all trailing Carriage Returns and Line Feeds (%0D%0A) from a textarea's value? Thanks in advance. Also, are they any great references for learning how to use Regular Expressions?

Javascript

2205

RegExp and how to use regular expressions

by: Jeff Sandler | last post by:

I have a page that accepts input from many textboxes. Many of the textboxes are intended to accept dates and times, thus, I expect only digits to be entered. I originally tested using parseInt and isNaN, but I'm not even sure that the results are as perfect as I need. I am expecting to use RegExp.test(string), but I'm not 100% sure about that, either. Here is a test program with a textbox that has a maxlength of 2 characters. The...

Javascript

3528

RegExp for hyphen to camelCase?

by: RobG | last post by:

I'm messing with getPropertyValue (Mozilla et al) and currentStyle (IE) and have a general function (slightly modified from one originally posted by Steve van Dongen) for getting style properties: function GetCurrentStyle( el, prop ) { if ( window.getComputedStyle ) { // Mozilla et al return window.getComputedStyle(el, '').getPropertyValue(prop) ); } // IE5+ else if ( el.currentStyle ) {

Javascript

5013

Validate EMail with RegExp...

by: Dag Sunde | last post by:

My understanding of regular expressions is rudimentary, at best. I have this RegExp to to a very simple validation of an email-address, but it turns out that it refuses to accept mail-addresses with hypens in them. Can anybody please help me adjust it so it will accept addresses like dag-sunde@test-domain.net too?

Javascript

1408

Regexp problem with form element

by: Christoph | last post by:

I'm trying to set up client side validation for a textarea form element to ensure that the data entered does not exceed 200 characters. I'm using the following code but it doesn't seem to be working correctly: if( this.value != '' ) { if( !( RegExp( '^{0,200}$' ).test( this.value ))) { alert( 'Information provided must not be more than 200 characters.' ); this.focus(); } }

Javascript

2072

regexp questoin

by: micklee74 | last post by:

hi i created a script to ask user for an input that can be a pattern right now, i use re to compile that pattern pat = re.compile(r"%s" %(userinput) ) #userinput is passed from command line argument if the user key in a pattern , eg , and my script will search some lines that contains pat.findall(lines)

Python

1619

Assign a PERL style regex in the RegExp constructor?

by: jgarrard | last post by:

Hi, I have an array of strings which are regular expressions in the PERL syntax (ie / / delimeters). I wish to create a RegExp in order to do some useful work, but am stuck for a way of getting these strings into the RegExp object. The RegExp constructor seems to want two parameters - the non / delimited expression, and the global modifiers.

Javascript

1200

regexp qns

by: eight02645999 | last post by:

hi suppose i have a string like test1?test2t-test3*test4*test5$test6#test7*test8 how can i construct the regexp to get test3*test4*test5 and test7*test8, ie, i want to match * and the words before and after? thanks

Python

8443

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8356

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8550

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8639

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7385

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5663

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4198

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4366

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2769

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp