472,354 Members | 2,196 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,354 software developers and data experts.

re module non-greedy matches broken

re:
4.2.1 Regular Expression Syntax
http://docs.python.org/lib/re-syntax.html

*?, +?, ??
Adding "?" after the qualifier makes it perform the match in non-greedy or
minimal fashion; as few characters as possible will be matched.

the regular expression module fails to perform non-greedy matches as
described in the documentation: more than "as few characters as possible"
are matched.

this is a bug and it needs to be fixed.

examples follow.

lothar@erda /ntd/vl
$ cat vwre.py
#! /usr/bin/env python

import re

vwre = re.compile("V.*?W")
vwlre = re.compile("V.*?WL")

if __name__ == "__main__":

newdoc = "V1WVVV2WWW"
vwli = re.findall(vwre, newdoc)
print "vwli[], expect", ['V1W', 'V2W']
print "vwli[], return", vwli

newdoc = "V1WLV2WV3WV4WLV5WV6WL"
vwlli = re.findall(vwlre, newdoc)
print "vwlli[], expect", ['V1WL', 'V4WL', 'V6WL']
print "vwlli[], return", vwlli

lothar@erda /ntd/vl
$ python vwre.py
vwli[], expect ['V1W', 'V2W']
vwli[], return ['V1W', 'VVV2W']
vwlli[], expect ['V1WL', 'V4WL', 'V6WL']
vwlli[], return ['V1WL', 'V2WV3WV4WL', 'V5WV6WL']

lothar@erda /ntd/vl
$ python -V
Python 2.3.3
Jul 18 '05 #1
12 4208
* lothar wrote:
re:
4.2.1 Regular Expression Syntax
http://docs.python.org/lib/re-syntax.html

*?, +?, ??
Adding "?" after the qualifier makes it perform the match in non-greedy
or
minimal fashion; as few characters as possible will be matched.

the regular expression module fails to perform non-greedy matches as
described in the documentation: more than "as few characters as possible"
are matched.

this is a bug and it needs to be fixed.


The documentation is just incomplete. Non-greedy regexps still start
matching the leftmost. So instead the longest of the leftmost you get the
shortest of the leftmost. One may consider this as a documentation bug,
yes.

nd
--
# André Malo, <http://www.perlig.de/> #
Jul 18 '05 #2
this response is nothing but a description of the behavior i reported.

as to whether this behaviour was intended, one would have to ask the module
writer about that.
because of the statement in the documentation, which places no qualification
on how the scan for the shortest possible match is to be done, my guess is
that this problem was overlooked.

to produce a non-greedy (minimal length) match it is required that the start
of the non-greedy part of the match repeatedly be moved right with the last
match of the left-hand part of the pattern (preceding the .*?).

why would someone want a non-greedy (minimal length) match that was not
always non-greedy (minimal length)?

"André Malo" <au********@g-kein-spam.com> wrote in message
news:20*****************@news.perlig.de...
* lothar wrote:
re:
4.2.1 Regular Expression Syntax
http://docs.python.org/lib/re-syntax.html

*?, +?, ??
Adding "?" after the qualifier makes it perform the match in non-greedy
or
minimal fashion; as few characters as possible will be matched.

the regular expression module fails to perform non-greedy matches as
described in the documentation: more than "as few characters as possible"
are matched.

this is a bug and it needs to be fixed.


The documentation is just incomplete. Non-greedy regexps still start
matching the leftmost. So instead the longest of the leftmost you get the
shortest of the leftmost. One may consider this as a documentation bug,
yes.

nd
--
# André Malo, <http://www.perlig.de/> #

Jul 18 '05 #3
* "lothar" <lo****@ultimathule.nul> wrote:
this response is nothing but a description of the behavior i reported.
Then you have not read my response carefully enough.
as to whether this behaviour was intended, one would have to ask the module
writer about that.
No, I've responded with a view on regexes, not on the module. That is the way
_regexes_ work. Non-greedy regexes do not match the minimal-length at all, they
are just ... non-greedy (technically the backtracking just stacks the longest
instead of the shortest). They *may* match the shortest match, but it's a
special case. Therefore I've stated that the documentation is incomplete.

Actually your expectations go a bit beyond the documentation. From a certain
point of view (matches always start most left) the matches you're seeing
*are* the minimal-length matches.
because of the statement in the documentation, which places no qualification ^^^^^^^^^^^^^^^^
that's the point.
on how the scan for the shortest possible match is to be done, my guess is
that this problem was overlooked.


In the docs, yes. But buy yourself a regex book and learn for yourself ;-)
The first thing you should learn about regexes is that the source of pain
of most regex implementations is the documentation, which is very likely
to be wrong.

Finally let me ask a question:

import re
x = re.compile('<.*?>')
print x.search('<title>...</title><body>...</body>').group(0)

What would you expect to be printed out? <title> or <body>? Why?

nd
Jul 18 '05 #4
"lothar" wrote:
this is a bug and it needs to be fixed.


it's not a bug, and it's not going to be "fixed". search, findall, finditer, sub,
etc. all scan the target string from left to right, and process the first location
(or all locations) where the pattern matches.

</F>

Jul 18 '05 #5
how then, do i specify a non-greedy regex
<1st-pat><not-1st-pat>*?<follow-pat>

that is, such that non-greedy part <not-1st-pat>*?
excludes a match of <1st-pat>

in other words, how do i write regexes for my examples?

what book or books on regexes or with a good section on regexes would you
recommend?
Hopcroft and Ullman?
"André Malo" <au********@g-kein-spam.com> wrote in message
news:d2**********@news.web.de...
* "lothar" <lo****@ultimathule.nul> wrote:
this response is nothing but a description of the behavior i reported.
Then you have not read my response carefully enough.
as to whether this behaviour was intended, one would have to ask the module writer about that.


No, I've responded with a view on regexes, not on the module. That is the

way _regexes_ work. Non-greedy regexes do not match the minimal-length at all, they are just ... non-greedy (technically the backtracking just stacks the longest instead of the shortest). They *may* match the shortest match, but it's a
special case. Therefore I've stated that the documentation is incomplete.

Actually your expectations go a bit beyond the documentation. From a certain point of view (matches always start most left) the matches you're seeing
*are* the minimal-length matches.
because of the statement in the documentation, which places no qualification
^^^^^^^^^^^^^^^^ that's the point.
on how the scan for the shortest possible match is to be done, my guess

is that this problem was overlooked.


In the docs, yes. But buy yourself a regex book and learn for yourself ;-)
The first thing you should learn about regexes is that the source of pain
of most regex implementations is the documentation, which is very likely
to be wrong.

Finally let me ask a question:

import re
x = re.compile('<.*?>')
print x.search('<title>...</title><body>...</body>').group(0)

What would you expect to be printed out? <title> or <body>? Why?

nd



Jul 18 '05 #6
> what book or books on regexes

A standard is Mastering Regular Expressions, 2nd ed, by xxx (sorry, forget)

TJR

Jul 18 '05 #7
On Apr 4, 2005 10:06 PM, Terry Reedy <tj*****@udel.edu> wrote:
what book or books on regexes

A standard is Mastering Regular Expressions, 2nd ed, by xxx (sorry, forget)


Mastering Regular Expressions, by Jeffrey Friedl
See http://www.regex.info/

Regards,
--
Swaroop C H
Blog: http://www.swaroopch.info
Book: http://www.byteofpython.info
Jul 18 '05 #8
with respect to the documentation, the module is broken.

the module does not necessarily deliver a "minimal length" match for a
non-greedy pattern.
"Fredrik Lundh" <fr*****@pythonware.com> wrote in message
news:ma**************************************@pyth on.org...
"lothar" wrote:
this is a bug and it needs to be fixed.
it's not a bug, and it's not going to be "fixed". search, findall,

finditer, sub, etc. all scan the target string from left to right, and process the first location (or all locations) where the pattern matches.

</F>


Jul 18 '05 #9
On 04/04/2005-04:20PM, lothar wrote:

how then, do i specify a non-greedy regex
<1st-pat><not-1st-pat>*?<follow-pat>

that is, such that non-greedy part <not-1st-pat>*?
excludes a match of <1st-pat>


jet% cat vwre2.py
#! /usr/bin/env python

import re

vwre = re.compile("V[^V]W")
vwlre = re.compile("V[^V]WL")

if __name__ == "__main__":

newdoc = "V1WVVV2WWW"
vwli = re.findall(vwre, newdoc)
print "vwli[], expect", ['V1W', 'V2W']
print "vwli[], return", vwli

newdoc = "V1WLV2WV3WV4WLV5WV6WL"
vwlli = re.findall(vwlre, newdoc)
print "vwlli[], expect", ['V1WL', 'V4WL', 'V6WL']
print "vwlli[], return", vwlli

jet% ./vwre2.py
vwli[], expect ['V1W', 'V2W']
vwli[], return ['V1W', 'V2W']
vwlli[], expect ['V1WL', 'V4WL', 'V6WL']
vwlli[], return ['V1WL', 'V4WL', 'V6WL']

Jul 18 '05 #10
"lothar" wrote:
with respect to the documentation, the module is broken.
nope.
the module does not necessarily deliver a "minimal length" match for a
non-greedy pattern.


it isn't supposed to: a regular expression describes a *set* of matching
strings, and the engine is free to return any string from that set. Python's
engine returns the *first* string it finds that belongs to the set. if you use
a non-greedy operator, the engine will return the first non-greedy match
it finds, not the overall shortest non-greedy match.

if you don't want to understand how regular expressions work, don't use
them.

</F>

Jul 18 '05 #11
a non-greedy match is implicitly defined in the documentation to be one such
that there is no proper substring in the return which could also match the
regex.

the documentation implies the module will return a non-greedy match.

the module does not return a non-greedy match.
"Fredrik Lundh" <fr*****@pythonware.com> wrote in message
news:ma**************************************@pyth on.org...
"lothar" wrote:
with respect to the documentation, the module is broken.
nope.
the module does not necessarily deliver a "minimal length" match for a
non-greedy pattern.


it isn't supposed to: a regular expression describes a *set* of matching
strings, and the engine is free to return any string from that set.

Python's engine returns the *first* string it finds that belongs to the set. if you use a non-greedy operator, the engine will return the first non-greedy match
it finds, not the overall shortest non-greedy match.

if you don't want to understand how regular expressions work, don't use
them.

</F>


Jul 18 '05 #12
"lothar" wrote:
a non-greedy match is implicitly defined in the documentation to be one such
that there is no proper substring in the return which could also match the
regex.
no, that's not what it says. this is what is says:

Adding "?" after the qualifier makes it perform the match in non-greedy
or minimal fashion; as few characters as possible will be matched.

note that it says "qualifier" (that is, the preceeding *, +, or ? operator). it
doesn't say that the *entire* regex should be non-greedy. it does not say
that search, findall, sub etc. should look for the shortest possible overall
match. all it says is that the preceeding operator, and that operator only,
should look for the shortest possible match, rather than the longest.
the module does not return a non-greedy match.


it does. the problem is all in your head.

</F>

Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Josiah Carlson | last post by:
Good day everyone, I have produced a patch against the latest CVS to add support for two new formatting characters in the struct module. It is currently an RFE, which I include a link to at the...
5
by: dody suria wijaya | last post by:
I found this problem when trying to split a module into two. Here's an example: ============== #Module a (a.py): from b import * class Main: pass ============== ==============
18
by: Steven Bethard | last post by:
In the "empty classes as c structs?" thread, we've been talking in some detail about my proposed "generic objects" PEP. Based on a number of suggestions, I'm thinking more and more that instead of...
25
by: Xah Lee | last post by:
Python Doc Problem Example: gzip Xah Lee, 20050831 Today i need to use Python to compress/decompress gzip files. Since i've read the official Python tutorial 8 months ago, have spent 30...
5
by: Agnes | last post by:
I want to write a program with many sub-method. for example, 1)method :company_search(code) which return name,addresss...etc 2)method:currency(code) which return the current exchange rate....etc...
5
by: sophie_newbie | last post by:
OK this might seem like a retarded question, but what is the difference between a library and a module? If I do: import string am I importing a module or a library? And if i do...
13
by: André | last post by:
Hi, i'm developping asp.net applications and therefore i use VB.net. I have some questions about best practises. According what i read about class and module and if i understand it right, a...
0
by: Frank Aune | last post by:
Hello, I just recently found out that wx.lib.pubsub has finally moved away from wx, and now lives at: http://pubsub.wiki.sourceforge.net I'm trying to use pubsub3, which is the third...
23
by: Harishankar | last post by:
Hi, Sorry to start off on a negative note in the list, but I feel that the Python subprocess module is sorely deficient because it lacks a mechanism to: 1. Create non-blocking pipes which can...
2
by: emallove | last post by:
I'm running into the below "No modules named _sha256" issue, with a python installed in a non-standard location. $ python Python 2.5.2 (r252:60911, May 20 2008, 09:46:50) on linux2 Type...
2
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it so the python app could use a http request to get...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
0
by: Rahul1995seven | last post by:
Introduction: In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python has gained popularity among beginners and experts...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.