473,714 Members | 2,623 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

re module non-greedy matches broken

re:
4.2.1 Regular Expression Syntax
http://docs.python.org/lib/re-syntax.html

*?, +?, ??
Adding "?" after the qualifier makes it perform the match in non-greedy or
minimal fashion; as few characters as possible will be matched.

the regular expression module fails to perform non-greedy matches as
described in the documentation: more than "as few characters as possible"
are matched.

this is a bug and it needs to be fixed.

examples follow.

lothar@erda /ntd/vl
$ cat vwre.py
#! /usr/bin/env python

import re

vwre = re.compile("V.* ?W")
vwlre = re.compile("V.* ?WL")

if __name__ == "__main__":

newdoc = "V1WVVV2WWW "
vwli = re.findall(vwre , newdoc)
print "vwli[], expect", ['V1W', 'V2W']
print "vwli[], return", vwli

newdoc = "V1WLV2WV3WV4WL V5WV6WL"
vwlli = re.findall(vwlr e, newdoc)
print "vwlli[], expect", ['V1WL', 'V4WL', 'V6WL']
print "vwlli[], return", vwlli

lothar@erda /ntd/vl
$ python vwre.py
vwli[], expect ['V1W', 'V2W']
vwli[], return ['V1W', 'VVV2W']
vwlli[], expect ['V1WL', 'V4WL', 'V6WL']
vwlli[], return ['V1WL', 'V2WV3WV4WL', 'V5WV6WL']

lothar@erda /ntd/vl
$ python -V
Python 2.3.3
Jul 18 '05 #1
12 4422
* lothar wrote:
re:
4.2.1 Regular Expression Syntax
http://docs.python.org/lib/re-syntax.html

*?, +?, ??
Adding "?" after the qualifier makes it perform the match in non-greedy
or
minimal fashion; as few characters as possible will be matched.

the regular expression module fails to perform non-greedy matches as
described in the documentation: more than "as few characters as possible"
are matched.

this is a bug and it needs to be fixed.


The documentation is just incomplete. Non-greedy regexps still start
matching the leftmost. So instead the longest of the leftmost you get the
shortest of the leftmost. One may consider this as a documentation bug,
yes.

nd
--
# André Malo, <http://www.perlig.de/> #
Jul 18 '05 #2
this response is nothing but a description of the behavior i reported.

as to whether this behaviour was intended, one would have to ask the module
writer about that.
because of the statement in the documentation, which places no qualification
on how the scan for the shortest possible match is to be done, my guess is
that this problem was overlooked.

to produce a non-greedy (minimal length) match it is required that the start
of the non-greedy part of the match repeatedly be moved right with the last
match of the left-hand part of the pattern (preceding the .*?).

why would someone want a non-greedy (minimal length) match that was not
always non-greedy (minimal length)?

"André Malo" <au********@g-kein-spam.com> wrote in message
news:20******** *********@news. perlig.de...
* lothar wrote:
re:
4.2.1 Regular Expression Syntax
http://docs.python.org/lib/re-syntax.html

*?, +?, ??
Adding "?" after the qualifier makes it perform the match in non-greedy
or
minimal fashion; as few characters as possible will be matched.

the regular expression module fails to perform non-greedy matches as
described in the documentation: more than "as few characters as possible"
are matched.

this is a bug and it needs to be fixed.


The documentation is just incomplete. Non-greedy regexps still start
matching the leftmost. So instead the longest of the leftmost you get the
shortest of the leftmost. One may consider this as a documentation bug,
yes.

nd
--
# André Malo, <http://www.perlig.de/> #

Jul 18 '05 #3
* "lothar" <lo****@ultimat hule.nul> wrote:
this response is nothing but a description of the behavior i reported.
Then you have not read my response carefully enough.
as to whether this behaviour was intended, one would have to ask the module
writer about that.
No, I've responded with a view on regexes, not on the module. That is the way
_regexes_ work. Non-greedy regexes do not match the minimal-length at all, they
are just ... non-greedy (technically the backtracking just stacks the longest
instead of the shortest). They *may* match the shortest match, but it's a
special case. Therefore I've stated that the documentation is incomplete.

Actually your expectations go a bit beyond the documentation. From a certain
point of view (matches always start most left) the matches you're seeing
*are* the minimal-length matches.
because of the statement in the documentation, which places no qualification ^^^^^^^^^^^^^^^ ^
that's the point.
on how the scan for the shortest possible match is to be done, my guess is
that this problem was overlooked.


In the docs, yes. But buy yourself a regex book and learn for yourself ;-)
The first thing you should learn about regexes is that the source of pain
of most regex implementations is the documentation, which is very likely
to be wrong.

Finally let me ask a question:

import re
x = re.compile('<.* ?>')
print x.search('<titl e>...</title><body>... </body>').group(0 )

What would you expect to be printed out? <title> or <body>? Why?

nd
Jul 18 '05 #4
"lothar" wrote:
this is a bug and it needs to be fixed.


it's not a bug, and it's not going to be "fixed". search, findall, finditer, sub,
etc. all scan the target string from left to right, and process the first location
(or all locations) where the pattern matches.

</F>

Jul 18 '05 #5
how then, do i specify a non-greedy regex
<1st-pat><not-1st-pat>*?<follow-pat>

that is, such that non-greedy part <not-1st-pat>*?
excludes a match of <1st-pat>

in other words, how do i write regexes for my examples?

what book or books on regexes or with a good section on regexes would you
recommend?
Hopcroft and Ullman?
"André Malo" <au********@g-kein-spam.com> wrote in message
news:d2******** **@news.web.de. ..
* "lothar" <lo****@ultimat hule.nul> wrote:
this response is nothing but a description of the behavior i reported.
Then you have not read my response carefully enough.
as to whether this behaviour was intended, one would have to ask the module writer about that.


No, I've responded with a view on regexes, not on the module. That is the

way _regexes_ work. Non-greedy regexes do not match the minimal-length at all, they are just ... non-greedy (technically the backtracking just stacks the longest instead of the shortest). They *may* match the shortest match, but it's a
special case. Therefore I've stated that the documentation is incomplete.

Actually your expectations go a bit beyond the documentation. From a certain point of view (matches always start most left) the matches you're seeing
*are* the minimal-length matches.
because of the statement in the documentation, which places no qualification
^^^^^^^^^^^^^^^ ^ that's the point.
on how the scan for the shortest possible match is to be done, my guess

is that this problem was overlooked.


In the docs, yes. But buy yourself a regex book and learn for yourself ;-)
The first thing you should learn about regexes is that the source of pain
of most regex implementations is the documentation, which is very likely
to be wrong.

Finally let me ask a question:

import re
x = re.compile('<.* ?>')
print x.search('<titl e>...</title><body>... </body>').group(0 )

What would you expect to be printed out? <title> or <body>? Why?

nd



Jul 18 '05 #6
> what book or books on regexes

A standard is Mastering Regular Expressions, 2nd ed, by xxx (sorry, forget)

TJR

Jul 18 '05 #7
On Apr 4, 2005 10:06 PM, Terry Reedy <tj*****@udel.e du> wrote:
what book or books on regexes

A standard is Mastering Regular Expressions, 2nd ed, by xxx (sorry, forget)


Mastering Regular Expressions, by Jeffrey Friedl
See http://www.regex.info/

Regards,
--
Swaroop C H
Blog: http://www.swaroopch.info
Book: http://www.byteofpython.info
Jul 18 '05 #8
with respect to the documentation, the module is broken.

the module does not necessarily deliver a "minimal length" match for a
non-greedy pattern.
"Fredrik Lundh" <fr*****@python ware.com> wrote in message
news:ma******** *************** *************** @python.org...
"lothar" wrote:
this is a bug and it needs to be fixed.
it's not a bug, and it's not going to be "fixed". search, findall,

finditer, sub, etc. all scan the target string from left to right, and process the first location (or all locations) where the pattern matches.

</F>


Jul 18 '05 #9
On 04/04/2005-04:20PM, lothar wrote:

how then, do i specify a non-greedy regex
<1st-pat><not-1st-pat>*?<follow-pat>

that is, such that non-greedy part <not-1st-pat>*?
excludes a match of <1st-pat>


jet% cat vwre2.py
#! /usr/bin/env python

import re

vwre = re.compile("V[^V]W")
vwlre = re.compile("V[^V]WL")

if __name__ == "__main__":

newdoc = "V1WVVV2WWW "
vwli = re.findall(vwre , newdoc)
print "vwli[], expect", ['V1W', 'V2W']
print "vwli[], return", vwli

newdoc = "V1WLV2WV3WV4WL V5WV6WL"
vwlli = re.findall(vwlr e, newdoc)
print "vwlli[], expect", ['V1WL', 'V4WL', 'V6WL']
print "vwlli[], return", vwlli

jet% ./vwre2.py
vwli[], expect ['V1W', 'V2W']
vwli[], return ['V1W', 'V2W']
vwlli[], expect ['V1WL', 'V4WL', 'V6WL']
vwlli[], return ['V1WL', 'V4WL', 'V6WL']

Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
2302
by: Josiah Carlson | last post by:
Good day everyone, I have produced a patch against the latest CVS to add support for two new formatting characters in the struct module. It is currently an RFE, which I include a link to at the end of this post. Please read the email before you respond to it. Generally, the struct module is for packing and unpacking of binary data. It includes support to pack and unpack the c types: byte, char, short, long, long long, char, *, and...
5
2015
by: dody suria wijaya | last post by:
I found this problem when trying to split a module into two. Here's an example: ============== #Module a (a.py): from b import * class Main: pass ============== ==============
18
3045
by: Steven Bethard | last post by:
In the "empty classes as c structs?" thread, we've been talking in some detail about my proposed "generic objects" PEP. Based on a number of suggestions, I'm thinking more and more that instead of a single collections type, I should be proposing a new "namespaces" module instead. Some of my reasons: (1) Namespace is feeling less and less like a collection to me. Even though it's still intended as a data-only structure, the use cases...
25
7748
by: Xah Lee | last post by:
Python Doc Problem Example: gzip Xah Lee, 20050831 Today i need to use Python to compress/decompress gzip files. Since i've read the official Python tutorial 8 months ago, have spent 30 minutes with Python 3 times a week since, have 14 years of computing experience, 8 years in mathematical computing and 4 years in unix admin and perl, i have quickly found the official doc: http://python.org/doc/2.4.1/lib/module-gzip.html
5
6107
by: Agnes | last post by:
I want to write a program with many sub-method. for example, 1)method :company_search(code) which return name,addresss...etc 2)method:currency(code) which return the current exchange rate....etc ..... manys Should I write it use module ??? or in code file ?? What is the difference about it ? Thanks From Agnes
5
7724
by: sophie_newbie | last post by:
OK this might seem like a retarded question, but what is the difference between a library and a module? If I do: import string am I importing a module or a library? And if i do string.replace() am I using a module or a function or a
13
2693
by: André | last post by:
Hi, i'm developping asp.net applications and therefore i use VB.net. I have some questions about best practises. According what i read about class and module and if i understand it right, a module does the same as a class but cannot herite or be herited. 1)Is that right? 2) So i guess this module does exactly the same as the class?
0
1037
by: Frank Aune | last post by:
Hello, I just recently found out that wx.lib.pubsub has finally moved away from wx, and now lives at: http://pubsub.wiki.sourceforge.net I'm trying to use pubsub3, which is the third version and now the default one, but I'm having a hard time creating topics and messages for sending:
23
2061
by: Harishankar | last post by:
Hi, Sorry to start off on a negative note in the list, but I feel that the Python subprocess module is sorely deficient because it lacks a mechanism to: 1. Create non-blocking pipes which can be read in a separate thread (I am currently writing a mencoder GUI in Tkinter and need a full fledged process handler to control the command line and to display the progress in a text-box)
2
3766
by: emallove | last post by:
I'm running into the below "No modules named _sha256" issue, with a python installed in a non-standard location. $ python Python 2.5.2 (r252:60911, May 20 2008, 09:46:50) on linux2 Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/ws/ompi-tools/lib/python2.5/md5.py", line 6, in <module>
0
8707
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
9074
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7953
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6634
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5947
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4464
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4725
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3158
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2520
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.