473,800 Members | 2,613 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

leftmost longest match (of disjunctions)

Hello,

The program given below returns the lines:

a
ab

Is there a way to use python regular expressions such that the program
would return the following lines?

ab
ab

############### ############### ############### ############### ############

import re

rx1 = re.compile("(a| ab)")
rx2 = re.compile("(ab |a)")

out1 = rx1.search("ab" )
out2 = rx2.search("ab" )

print out1.group(1)
print out2.group(1)

############### ############### ############### ############### ############

Jörg
Jul 18 '05 #1
12 2974
Joerg Schuster wrote:

The program given below returns the lines:

a
ab

Is there a way to use python regular expressions such that the program
would return the following lines?

ab
ab

############### ############### ############### ############### ############

import re

rx1 = re.compile("(a| ab)")
rx2 = re.compile("(ab |a)")


Have you checked the documentation for "re"?

It reads:
"|" A|B, where A and B can be arbitrary REs, creates a regular expression
that will match either A or B. An arbitrary number of REs can be separated
by the "|" in this way. This can be used inside groups (see below) as well.
As the target string is scanned, REs separated by "|" are tried from left to
right. When one pattern completely matches, that branch is accepted. This
means that once A matches, B will not be tested further, even if it would
produce a longer overall match. In other words, the "|" operator is never
greedy.

------

Seems pretty clear and explicit to me. Your example is basically a working
proof of the above code, so I'm not sure what you were expecting differently.

-Peter
Jul 18 '05 #2
Peter Hansen <pe***@engcorp. com> writes:
produce a longer overall match. In other words, the "|" operator is never
greedy.


O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?
Jörg
Jul 18 '05 #3
Peter Hansen <pe***@engcorp. com> writes:
produce a longer overall match. In other words, the "|" operator is never
greedy.


O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?
Jörg
Jul 18 '05 #4
Joerg Schuster wrote:
Peter Hansen <pe***@engcorp. com> writes:

produce a longer overall match. In other words, the "|" operator is never
greedy.

O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?


sort the re by size first?

Pádraig.

Jul 18 '05 #5
Pa*****@Linux.i e writes:
Joerg Schuster wrote:
Peter Hansen <pe***@engcorp. com> writes:
produce a longer overall match. In other words, the "|" operator
is never greedy.

O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?


sort the re by size first?


The point is not to get the match of the longest part of the
disjunction, but to get the match of that part of the disjunction
which is the longest one. (The match of ".*" may be much longer than
the match of "abc", although the latter regex contains more
characters.)

Jörg

Jul 18 '05 #6
Joerg Schuster <js@cis.uni-muenchen.de> writes:
Pa*****@Linux.i e writes:
Joerg Schuster wrote:
Peter Hansen <pe***@engcorp. com> writes:

> produce a longer overall match. In other words, the "|" operator
> is never greedy.
O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?


sort the re by size first?


The point is not to get the match of the longest part of the
disjunction, but to get the match of that part of the disjunction
which is the longest one. (The match of ".*" may be much longer than
the match of "abc", although the latter regex contains more
characters.)

Jörg


I should have read your mail more carefully. You wrote about sorting
the *re*, not about sorting the regex. Sorry.

Jörg
Jul 18 '05 #7
Joerg Schuster wrote:
Peter Hansen <pe***@engcorp. com> writes:
produce a longer overall match. In other words, the "|" operator is
never greedy.


O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?


I don't think so - as regexps are finite state automata, their capabilities
are limited to what these machines can do. So they can't backtrack to a
part of the disjunction that was a shorter match if the longer match
suddenly will fail. Consider this example:

rex: a|abc

string: abd

If the state-machine was greedy, it would follow the second path and fail.

The only thing you can do to avoid is is to split your disjunctions and
search separately, and afterwards takt the longest part. That could be done
using a small grammar, that allows you to detect the disjunctions and
create separate rexes for it.

Diez
Jul 18 '05 #8
Joerg Schuster wrote:
I should have read your mail more carefully. You wrote about sorting
the *re*, not about sorting the regex. Sorry.


what's the difference?

</F>


Jul 18 '05 #9
Joerg Schuster wrote:
O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?


sort the re by size first?


The point is not to get the match of the longest part of the
disjunction, but to get the match of that part of the disjunction
which is the longest one. (The match of ".*" may be much longer
than the match of "abc", although the latter regex contains more
characters.)


you can use "sre_parse.pars e(x).getwidth() " on a subexpression, to
get the shortest/longest possible match.
from sre_parse import parse
parse("a?").get width() (0, 1) parse("ab").get width() (2, 2) parse(".+").get width()

(1, 65535)

(where >=65535 should be interpreted as "any number")

</F>


Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1644
by: alexk | last post by:
Hi, I would like to request your help. My problem is as follows. I want to match urls, and therefore I have a group of long valid domain names in my regex: ..... (?:com|org|net|biz|info|ac|cc|gs|ms| sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
2
2582
by: Keith Chadwick | last post by:
XML <data> <option>this is test 1</option> <option>this is test 11</option> <option>this is test 111</option> <option>this is test 1111</option> <option>this is test 11111</option> </data>
2
1480
by: Horizon | last post by:
Hi, I would like to build a regex that returns me the table list of a SQL SELECT request like : -SELECT * FROM tab1, tab2, tab3 -SELECT * FROM tab1, tab2 WHERE col="abc" -SELECT * FROM tab1, tab2 LIMIT 1 I tried something like that : ---
5
3516
by: Brian Henry | last post by:
I have a messaging application that has a data grid with information like an email list would have (from, subject, time sent, size) but the subject could be very long in theory, and then it would word wrap.. the subject is a variable width column (resizes to fit space) while the others are fixed size... the problem is that i dont want the text to ever word wrap.. but truncate the visible subject to match the current cell width... (kinda...
32
14714
by: Licheng Fang | last post by:
Basically, the problem is this: 'do' Python's NFA regexp engine trys only the first option, and happily rests on that. There's another example: 'oneself' The Python regular expression engine doesn't exaust all the
6
2728
by: EXI-Andrews, Jack | last post by:
the '*' and '+' don't seem to be greedy.. they will consume less in order to match a string: ('aa', 'ab') this is the sort of behaviour i'd expect from '(a+?)(ab)' a+ should greedily consume a's at the expense of the string not matching
1
2605
by: hn.ft.pris | last post by:
"Forward maximum match" mean that there is a dictionary contains thousands of items which are single words. For example, "kid", "nice", "best"... And here get a string "kidnicebs", the forward maximum match algorithm need to find the longest string that matches the string in dictionary. Because the longest string in the dictionary has a length of 4, so the algoritm pick the first four characters of the string(that's why it's called...
4
3057
by: Thomas Mlynarczyk | last post by:
Hello, I have two arrays like this: $aSearch = array( 'A', 'B', 'C.D', 'E', 'F' ); $aSubject = array( 'A' =0, 'A.B' =1, 'X' =2, 'C.D.E' =3, 'A.B.C' => 4 ); Now I want to search $aSubject for the longest key that consists of the first x elements of $aSearch concatenated by "." (where x = 1...count(
4
5293
by: Marek Kubica | last post by:
Hi, I am trying to get this stuff working, but I still fail. I have a format which consists of three elements: \d{4}M?-\d (4 numbers, optional M, dash, another number) EMPTY (the <EMPTYtoken) (the <PAGEBREAKtoken. The line may contain whitespaces, but nothing else)
0
9551
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10279
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10255
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9092
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7582
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6815
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5473
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
3765
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2948
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.