473,396 Members | 1,783 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

greedy match wanted

Hi,

I would like to request your help.

My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

..... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

Thanks for any help,
Alex.

Jul 18 '05 #1
3 1629
alexk wrote:
My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

.... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?


According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one
pattern completely matches, that branch is accepted. This means that once A matches, B will not be
tested further, even if it would produce a longer overall match. In other words, the "|" operator is
never greedy."

So putting "com.to" before "com" does what you want.
import re
re.search(r'com|com\.to', 'kuku.com.to').group() 'com' re.search(r'com\.to|com', 'kuku.com.to').group()

'com.to'

Kent
Jul 18 '05 #2
This is very similar to some common problems in developing pyparsing
grammars. One of my early attempts at parsing CORBA IDL listed the
valid parameter passing mechanisms as ('in' | 'out' | 'inout'). As you
can imagine, I never successfully matched an 'inout', since the 'in'
match came first. (Pyparsing offers an alternative greedy alternation
operator '^', which tests all alternatives and then selects the longest
match - however, '^' can have performance issues.) Of course, the
simplest recourse is just to change the definition to ('inout' | 'in' |
'out'), but I wanted to provide a more general mechanism (since I was
getting e-mails from pyparsing users also having this same problem).
The simplest approach is to just sort the list by order, longest to
shortest, but this removes any option the developer may want to try to
front-load the list with options that are expected to occur more
frequently (for example, if defining a list of potential relational
operators, forcing ('=' | '<' | '>' | '<=' | '>=' | '!=') to become
('<=' | '>='| '!=' | '=' | '<' | '>') may cause unnecessary tests of
infrequently used operators).

I ended up adding the oneOf helper to pyparsing, which takes a single
string of space-delimited literals, and then returns a list of literals
separated by '|' operators, with literals minimally reordered, that is,
only reorderded if a longer literal masks some shorter literal. Also,
accidental duplicates are stripped. Here is the implementation of oneOf
('|' operators generate a MatchFirst object):

..def oneOf( strs, caseless=False ):
.. """Helper to quickly define a set of alternative Literals, and
makes sure to do
.. longest-first testing when there is a conflict, regardless of
the input order,
.. but returns a MatchFirst for best performance.
.. """
.. if caseless:
.. isequal = ( lambda a,b: a.upper() == b.upper() )
.. parseElementClass = CaselessLiteral
.. else:
.. isequal = ( lambda a,b: a == b )
.. parseElementClass = Literal
..
.. symbols = strs.split()
.. i = 0
.. while i < len(symbols)-1:
.. cur = symbols[i]
.. for j,other in enumerate(symbols[i+1:]):
.. if ( isequal(other, cur) ):
.. del symbols[i+j+1]
.. break
.. elif ( isequal(other[:len(cur)],cur) ):
.. del symbols[i+j+1]
.. symbols.insert(i,other)
.. cur = other
.. break
.. else:
.. i += 1
..
.. return MatchFirst( [ parseElementClass(sym) for sym in symbols ] )
..

You could pre-process your regex with something similar, so you can
avoid having to do this (error-prone) checking yourself.

-- Paul

Jul 18 '05 #3
Thanks, I'll try your solution.
Alex.

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Jack Smith | last post by:
Hello, any help appreciated with following problem. I figured out the algorithm (I think), just having trouble proving it is optimal. Suppose we are given n tasks each of which takes 1 unit...
12
by: lothar | last post by:
re: 4.2.1 Regular Expression Syntax http://docs.python.org/lib/re-syntax.html *?, +?, ?? Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few...
3
by: Jane Doe | last post by:
Hello, I need to browse a list of hyperlinks, each followed by an author, and remove the links only for certain authors. 1. I searched the archives on Google, but didn't find how to tell the...
8
by: John Hazen | last post by:
I want to match one or two instances of a pattern in a string. According to the docs for the 're' module ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier is greedy by...
6
by: EXI-Andrews, Jack | last post by:
the '*' and '+' don't seem to be greedy.. they will consume less in order to match a string: ('aa', 'ab') this is the sort of behaviour i'd expect from '(a+?)(ab)' a+ should greedily...
3
by: =?Utf-8?B?c2JwYXJzb25z?= | last post by:
I have a scenario where a string is sent in chunks to my app. I need to be able to identify certain tags in this partial string as it arrives. eg <DALFile>xxxxxxxxx</DALFile> I need to be able...
5
by: vbgunz | last post by:
/* * BEGIN EXAMPLES */ var text = 'A Cats Catalog of Cat Catastrophes and Calamities'; /*** * EXAMPLE 1: negative lookahead assertion logic ***/
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.