468,170 Members | 1,905 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,170 developers. It's quick & easy.

greedy match wanted

Hi,

I would like to request your help.

My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

..... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

Thanks for any help,
Alex.

Jul 18 '05 #1
3 1439
alexk wrote:
My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

.... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?


According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one
pattern completely matches, that branch is accepted. This means that once A matches, B will not be
tested further, even if it would produce a longer overall match. In other words, the "|" operator is
never greedy."

So putting "com.to" before "com" does what you want.
import re
re.search(r'com|com\.to', 'kuku.com.to').group() 'com' re.search(r'com\.to|com', 'kuku.com.to').group()

'com.to'

Kent
Jul 18 '05 #2
This is very similar to some common problems in developing pyparsing
grammars. One of my early attempts at parsing CORBA IDL listed the
valid parameter passing mechanisms as ('in' | 'out' | 'inout'). As you
can imagine, I never successfully matched an 'inout', since the 'in'
match came first. (Pyparsing offers an alternative greedy alternation
operator '^', which tests all alternatives and then selects the longest
match - however, '^' can have performance issues.) Of course, the
simplest recourse is just to change the definition to ('inout' | 'in' |
'out'), but I wanted to provide a more general mechanism (since I was
getting e-mails from pyparsing users also having this same problem).
The simplest approach is to just sort the list by order, longest to
shortest, but this removes any option the developer may want to try to
front-load the list with options that are expected to occur more
frequently (for example, if defining a list of potential relational
operators, forcing ('=' | '<' | '>' | '<=' | '>=' | '!=') to become
('<=' | '>='| '!=' | '=' | '<' | '>') may cause unnecessary tests of
infrequently used operators).

I ended up adding the oneOf helper to pyparsing, which takes a single
string of space-delimited literals, and then returns a list of literals
separated by '|' operators, with literals minimally reordered, that is,
only reorderded if a longer literal masks some shorter literal. Also,
accidental duplicates are stripped. Here is the implementation of oneOf
('|' operators generate a MatchFirst object):

..def oneOf( strs, caseless=False ):
.. """Helper to quickly define a set of alternative Literals, and
makes sure to do
.. longest-first testing when there is a conflict, regardless of
the input order,
.. but returns a MatchFirst for best performance.
.. """
.. if caseless:
.. isequal = ( lambda a,b: a.upper() == b.upper() )
.. parseElementClass = CaselessLiteral
.. else:
.. isequal = ( lambda a,b: a == b )
.. parseElementClass = Literal
..
.. symbols = strs.split()
.. i = 0
.. while i < len(symbols)-1:
.. cur = symbols[i]
.. for j,other in enumerate(symbols[i+1:]):
.. if ( isequal(other, cur) ):
.. del symbols[i+j+1]
.. break
.. elif ( isequal(other[:len(cur)],cur) ):
.. del symbols[i+j+1]
.. symbols.insert(i,other)
.. cur = other
.. break
.. else:
.. i += 1
..
.. return MatchFirst( [ parseElementClass(sym) for sym in symbols ] )
..

You could pre-process your regex with something similar, so you can
avoid having to do this (error-prone) checking yourself.

-- Paul

Jul 18 '05 #3
Thanks, I'll try your solution.
Alex.

Jul 18 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Jack Smith | last post: by
12 posts views Thread by lothar | last post: by
8 posts views Thread by John Hazen | last post: by
6 posts views Thread by EXI-Andrews, Jack | last post: by
3 posts views Thread by =?Utf-8?B?c2JwYXJzb25z?= | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.