greedy match wanted

alexk

Hi,

I would like to request your help.

My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

..... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

Thanks for any help,
Alex.

Jul 18 '05 #1

Subscribe Post Reply

1629

Kent Johnson

alexk wrote:

My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

.... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one
pattern completely matches, that branch is accepted. This means that once A matches, B will not be
tested further, even if it would produce a longer overall match. In other words, the "|" operator is
never greedy."

So putting "com.to" before "com" does what you want.

import re
re.search(r'com|com\.to', 'kuku.com.to').group() 'com' re.search(r'com\.to|com', 'kuku.com.to').group()

'com.to'

Kent

Jul 18 '05 #2

Paul McGuire

This is very similar to some common problems in developing pyparsing
grammars. One of my early attempts at parsing CORBA IDL listed the
valid parameter passing mechanisms as ('in' | 'out' | 'inout'). As you
can imagine, I never successfully matched an 'inout', since the 'in'
match came first. (Pyparsing offers an alternative greedy alternation
operator '^', which tests all alternatives and then selects the longest
match - however, '^' can have performance issues.) Of course, the
simplest recourse is just to change the definition to ('inout' | 'in' |
'out'), but I wanted to provide a more general mechanism (since I was
getting e-mails from pyparsing users also having this same problem).
The simplest approach is to just sort the list by order, longest to
shortest, but this removes any option the developer may want to try to
front-load the list with options that are expected to occur more
frequently (for example, if defining a list of potential relational
operators, forcing ('=' | '<' | '>' | '<=' | '>=' | '!=') to become
('<=' | '>='| '!=' | '=' | '<' | '>') may cause unnecessary tests of
infrequently used operators).

I ended up adding the oneOf helper to pyparsing, which takes a single
string of space-delimited literals, and then returns a list of literals
separated by '|' operators, with literals minimally reordered, that is,
only reorderded if a longer literal masks some shorter literal. Also,
accidental duplicates are stripped. Here is the implementation of oneOf
('|' operators generate a MatchFirst object):

..def oneOf( strs, caseless=False ):
.. """Helper to quickly define a set of alternative Literals, and
makes sure to do
.. longest-first testing when there is a conflict, regardless of
the input order,
.. but returns a MatchFirst for best performance.
.. """
.. if caseless:
.. isequal = ( lambda a,b: a.upper() == b.upper() )
.. parseElementClass = CaselessLiteral
.. else:
.. isequal = ( lambda a,b: a == b )
.. parseElementClass = Literal
..
.. symbols = strs.split()
.. i = 0
.. while i < len(symbols)-1:
.. cur = symbols[i]
.. for j,other in enumerate(symbols[i+1:]):
.. if ( isequal(other, cur) ):
.. del symbols[i+j+1]
.. break
.. elif ( isequal(other[:len(cur)],cur) ):
.. del symbols[i+j+1]
.. symbols.insert(i,other)
.. cur = other
.. break
.. else:
.. i += 1
..
.. return MatchFirst( [ parseElementClass(sym) for sym in symbols ] )
..

You could pre-process your regex with something similar, so you can
avoid having to do this (error-prone) checking yourself.

-- Paul

Jul 18 '05 #3

alexk

Thanks, I'll try your solution.
Alex.

Jul 18 '05 #4

Similar topics

greedy algorithm

by: Jack Smith | last post by:

Hello, any help appreciated with following problem. I figured out the algorithm (I think), just having trouble proving it is optimal. Suppose we are given n tasks each of which takes 1 unit...

Java

re module non-greedy matches broken

by: lothar | last post by:

re: 4.2.1 Regular Expression Syntax http://docs.python.org/lib/re-syntax.html *?, +?, ?? Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few...

Python

[RegExp] Making non-greedy; Escaping parentheses?

by: Jane Doe | last post by:

Hello, I need to browse a list of hyperlinks, each followed by an author, and remove the links only for certain authors. 1. I searched the archives on Google, but didn't find how to tell the...

Javascript

regexp non-greedy matching bug?

by: John Hazen | last post by:

I want to match one or two instances of a pattern in a string. According to the docs for the 're' module ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier is greedy by...

Python

re.match -- not greedy?

by: EXI-Andrews, Jack | last post by:

the '*' and '+' don't seem to be greedy.. they will consume less in order to match a string: ('aa', 'ab') this is the sort of behaviour i'd expect from '(a+?)(ab)' a+ should greedily...

Python

Regex greedy/lazy problem

by: =?Utf-8?B?c2JwYXJzb25z?= | last post by:

I have a scenario where a string is sent in chunks to my app. I need to be able to identify certain tags in this partial string as it arrives. eg <DALFile>xxxxxxxxx</DALFile> I need to be able...

.NET Framework

positive/negative lookahead issue. greedy = problems?

by: vbgunz | last post by:

/* * BEGIN EXAMPLES */ var text = 'A Cats Catalog of Cat Catastrophes and Calamities'; /*** * EXAMPLE 1: negative lookahead assertion logic ***/

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server