473,769 Members | 2,345 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Seeking regex optimizer

I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith( s_1) -> True or
s_k.endswith(s_ 1) -> True. An extreme case would be ls = ['a', 'aa',
....,'aaaa...ab ']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay

Jun 18 '06 #1
15 3232

Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith( s_1) -> True or
s_k.endswith(s_ 1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay


A start would be:
regexp = "^(" + "|".join(sorted (ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.

You say you want something that is optimised. What have have you tried?
- Pad.

Jun 18 '06 #2
Paddy wrote:
Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith( s_1) -> True or
s_k.endswith(s_ 1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay
A start would be:
regexp = "^(" + "|".join(sorted (ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.


For special characters there might be a workaround using escapes. This
is indeed important, but I do think one should split the problem into
separate parts.
You say you want something that is optimised. What have have you tried?


Sorting the list and checking for successor inclusions. Say you have ls
= ['x','a', 'aa', 'aab' ,'ab']

This can be mapped to:

'x|a(?:(?:ab)?| b?|a?)'

or to:

'^(x|ab|aab|aa| a)'

with reverse sorting as in your proposal.The naive solution is easy to
generate but I'm sceptical about its cost effectiveness. On the other
hand I do not want to investigate this matter if somebody else already
did it thoroughly.

Regards,
Kay

Jun 18 '06 #3
On 19/06/2006 6:30 AM, Paddy wrote:
Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith( s_1) -> True or
s_k.endswith(s_ 1) -> True.
Kay, what is the relevance of one string being a suffix of another? I
don't see how that could affect the outcome.

An extreme case would be ls = ['a', 'aa', ...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?
Optimised with respect to what? speed? ease of construction?

I presume that you will ensure that the members of the list are unique.

Note that the Python regex engine will consider each candidate in
Paddy's solution left to right until it gets a match or reaches the end
(that's why the reverse sort is needed to get longest possible match).
This is worst-case O(N) where N is the total of the lengths of all the
strings in your list.

As far as I can see, this is the only basic solution (modulo one or two
twiddles -- see below) using Python's re.

You could possibly consider producing "zzz|foo(?:bar) ?|aaa" instead of
"zzz|foobar|foo |aaa" -- but whether that would run sufficiently faster
to offset the cost of construction is anybody's guess.

How many strings in your list? Average/maximum length? Likelihood of
ls[i].startswith(ls[j]) == True? unicode or str?

Your requirements are rather constrained: "sx.match(s ) yields a
SRE_Match object" ... why do you need this? Surely all you need is
matched_len (which may be zero) such that s[:matched_len] is the matched
prefix.

I would have thought the way to approach this would be a simple
character-by-character tree/trie-driven lookup. This would be worst case
O(n) where n is the length of the longest string in your list. There may
well be a Python-callable gadget for this on the net somewhere. Google
"Danny Yoo ahocorasick" for a Python-callable solution to a similar but
more complex problem.

A cheap hack using Python's re: divide the problem according to first
character:

prefix_dict_mat ch = {
'a': re.compile('alp aca|alligator') .match,
'f': re.compile('foo bar|foo').match ,
'z': re.compile('zoo |zebra').match,
}
if s and s[0] in prefix_dict_mat ch:
match_obj = prefix_dict_mat ch[s[0]](s)
else:
match_obj = None

Regards,
Kay


A start would be:
regexp = "^(" + "|".join(sorted (ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.


Paddy, fixing that problem, and "optimising " by removing the redundant
^() metacharacters:

regexp = "|".join(map(re .escape, sorted(ls, reverse=True)))
Hoping some of this helps,
Regards,
John
Jun 18 '06 #4
Thus spoke Kay Schluehr (on 2006-06-18 19:07):
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith( s_1) -> True or
s_k.endswith(s_ 1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.


With some ideas from Kay and Paddy, it tried to get
along with Python in doing this.

If its allowed to spread the individual strings
into alterations, the following would imho do:
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
text = r'this is a text containing aaaöüöaaaµaaa and more';
lstr = [
'a',
'aa',
'aaaaa',
'aaaöüöaaaµaaa' ,
'aaaaaaaaaaaaaa ab'
]

import re
pat = re.compile( \
'(' + \
'|'.join(sorted (lstr,lambda a,b: len(b)-len(a))) + \
')', re.VERBOSE);

hits = sorted( pat.findall(tex t), lambda a,b: len(b)-len(a) )
print 'Longest: ', hits[0]

This will print 'aaaöüöaaaµaaa' from the text
and won't complain about specuial characters used.

in Perl, you could build up the regex
by dynamic evaluation (??{..}), but I
didn't manage to get this working in Python,
so here is (in Perl) how I thougt it would work:

my $text = "this is a text containing aaaöüöaaaµaaa and more";
my @lstr = (
'a',
'aa',
'aaaaa',
'aaaöüöaaaµaaa' ,
'aaaaaaaaaaaaaa ab',
);

my $re = qr{
(??{
join '|',
map { quotemeta }
sort{ length $b <=> length $a }
@lstr
})
}x;

$_ = $text;
print "Longest: ", (my $hit) = reverse sort /$re/g;
Maybe the experts can bring some light to it.

Regards

Mirco
Jun 18 '06 #5
Kay Schluehr wrote:
<SNIP>
with reverse sorting as in your proposal.The naive solution is easy to
generate but I'm sceptical about its cost effectiveness. On the other
hand I do not want to investigate this matter if somebody else already
did it thoroughly.

Regards,
Kay


Hi Kay,
The only way to know if something is optimised enough, is for you to
test it in your application. If you haven't finished your application,
then don't sweat it. take a note of your options, stick one in, then
get the application finished.
Its only when its finished that you can really say if further
optimisations are necessary, and you would have to profile the complete
prog to see where it's spending its time.

I usually find myself using string methods where possible, then regular
expressions only for things that string methods cannot do. When I
finish my script I usually find that Pythons speed is adequate for most
of my text processing tasks, or if speed IS an issue, then i needed to
profile the completed application anyway.

- Pad.

Jun 18 '06 #6
Mirco,

with "special characters" I mentioned control characters of regular
expressions i.e. one of ".^$()?[]{}\|+*" but not non ascii-127
characters.

For a workaround you simply have to "mangle" those using an escape
control character:

REGEXCHAR = ".^$()?[]{}\\|+*"
def mangle(s):
pattern = []
for c in s:
if c in REGEXCHAR:
pattern.append( "\\")
pattern.append( c)
return "".join(pattern )

Regards,
Kay

Jun 19 '06 #7
On 19/06/2006 7:06 PM, Kay Schluehr wrote:
Mirco,

with "special characters" I mentioned control characters of regular
expressions i.e. one of ".^$()?[]{}\|+*" but not non ascii-127
characters.

For a workaround you simply have to "mangle" those using an escape
control character:

REGEXCHAR = ".^$()?[]{}\\|+*"
def mangle(s):
pattern = []
for c in s:
if c in REGEXCHAR:
pattern.append( "\\")
pattern.append( c)
return "".join(pattern )


What's wrong with re.escape()?
Have you not read (a) my response to Paddy's first posting (b) the manual?
Jun 19 '06 #8
gry

Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith( s_1) -> True or
s_k.endswith(s_ 1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.
In a very similar case I used a simple tree of dictionaries, one node
per letter, to represent the strings.
This naturally collapses cases like ['a','aa','aaa']. Then a recursive
function finds
the desired prefix. This was WAY faster than the "re" module for large
n (tradeoff point for me was n~1000). It requires a bit more coding,
but I think it is the natural data structure for this problem.

As others have suggested, you should first try the most naive
implementation before making a hard optimization problem out of this.
Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay


Jun 19 '06 #9
Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n].


Why do you want to use a regex for this? When you have constant
strings there are other solutions. For example, this uses Nicolas
Lehuen's
pytst module from http://nicolas.lehuen.com/index.php/Pytst
import tst
tree = tst.TST()
tree["aa"] = (1, "String with 'aa'")
tree["aahhh"] = (2, "At the doctor's")
tree["a"] = (3, "indefinite article")
tree.scan("This is a bunch of text. Have you seen an aardvark?", .... tst.TupleListAc tion())
[('This is ', -8, None), ('a', 1, (3, 'indefinite article')), (' bunch
of text. H', -18, None), ('a', 1, (3, 'indefinite article')), ('ve you
seen ', -12, None), ('a', 1, (3, 'indefinite article')), ('n ', -2,
None), ('aa', 2, (1, "String with 'aa'")), ('rdv', -3, None), ('a', 1,
(3, 'indefinite article')), ('rk?', -3, None)]


Each returned tuple has three elements:
For a substring which is not a match these are:
- the unmatched substring
- the negative length of the substring
- None

For a substring which matches:
- the matched substring
- the positive length of the substring
- the value of the match in the TST (the tree)

It uses Aho-Corasick for the implementation which is fast and does what
you expect it to do. Nor does it have a problem of matching more than
99 possible strings as the regexp approach may have.

Andrew
da***@dalkescie ntific.com

Jun 19 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2873
by: Josef Sachs | last post by:
Is Andrew Kuchling's regex-to-re HOWTO available anywhere? I've found the following (dead) links on various Web pages: http://py-howto.sourceforge.net/regex-to-re/regex-to-re.html http://starship.skyport.net/crew/amk/regex/regex-to-re.html http://www.python.org/doc/howto/regex-to-re/ http://www.amk.ca/python/howto/regex-to-re/ Thanks in advance.
9
4750
by: hemal | last post by:
I came across a very strange situation at work. There is an order of magnitude difference in execution time for the following two queries (10 v/s ~130 msec): select count(*) from table_name where column_name = 'value' or 1 = 0 select count(*) from table_name where column_name = 'value' I do not want to go into the reason why the redundent condition exists, the example is representative of the real query where it
5
2277
by: bluesrift | last post by:
Using the WYSIWYG contenteditable feature, Internet Explorer will often add a style to the image tag to define its display size after the image has been dragged to display at something other than its natural size. For example: style="WIDTH: 432px; HEIGHT: 344px" The values contained within the style are the correct ones resulting from the drag and override the standard img tag height= and width= parameters still remaining in the image...
3
2096
by: Peter Arrenbrecht | last post by:
Hi all We ran into a very annoying optimizer problem recently. We had added a new key and self-join relation to a large production table. That key will be filled very rarely and having just being added was practically never filled in when the first user tried to delete a row from the table. Now, the optimizer tried to enforce RI on the new relation. But instead of using the index generated by the relation, it used a table scan!
2
8317
by: UJ | last post by:
Does anybody have lying around a Regex for a strong password (or a routine to check it). I'd like the normal First character has to be a character, total of 8 characters and has to have at least one number in it. TIA - Jeff.
3
2376
by: aj | last post by:
DB2 LUW v8.2 FP 14 RHAS 2.1 Sorry if these are newbie questions. Optimizer stuff is black magic to me. For both of these, assume stats are current and an even distribution of data. ------------------------- Lets say I have a table FOO1 that has, among other columns, a column named A. There is a non-unique index on A that has medium selectivity. Lets also say I have a table FOO2 that has, among other columns, a column named B. ...
4
2185
by: Henrik Dahl | last post by:
Hello! In my application I have a need for using a regular expression now and then. Often the same regular expression must be used multiple times. For performance reasons I use the RegexOptions.Compiled when I instantiate it. It must be obvious that it takes some time to instantiate such an object. My question is, does the Regex instantiation somehow deal with some caching internally so instantiating a Regex object multiple times...
16
2253
by: Mark Chambers | last post by:
Hi there, I'm seeking opinions on the use of regular expression searching. Is there general consensus on whether it's now a best practice to rely on this rather than rolling your own (string) pattern search functions. Where performance is an issue you can alway write your own specialized routine of course. However, for the occasional pattern search where performance isn't an issue, would most seasoned .NET developers rely on "Regex" and...
0
1012
by: Reedick, Andrew | last post by:
<snip> I have a Perl background and have found the O'Reilly books to be useful. The Learning Python book (or whatever it's called) is good because it covers the paradigm shifts and potential gotchas that you won't even consider thinking about otherwise. Only downside is wading through the novice 'how to program' parts. The Cookbook is also good for getting 'standard' apps up and running quickly (meaning you know how to do it in Perl,...
0
9589
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10050
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9999
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8876
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
3967
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.