Seeking regex optimizer - Page 2

Kay Schluehr

I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith( s_1) -> True or
s_k.endswith(s_ 1) -> True. An extreme case would be ls = ['a', 'aa',
....,'aaaa...ab ']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay

Jun 18 '06

Subscribe Reply

3234

Mirco Wahab

Thus spoke an*********@gma il.com (on 2006-06-19 22:51):

It uses Aho-Corasick for the implementation which is fast and does what
you expect it to do. Nor does it have a problem of matching more than
99 possible strings as the regexp approach may have.

If you pull the strings into (?>( ... )) (atomic groups),
this would't happen.

http://www.regular-expressions.info/atomic.html
...
Everything between (?>) is treated as one single token
by the regex engine, once the regex engine leaves the
group.
Because the entire group is one token, no backtracking
can take place once the regex engine has found a match
for the group.
...

Maybe Py.2.5 will have them?

Regards

Mrico

Jun 19 '06 #11

andrewdalke

Replying to me Mirco Wahab wrote:

If you pull the strings into (?>( ... )) (atomic groups),
this would't happen.

Given that Python's re engine doesn't support this feature
it doesn't really help the original poster's problem.

Even if some future Python did support it, the limit
to 100 named groups is unaffected by backtracking.

import re
re.compile("(.) "*100) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File
"/Library/Frameworks/Python.framewor k/Versions/2.4/lib/python2.4/sre.py",
line 180, in compile
return _compile(patter n, flags)
File
"/Library/Frameworks/Python.framewor k/Versions/2.4/lib/python2.4/sre.py",
line 225, in _compile
p = sre_compile.com pile(pattern, flags)
File
"/Library/Frameworks/Python.framewor k/Versions/2.4/lib/python2.4/sre_compile.py" ,
line 506, in compile
raise AssertionError(
AssertionError: sorry, but this version only supports 100 named groups

There was no backtracking in "(.)"*100.

Andrew
da***@dalkescie ntific.com

Jun 19 '06 #12

Kay Schluehr

an*********@gma il.com wrote:

Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n].

Why do you want to use a regex for this?

Because it is part of a tokenizer that already uses regexps and I do
not intend to rewrite / replace it. Certain groups of token (
operators, braces and special characters ) should be user extensible.
All others will stay as they are. I found that certain groups of token
might be represented in a more compact mannerr: for matching ['+',
'++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if
there is some generic approach to solve the "inverse regexp" problem in
a non-trivial fashion.

Regards,
Kay

Jun 20 '06 #13

andrewdalke

Kay Schluehr replied to my question:

Why do you want to use a regex for this?
Because it is part of a tokenizer that already uses regexps and I do
not intend to rewrite / replace it.

Switching to pytst is not a big change - there will be little
impact on the rest of your code. On the other hand, it does
change your build and deployment process because of the
need for another package.

Because you are constraining yourself to regex there isn't
much of an improvement you can do.
I found that certain groups of token
might be represented in a more compact mannerr: for matching ['+',
'++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if
there is some generic approach to solve the "inverse regexp" problem in
a non-trivial fashion.
There are a few questions which come up. Why do you want to
have a "more compact" representation? Do you think it will
make the result faster? Likely it will because of the reduced
need for backtracking but unless you have a lot of strings with
the same prefix then the number of backtracks should go as
something like the log of the number of words, which scales
pretty well.

Are you speed limited by the existing implementation? Likely
not. In which case the performance isn't an issue. Consider
this anecdote from http://www.technicat.com/writing/programming.html

I was asked in a phone interview with Google how I would
search for a number in an array of ordered numbers.
Obviously, the questioner was asking for a CS 101 answer -
binary search. But in real life, I would probably do the "wrong"
thing - search the array from beginning to end. There's no point
in taking twice the time to write twice as much code that has
to be maintained and debugged if the application performance
is good enough, and in particular if that piece of code is not
the bottleneck in the application. (And I seriously doubt you'd
have that data laid out linearly in a fixed array like that if it
was
the bottleneck)

As it turns out, the regexp implementation could do the optimization
you want. That is, it could recognize that the input sequence is
a set of fixed words (no special characters) and implement something
like Aho-Corasick, in which case you wouldn't need to change
your input.

I don't know if Python's regexp implementation does this already.
Neither do you. Test the functionality and performance first
before going starting other work.

I am not the first to give you this advice:

John Machin: Optimised with respect to what? speed? ease of construction?
gr*@ll.mit.edu: As others have suggested, you should first try the most naive
implementation before making a hard optimization problem out of this.

As for the general question of the existance of a "generic approach
to solve the "inverse regexp" problem in a non-trivial fashion."

Yes, there is. Kinda of. All (formal/theoretical) regular
expressions
can be written as a regular expression without need for backtracking.
If you look up the theory of regular expressions, backtracking occurs
when there are ambiguities in traversing the state machine. These
are called nondeterministi c finite state machines, or NFAs. It turns
out that every NFA can be written as a DFA, which has no need for
backtracking.

Except. Actually, two excepts. The normal algorithm for converting
an NFA to a DFA doesn't support grouping, which you'll need to
figure out which group matched. Wikipedia says there is an alternate,
slower algorithm which does support groups.

The second except is that regexps in Python, Perl, etc. are not
regular expressions. Some patterns, like r"([A-Z][a-z]+)\1", which
matches "WikiWiki", are not "regular grammars" under the definition
used in language theory. It's possible to use regexps to match
strings of prime length, for example.

So in the general case there is no solution. There are many
possible optimizations, but no general solution.

For your problem, which is a limited subset of the problem as
it involves only constant strings, then the solution is a prefix
tree, also called a trie. See http://en.wikipedia.org/wiki/Prefix_tree

John Machin in an earlier post suggested you search along these
lines when he said I would have thought the way to approach this would be a simple
character-by-character tree/trie-driven lookup. This would be worst case
O(n) where n is the length of the longest string in your list. There may
well be a Python-callable gadget for this on the net somewhere. Google
"Danny Yoo ahocorasick" for a Python-callable solution to a similar but
more complex problem.

For that matter I suggested it when I pointed you to pytst. The tst
means "ternary search tree" (some write it "ternary search trie") which
is a different implementation of the same idea.

Your posts read like you ignored those pointers because they didn't
fall into the category "implemente d with a regular expression". If
that's the case then no, there is no solution. You're letting a sense
of purity overwhelm practicality.

If you are really interested in this problem, try taking a CS
course titled something like "Automata Theory" or "Foundation s
of Computer Science". The one which covers DFAs, NFAs,
CFGs, Turing machines, etc. It's the CS class I enjoyed the most.

Andrew
da***@dalkescie ntific.com

Jun 20 '06 #14

Mirco Wahab

Thus spoke an*********@gma il.com (on 2006-06-20 01:39):

Hi, are you the A.Dalke from the Schulten group (VMD) as
listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi

Replying to me Mirco Wahab wrote:
If you pull the strings into (?>( ... )) (atomic groups),
this would't happen.
Given that Python's re engine doesn't support this feature
it doesn't really help the original poster's problem.

Yeah, right. I misunderstood the problem
- where it was meant to be 'literally'
a problem of capture group id's. I took
it metaphorical, which meant to me
'problems by backtracking' (which was wrong).
Even if some future Python did support it, the limit
to 100 named groups is unaffected by backtracking.

Right for that one, what I had in mind was something
like the following:

# naive regex '\d+9'
# find some number only if it ends by 9
my $str="100990000 000000000000000 000000000000000 0";

print "non-atomic backtracks: $1\n" if $str =~ / ( \d+9 )/x;
print "atomic: matches $1\n" if $str =~ / ( (?>\d+)9 )/x;

Second line would not backtrack and not find the 9 behind \d+
because atomic grouping prevents that, so line 1 matches
10099 and line 2 nothing (but very fast, which is the point).

import re
re.compile("(.) "*100) Traceback (most recent call last):
...
...
AssertionError: sorry, but this version only supports 100 named groups

I tried:

my $re = '(\d+)\D' x 10000;
my $text = join ' ', 0..10000;

print $+ if $text =~ /$re/;

print "\n", $1001;

which prints:
9999
1000

Why is there a named group count
restriction in Python? Any reasons
one should know?

Regards & thanks

Mirco

Jun 20 '06 #15

andrewdalke

Mirco Wahab wrote:

Hi, are you the A.Dalke from the Schulten group (VMD) as
listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi
Yes. But I left there nearly a decade ago.
# naive regex '\d+9'
# find some number only if it ends by 9
my $str="100990000 000000000000000 000000000000000 0";

print "non-atomic backtracks: $1\n" if $str =~ / ( \d+9 )/x;
print "atomic: matches $1\n" if $str =~ / ( (?>\d+)9 )/x;
What makes me less than enthusiastic about various new regexp
features is that they can be written other ways. For example,

import re
pat = re.compile("([0-8]*9)+")
pat.match("1009 900000000000000 000000000000000 000000") <_sre.SRE_Mat ch object at 0x590a0> _.group(0) '10099' pat = re.compile("([0-8]*9)+(?!\d)")
pat.match("1009 900000000000000 000000000000000 000000")

They are different ways to reach the same goal. Understanding
the alternatives requires some non-trivial understanding about
how regexp engines work. The diversity of solutions with no
clear preference of one over the other overly promotes
idiosyncratic patterns and requires that everyone reading the
regexps must understand all the special syntax contructions
even when a person's working vocabulary only uses a subset
of those.
Second line would not backtrack and not find the 9 behind \d+
because atomic grouping prevents that, so line 1 matches
10099 and line 2 nothing (but very fast, which is the point).
While my pattern is not as fast. It's linearly slower than yours
as it does the backtracking. (Unless the engine is very smart.)
That factor is not something I'm concerned with.

I can also get performance without backtracking (though there
is setup for backtracking) using
pat = re.compile("([0-8]*9)+([0-8]*)")
m = pat.match("1009 900000000000000 000000000000000 000000")
if m.group(2): print "does not end in a 9" ....
does not end in a 9
m = pat.match("1009 900000000000000 000000000000000 0000009a")
if m.group(2): print "does not end in a 9" .... Why is there a named group count
restriction in Python? Any reasons
one should know?

While not definite, I think it introduces an ambiguity with octal
escape codes. (*That's* something to omit in Python 3000!)
Here's the quote from perlre

There is no limit to the number of captured substrings that you
may
use. However Perl also uses \10, \11, etc. as aliases for \010,
\011,
etc. (Recall that 0 means octal, so \011 is the character at
number 9
in your coded character set; which would be the 10th character,
a hori-
zontal tab under ASCII.) Perl resolves this ambiguity by
interpreting
\10 as a backreference only if at least 10 left parentheses have
opened
before it. Likewise \11 is a backreference only if at least 11
left
parentheses have opened before it. And so on. \1 through \9
are
always interpreted as backreferences.

Python doesn't have "\10", "\11", etc. as aliases for "\010", "\011",
etc.
re.compile("\10 ") <_sre.SRE_Patte rn object at 0x575c0> pat = re.compile(r"\1 0") Traceback (most recent call last):
File "<stdin>", line 1, in ?
File
"/Library/Frameworks/Python.framewor k/Versions/2.4/lib/python2.4/sre.py",
line 180, in compile
return _compile(patter n, flags)
File
"/Library/Frameworks/Python.framewor k/Versions/2.4/lib/python2.4/sre.py",
line 227, in _compile
raise error, v # invalid expression
sre_constants.e rror: bogus escape: '\\10' pat = re.compile(r"\0 10")

so there is no ambiguity until reaching \100.
pat = re.compile(r"\1 00")
pat.match("@") <_sre.SRE_Mat ch object at 0x232560>

Most likely /F (who implemented the regex engine) decided
to refuse the temptation to guess the meaning, eg by
parens counting as Perl does.

Try this variation of your program to see how Perl changes
the meaning of "\100" based on the input string

for (my $i=98; $i<=103; $i++) {
my $re = '(\d+)\D' x $i . '\100';
my $text = join ' ', 0..$i;
$text .= " @";

print "Testing $i\n";
if ($text =~ /$re/) {
print " Number of matches: $+\n";
print " Last group: ${$i}\n";
} else {
print "No matches\n";
}
}

I get

Testing 98
Number of matches: 98
Last group: 98
Testing 99
Number of matches: 99
Last group: 99
Testing 100
No matches
Testing 101
No matches
Testing 102
No matches
Testing 103
No matches
Andrew
da***@dalkescie ntific.com

Jun 20 '06 #16

Similar topics

2873

Seeking regex-to-re HOWTO

by: Josef Sachs | last post by:

Is Andrew Kuchling's regex-to-re HOWTO available anywhere? I've found the following (dead) links on various Web pages: http://py-howto.sourceforge.net/regex-to-re/regex-to-re.html http://starship.skyport.net/crew/amk/regex/regex-to-re.html http://www.python.org/doc/howto/regex-to-re/ http://www.amk.ca/python/howto/regex-to-re/ Thanks in advance.

Python

4750

Query Optimizer does not ignore constant condition?

by: hemal | last post by:

I came across a very strange situation at work. There is an order of magnitude difference in execution time for the following two queries (10 v/s ~130 msec): select count(*) from table_name where column_name = 'value' or 1 = 0 select count(*) from table_name where column_name = 'value' I do not want to go into the reason why the redundent condition exists, the example is representative of the real query where it

Oracle Database

2277

Complex regex to alter img tags

by: bluesrift | last post by:

Using the WYSIWYG contenteditable feature, Internet Explorer will often add a style to the image tag to define its display size after the image has been dragged to display at something other than its natural size. For example: style="WIDTH: 432px; HEIGHT: 344px" The values contained within the style are the correct ones resulting from the drag and override the standard img tag height= and width= parameters still remaining in the image...

Javascript

2096

Optimizer bad with sparse keys?

by: Peter Arrenbrecht | last post by:

Hi all We ran into a very annoying optimizer problem recently. We had added a new key and self-join relation to a large production table. That key will be filled very rarely and having just being added was practically never filled in when the first user tried to delete a row from the table. Now, the optimizer tried to enforce RI on the new relation. But instead of using the index generated by the relation, it used a table scan!

DB2 Database

8317

Strong Password RegEx

by: UJ | last post by:

Does anybody have lying around a Regex for a strong password (or a routine to check it). I'd like the normal First character has to be a character, total of 8 characters and has to have at least one number in it. TIA - Jeff.

C# / C Sharp

2376

Optimizer questions

by: aj | last post by:

DB2 LUW v8.2 FP 14 RHAS 2.1 Sorry if these are newbie questions. Optimizer stuff is black magic to me. For both of these, assume stats are current and an even distribution of data. ------------------------- Lets say I have a table FOO1 that has, among other columns, a column named A. There is a non-unique index on A that has medium selectivity. Lets also say I have a table FOO2 that has, among other columns, a column named B. ...

DB2 Database

2185

Regex internal caching or what?

by: Henrik Dahl | last post by:

Hello! In my application I have a need for using a regular expression now and then. Often the same regular expression must be used multiple times. For performance reasons I use the RegexOptions.Compiled when I instantiate it. It must be obvious that it takes some time to instantiate such an object. My question is, does the Regex instantiation somehow deal with some caching internally so instantiating a Regex object multiple times...

C# / C Sharp

2254

Opinion wanted on "Regex"

by: Mark Chambers | last post by:

Hi there, I'm seeking opinions on the use of regular expression searching. Is there general consensus on whether it's now a best practice to rely on this rather than rolling your own (string) pattern search functions. Where performance is an issue you can alway write your own specialized routine of course. However, for the occasional pattern search where performance isn't an issue, would most seasoned .NET developers rely on "Regex" and...

C# / C Sharp

1012

RE: New to Python, familiar with Perl - Seeking info sources

by: Reedick, Andrew | last post by:

<snip> I have a Perl background and have found the O'Reilly books to be useful. The Learning Python book (or whatever it's called) is good because it covers the paradigm shifts and potential gotchas that you won't even consider thinking about otherwise. Only downside is wading through the novice 'how to program' parts. The Cookbook is also good for getting 'standard' apps up and running quickly (meaning you know how to do it in Perl,...

Python

9645

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10325

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9950

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8972

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6739

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5381

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5511

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4050

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2879

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General