By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,148 Members | 774 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,148 IT Pros & Developers. It's quick & easy.

Compare regular expressions

P: n/a
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...
Apr 16 '07 #1
Share this Question
Share on Google+
10 Replies


P: n/a
Thomas Dybdahl Ahle schrieb:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...
It's not. For the simplest of expressions one might come up with a
relation between them, but even that would be hard. General case? No chance.

Diez
Apr 16 '07 #2

P: n/a
On Apr 16, 2:50 am, Thomas Dybdahl Ahle <lob...@gmail.comwrote:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...

What you want is finding if R2 is a superset of R1 for two given
regular languages R1 and R2. I know of some methods for finding
intersection of two regular languages; and I think the time/space
complexity is big.

So the simple answer is it is not feasible to provide such support for
two generic r.e.s without a large time/space usage. You may consult
any of the math/theory groups for more insights.

If you know already R2 >= R1 (that is you precompute and remember),
then it's a trivial to skip checking for R1 if R2 turned up negative.
You can even arrange all the Rs in a binary tree like fashion and skip
checking a whole subtree if the sub-tree's root node gave negative for
r.e. match.

Karthik

Apr 16 '07 #3

P: n/a
On Apr 16, 10:50 am, Thomas Dybdahl Ahle <lob...@gmail.comwrote:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...
you could OR all the individual RE's test them all at once then find
out which matched.

big_re = "|".join( r"(?P<__match_%i__>%s)" % (i, r)
for i,r in enumerate(regexps) )

now if one of the regexps matches then the corresponding named
group should have a non-empty string value.

- Paddy.

Apr 16 '07 #4

P: n/a
On Apr 16, 1:50 pm, "Diez B. Roggisch" <d...@nospam.web.dewrote:
It's not. For the simplest of expressions one might come up with a
relation between them, but even that would be hard. General case? No chance.
I wouldn't say there's 'no chance'. It would require external parsing,
for sure, but if anything, it may only be impossible for unusual
cases.

(Maybe I'll try this the next time I feel like writing parsers for fun
(which is surprisingly frequently).)

Apr 16 '07 #5

P: n/a
Adam Atlas schrieb:
On Apr 16, 1:50 pm, "Diez B. Roggisch" <d...@nospam.web.dewrote:
>It's not. For the simplest of expressions one might come up with a
relation between them, but even that would be hard. General case? No chance.

I wouldn't say there's 'no chance'. It would require external parsing,
for sure, but if anything, it may only be impossible for unusual
cases.
The external parsing of regular expressions is cheap. But once you are
there, how do you proceed? That's where it becomes hard. I can think of
some simple strategies, but nothing that can't be easily found a
counter-example for.

Diez
Apr 17 '07 #6

P: n/a
Paddy schrieb:
On Apr 16, 10:50 am, Thomas Dybdahl Ahle <lob...@gmail.comwrote:
>Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.

Do anybody know if such a test is possible?
if exp0.contains(exp1): ...

you could OR all the individual RE's test them all at once then find
out which matched.

big_re = "|".join( r"(?P<__match_%i__>%s)" % (i, r)
for i,r in enumerate(regexps) )

now if one of the regexps matches then the corresponding named
group should have a non-empty string value.
This doesn't answer the question if two of the sub-expressions matched.

Diez
Apr 17 '07 #7

P: n/a
Den Mon, 16 Apr 2007 11:50:40 +0200 skrev Thomas Dybdahl Ahle:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a
line, as most of the time, one of them having a match means none of the
others can have so.

But ofcource there are also cases where a regular expression can
"contain" another expression, like in: "^strange line (\w+) and (\w+)$"
and "^strange line (\w+) (?:.*?)$" in which case I'd like to first test
the seccond and only if it mathces test the seccond.

Do anybody know if such a test is possible? if exp0.contains(exp1): ...
I found this link: http://terpstra.ca/compare.html which seams to compare
expressions. Not python expressions though.
Sadly it writes nothing about the way it does the thing, and if it will
always work.
Apr 17 '07 #8

P: n/a
On Apr 17, 9:17 am, "Diez B. Roggisch" <d...@nospam.web.dewrote:
Paddy schrieb:
On Apr 16, 10:50 am, Thomas Dybdahl Ahle <lob...@gmail.comwrote:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.
Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.
But ofcource there are also cases where a regular expression can
"contain" another expression, like in:
"^strange line (\w+) and (\w+)$" and "^strange line (\w+) (?:.*?)$" in
which case I'd like to first test the seccond and only if it mathces test
the seccond.
Do anybody know if such a test is possible?
if exp0.contains(exp1): ...
you could OR all the individual RE's test them all at once then find
out which matched.
big_re = "|".join( r"(?P<__match_%i__>%s)" % (i, r)
for i,r in enumerate(regexps) )
now if one of the regexps matches then the corresponding named
group should have a non-empty string value.

This doesn't answer the question if two of the sub-expressions matched.

Diez
Or three, or four...
If the frequencies of occurence are right and the OP wants all, then
he could use the above to get any then follow up with a search for
all
the rest to the right of the matching regexp portion to get any more.
But thats complex. Better to just do individual matches in a loop
I'd think.

- Paddy.

Apr 17 '07 #9

P: n/a
Den Tue, 17 Apr 2007 11:59:15 -0700 skrev Paddy:
On Apr 17, 9:17 am, "Diez B. Roggisch" <d...@nospam.web.dewrote:
>Paddy schrieb:
you could OR all the individual RE's test them all at once then find
out which matched.
big_re = "|".join( r"(?P<__match_%i__>%s)" % (i, r)
for i,r in enumerate(regexps) )

This doesn't answer the question if two of the sub-expressions matched.
Hm, if I were to OR them all the time, then I wouldn't get any boost from
compiling.

I'll probably just stick with a try them all solution, and then change it
if I run into something that does the right thing. I believe it can make
the code something like 7 or 10 times faster.
Apr 17 '07 #10

P: n/a
Thomas Dybdahl Ahle wrote:
Hi, I'm writing a program with a large data stream to which modules can
connect using regular expressions.

Now I'd like to not have to test all expressions every time I get a line,
as most of the time, one of them having a match means none of the others
can have so.
Not what you want, and would be a LOT of work, but if I
remember correctly (from university more than 20 years ago)
there is an algorithm that could be adapted to return a list
of all the regular expressions that match a string.

I thing the base algorithms were documented in Aho and Ullman
("The Dragon book" if I remember correctly). There was one
for converting a regular expression into a Non-deterministic
Finite-state Automaton, and another for converting the NFA
to a Deterministic Finite-state Automaton. The book mentioned
that this also handles multiple regular expressions which can
be treated as the sub-expressions combined using the or operator.
Other, newer books on compiler design would probably contain
similar information in their sections on lexical analysis.

The algorithm given (if my memory is correct) only handled the
case where a true/false match/no-match result was required, but
mentioned how to generalise it to return which of several
expressions separated by or operators was matched. I think
an additional generalisation to return a set of matches rather
than one would be relatively straight-forward.

I believe these algorithms are the basis of lex/flex and
similar utilities, as well as the regular expression handling
of many languages.

Of course, I would like to emphasise that all this is based on
memory of university courses more than 20 years ago, so the
details could be wrong.

Charles
Apr 20 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.