469,631 Members | 1,681 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,631 developers. It's quick & easy.

Regular Expression: Matching substring

Hi,

I'm currently running into a confusion on regex and hopefully you guys
can clear it up for me.

Suppose I have a regular expression (0|(1(01*0)*1))* and two test
strings: 110_1011101_ and _101101_1. (The underscores are not part of
the string. They are added to show that both string has a substring
that matches the pattern.) Applying a match() function on the first
string returns true while false for the second. The difference is the
first one has unmatched chunk in the beginning while the second at the
end. How's the regex rule work here?

Thanks.

Apr 13 '06 #1
7 3070
Hi Kevin,

You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
most
three characters of a string must not be ``101" while not followed by
an `0'.
After reading the first `1', automata expects `1' or ``00" or ``010"
or ``11",
right?:)
Kevin CH 寫道:
Hi,

I'm currently running into a confusion on regex and hopefully you guys
can clear it up for me.

Suppose I have a regular expression (0|(1(01*0)*1))* and two test
strings: 110_1011101_ and _101101_1. (The underscores are not part of
the string. They are added to show that both string has a substring
that matches the pattern.) Applying a match() function on the first
string returns true while false for the second. The difference is the
first one has unmatched chunk in the beginning while the second at the
end. How's the regex rule work here?

Thanks.


Apr 13 '06 #2

Leon wrote:
Hi Kevin,

You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
most
three characters of a string must not be ``101" while not followed by
an `0'.
After reading the first `1', automata expects `1' or ``00" or ``010"
or ``11",
right?:) Why it must expect "010"? Why not say "0110", since 1* can represent 0
or more repetitions.



Kevin CH 寫道:
Hi,

I'm currently running into a confusion on regex and hopefully you guys
can clear it up for me.

Suppose I have a regular expression (0|(1(01*0)*1))* and two test
strings: 110_1011101_ and _101101_1. (The underscores are not part of
the string. They are added to show that both string has a substring
that matches the pattern.) Applying a match() function on the first
string returns true while false for the second. The difference is the
first one has unmatched chunk in the beginning while the second at the
end. How's the regex rule work here?

Thanks.


Apr 13 '06 #3
You are right. In fact the procedure is as follows:
The substr ``101101" is no problem, if stop here, match will
successful.
But the tailing `1' occurs, so we may imagine the working automata move
to a state, which according to the regexp's outer most `)', and ready
to repeat
the whole regexp again. In this case, the answer is ``yes" only when
there exists
at least two ``1", but only one here.

BTW, the first string is matched exactly, according to your notion, it
should be written as: _11_0_1011101

Apr 13 '06 #4
Oh yea, I see what's my confusion now. In the first string, I didn't
consider 11 and 0 matches the pattern without the repetition. Actually
what I did is I entered the pattern and the test strings into kudos (a
python regexp debugger) and got the match groups, which didn't have 11
and 0 as matches, although they do match the pattern "(0|1(01*0)*1)".

Thank you for the help.
Leon wrote:
You are right. In fact the procedure is as follows:
The substr ``101101" is no problem, if stop here, match will
successful.
But the tailing `1' occurs, so we may imagine the working automata move
to a state, which according to the regexp's outer most `)', and ready
to repeat
the whole regexp again. In this case, the answer is ``yes" only when
there exists
at least two ``1", but only one here.

BTW, the first string is matched exactly, according to your notion, it
should be written as: _11_0_1011101


Apr 13 '06 #5
On 13/04/2006 12:33 PM, Kevin CH wrote:
Hi,

I'm currently running into a confusion on regex and hopefully you guys
can clear it up for me.

Suppose I have a regular expression (0|(1(01*0)*1))* and two test
strings: 110_1011101_ and _101101_1. (The underscores are not part of
the string. They are added to show that both string has a substring
that matches the pattern.) Applying a match() function on the first
string returns true while false for the second.
Perhaps you are using grep, or you have stumbled on the old deprecated
"regex" module and are using that instead of the "re" module. Perhaps
not as you are using only 2 plain vanilla RE operations which should
work the same way everywhere. Perhaps you are having trouble with
search() versus match() -- if so, read the section on this topic in the
re docs. It's rather hard to tell what you are doing without seeing the
code you are using.
The difference is the
first one has unmatched chunk in the beginning
With re's match(), the whole string matches.
while the second at the
end.
With re's match(), the part you marked with underscores (at the
*beginning*) matches.

How's the regex rule work here?


Let's abbreviate your pattern as (0|X)*
This means 0 or more occurrences of strings that match either 0 or X.

Case 1 gives us 11 matching X [it's a 1 followed by zero occurrences of
(01*0) followed by a 1], then a 0, then 1011101 matching X [it's a 1
foll. by 1 occ. of (01110) followed by a 1].

Case 2 gives us 101101 matching X [it's a 1 foll. by 1 occ of (0110)
foll by a 1] -- then there's a 1 that doesn't match anything.

Here's some code and its output:

C:\junk>type kevinch.py
import re

rx = re.compile(r"(0|(1(01*0)*1))*")

def doit(n, s):
print "Case", n
m = rx.match(s)
if m:
print "0123456789"
print s
for k in range(4):
print "span(%d) -> %r" % (k, m.span(k))
else:
print "... no match"

s1 = "110_1011101_".replace('_', '')
s2 = "_101101_1".replace('_', '')
doit(1, s1)
doit(2, s2)

C:\junk>kevinch.py
Case 1
0123456789
1101011101
span(0) -> (0, 10)
span(1) -> (3, 10)
span(2) -> (3, 10)
span(3) -> (4, 9)
Case 2
0123456789
1011011
span(0) -> (0, 6)
span(1) -> (0, 6)
span(2) -> (0, 6)
span(3) -> (1, 5)

HTH,
John
Apr 13 '06 #6
Thank you for your reply.
Perhaps you are using grep, or you have stumbled on the old deprecated
"regex" module and are using that instead of the "re" module. Perhaps
not as you are using only 2 plain vanilla RE operations which should
work the same way everywhere. Perhaps you are having trouble with
search() versus match() -- if so, read the section on this topic in the
re docs. It's rather hard to tell what you are doing without seeing the
code you are using.


Sorry I should have said it up front. I'm using Kudos (which I'm sure
uses re module) to test these strings on the pattern, and had the match
results as I stated. (search() of course gives me true since the
pattern appears in the substrings of both strings.)

Apr 13 '06 #7
"Kevin CH" wrote:

news:11**********************@v46g2000cwv.googlegr oups.com...
Thank you for your reply.
Perhaps you are using grep, or you have stumbled on the old deprecated
"regex" module and are using that instead of the "re" module. Perhaps
not as you are using only 2 plain vanilla RE operations which should
work the same way everywhere. Perhaps you are having trouble with
search() versus match() -- if so, read the section on this topic in the
re docs. It's rather hard to tell what you are doing without seeing the
code you are using.


Sorry I should have said it up front. I'm using Kudos (which I'm sure
uses re module) to test these strings on the pattern, and had the match
results as I stated. (search() of course gives me true since the
pattern appears in the substrings of both strings.)


Python's "match" function doesn't return "true" or "false"; it returns a match
object if the string matches the pattern, and None if not. since your pattern
can match the empty string, it'll match any target string (all strings start with
an empty string), and will never return false.

looks like the "debugger" does a great job of hiding how things really work...

</F>

Apr 13 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

7 posts views Thread by Patient Guy | last post: by
2 posts views Thread by Jonas | last post: by
8 posts views Thread by Natalia DeBow | last post: by
25 posts views Thread by Mike | last post: by
12 posts views Thread by =?Utf-8?B?SlA=?= | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.