Say I have some string that begins with an arbitrary sequence of characters
and then alternates repeating the letters 'a' and 'b' any number of times,
e.g.
"xyz123aaabbaabbbbababbbbaaabb"
I'm looking for a regular expression that matches the first, and only the
first, sequence of the letter 'a', and only if the length of the sequence is
exactly 3.
Does such a regular expression exist? If so, any ideas as to what it could
be?
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com 43 2736
Hello Roger, I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
import sys, re, os
if __name__=='__main__':
m = re.search('a{3}', 'xyz123aaabbaaabbbbababbbbaabb')
print m.group(0)
print "Preceded by: \"" + m.string[0:m.start(0)] + "\""
Best wishes,
Christoph
Roger L. Cauvin enlightened us with: I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
Your request is ambiguous:
1) You're looking for the first, and only the first, sequence of the
letter 'a'. If the length of this first, and only the first,
sequence of the letter 'a' is not 3, no match is made at all.
2) You're looking for the first, and only the first, sequence of
length 3 of the letter 'a'.
What is it?
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
> Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. "xyz123aaabbaabbbbababbbbaaabb"
I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
Does such a regular expression exist? If so, any ideas as to what it could be?
I'm not quite sure what your intent here is, as the
resulting find would obviously be "aaa", of length 3.
If you mean that you want to test against a number of
things, and only find items where "aaa" is the first "a" on
the line, you might try something like
import re
listOfStringsToTest = [
'helloworld',
'xyz123aaabbaabababbab',
'cantalopeaaabababa',
'baabbbaaabbbbb',
'xyzaa123aaabbabbabababaa']
r = re.compile("[^a]*(a{3})b+(a+b+)*")
matches = [s for s in listOfStringsToTest if r.match(s)]
print repr(matches)
If you just want the *first* triad of "aaa", you can change
the regexp to
r = re.compile(".*?(a{3})b+(a+b+)*")
With a little more detail as to the gist of the problem,
perhaps a better solution can be found. In particular, are
there items in the listOfStringsToTest that should be found
but aren't with either of the regexps?
-tkc
Tim Chase <py*********@tim.thechases.com> wrote:
... I'm not quite sure what your intent here is, as the resulting find would obviously be "aaa", of length 3.
But that would also match 'aaaa'; I think he wants negative loobehind
and lookahead assertions around the 'aaa' part. But then there's the
spec about matching only if the sequence is the first occurrence of
'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe
parentheses around the 'aaa' to somehow 'match' is specially?).
It's definitely not very clear what exactly the intent is, no...
Alex
"Sybren Stuvel" <sy*******@YOURthirdtower.com.imagination> wrote in message
news:sl**********************@schuimige.unrealtowe r.org... Roger L. Cauvin enlightened us with: I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
Your request is ambiguous:
1) You're looking for the first, and only the first, sequence of the letter 'a'. If the length of this first, and only the first, sequence of the letter 'a' is not 3, no match is made at all.
2) You're looking for the first, and only the first, sequence of length 3 of the letter 'a'.
What is it?
The first option describes what I want, with the additional restriction that
the "first sequence of the letter 'a'" is defined as 1 or more consecutive
occurrences of the letter 'a', followed directly by the letter 'b'.
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
"Christoph Conrad" <no****@spamgourmet.com> wrote in message
news:up***********@ID-24456.user.uni-berlin.de... Hello Roger,
I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
import sys, re, os
if __name__=='__main__':
m = re.search('a{3}', 'xyz123aaabbaaabbbbababbbbaabb') print m.group(0) print "Preceded by: \"" + m.string[0:m.start(0)] + "\""
The correct pattern should reject the string:
'xyz123aabbaaab'
since the length of the first sequence of the letter 'a' is 2. Yours
accepts it, right?
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
"Alex Martelli" <al***@mail.comcast.net> wrote in message
news:1h9reyq.z7u4ziv8itblN%al***@mail.comcast.net. .. Tim Chase <py*********@tim.thechases.com> wrote: ... I'm not quite sure what your intent here is, as the resulting find would obviously be "aaa", of length 3.
But that would also match 'aaaa'; I think he wants negative loobehind and lookahead assertions around the 'aaa' part. But then there's the spec about matching only if the sequence is the first occurrence of 'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe parentheses around the 'aaa' to somehow 'match' is specially?).
It's definitely not very clear what exactly the intent is, no...
Sorry for the confusion. The correct pattern should reject all strings
except those in which the first sequence of the letter 'a' that is followed
by the letter 'b' has a length of exactly three.
Hope that's clearer . . . .
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that "Roger L. Cauvin"
<ro***@deadspam.com> might have written: Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g.
"xyz123aaabbaabbbbababbbbaaabb"
I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
Does such a regular expression exist? If so, any ideas as to what it could be?
Is this what you mean?
^[^a]*(a{3})(?:[^a].*)?$
This fits your description.
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians
Hello Roger, since the length of the first sequence of the letter 'a' is 2. Yours accepts it, right?
Yes, i misunderstood your requirements. So it must be modified
essentially to that what Tim Chase wrote:
m = re.search('^[^a]*a{3}b', 'xyz123aabbaaab')
Best wishes from germany,
Christoph
> Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three.
Ah...a little more clear.
r = re.compile("[^a]*a{3}b+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]
or (as you've only got 3 of 'em)
r = re.compile("[^a]*aaab+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]
should do the trick. To exposit:
[^a]* a bunch of stuff that's not "a"
a{3} or aaa three letter "a"s
b+ one or more "b"s
(a+b*) any number of "a"s followed optionally by "b"s
Hope this helps,
-tkc
Tim Chase <py*********@tim.thechases.com> wrote: Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three.
Ah...a little more clear.
r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in listOfStringsToTest if r.match(s)]
Unfortunately, the OP's spec is even more complex than this, if we are
to take to the letter what you just quoted; e.g.
aazaaab
SHOULD match, because the sequence 'aaz' (being 'a' NOT followed by the
letter 'b') should not invalidate the match that follows. I don't think
he means the strings contain only a's and b's.
Locating 'the first sequence of a followed by b' is easy, and reasonably
easy to check the sequence is exactly of length 3 (e.g. with a negative
lookbehind) -- but I don't know how to tell a RE to *stop* searching for
more if the check fails.
If a little more than just REs and matching was allowed, it would be
reasonably easy, but I don't know how to fashion a RE r such that
r.match(s) will succeed if and only if s meets those very precise and
complicated specs. That doesn't mean it just can't be done, just that I
can't do it so far. Perhaps the OP can tell us what constrains him to
use r.match ONLY, rather than a little bit of logic around it, so we can
see if we're trying to work in an artificially overconstrained domain?
Alex
Christoph Conrad <no****@spamgourmet.com> wrote: Hello Roger,
since the length of the first sequence of the letter 'a' is 2. Yours accepts it, right?
Yes, i misunderstood your requirements. So it must be modified essentially to that what Tim Chase wrote:
m = re.search('^[^a]*a{3}b', 'xyz123aabbaaab')
....but that rejects 'aazaaab' which should apparently be accepted.
Alex
Hallo Alex, r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in listOfStringsToTest if r.match(s)]
Unfortunately, the OP's spec is even more complex than this, if we are to take to the letter what you just quoted; e.g. aazaaab SHOULD match,
Then it's again "a{3}b", isn't it?
Freundliche Grüße,
Christoph
"Tim Chase" <py*********@tim.thechases.com> wrote in message
news:ma***************************************@pyt hon.org... Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three.
Ah...a little more clear.
r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in listOfStringsToTest if r.match(s)]
Wow, I like it, but it allows some strings it shouldn't. For example:
"xyz123aabbaaab"
(It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
"Christos Georgiou" <tz**@sil-tec.gr> wrote in message
news:bo********************************@4ax.com... On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that "Roger L. Cauvin" <ro***@deadspam.com> might have written:
Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g.
"xyz123aaabbaabbbbababbbbaaabb"
I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
Does such a regular expression exist? If so, any ideas as to what it could be?
Is this what you mean?
^[^a]*(a{3})(?:[^a].*)?$
Close, but the pattern should allow "arbitrary sequence of characters" that
precede the alternating a's and b's to contain the letter 'a'. In other
words, the pattern should accept:
"xayz123aaabbab"
since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.
Your proposed pattern rejects this string.
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
>>r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in listOfStringsToTest if r.match(s)]
Wow, I like it, but it allows some strings it shouldn't. For example:
"xyz123aabbaaab"
(It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)
Anchoring it to the beginning/end might solve that:
r = re.compile("^[^a]*a{3}b+(a+b*)*$")
this ensures that no "a"s come before the first 3x"a" and nothing
but "b" and "a" follows it.
-tkc
(who's translating from vim regexps which are just diff. enough
to throw a wrench in works...)
Roger L. Cauvin wrote: Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three.
Hope that's clearer . . . .
Examples are a *really* good way to clarify ambiguous or complex
requirements. In fact, when made executable they're called "test cases"
:-), and supplying a few of those (showing input values and expected
output values) would help, not only to clarify your goals for the
humans, but also to let the proposed solutions easily be tested.
(After all, are you going to just trust that whatever you are handed
here is correctly implemented, and based on a perfect understanding of
your apparently unclear requirements?)
-Peter
"Tim Chase" <py*********@tim.thechases.com> wrote in message
news:ma***************************************@pyt hon.org... r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in listOfStringsToTest if r.match(s)]
Wow, I like it, but it allows some strings it shouldn't. For example:
"xyz123aabbaaab"
(It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)
Anchoring it to the beginning/end might solve that:
r = re.compile("^[^a]*a{3}b+(a+b*)*$")
this ensures that no "a"s come before the first 3x"a" and nothing but "b" and "a" follows it.
Anchoring may be the key here, but this pattern rejects
"xayz123aaabab"
which it should accept, since the 'a' between the 'x' and the 'y' is not
directly followed by the letter 'b'.
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
"Peter Hansen" <pe***@engcorp.com> wrote in message
news:ma***************************************@pyt hon.org... Roger L. Cauvin wrote: Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three.
Hope that's clearer . . . .
Examples are a *really* good way to clarify ambiguous or complex requirements. In fact, when made executable they're called "test cases" :-), and supplying a few of those (showing input values and expected output values) would help, not only to clarify your goals for the humans, but also to let the proposed solutions easily be tested.
Good suggestion. Here are some "test cases":
"xyz123aaabbab" accept
"xyz123aabbaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
On Thu, 26 Jan 2006 16:26:57 GMT, rumours say that "Roger L. Cauvin"
<ro***@deadspam.com> might have written: "Christos Georgiou" <tz**@sil-tec.gr> wrote in message news:bo********************************@4ax.com.. . On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that "Roger L. Cauvin" <ro***@deadspam.com> might have written:
Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g.
"xyz123aaabbaabbbbababbbbaaabb"
I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
Does such a regular expression exist? If so, any ideas as to what it could be?
Is this what you mean?
^[^a]*(a{3})(?:[^a].*)?$
Close, but the pattern should allow "arbitrary sequence of characters" that precede the alternating a's and b's to contain the letter 'a'. In other words, the pattern should accept:
"xayz123aaabbab"
since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.
Your proposed pattern rejects this string.
1.
(a{3})(?:b[ab]*)?$
This finds the first (leftmost) "aaa" either at the end of the string or
followed by 'b' and then arbitrary sequences of 'a' and 'b'.
This will also match "aaaa" (from second position on).
2.
If you insist in only three 'a's and you can add the constraint that:
* let s be the "arbitrary sequence of characters" at the start of your
searched text
* len(s) >= 1 and not s.endswith('a')
then you'll have this reg.ex.
(?<=[^a])(a{3})(?:b[ab]*)?$
3.
If you want to allow for a possible empty "arbitrary sequence of characters"
at the start and you don't mind search speed
^(?:.?*[^a])?(a{3})(?:b[ab]*)?$
This should cover you: s="xayzbaaa123aaabbab" r=re.compile(r"^(?:.*?[^a])?(a{3})(?:b[ab]*)?$") m= r.match(s) m.group(1)
'aaa' m.start(1)
11 s[11:]
'aaabbab'
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians
"Alex Martelli" <al***@mail.comcast.net> wrote in message
news:1h9rhxf.vlrqwp1l99n66N%al***@mail.comcast.net ... Tim Chase <py*********@tim.thechases.com> wrote:
> Sorry for the confusion. The correct pattern should reject > all strings except those in which the first sequence of the > letter 'a' that is followed by the letter 'b' has a length of > exactly three.
....
.... If a little more than just REs and matching was allowed, it would be reasonably easy, but I don't know how to fashion a RE r such that r.match(s) will succeed if and only if s meets those very precise and complicated specs. That doesn't mean it just can't be done, just that I can't do it so far. Perhaps the OP can tell us what constrains him to use r.match ONLY, rather than a little bit of logic around it, so we can see if we're trying to work in an artificially overconstrained domain?
Alex, you seem to grasp exactly what the requirements are in this case. I
of course don't *have* to use regular expressions only, but I'm working with
an infrastructure that uses regexps in configuration files so that the code
doesn't have to change to add or change patterns. Before throwing up my
hands and re-architecting, I wanted to see if regexps would handle the job
(they have in every case but one).
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
On Thu, 26 Jan 2006 16:41:08 GMT, rumours say that "Roger L. Cauvin"
<ro***@deadspam.com> might have written: Good suggestion. Here are some "test cases":
"xyz123aaabbab" accept "xyz123aabbaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
Applying my last regex to your test cases: r.match("xyz123aaabbab")
<_sre.SRE_Match object at 0x00B47F60> r.match("xyz123aabbaab") r.match("xayz123aaabab")
<_sre.SRE_Match object at 0x00B50020> r.match("xaaayz123abab") r.match("xaaayz123aaabab")
<_sre.SRE_Match object at 0x00B47F60> print r.pattern
^(?:.*?[^a])?(a{3})(?:b[ab]*)?$
You should also remember to check the (match_object).start(1) to verify that
it matches the "aaa" you want.
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians
On Thu, 26 Jan 2006 16:26:57 GMT in comp.lang.python, "Roger L.
Cauvin" <ro***@deadspam.com> wrote: "Christos Georgiou" <tz**@sil-tec.gr> wrote in message news:bo********************************@4ax.com.. .
[...] Is this what you mean?
^[^a]*(a{3})(?:[^a].*)?$
Close, but the pattern should allow "arbitrary sequence of characters" that precede the alternating a's and b's to contain the letter 'a'. In other words, the pattern should accept:
"xayz123aaabbab"
since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.
I don't know an RE is the best solution to this problem. If I
understand the problem correctly, building a state machine to solve
this is trivial. The following took about 5 minutes of coding:
---begin included file
# Define our states.
# state[0] is next state if character is 'a'
# state[1] is next state if character is 'b'
# state[2] is next state for any other character
# Accept state means we've found a match
Accept = []
for i in range(3):
Accept.append(Accept)
# Reject state means the string cannot match
Reject = []
for i in range(3):
Reject.append(Reject)
# Remaining states: Start, 'a' found, 'aa', 'aaa', and 'aaaa'
Start = [0,1,2]
a1 = [0,1,2]
a2 = [0,1,2]
a3 = [0,1,2]
a4 = [0,1,2]
# Start: looking for first 'a'
Start[0] = a1
Start[1] = Start
Start[2] = Start
# a1: 1 'a' found so far
a1[0] = a2
a1[1] = Reject
a1[2] = Start
# a2: 'aa' found
a2[0] = a3
a2[1] = Reject
a2[2] = Start
# a3: 'aaa' found
a3[0] = a4
a3[1] = Accept
a3[2] = Start
# a4: four or more 'a' in a row
a4[0] = a4
a4[1] = Reject
a4[2] = Start
def detect(s):
"""
Return 1 if first substring aa*b has exactly 3 a's
Return 0 otherwise
"""
state = Start
for c in s:
if c == 'a':
state = state[0]
elif c == 'b':
state = state[1]
else:
state = state[2]
if state is Accept:
return 1
return 0
print detect("xyza123abc")
print detect("xyzaaa123aabc")
print detect("xyzaa123aaabc")
print detect("xyza123aaaabc")
--- end included file ---
And I'm pretty sure it does what you need, though it's pretty naive.
Note that if '3' isn't a magic number, states a1, a2, a3, and a4 could
be re-implemented as a single state with a counter, but the logic
inside detect gets a little hairier.
I haven't timed it, but it's not doing anything other than simple
comparisons and assignments. It's a little (OK, a lot) more code than
a simple RE, but I know it works.
HTH,
-=Dave
--
Change is inevitable, progress is not.
Roger L. Cauvin wrote: Good suggestion. Here are some "test cases":
"xyz123aaabbab" accept "xyz123aabbaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
$ more test.py
import re
print "got expected"
print "------ --------"
testsuite = (
("xyz123aaabbab", "accept"),
("xyz123aabbaab", "reject"),
("xayz123aaabab", "accept"),
("xaaayz123abab", "reject"),
("xaaayz123aaabab", "accept"),
)
for string, result in testsuite:
m = re.search("aaab", string)
if m:
print "accept",
else:
print "reject",
print result
$ python test.py
got expected
---------------
accept accept
reject reject
accept accept
reject reject
accept accept
</F>
"Fredrik Lundh" <fr*****@pythonware.com> wrote in message
news:ma***************************************@pyt hon.org... Roger L. Cauvin wrote:
Good suggestion. Here are some "test cases":
"xyz123aaabbab" accept "xyz123aabbaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
$ more test.py
import re
print "got expected" print "------ --------"
testsuite = ( ("xyz123aaabbab", "accept"), ("xyz123aabbaab", "reject"), ("xayz123aaabab", "accept"), ("xaaayz123abab", "reject"), ("xaaayz123aaabab", "accept"), )
for string, result in testsuite: m = re.search("aaab", string) if m: print "accept", else: print "reject", print result
$ python test.py got expected --------------- accept accept reject reject accept accept reject reject accept accept
Thanks, but the second test case I listed contained a typo. It should have
contained a sequence of three of the letter 'a'. The test cases should be:
"xyz123aaabbab" accept
"xyz123aabbaaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept
Your pattern fails the second test.
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
On Thu, 26 Jan 2006 18:01:07 +0100, rumours say that "Fredrik Lundh"
<fr*****@pythonware.com> might have written: Roger L. Cauvin wrote:
Good suggestion. Here are some "test cases":
"xyz123aaabbab" accept "xyz123aabbaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept $ more test.py
[snip of code] m = re.search("aaab", string)
[snip of more code]
$ python test.py got expected --------------- accept accept reject reject accept accept reject reject accept accept
You're right, Fredrik, but we (graciously as a group :) take also notice of
the other requirements that the OP has provided elsewhere and that are not
covered by the simple test that he specified.
The code above works for "aaaab" too, which the OP has already ruled out,
and it doesn't work for "aaa".
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians
"Christos Georgiou" <tz**@sil-tec.gr> wrote in message
news:ie********************************@4ax.com... On Thu, 26 Jan 2006 16:41:08 GMT, rumours say that "Roger L. Cauvin" <ro***@deadspam.com> might have written:
Good suggestion. Here are some "test cases":
"xyz123aaabbab" accept "xyz123aabbaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
Applying my last regex to your test cases:
r.match("xyz123aaabbab") <_sre.SRE_Match object at 0x00B47F60> r.match("xyz123aabbaab") r.match("xayz123aaabab") <_sre.SRE_Match object at 0x00B50020> r.match("xaaayz123abab") r.match("xaaayz123aaabab") <_sre.SRE_Match object at 0x00B47F60> print r.pattern
^(?:.*?[^a])?(a{3})(?:b[ab]*)?$
You should also remember to check the (match_object).start(1) to verify that it matches the "aaa" you want.
Thanks, but the second test case I listed contained a typo. It should have
contained a sequence of three of the letter 'a'. The test cases should be:
"xyz123aaabbab" accept
"xyz123aabbaaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept
Your pattern fails the second test.
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
"Christos Georgiou" <tz**@sil-tec.gr> wrote in message
news:t9********************************@4ax.com... On Thu, 26 Jan 2006 16:26:57 GMT, rumours say that "Roger L. Cauvin" <ro***@deadspam.com> might have written:
"Christos Georgiou" <tz**@sil-tec.gr> wrote in message news:bo********************************@4ax.com. ..
On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that "Roger L. Cauvin" <ro***@deadspam.com> might have written:Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g.
"xyz123aaabbaabbbbababbbbaaabb"
I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
Does such a regular expression exist? If so, any ideas as to what it could be?
Is this what you mean?
^[^a]*(a{3})(?:[^a].*)?$
Close, but the pattern should allow "arbitrary sequence of characters" that precede the alternating a's and b's to contain the letter 'a'. In other words, the pattern should accept:
"xayz123aaabbab"
since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.
Your proposed pattern rejects this string.
1.
(a{3})(?:b[ab]*)?$
This finds the first (leftmost) "aaa" either at the end of the string or followed by 'b' and then arbitrary sequences of 'a' and 'b'.
This will also match "aaaa" (from second position on).
2.
If you insist in only three 'a's and you can add the constraint that:
* let s be the "arbitrary sequence of characters" at the start of your searched text * len(s) >= 1 and not s.endswith('a')
then you'll have this reg.ex.
(?<=[^a])(a{3})(?:b[ab]*)?$
3.
If you want to allow for a possible empty "arbitrary sequence of characters" at the start and you don't mind search speed
^(?:.?*[^a])?(a{3})(?:b[ab]*)?$
This should cover you:
s="xayzbaaa123aaabbab" r=re.compile(r"^(?:.*?[^a])?(a{3})(?:b[ab]*)?$") m= r.match(s) m.group(1) 'aaa' m.start(1) 11 s[11:]
'aaabbab'
Thanks for continuing to follow up, Christos. Please see my reply to your
other post (in which you applied the test cases).
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
On Thu, 26 Jan 2006 17:09:18 GMT, rumours say that "Roger L. Cauvin"
<ro***@deadspam.com> might have written: Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be:
"xyz123aaabbab" accept "xyz123aabbaaab" reject
Here I object to either you or your need for a regular expression. You see,
before the "aaa" in your second test case, you have an "arbitrary sequence
of characters", so your requirements are met.
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians
"Christos Georgiou" <tz**@sil-tec.gr> wrote in message
news:cf********************************@4ax.com... On Thu, 26 Jan 2006 18:01:07 +0100, rumours say that "Fredrik Lundh" <fr*****@pythonware.com> might have written:
Roger L. Cauvin wrote:
Good suggestion. Here are some "test cases":
"xyz123aaabbab" accept "xyz123aabbaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
$ more test.py
[snip of code] m = re.search("aaab", string) [snip of more code]
$ python test.py got expected --------------- accept accept reject reject accept accept reject reject accept accept
You're right, Fredrik, but we (graciously as a group :) take also notice of the other requirements that the OP has provided elsewhere and that are not covered by the simple test that he specified.
My fault, guys. The second test case should be
"xyz123aabbaaab" reject
instead of
"xyz123aabbaab" reject
Fredrik's pattern fails this test case.
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
Roger L. Cauvin wrote: $ python test.py got expected --------------- accept accept reject reject accept accept reject reject accept accept
Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be:
"xyz123aaabbab" accept "xyz123aabbaaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
Your pattern fails the second test.
$ more test.py
import re
print "got expected"
print "------ --------"
testsuite = (
("xyz123aaabbab", "accept"),
("xyz123aabbaaab", "reject"),
("xayz123aaabab", "accept"),
("xaaayz123abab", "reject"),
("xaaayz123aaabab", "accept"),
)
for string, result in testsuite:
m = re.search("a+b", string)
if m and len(m.group()) == 4:
print "accept",
else:
print "reject",
print result
$ python test.py
got expected
------ --------
accept accept
reject reject
accept accept
reject reject
accept accept
</F>
"Christos Georgiou" <tz**@sil-tec.gr> wrote in message
news:9m********************************@4ax.com... On Thu, 26 Jan 2006 17:09:18 GMT, rumours say that "Roger L. Cauvin" <ro***@deadspam.com> might have written:
Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be:
"xyz123aaabbab" accept "xyz123aabbaaab" reject
Here I object to either you or your need for a regular expression. You see, before the "aaa" in your second test case, you have an "arbitrary sequence of characters", so your requirements are met.
Well, thank you for your efforts so far, Christos.
My purpose is to determine whether it's possible to do this using regular
expressions, since my application is already architected around
configuration files that use regular expressions. It may not be the best
architecture, but I still don't know the answer to my question. Is it
*possible* to fulfill my requirements with regular expressions, even if it's
not the best way to do it?
The requirements are not met by your regular expression, since by definition
the "arbitrary sequence of characters" stops once the sequences of a's and
b's starts.
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
"Fredrik Lundh" <fr*****@pythonware.com> wrote in message
news:ma***************************************@pyt hon.org... Roger L. Cauvin wrote:
> $ python test.py > got expected > --------------- > accept accept > reject reject > accept accept > reject reject > accept accept
Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be:
"xyz123aaabbab" accept "xyz123aabbaaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
Your pattern fails the second test.
$ more test.py
import re
print "got expected" print "------ --------"
testsuite = ( ("xyz123aaabbab", "accept"), ("xyz123aabbaaab", "reject"), ("xayz123aaabab", "accept"), ("xaaayz123abab", "reject"), ("xaaayz123aaabab", "accept"), )
for string, result in testsuite: m = re.search("a+b", string) if m and len(m.group()) == 4: print "accept", else: print "reject", print result
$ python test.py
got expected ------ -------- accept accept reject reject accept accept reject reject accept accept
Thanks, but I'm looking for a solution in terms of a regular expression
only. In other words, "accept" means the regular expression matched, and
"reject" means the regular expression did not match. I want to see if I can
fulfill the requirements without additional code (such as checking
"len(m.group())").
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
The below seems to pass all the tests you threw at it (taking the
modified 2nd test into consideration)
One other test that occurs to me would be
"xyz123aaabbaaabab"
where you have "aaab" in there twice.
-tkc
import re
tests = [
("xyz123aaabbab",True),
("xyz123aabbaaab", False),
("xayz123aaabab",True),
("xaaayz123abab", False),
("xaaayz123aaabab",True)
]
exp = '^([^b]|((?<!a)b))*aaab+[ab]*$'
r = re.compile(exp)
print "Using regexp: %s" % exp
for test,expectedResult in tests:
if r.match(test):
result = True
else:
result = False
if result == expectedResult:
print "%s passed" % test
else:
print "%s failed (expected %s, got %s)" % (
test,expectedResult,result)
--
--
Roger L. Cauvin wrote: "Fredrik Lundh" <fr*****@pythonware.com> wrote in message news:ma***************************************@pyt hon.org... Roger L. Cauvin wrote:
$ python test.py got expected --------------- accept accept reject reject accept accept reject reject accept accept Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be:
"xyz123aaabbab" accept "xyz123aabbaaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept
This passes your tests. I haven't closely followed the thread for other
requirements:
pattern = ".*?(?<![a+b])aaab" #look for aaab not preceded by any a+b
test(pattern)
got expected
------ --------
accept accept
reject reject
accept accept
reject reject
accept accept
testsuite = (
... ("xyz123aaabbab", "accept"),
... ("xyz123aabbaaab", "reject"),
... ("xayz123aaabab", "accept"),
... ("xaaayz123abab", "reject"),
... ("xaaayz123aaabab", "accept"),
... ) def test(pattern):
... print "got expected"
... print "------ --------"
... for string, result in testsuite:
... m = re.match(pattern, string)
... if m:
... print "accept",
... else:
... print "reject",
... print result
...
Michael
"Michael Spencer" <ma**@telcopartners.com> wrote in message
news:ma***************************************@pyt hon.org... Roger L. Cauvin wrote:
"xyz123aaabbab" accept "xyz123aabbaaab" reject "xayz123aaabab" accept "xaaayz123abab" reject "xaaayz123aaabab" accept This passes your tests. I haven't closely followed the thread for other requirements: >>> pattern = ".*?(?<![a+b])aaab" #look for aaab not preceded by any a+b
Very interesting. I think you may have solved the problem. The key seems
to be the "not preceded by" part. I'm unfamiliar with some of the notation.
Can you explain what "[a+b]" and the "(?<!" do?
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
"Tim Chase" <py*********@tim.thechases.com> wrote in message
news:ma***************************************@pyt hon.org... The below seems to pass all the tests you threw at it (taking the modified 2nd test into consideration)
One other test that occurs to me would be
"xyz123aaabbaaabab"
where you have "aaab" in there twice.
Good suggestion.
^([^b]|((?<!a)b))*aaab+[ab]*$
Looks good, although I've been unable to find a good explanation of the
"negative lookbehind" construct "(?<". How does it work?
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
>> "xyz123aaabbaaabab" where you have "aaab" in there twice.
Good suggestion.
I assumed that this would be a valid case. If not, the
expression would need tweaking. ^([^b]|((?<!a)b))*aaab+[ab]*$
Looks good, although I've been unable to find a good explanation of the "negative lookbehind" construct "(?<". How does it work?
The beginning part of the expression
([^b]|((?<!a)b))*
breaks down as
[^b] anything that isn't a "b"
| or
(...) this other thing
where "this other thing" is
(?<!a)b a "b" as long as it isn't immediately
preceeded by an "a"
The "(?<!...)" construct means that the "..." portion can't come
before the following token in the regexp...in this case, before a
"b".
There's also a "negative lookahead" (rather than "lookbehind")
which prevents items from following. This should be usable in
this scenario as wall and works with the aforementioned tests, using
"^([^a]|(a(?!b)))*aaab+[ab]*$"
which would be "anything that's not an 'a'; or an 'a' as long as
it's not followed by a 'b'"
The gospel is at: http://docs.python.org/lib/re-syntax.html
but is a bit terse. O'reily has a fairly good book on regexps if
you want to dig a bit deeper.
-tkc
Roger L. Cauvin wrote: "Michael Spencer" <ma**@telcopartners.com> wrote in message news:ma***************************************@pyt hon.org...
Roger L. Cauvin wrote:>"xyz123aaabbab" accept >"xyz123aabbaaab" reject >"xayz123aaabab" accept >"xaaayz123abab" reject >"xaaayz123aaabab" accept >
This passes your tests. I haven't closely followed the thread for other requirements:
>>> pattern = ".*?(?<![a+b])aaab" #look for aaab not preceded by any a+b
Very interesting. I think you may have solved the problem. The key seems to be the "not preceded by" part. I'm unfamiliar with some of the notation. Can you explain what "[a+b]" and the "(?<!" do?
I think you might need to add a test case involving a pattern of aaaab
prior to another aaab. From what I gather (not reading too closely),
you would want this to be rejected. Is that true?
xyz123aaaababaaabab
-Peter
"Peter Hansen" <pe***@engcorp.com> wrote in message
news:ma***************************************@pyt hon.org... Roger L. Cauvin wrote: "Michael Spencer" <ma**@telcopartners.com> wrote in message news:ma***************************************@pyt hon.org...
Roger L. Cauvin wrote: >>"xyz123aaabbab" accept >>"xyz123aabbaaab" reject >>"xayz123aaabab" accept >>"xaaayz123abab" reject >>"xaaayz123aaabab" accept >>
This passes your tests. I haven't closely followed the thread for other requirements:
>>> pattern = ".*?(?<![a+b])aaab" #look for aaab not preceded by any a+b
Very interesting. I think you may have solved the problem. The key seems to be the "not preceded by" part. I'm unfamiliar with some of the notation. Can you explain what "[a+b]" and the "(?<!" do?
I think you might need to add a test case involving a pattern of aaaab prior to another aaab. From what I gather (not reading too closely), you would want this to be rejected. Is that true?
xyz123aaaababaaabab
Adding that test would be a good idea. You're right; I would want that
string to be rejected, since in that string the first sequence of 'a'
directly preceding a 'b' is of length 4 instead of 3.
Thanks for the solution!
--
Roger L. Cauvin no**********@cauvin.org (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research http://www.cauvin-inc.com
Christoph Conrad <no****@spamgourmet.com> wrote: Hallo Alex,
r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in listOfStringsToTest if r.match(s)]
Unfortunately, the OP's spec is even more complex than this, if we are to take to the letter what you just quoted; e.g. aazaaab SHOULD match,
Then it's again "a{3}b", isn't it?
Except that this one would also match aazaaaaab, which it shouldn't.
Alex
How about :
pattern = re.compile('^([^a]|(a+[^ab]))*aaab')
Which basically says, "precede with arbitrarily many non-a's
or a sequences ending in non-b, then must have 3 as followed by a b."
cases = ["xyz123aaabbab", "xayz123aaabab", "xaaayz123aaabab",
"xyz123aaaababaaabab", "xyz123aabbaaab", "xaaayz123abab"]
[re.search(pattern, case) is not None for case in cases]
[True, True, True, False, False, False]
--Scott David Daniels sc***********@acm.org
Alex Martelli wrote: Christoph Conrad <no****@spamgourmet.com> wrote:
Hello Roger,
since the length of the first sequence of the letter 'a' is 2. Yours accepts it, right? Yes, i misunderstood your requirements. So it must be modified essentially to that what Tim Chase wrote:
m = re.search('^[^a]*a{3}b', 'xyz123aabbaaab')
...but that rejects 'aazaaab' which should apparently be accepted.
... and that is OK. That was the request:
I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3.
--Armin
Alex This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Ron Adam |
last post by:
Is it possible to match a string to regular expression pattern instead
of the other way around?
For example, instead of finding a match within a string, I want to
find out, (pass or fail), if...
|
by: Follower |
last post by:
Hi,
I am working on a function to return extracts from a text document
with a specific phrase highlighted (i.e. display the context of the
matched phrase).
The requirements are:
* Match...
|
by: Tom Deco |
last post by:
Hi,
I'm trying to use a regular expression to match a string containing a #
(basically i'm looking for #include ...)
I don't seem to manage to write a regular expression that matches this.
...
|
by: Steve Kirsch |
last post by:
I need a simple function that can match the number of beginning and ending
parenthesis in an expression. Here's a sample expression:
( ( "john" ) and ( "jane" ) and ( "joe" ) )
Does .NET have...
|
by: JackRazz |
last post by:
Anyone know the regular expression to match a blank line where the byte sequence is
"0D 0A 0D 0A"
ive tried "\r\n\r\n+", "^$+" "\n\r" with no success. Any Ideas?
Thanks - JackRazz
This is...
|
by: Johnny Williams |
last post by:
I'm struggling to create a regular expression for use with VB .Net which
matches a person's name in a string
of words.
For example in "physicist Albert Einstein was born in Germany and"
I want...
|
by: konrad Krupa |
last post by:
I'm not expert in Pattern Matching and it would take me a while to come up
with the syntax for what I'm trying to do.
I hope there are some experts that can help me.
I'm trying to match...
|
by: mikko.n |
last post by:
I have recently been experimenting with GNU C library regular
expression functions and noticed a problem with pattern matching. It
seems to recognize only the first match but ignoring the rest of...
|
by: Andy B |
last post by:
I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:
<startTag>55555 any...
|
by: lllomh |
last post by:
Define the method first
this.state = {
buttonBackgroundColor: 'green',
isBlinking: false, // A new status is added to identify whether the button is blinking or not
}
autoStart=()=>{
|
by: DJRhino |
last post by:
Was curious if anyone else was having this same issue or not....
I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
|
by: Aliciasmith |
last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...
|
by: NeoPa |
last post by:
Hello everyone.
I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report).
I know it can be done by selecting :...
|
by: NeoPa |
last post by:
Introduction
For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM)
Please note that the UK and Europe revert to winter time on...
|
by: nia12 |
last post by:
Hi there,
I am very new to Access so apologies if any of this is obvious/not clear.
I am creating a data collection tool for health care employees to complete. It consists of a number of...
|
by: NeoPa |
last post by:
Introduction
For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...
|
by: isladogs |
last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, Mike...
| |