471,091 Members | 1,533 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,091 software developers and data experts.

Regex anomaly


Hello,

Has anyone has issue with compiled re's vis-a-vis the re.I (ignore
case) flag? I can't make sense of this compiled re producing a
different match when given the flag, odd both in it's difference from
the uncompiled regex (as I thought the uncompiled api was a wrapper
around a compile-and-execute block) and it's difference from the
compiled version with no flag specified. The match given is utter
nonsense given the input re.

In [48]: import re
In [49]: reStr = r"([a-z]+)://"
In [51]: against = "http://www.hello.com"
In [53]: re.match(reStr, against).groups()
Out[53]: ('http',)
In [54]: re.match(reStr, against, re.I).groups()
Out[54]: ('http',)
In [55]: reCompiled = re.compile(reStr)
In [56]: reCompiled.match(against).groups()
Out[56]: ('http',)
In [57]: reCompiled.match(against, re.I).groups()
Out[57]: ('tp',)

cheers,
-Mike

Jan 3 '06 #1
12 1270
<mi********@gmail.com> wrote:

Hello,

Has anyone has issue with compiled re's vis-a-vis the re.I (ignore
case) flag? I can't make sense of this compiled re producing a
different match when given the flag, odd both in it's difference from
the uncompiled regex (as I thought the uncompiled api was a wrapper
around a compile-and-execute block) and it's difference from the
compiled version with no flag specified. The match given is utter
nonsense given the input re.

In [48]: import re
In [49]: reStr = r"([a-z]+)://"
In [51]: against = "http://www.hello.com"
In [53]: re.match(reStr, against).groups()
Out[53]: ('http',)
In [54]: re.match(reStr, against, re.I).groups()
Out[54]: ('http',)
In [55]: reCompiled = re.compile(reStr)
In [56]: reCompiled.match(against).groups()
Out[56]: ('http',)
In [57]: reCompiled.match(against, re.I).groups()
Out[57]: ('tp',)


LOL, and you'll be LOL too when you see the problem :-)

You can't give the re.I flag to reCompiled.match(). You have to give
it to re.compile(). The second argument to reCompiled.match() is the
position where to start searching. I'm guessing re.I is defined as 2,
which explains the match you got.

This is actually one of those places where duck typing let us down.
If we had type bondage, re.I would be an instance of RegExFlags, and
reCompiled.match() would have thrown a TypeError when the second
argument wasn't an integer. I'm not saying type bondage is inherently
better than duck typing, just that it has its benefits at times.
Jan 3 '06 #2
On 2 Jan 2006 21:00:53 -0800, mi********@gmail.com <mi********@gmail.com> wrote:

Has anyone has issue with compiled re's vis-a-vis the re.I (ignore
case) flag? I can't make sense of this compiled re producing a
different match when given the flag, odd both in it's difference from
the uncompiled regex (as I thought the uncompiled api was a wrapper
around a compile-and-execute block) and it's difference from the
compiled version with no flag specified. The match given is utter
nonsense given the input re.


The re.compile and re.match methods take the flag parameter:

compile( pattern[, flags])
match( pattern, string[, flags])

But the regular expression object method takes different paramters:

match( string[, pos[, endpos]])

It's not a little confusing that the parameters to re.match() and
re.compile().match() are so different, but that's the cause of what
you're seeing.

You need to do:

reCompiled = re.compile(reStr, re.I)
reCompiled.match(against).groups()

to get the behaviour you want.

Andrew
Jan 3 '06 #3
>>>>> mike klaas <mi********@gmail.com> writes:
In [48]: import re
In [49]: reStr = r"([a-z]+)://"
In [51]: against = "http://www.hello.com"
In [53]: re.match(reStr, against).groups()
Out[53]: ('http',)
In [54]: re.match(reStr, against, re.I).groups()
Out[54]: ('http',)
In [55]: reCompiled = re.compile(reStr)
In [56]: reCompiled.match(against).groups()
Out[56]: ('http',)
In [57]: reCompiled.match(against, re.I).groups()
Out[57]: ('tp',)


I can reproduce this on Debian Linux testing, both python 2.3 and python
2.4. Seems like a bug. search() also exhibits the same behavior.

Ganesan
--
Ganesan Rajagopal (rganesan at debian.org) | GPG Key: 1024D/5D8C12EA
Web: http://employees.org/~rganesan | http://rganesan.blogspot.com
Jan 3 '06 #4
Thanks guys, that is probably the most ridiculous mistake I've made in
years <g>

-Mike

Jan 3 '06 #5
In article <11**********************@f14g2000cwb.googlegroups .com>,
mi********@gmail.com wrote:
Thanks guys, that is probably the most ridiculous mistake I've made in
years <g>

-Mike


If that's the more ridiculous you can come up with, you're not trying hard
enough. I've done much worse.
Jan 3 '06 #6
>>>>> mike klaas <mi********@gmail.com> writes:
Thanks guys, that is probably the most ridiculous mistake I've made in
years <g>


I was taken too :-). This is quite embarassing, considering that I remember
reading a big thread in python devel list about this a while back!

Ganesan

--
Ganesan Rajagopal (rganesan at debian.org) | GPG Key: 1024D/5D8C12EA
Web: http://employees.org/~rganesan | http://rganesan.blogspot.com
Jan 3 '06 #7
Would this particular inconsistency be candidate for change in Py3k?
Seems to me the pos and endpos arguments are redundant with slicing,
and the re.match function would benefit from having the same arguments
as pattern.match. Of course, this is a backwards-incompatible change;
that's why I suggested Py3k.

Jan 3 '06 #8
On 3 Jan 2006 02:20:52 -0800, Sam Pointon <fr*************@gmail.com> wrote:
Would this particular inconsistency be candidate for change in Py3k?
Seems to me the pos and endpos arguments are redundant with slicing,
Being able to specify the start and end indices for a search is
important when working with very large strings (multimegabyte) --
where slicing would create a copy, specifying pos and endpos allows
for memory-efficient searching in limited areas of a string.
and the re.match function would benefit from having the same arguments
as pattern.match.


Not at all; the flags need to be specified when the regex is compiled,
as they affect the compiled representation (finite state automaton I
expect) of the regex. If the flags were given in pattern.match(), then
there'd be no performance benefit gained from precompiling the regex.

Andrew
Jan 3 '06 #9
In article <11**********************@f14g2000cwb.googlegroups .com>,
"Sam Pointon" <fr*************@gmail.com> wrote:
Would this particular inconsistency be candidate for change in Py3k?
Seems to me the pos and endpos arguments are redundant with slicing,
and the re.match function would benefit from having the same arguments
as pattern.match. Of course, this is a backwards-incompatible change;
that's why I suggested Py3k.


I don't see any way to implement re.I at match time; it's something that
needs to get done at regex compile time. It's available in the
module-level match() call, because that one is really compile-then-match().
Jan 3 '06 #10
In article <ro***********************@reader2.panix.com>,
Roy Smith <ro*@panix.com> wrote:
In article <11**********************@f14g2000cwb.googlegroups .com>,
"Sam Pointon" <fr*************@gmail.com> wrote:
Would this particular inconsistency be candidate for change in Py3k?
Seems to me the pos and endpos arguments are redundant with slicing,
and the re.match function would benefit from having the same arguments
as pattern.match. Of course, this is a backwards-incompatible change;
that's why I suggested Py3k.


I don't see any way to implement re.I at match time;


It's easy: just compile two machines, one with re.I and one without and
package them as if they were one. Then use the flag to pick a compiled
machine at run time.

rg
Jan 3 '06 #11
Roy Smith wrote:
LOL, and you'll be LOL too when you see the problem :-)

You can't give the re.I flag to reCompiled.match(). You have to give
it to re.compile(). The second argument to reCompiled.match() is the
position where to start searching. I'm guessing re.I is defined as 2,
which explains the match you got.

This is actually one of those places where duck typing let us down.
If we had type bondage, re.I would be an instance of RegExFlags, and
reCompiled.match() would have thrown a TypeError when the second
argument wasn't an integer. I'm not saying type bondage is inherently
better than duck typing, just that it has its benefits at times.

Even with duck-typing, we could cut our users a break. Making
our flags instances of a distinct class doesn't actually require
type bondage.

We could define the __or__ method for RegExFlags, but really,
or-ing together integer flags is old habit from low-level
languages. Really we should pass a set of flags.
--
--Bryan
Jan 5 '06 #12

Bryan> We could define the __or__ method for RegExFlags, but really,
Bryan> or-ing together integer flags is old habit from low-level
Bryan> languages. Really we should pass a set of flags.

Good idea. Added to the Python3.0Suggestions wiki page:

http://wiki.python.org/moin/Python3%2e0Suggestions

Skip
Jan 5 '06 #13

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by Tim Conner | last post: by
20 posts views Thread by jeevankodali | last post: by
17 posts views Thread by clintonG | last post: by
16 posts views Thread by clintonG | last post: by
6 posts views Thread by Extremest | last post: by
7 posts views Thread by Extremest | last post: by
3 posts views Thread by aspineux | last post: by
1 post views Thread by mai | last post: by
15 posts views Thread by morleyc | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.