473,395 Members | 1,530 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

regex confusion


In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
a complete example

import re, urllib
rgxPrev = re.compile('.*?a.*?')

url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
s = urllib.urlopen(url).read()
m = rgxPrev.match(s)
print m
print s.find('a')

m is None (no match) and the s.find('a') reports an 'a' at index 48.

I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

Or am I insane?

John Hunter
hunter:~/python/projects/poker/data/pokerroom> uname -a
Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686
i686 i386 GNU/Linux
hunter:~/python/projects/poker/data/pokerroom> python
Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
[GCC 3.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to rlcompleter2 0.95
for nice experiences hit <tab> multiple times

Jul 18 '05 #1
8 1664
MAybe you meant:
import re, urllib
rgxPrev = re.compile('.*?a.*?')

url =
'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
s = urllib.urlopen(url).read()
***m = match(rgxPrev,s)***
print m
print s.find('a')

match takes two arguments

"John Hunter" <jd******@ace.bsd.uchicago.edu> wrote in message
news:ma**************************************@pyth on.org...

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
a complete example

import re, urllib
rgxPrev = re.compile('.*?a.*?')

url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html' s = urllib.urlopen(url).read()
m = rgxPrev.match(s)
print m
print s.find('a')

m is None (no match) and the s.find('a') reports an 'a' at index 48.

I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

Or am I insane?

John Hunter
hunter:~/python/projects/poker/data/pokerroom> uname -a
Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux
hunter:~/python/projects/poker/data/pokerroom> python
Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
[GCC 3.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to rlcompleter2 0.95
for nice experiences hit <tab> multiple times

Jul 18 '05 #2
John Hunter wrote:

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')


This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left. I'm not exactly sure why this is not working, but its definitely
redundant. Eliminiating the redundancy gives you this:

rgxPrev = re.compile('.*a.*')

Works perfect.

Regards,

Diez

Jul 18 '05 #3
On Tue, 09 Dec 2003 09:43:24 -0600,
John Hunter <jd******@ace.bsd.uchicago.edu> wrote:
rgxPrev = re.compile('.*?a.*?')


.. doesn't match newlines unless you specify the re.DOTALL / (?s) flag, so it
won't match unless 'a' is on the very first line. Add (?s) to your
expression, and it should work (though it'll be much slower than the .find()
method).

--amk
Jul 18 '05 #4
"Diez B. Roggisch" wrote:

John Hunter wrote:

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')


This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left.


Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the match
in non-greedy or minimal fashion; as few characters as possible will be
matched. ....

-Peter
Jul 18 '05 #5
John Hunter wrote:

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
[...]
I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.
There is a nice example where non-greedy regexes are really useful in A. M.
Kuchling's Regex Howto (http://www.amk.ca/python/howto/regex/regex.html)
Or am I insane?


This may be off-topic, but the easiest if not fastest way to find multiple
occurences of a string in a text is:
import re
r = re.compile("a")
for m in r.finditer("abca\na"): .... print m.start()
....
0
3
5


Peter
Jul 18 '05 #6
>> This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left.


Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the
match in non-greedy or minimal fashion; as few characters as possible will
be matched. ....


Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in question
definitely didn't work with it.

Diez
Jul 18 '05 #7
Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in
question definitely didn't work with it.


Ok - I just found out - it makes sense when taking into account what follows
in the regex, as that will be matched earlier. Neat - didn't know that such
things existed.

Diez
Jul 18 '05 #8
>>>>> "Peter" == Peter Otten <__*******@web.de> writes:

Peter> This may be off-topic, but the easiest if not fastest way
Peter> to find multiple occurences of a string in a text is:

Right, I actually am using regex matching and not literal char
matching, but in trying to debug why my regex wasn't working, I
simplified it to the simplest case I could, which was a string
literal.

Thanks for the DOTALL pointer above.

JDH

Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Will Clifton | last post by:
Hello, Spent all day yesterday reading about this and I still can't get it. Perhaps my IQ is not much above room temperature... My mySQL database is a simple inventory-type database with a...
75
by: Xah Lee | last post by:
http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...
9
by: Christ | last post by:
Hi there, i'm trying to make a regex, but it ain't working. In just one regex expression I want to check a password that must meet following requirements: - at least 6 characters long - at...
2
by: Daniel Billingsley | last post by:
First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking in this area... it's just a whole bunch of information thrown at you and you're left to yourself as to organizing it in...
3
by: DevBoy | last post by:
I am in need of parsing string based on characters like / or { or } or ^ However, anytime I try and run the following code I do not the proper results (It always returns the same string unparsed....
2
by: Tom Jones | last post by:
Hi, I have a component that accepts a string representing a class of files (exactly like those you would pass to the 'dir' dos command, ie. '*.txt', or '???.cpp'). An exception is generated...
7
by: Beeeeeeeeeeeeves | last post by:
Hi I do mostly programming in VB6 and C# although I like to dabble in C++ now and again, I was just wondering what is a good* regular expression library to use for C++, given that I DON'T want to...
17
by: Mark | last post by:
I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...
5
by: Jeff | last post by:
....hoping someone can help someone still new to vb.net 2005 with something new to him. ....been successfully using the regular expression validators from the toolbox, but now I have need to do...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.