regex confusion

John Hunter

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
a complete example

import re, urllib
rgxPrev = re.compile('.*?a.*?')

url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
s = urllib.urlopen(url).read()
m = rgxPrev.match(s)
print m
print s.find('a')

m is None (no match) and the s.find('a') reports an 'a' at index 48.

I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

Or am I insane?

John Hunter
hunter:~/python/projects/poker/data/pokerroom> uname -a
Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686
i686 i386 GNU/Linux
hunter:~/python/projects/poker/data/pokerroom> python
Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
[GCC 3.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to rlcompleter2 0.95
for nice experiences hit <tab> multiple times

Jul 18 '05 #1

Subscribe Post Reply

1664

Luther Barnum

MAybe you meant:
import re, urllib
rgxPrev = re.compile('.*?a.*?')

url =
'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
s = urllib.urlopen(url).read()
***m = match(rgxPrev,s)***
print m
print s.find('a')

match takes two arguments

"John Hunter" <jd******@ace.bsd.uchicago.edu> wrote in message
news:ma**************************************@pyth on.org...

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
a complete example

import re, urllib
rgxPrev = re.compile('.*?a.*?')

url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html' s = urllib.urlopen(url).read()
m = rgxPrev.match(s)
print m
print s.find('a')

m is None (no match) and the s.find('a') reports an 'a' at index 48.

I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

Or am I insane?

John Hunter
hunter:~/python/projects/poker/data/pokerroom> uname -a
Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux
hunter:~/python/projects/poker/data/pokerroom> python
Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
[GCC 3.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to rlcompleter2 0.95
for nice experiences hit <tab> multiple times

Jul 18 '05 #2

Diez B. Roggisch

John Hunter wrote:

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left. I'm not exactly sure why this is not working, but its definitely
redundant. Eliminiating the redundancy gives you this:

rgxPrev = re.compile('.*a.*')

Works perfect.

Regards,

Diez

Jul 18 '05 #3

A.M. Kuchling

On Tue, 09 Dec 2003 09:43:24 -0600,
John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

rgxPrev = re.compile('.*?a.*?')

.. doesn't match newlines unless you specify the re.DOTALL / (?s) flag, so it
won't match unless 'a' is on the very first line. Add (?s) to your
expression, and it should work (though it'll be much slower than the .find()
method).

--amk

Jul 18 '05 #4

Peter Hansen

"Diez B. Roggisch" wrote:

John Hunter wrote:

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left.

Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the match
in non-greedy or minimal fashion; as few characters as possible will be
matched. ....

-Peter

Jul 18 '05 #5

Peter Otten

John Hunter wrote:

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
[...]
I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.
There is a nice example where non-greedy regexes are really useful in A. M.
Kuchling's Regex Howto (http://www.amk.ca/python/howto/regex/regex.html)
Or am I insane?

This may be off-topic, but the easiest if not fastest way to find multiple
occurences of a string in a text is:

import re
r = re.compile("a")
for m in r.finditer("abca\na"): .... print m.start()
....
0
3
5

Peter

Jul 18 '05 #6

Diez B. Roggisch

>> This is a bogus regex - a '*' means "zero or more occurences" for the

expression to the left. '?' means "zero or one occurence" for the exp to
the left.

Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the
match in non-greedy or minimal fashion; as few characters as possible will
be matched. ....

Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in question
definitely didn't work with it.

Diez

Jul 18 '05 #7

Diez B. Roggisch

Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in
question definitely didn't work with it.

Ok - I just found out - it makes sense when taking into account what follows
in the regex, as that will be matched earlier. Neat - didn't know that such
things existed.

Diez

Jul 18 '05 #8

John Hunter

>>>>> "Peter" == Peter Otten <__*******@web.de> writes:

Peter> This may be off-topic, but the easiest if not fastest way
Peter> to find multiple occurences of a string in a text is:

Right, I actually am using regex matching and not literal char
matching, but in trying to debug why my regex wasn't working, I
simplified it to the simplest case I could, which was a string
literal.

Thanks for the DOTALL pointer above.

JDH

Jul 18 '05 #9

Similar topics

Regex or str_replace confusion

by: Will Clifton | last post by:

Hello, Spent all day yesterday reading about this and I still can't get it. Perhaps my IQ is not much above room temperature... My mySQL database is a simple inventory-type database with a...

PHP

[perl-python] Python documentation moronicities (continued)

by: Xah Lee | last post by:

http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...

Python

Advanced Regex

by: Christ | last post by:

Hi there, i'm trying to make a regex, but it ain't working. In just one regex expression I want to check a password that must meet following requirements: - at least 6 characters long - at...

Javascript

my head is spinning with regex

by: Daniel Billingsley | last post by:

First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking in this area... it's just a whole bunch of information thrown at you and you're left to yourself as to organizing it in...

C# / C Sharp

Regex Issue

by: DevBoy | last post by:

I am in need of parsing string based on characters like / or { or } or ^ However, anytime I try and run the following code I do not the proper results (It always returns the same string unparsed....

C# / C Sharp

Regex confusion

by: Tom Jones | last post by:

Hi, I have a component that accepts a string representing a class of files (exactly like those you would pass to the 'dir' dos command, ie. '*.txt', or '???.cpp'). An exception is generated...

C# / C Sharp

Regex library

by: Beeeeeeeeeeeeves | last post by:

Hi I do mostly programming in VB6 and C# although I like to dabble in C++ now and again, I was just wondering what is a good* regular expression library to use for C++, given that I DON'T want to...

.NET Framework

parsing VB code with a regex

by: Mark | last post by:

I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...

.NET Framework

basic regex question

by: Jeff | last post by:

....hoping someone can help someone still new to vb.net 2005 with something new to him. ....been successfully using the regular expression validators from the toolbox, but now I have need to do...

Visual Basic .NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice