469,946 Members | 1,805 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,946 developers. It's quick & easy.

unexpected behaviour for python regexp: caret symbol almost useless?

This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file

However that's not true for the re module in python, since this one
takes the regexp as if were specified this way: '^<widget class=".*"
id=".*">'

For some reason regexp on python decide to match from the start of the
line, no matter if you used or not the caret symbol '^'.

I have a hard time to note why this regexp wasn't working:
regexp = re.compile(r'<widget class=".*" id="(.*)">')

The solution was to consider spaces:
regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

To reproduce behaviour just take a .glade file and this python script:
<code>
import re

glade_file_name = 'some.glade'

bad_regexp = re.compile(r'<widget class=".*" id="(.*)">')
good_regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

for line in open(glade_file_name):
if bad_regexp.match(line):
print 'bad:', line.strip()
if good_regexp.match(line):
print 'good:', line.strip()
</code>

The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.

May 28 '06 #1
4 2586
conan wrote:
The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.


You want search(), not match().

http://docs.python.org/lib/matching-searching.html

Peter
May 28 '06 #2
"conan" <co************@gmail.com> wrote in message
news:11**********************@38g2000cwa.googlegro ups.com...
This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file


As Peter Otten has already mentioned, this is the difference between the re
"match" and "search" methods.

As purely a lateral exercise, here is a pyparsing rendition of your program:

------------------------------------
from pyparsing import makeXMLTags, line

# define pyparsing patterns for begin and end XML tags
widgetStart,widgetEnd = makeXMLTags("widget")

# read the file contents
glade_file_name = 'some.glade'
gladeContents = open(glade_file_name).read()

# scan the input string for matching tags
for widget,start,end in widgetStart.scanString(gladeContents):
print "good:", line(start, gladeContents).strip()
print widget["class"], widget["id"]
print "Class: %(class)s; Id: %(id)s" % widget
------------------------------------
Not quite an exact match, only the good lines get listed. But also check
out some of the other capabilities. To do this with re's, you have to
clutter up the re expression with field names, as in:

(r'<widget class=(?P<class>".*") id="(?P<id>.*)">')

The parsing patterns generated by makeXMLTags give dict-like and
attribute-like access to any attributes included with the tag. If not for
the unfortunate attribute name "class" (which is a Python keyword), you
could also reference these values as widget.class and widget.id.

If you are parsing HTML, there is also a makeHTMLTags method, which creates
patterns that are less rigid about upper/lower case and other XML
strictnesses.

-- Paul
May 28 '06 #3
Thank you, i have read this but somehow a missed it when the issue
arose.

May 29 '06 #4
Thank you Paul.

Since the only thing i'm doing is extracting this fields, and have no
plans to include other stuff, a regexp is fine. However i will take
into account 'pyparsing' when i need to do more complex parsing.

As you can see in the example i send, i was trying to get info from a
glade file, in particular i was tired of doing this everytime i need to
access a widget:

some_var = xml.get_widget('some_id')

(doing this is tiresome when you have more than 10 widgets)

So i do a little module to have all widgets instanciated as attributes
of the object, for anyone interested it is on:

http://www.lugmen.org.ar/~p10n/sourc.../GetWidgets.py

However is still pretty unmature, since it lacks some checks.

May 29 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Helmut Zeisel | last post: by
2 posts views Thread by Gerhard Esterhuizen | last post: by
3 posts views Thread by rimmer | last post: by
852 posts views Thread by Mark Tarver | last post: by
4 posts views Thread by Matt | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.