By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,219 Members | 1,110 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,219 IT Pros & Developers. It's quick & easy.

unexpected behaviour for python regexp: caret symbol almost useless?

P: n/a
This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file

However that's not true for the re module in python, since this one
takes the regexp as if were specified this way: '^<widget class=".*"
id=".*">'

For some reason regexp on python decide to match from the start of the
line, no matter if you used or not the caret symbol '^'.

I have a hard time to note why this regexp wasn't working:
regexp = re.compile(r'<widget class=".*" id="(.*)">')

The solution was to consider spaces:
regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

To reproduce behaviour just take a .glade file and this python script:
<code>
import re

glade_file_name = 'some.glade'

bad_regexp = re.compile(r'<widget class=".*" id="(.*)">')
good_regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

for line in open(glade_file_name):
if bad_regexp.match(line):
print 'bad:', line.strip()
if good_regexp.match(line):
print 'good:', line.strip()
</code>

The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.

May 28 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
conan wrote:
The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.


You want search(), not match().

http://docs.python.org/lib/matching-searching.html

Peter
May 28 '06 #2

P: n/a
"conan" <co************@gmail.com> wrote in message
news:11**********************@38g2000cwa.googlegro ups.com...
This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file


As Peter Otten has already mentioned, this is the difference between the re
"match" and "search" methods.

As purely a lateral exercise, here is a pyparsing rendition of your program:

------------------------------------
from pyparsing import makeXMLTags, line

# define pyparsing patterns for begin and end XML tags
widgetStart,widgetEnd = makeXMLTags("widget")

# read the file contents
glade_file_name = 'some.glade'
gladeContents = open(glade_file_name).read()

# scan the input string for matching tags
for widget,start,end in widgetStart.scanString(gladeContents):
print "good:", line(start, gladeContents).strip()
print widget["class"], widget["id"]
print "Class: %(class)s; Id: %(id)s" % widget
------------------------------------
Not quite an exact match, only the good lines get listed. But also check
out some of the other capabilities. To do this with re's, you have to
clutter up the re expression with field names, as in:

(r'<widget class=(?P<class>".*") id="(?P<id>.*)">')

The parsing patterns generated by makeXMLTags give dict-like and
attribute-like access to any attributes included with the tag. If not for
the unfortunate attribute name "class" (which is a Python keyword), you
could also reference these values as widget.class and widget.id.

If you are parsing HTML, there is also a makeHTMLTags method, which creates
patterns that are less rigid about upper/lower case and other XML
strictnesses.

-- Paul
May 28 '06 #3

P: n/a
Thank you, i have read this but somehow a missed it when the issue
arose.

May 29 '06 #4

P: n/a
Thank you Paul.

Since the only thing i'm doing is extracting this fields, and have no
plans to include other stuff, a regexp is fine. However i will take
into account 'pyparsing' when i need to do more complex parsing.

As you can see in the example i send, i was trying to get info from a
glade file, in particular i was tired of doing this everytime i need to
access a widget:

some_var = xml.get_widget('some_id')

(doing this is tiresome when you have more than 10 widgets)

So i do a little module to have all widgets instanciated as attributes
of the object, for anyone interested it is on:

http://www.lugmen.org.ar/~p10n/sourc.../GetWidgets.py

However is still pretty unmature, since it lacks some checks.

May 29 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.