472,331 Members | 1,604 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,331 software developers and data experts.

unexpected behaviour for python regexp: caret symbol almost useless?

This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file

However that's not true for the re module in python, since this one
takes the regexp as if were specified this way: '^<widget class=".*"
id=".*">'

For some reason regexp on python decide to match from the start of the
line, no matter if you used or not the caret symbol '^'.

I have a hard time to note why this regexp wasn't working:
regexp = re.compile(r'<widget class=".*" id="(.*)">')

The solution was to consider spaces:
regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

To reproduce behaviour just take a .glade file and this python script:
<code>
import re

glade_file_name = 'some.glade'

bad_regexp = re.compile(r'<widget class=".*" id="(.*)">')
good_regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

for line in open(glade_file_name):
if bad_regexp.match(line):
print 'bad:', line.strip()
if good_regexp.match(line):
print 'good:', line.strip()
</code>

The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.

May 28 '06 #1
4 2668
conan wrote:
The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.


You want search(), not match().

http://docs.python.org/lib/matching-searching.html

Peter
May 28 '06 #2
"conan" <co************@gmail.com> wrote in message
news:11**********************@38g2000cwa.googlegro ups.com...
This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file


As Peter Otten has already mentioned, this is the difference between the re
"match" and "search" methods.

As purely a lateral exercise, here is a pyparsing rendition of your program:

------------------------------------
from pyparsing import makeXMLTags, line

# define pyparsing patterns for begin and end XML tags
widgetStart,widgetEnd = makeXMLTags("widget")

# read the file contents
glade_file_name = 'some.glade'
gladeContents = open(glade_file_name).read()

# scan the input string for matching tags
for widget,start,end in widgetStart.scanString(gladeContents):
print "good:", line(start, gladeContents).strip()
print widget["class"], widget["id"]
print "Class: %(class)s; Id: %(id)s" % widget
------------------------------------
Not quite an exact match, only the good lines get listed. But also check
out some of the other capabilities. To do this with re's, you have to
clutter up the re expression with field names, as in:

(r'<widget class=(?P<class>".*") id="(?P<id>.*)">')

The parsing patterns generated by makeXMLTags give dict-like and
attribute-like access to any attributes included with the tag. If not for
the unfortunate attribute name "class" (which is a Python keyword), you
could also reference these values as widget.class and widget.id.

If you are parsing HTML, there is also a makeHTMLTags method, which creates
patterns that are less rigid about upper/lower case and other XML
strictnesses.

-- Paul
May 28 '06 #3
Thank you, i have read this but somehow a missed it when the issue
arose.

May 29 '06 #4
Thank you Paul.

Since the only thing i'm doing is extracting this fields, and have no
plans to include other stuff, a regexp is fine. However i will take
into account 'pyparsing' when i need to do more complex parsing.

As you can see in the example i send, i was trying to get info from a
glade file, in particular i was tired of doing this everytime i need to
access a widget:

some_var = xml.get_widget('some_id')

(doing this is tiresome when you have more than 10 widgets)

So i do a little module to have all widgets instanciated as attributes
of the object, for anyone interested it is on:

http://www.lugmen.org.ar/~p10n/sourc.../GetWidgets.py

However is still pretty unmature, since it lacks some checks.

May 29 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Helmut Zeisel | last post by:
I want to build a static extension of Python using SWIG and VC++ 6.0 as described in http://www.swig.org/Doc1.3/Python.html#n8 for gcc. My file...
7
by: Paul Gorodyansky | last post by:
Hi, Say I have a text in my TEXTAREA box - 01234567890 I want - using script - insert say "abc" in the middle. Works almost OK in Internet...
2
by: Gerhard Esterhuizen | last post by:
Hi, I am observing unexpected behaviour, in the form of a corrupted class member access, from a simple C++ program that accesses an attribute...
3
by: rimmer | last post by:
I'm writing an extension module in C in which I'm passing an array of floats from C to python. The code below illustrates a simple C function...
852
by: Mark Tarver | last post by:
How do you compare Python to Lisp? What specific advantages do you think that one has over the other? Note I'm not a Python person and I have no...
23
by: gu | last post by:
hi to all! after two days debugging my code, i've come to the point that the problem was caused by an unexpected behaviour of python. or by lack of...
4
Jezternz
by: Jezternz | last post by:
Find the Caret starting and ending position in a textarea. I have been looking for a function that will do this for a long time. Its amazing how...
162
by: Sh4wn | last post by:
Hi, first, python is one of my fav languages, and i'll definitely keep developing with it. But, there's 1 one thing what I -really- miss: data...
4
by: Matt | last post by:
Hello all, I have just discovered (the long way) that using a RegExp object with the 'global' flag set produces inconsistent results when its...
0
by: tammygombez | last post by:
Hey everyone! I've been researching gaming laptops lately, and I must say, they can get pretty expensive. However, I've come across some great...
0
better678
by: better678 | last post by:
Question: Discuss your understanding of the Java platform. Is the statement "Java is interpreted" correct? Answer: Java is an object-oriented...
0
by: teenabhardwaj | last post by:
How would one discover a valid source for learning news, comfort, and help for engineering designs? Covering through piles of books takes a lot of...
0
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: CD Tom | last post by:
This happens in runtime 2013 and 2016. When a report is run and then closed a toolbar shows up and the only way to get it to go away is to right...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.