472,146 Members | 1,380 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,146 software developers and data experts.

Matching XML Tag Contents with Regex

I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
contents = data[m.start():m.end()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <divand ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris
Dec 11 '07 #1
6 10114
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?
Search for '*?' on http://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches
Dec 11 '07 #2
On Dec 11, 4:05 pm, Chris <chriss...@gmail.comwrote:
I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
<div class='default'>
here&apos;s some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
contents = data[m.start():m.end()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <divand ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris
print re.findall(r'<%s(?=[\s/>])[^>]*>' % 'div', r)

["<div class='default'>", "<div class='default'>", "<div
class='default'>"]

HTH

Harvey
Dec 11 '07 #3
On Dec 11, 11:41 am, garage <xmikeda...@gmail.comwrote:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches
Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Dec 11 '07 #4
Chris wrote:
On Dec 11, 11:41 am, garage <xmikeda...@gmail.comwrote:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches

Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.
Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez
Dec 11 '07 #5
On Dec 11, 1:08 pm, "Diez B. Roggisch" <de...@nospam.web.dewrote:
Chris wrote:
On Dec 11, 11:41 am, garage <xmikeda...@gmail.comwrote:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?
Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.
To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg
<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*
There's still some funkiness in the regex and logic, but this gives
you the three matches
Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:
<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
(tagName)s)]*
each match will include "targettext". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez
I was hoping a simple pattern like <tag>.*text.*</tagwouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.
Dec 11 '07 #6
I was hoping a simple pattern like <tag>.*text.*</tagwouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.
That's one of the common problems with rexes and XML/HTML. They start
out fast and easy, but at some point they blow up - or fail to fulfill
the task.

Diez
Dec 11 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by Xah Lee | last post: by
3 posts views Thread by Day Of The Eagle | last post: by
9 posts views Thread by Martijn | last post: by
1 post views Thread by George Durzi | last post: by
7 posts views Thread by Kevin CH | last post: by
reply views Thread by Tidane | last post: by
11 posts views Thread by tech | last post: by
1 post views Thread by Joe Strout | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.