473,573 Members | 2,886 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Matching XML Tag Contents with Regex

I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default' >
here&apos;s some text!
</div>
<div class='default' >
here&apos;s some text!
</div>
<div class='default' >
here&apos;s some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%( tagName)s\s[^>]*>[.\n\r\w\s\d\D\S \W]*[^(%
(tagName)s)]*' % dict(tagName=ta gName))

matches = pattern.findite r(data)
for m in matches:
contents = data[m.start():m.end ()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <divand ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris
Dec 11 '07 #1
6 10219
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?
Search for '*?' on http://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\ s[^>]*>[.\n\r\w\s\d\D\S \W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches
Dec 11 '07 #2
On Dec 11, 4:05 pm, Chris <chriss...@gmai l.comwrote:
I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default' >
here&apos;s some text!
</div>
<div class='default' >
here&apos;s some text!
</div>
<div class='default' >
here&apos;s some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%( tagName)s\s[^>]*>[.\n\r\w\s\d\D\S \W]*[^(%
(tagName)s)]*' % dict(tagName=ta gName))

matches = pattern.findite r(data)
for m in matches:
contents = data[m.start():m.end ()]
print repr(contents)
assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <divand ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris
print re.findall(r'<% s(?=[\s/>])[^>]*>' % 'div', r)

["<div class='default' >", "<div class='default' >", "<div
class='default' >"]

HTH

Harvey
Dec 11 '07 #3
On Dec 11, 11:41 am, garage <xmikeda...@gma il.comwrote:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s\ s[^>]*>[.\n\r\w\s\d\D\S \W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches
Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\ s[^>]*>[.\n\r\w\s\d\D\S \W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext ". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Dec 11 '07 #4
Chris wrote:
On Dec 11, 11:41 am, garage <xmikeda...@gma il.comwrote:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.

To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg

<%(tagName)s \s[^>]*>[.\n\r\w\s\d\D\S \W]*?[^(%(tagName)s)]*

There's still some funkiness in the regex and logic, but this gives
you the three matches

Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:

<%(tagName)s\ s[^>]*>[.\n\r\w\s\d\D\S \W]*?(targettext)+[^(%
(tagName)s)]*

each match will include "targettext ". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.
Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez
Dec 11 '07 #5
On Dec 11, 1:08 pm, "Diez B. Roggisch" <de...@nospam.w eb.dewrote:
Chris wrote:
On Dec 11, 11:41 am, garage <xmikeda...@gma il.comwrote:
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?
Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.
To get around the greedy single match, you can add a question mark
after the asterisk in the 'content' portion the the markup. This
causes it to take the shortest match, instead of the longest. eg
<%(tagName)s\ s[^>]*>[.\n\r\w\s\d\D\S \W]*?[^(%(tagName)s)]*
There's still some funkiness in the regex and logic, but this gives
you the three matches
Thanks, that's pretty close to what I was looking for. How would I
filter out tags that don't have certain text in the contents? I'm
running into the same issue again. For instance, if I use the regex:
<%(tagName)s\ s[^>]*>[.\n\r\w\s\d\D\S \W]*?(targettext)+[^(%
(tagName)s)]*
each match will include "targettext ". However, some matches will still
include </%(tagName)s)>, presumably from the tags which didn't contain
targettext.

Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML.

Diez
I was hoping a simple pattern like <tag>.*text.* </tagwouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.
Dec 11 '07 #6
I was hoping a simple pattern like <tag>.*text.* </tagwouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.
That's one of the common problems with rexes and XML/HTML. They start
out fast and easy, but at some point they blow up - or fail to fulfill
the task.

Diez
Dec 11 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
3197
by: Xah Lee | last post by:
# -*- coding: utf-8 -*- # Python # Matching string patterns # # Sometimes you want to know if a string is of # particular pattern. Let's say in your website # you have converted all images files from gif # format to png format. Now you need to change the # html code to use the .png files. So, essentially
3
2742
by: Day Of The Eagle | last post by:
Jeff_Relf wrote: > ...yet you don't even know what RegEx is. > I'm looking at the source code for mono's Regex implementation right now. You can download that source here ( use the class libraries download ). http://www.mono-project.com/Downloads
9
3446
by: Martijn | last post by:
Hi, Which is the prevalent way of matching a filename to a mask in runtime? The best I can think of, is sscanf. Thanks for the help! <OT> It's for the Windows platform, so any functions specific to that platform are welcome too
1
1592
by: George Durzi | last post by:
Consider this excerpt from some HTML. (This is a copy from View->Source, except for the comment) <TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0> <?xml version="1.0" encoding="UTF-16"?> <!-- need to extract whatever is here --> </TABLE> I need to extract all the HTML that would be in the <!-- need to extract whatever is here -->...
7
3265
by: Kevin CH | last post by:
Hi, I'm currently running into a confusion on regex and hopefully you guys can clear it up for me. Suppose I have a regular expression (0|(1(01*0)*1))* and two test strings: 110_1011101_ and _101101_1. (The underscores are not part of the string. They are added to show that both string has a substring that matches the pattern.) ...
8
2231
by: Xah Lee | last post by:
the Python regex documentation is available at: http://xahlee.org/perl-python/python_re-write/lib/module-re.html Note that, i've just made the terms of use clear. Also, can anyone answer what is the precise terms of license of the official python documentation? The official python.org doc site is not clear. Note also, that the regex...
0
1537
by: Tidane | last post by:
Visual Basic.NET Framework 2.0 I've created a program to parse out text as the program recieved it and use Regex matching to decide what should be done. My problem is that the text is matching when it shouldn't be, if that makes any sense. If Regex.IsMatch(Text, "You find (a|an)" & MoneyMatch) Then Other code here that doesn't matter....
11
4816
by: tech | last post by:
Hi, I need a function to specify a match pattern including using wildcard characters as below to find chars in a std::string. The match pattern can contain the wildcard characters "*" and "?", where "*" matches zero or more consecutive occurrences of any character and "?" matches a single occurrence of any character. Does boost or some...
1
3424
by: Joe Strout | last post by:
Wow, this was harder than I thought (at least for a rusty Pythoneer like myself). Here's my stab at an implementation. Remember, the goal is to add a "match" method to Template which works like Template.substitute, but in reverse: given a string, if that string matches the template, then it should return a dictionary mapping each template...
0
8032
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
1
7796
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8074
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
1
5601
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3734
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3739
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2223
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1310
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1044
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.