<stevebread@yahoo.comwrote in message
news:1156153916.849933.178790@75g2000cwc.googlegro ups.com...
Quote:
Hi, I am having some difficulty trying to create a regular expression.
>
Consider:
>
<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
>
Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short
>
A pyparsing solution may not be a speed demon to run, but doesn't take too
long to write. Some short explanatory comments:
- makeHTMLTags returns a tuple of opening and closing tags, but this example
does not use any closing tags, so simpler to just discard them (only use
zero'th return value)
- Your example includes not only <tag1and <tag2tags, but also a <br>
tag, which is presumably ignorable.
- The value returned from calling the searchString generator includes named
fields for the different tag attributes, making it easy to access the name
and value tag attributes.
- The expression generated by makeHTMLTags will also handle tags with other
surprising attributes that we didn't anticipate (such as "<br clear='all'/>"
or "<tag2 value='adj__short__' modifier='adv__very__'/>")
- Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some
simple string slicing gets us the data we want
The pyparsing home page is at
http://pyparsing.wikispaces.com.
-- Paul
from pyparsing import makeHTMLTags
tag1 = makeHTMLTags("tag1")[0]
tag2 = makeHTMLTags("tag2")[0]
br = makeHTMLTags("br")[0]
# define the pattern we're looking for, in terms of tag1 and tag2
# and specify that we wish to ignore <brtags
patt = tag1 + tag2
patt.ignore(br)
for tokens in patt.searchString(data):
print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2])
Prints:
john, tall
jack, short
Printing tokens.dump() gives:
['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True]
- empty: True
- name: jack
- startTag1: ['tag1', ['name', 'jack'], True]
- empty: True
- name: jack
- startTag2: ['tag2', ['value', 'adj__short__'], True]
- empty: True
- value: adj__short__
- value: adj__short__