Connecting Tech Pros Worldwide Forums | Help | Site Map

Regular Expression question

stevebread@yahoo.com
Guest
 
Posts: n/a
#1: Aug 21 '06
Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

My low quality regexp
re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj __(.*?)__',
re.DOTALL)

cannot handle the case where there is a tag1 that is not followed by a
tag2. findall returns
john, tall
joe, short

Ideas?

Thanks.


Rob Wolfe
Guest
 
Posts: n/a
#2: Aug 21 '06

re: Regular Expression question



stevebread@yahoo.com wrote:
Quote:
Hi, I am having some difficulty trying to create a regular expression.
>
Consider:
>
<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
>
Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short
>
My low quality regexp
re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj __(.*?)__',
re.DOTALL)
>
cannot handle the case where there is a tag1 that is not followed by a
tag2. findall returns
john, tall
joe, short
>
Ideas?
Have you tried this:

'tag1.+?name="(.+?)".*?(?=tag2).*?="adj__(.*?)__'

?

HTH,
Rob

stevebread@yahoo.com
Guest
 
Posts: n/a
#3: Aug 21 '06

re: Regular Expression question


Thanks, i just tried it but I got the same result.

I've been thinking about it for a few hours now and the problem with
this approach is that the .*? before the (?=tag2) may have matched a
tag1 and i don't know how to detect it.

And even if I could, how would I make the search reset its start
position to the second tag1 it found?

bearophileHUGS@lycos.com
Guest
 
Posts: n/a
#4: Aug 21 '06

re: Regular Expression question


I am not expert of REs yet, this my first possible solution:

import re

txt = """
<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>"""

tfinder = r"""< # The opening < the tag to find
\s* # Possible space or newline
(tag[12]) # First subgroup, the identifier, tag1
or tag2
\s+ # There must be a space or newline or
more
(?:name|value) # Name or value, non-grouping
\s* # Possible space or newline
= # The =
\s* # Possible space or newline
" # Opening "
([^"]*) # Second subgroup, the tag string, it
can't contain "
" # Closing " of the string
\s* # Possible space or newline
/? # One optional ending /
\s* # Possible space or newline
Quote:
# The closing of the tag
? # Greedy, match the first closing >
"""
patt = re.compile(tfinder, flags=re.I+re.X)

prec_type = ""
prec_string = ""
for mobj in patt.finditer(txt):
curr_type, curr_string = mobj.groups()
if curr_type == "tag2" and prec_type == "tag1":
print prec_string, curr_string.replace("adj__", "").strip("_")
prec_type = curr_type
prec_string = curr_string

Bye,
bearophile

Rob Wolfe
Guest
 
Posts: n/a
#5: Aug 21 '06

re: Regular Expression question



stevebread@yahoo.com wrote:
Quote:
Thanks, i just tried it but I got the same result.
>
I've been thinking about it for a few hours now and the problem with
this approach is that the .*? before the (?=tag2) may have matched a
tag1 and i don't know how to detect it.
Maybe like this:
'tag1.+?name="(.+?)".*?(?:<)(?=tag2).*?="adj__(.*? )__'

HTH,
Rob

stevebread@yahoo.com
Guest
 
Posts: n/a
#6: Aug 21 '06

re: Regular Expression question


got zero results on this one :)

Fredrik Lundh
Guest
 
Posts: n/a
#7: Aug 21 '06

re: Regular Expression question


stevebread@yahoo.com wrote:
Quote:
Hi, I am having some difficulty trying to create a regular expression.
>
Consider:
>
<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
>
Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short
import re

data = """
<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
"""

elems = re.findall("<(tag1|tag2)\s+(\w+)=\"([^\"]*)\"/>", data)

for i in range(len(elems)-1):
if elems[i][0] == "tag1" and elems[i+1][0] == "tag2":
print elems[i][2], elems[i+1][2]

</F>

Paddy
Guest
 
Posts: n/a
#8: Aug 21 '06

re: Regular Expression question



stevebread@yahoo.com wrote:
Quote:
Hi, I am having some difficulty trying to create a regular expression.
Steve,
I find this tool is great for debugging regular expressions.
http://kodos.sourceforge.net/

Just put some sample text in one window, your trial RE in another, and
Kodos displays a wealth of information on what matches.

Try it.

- Paddy.

Neil Cerutti
Guest
 
Posts: n/a
#9: Aug 21 '06

re: Regular Expression question


On 2006-08-21, stevebread@yahoo.com <stevebread@yahoo.comwrote:
Quote:
Hi, I am having some difficulty trying to create a regular expression.
>
Consider:
>
><tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
><tag1 name="joe"/>
><tag1 name="jack"/>
><tag2 value="adj__short__"/>
>
Whenever a tag1 is followed by a tag 2, I want to retrieve the
values of the tag1:name and tag2:value attributes. So my end
result here should be
>
john, tall
jack, short
>
Ideas?
It seems to me that an html parser might be a better solution.

Here's a slapped-together example. It uses a simple state
machine.

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.state = "get name"
self.name_attrs = None
self.result = {}

def handle_starttag(self, tag, attrs):
if self.state == "get name":
if tag == "tag1":
self.name_attrs = attrs
self.state = "found name"
elif self.state == "found name":
if tag == "tag2":
name = None
for attr in self.name_attrs:
if attr[0] == "name":
name = attr[1]
adj = None
for attr in attrs:
if attr[0] == "value" and attr[1][:3] == "adj":
adj = attr[1][5:-2]
if name == None or adj == None:
print "Markup error: expected attributes missing."
else:
self.result[name] = adj
self.state = "get name"
elif tag == "tag1":
# A new tag1 overrides the old one
self.name_attrs = attrs

p = MyHTMLParser()
p.feed("""
<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
""")
print repr(p.result)
p.close()

There's probably a better way to search for attributes in attr
than "for attr in attrs", but I didn't think of it, and the
example I found on the net used the same idiom. The format of
attrs seems strange. Why isn't it a dictionary?

--
Neil Cerutti
Sermon Outline: I. Delineate your fear II. Disown your fear III.
Displace your rear --Church Bulletin Blooper
Paul McGuire
Guest
 
Posts: n/a
#10: Aug 21 '06

re: Regular Expression question


<stevebread@yahoo.comwrote in message
news:1156153916.849933.178790@75g2000cwc.googlegro ups.com...
Quote:
Hi, I am having some difficulty trying to create a regular expression.
>
Consider:
>
<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
>
Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short
>
A pyparsing solution may not be a speed demon to run, but doesn't take too
long to write. Some short explanatory comments:
- makeHTMLTags returns a tuple of opening and closing tags, but this example
does not use any closing tags, so simpler to just discard them (only use
zero'th return value)
- Your example includes not only <tag1and <tag2tags, but also a <br>
tag, which is presumably ignorable.
- The value returned from calling the searchString generator includes named
fields for the different tag attributes, making it easy to access the name
and value tag attributes.
- The expression generated by makeHTMLTags will also handle tags with other
surprising attributes that we didn't anticipate (such as "<br clear='all'/>"
or "<tag2 value='adj__short__' modifier='adv__very__'/>")
- Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some
simple string slicing gets us the data we want

The pyparsing home page is at http://pyparsing.wikispaces.com.

-- Paul


from pyparsing import makeHTMLTags

tag1 = makeHTMLTags("tag1")[0]
tag2 = makeHTMLTags("tag2")[0]
br = makeHTMLTags("br")[0]

# define the pattern we're looking for, in terms of tag1 and tag2
# and specify that we wish to ignore <brtags
patt = tag1 + tag2
patt.ignore(br)

for tokens in patt.searchString(data):
print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2])


Prints:
john, tall
jack, short


Printing tokens.dump() gives:
['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True]
- empty: True
- name: jack
- startTag1: ['tag1', ['name', 'jack'], True]
- empty: True
- name: jack
- startTag2: ['tag2', ['value', 'adj__short__'], True]
- empty: True
- value: adj__short__
- value: adj__short__


Rob Wolfe
Guest
 
Posts: n/a
#11: Aug 21 '06

re: Regular Expression question



stevebread@yahoo.com wrote:
Quote:
got zero results on this one :)
Really?
Quote:
Quote:
Quote:
>>s = '''<tag1 name="john"/ <br/<tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>'''
Quote:
Quote:
Quote:
>>pat = re.compile('tag1.+?name="(.+?)".*?(?:<)(?=tag2).*? ="adj__(.*?)__', re.DOTALL)
>>m = re.findall(pat, s)
>>m
[('john', 'tall'), ('joe', 'short')]


Regards,
Rob

stevebread@yahoo.com
Guest
 
Posts: n/a
#12: Aug 21 '06

re: Regular Expression question


Hi, thanks everyone for the information! Still going through it :)

The reason I did not match on tag2 in my original expression (and I
apologize because I should have mentioned this before) is that other
tags could also have an attribute with the value of "adj__" and the
attribute name may not be the same for the other tags. The only thing I
can be sure of is that the value will begin with "adj__".

I need to match the "adj__" value with the closest preceding tag1
irrespective of what tag the "adj__" is in, or what the attribute
holding it is called, or the order of the attributes (there may be
others). This data will be inside an html page and so there will be
plenty of html tags in the middle all of which I need to ignore.

Thanks very much!
Steve

Anthra Norell
Guest
 
Posts: n/a
#13: Aug 22 '06

re: Regular Expression question


Steve,
I thought Fredrik Lundh's proposal was perfect. Are you now saying it doesn't solve your problem because your description of the
problem was incomplete? If so, could you post a worst case piece of htm, one that contains all possible complications, or a
collection of different cases all of which you need to handle?

Frederic

----- Original Message -----
From: <stevebread@yahoo.com>
Newsgroups: comp.lang.python
To: <python-list@python.org>
Sent: Monday, August 21, 2006 11:35 PM
Subject: Re: Regular Expression question

Quote:
Hi, thanks everyone for the information! Still going through it :)
>
The reason I did not match on tag2 in my original expression (and I
apologize because I should have mentioned this before) is that other
tags could also have an attribute with the value of "adj__" and the
attribute name may not be the same for the other tags. The only thing I
can be sure of is that the value will begin with "adj__".
>
I need to match the "adj__" value with the closest preceding tag1
irrespective of what tag the "adj__" is in, or what the attribute
holding it is called, or the order of the attributes (there may be
others). This data will be inside an html page and so there will be
plenty of html tags in the middle all of which I need to ignore.
>
Thanks very much!
Steve
>
--
http://mail.python.org/mailman/listinfo/python-list
Closed Thread