By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,519 Members | 1,792 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,519 IT Pros & Developers. It's quick & easy.

problem with regex, how to conclude more than one character

P: n/a
I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
Nov 7 '08 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On Nov 7, 3:06*pm, tecspr...@gmail.com wrote:
I always have no idea about how to express "conclude the entire word"
with regexp, *while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
And by the way, I've tried both (!</td>) and (?:!</td>), many ways
doesn't work.... so sad...
Nov 7 '08 #2

P: n/a
On Thu, Nov 6, 2008 at 11:06 PM, <te*******@gmail.comwrote:
I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute
Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
--
http://mail.python.org/mailman/listinfo/python-list
Nov 7 '08 #3

P: n/a
On Nov 7, 3:13*pm, "Chris Rebert" <c...@rebertia.comwrote:
On Thu, Nov 6, 2008 at 11:06 PM, *<tecspr...@gmail.comwrote:
I always have no idea about how to express "conclude the entire word"
with regexp, *while using python, I encountered this problem again...
for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".
I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute

Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...http://rebertia.com
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''
re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)
#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.
Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
--
http://mail.python.org/mailman/listinfo/python-list- Hide quoted text -

- Show quoted text -
Really thanks for quickly reply Chris!
Actually I tried BeautifulSoup and it's great.
But I'm not very familiar with it and it need more codes to parse the
html and get the right text.
I think regexp is more convenient if there is a way to filter out the
list just in one line:)
I did this all the way but stopped here...
Nov 7 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.