473,406 Members | 2,867 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

problem with regex, how to conclude more than one character

I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
Nov 7 '08 #1
3 1296
On Nov 7, 3:06*pm, tecspr...@gmail.com wrote:
I always have no idea about how to express "conclude the entire word"
with regexp, *while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
And by the way, I've tried both (!</td>) and (?:!</td>), many ways
doesn't work.... so sad...
Nov 7 '08 #2
On Thu, Nov 6, 2008 at 11:06 PM, <te*******@gmail.comwrote:
I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute
Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
--
http://mail.python.org/mailman/listinfo/python-list
Nov 7 '08 #3
On Nov 7, 3:13*pm, "Chris Rebert" <c...@rebertia.comwrote:
On Thu, Nov 6, 2008 at 11:06 PM, *<tecspr...@gmail.comwrote:
I always have no idea about how to express "conclude the entire word"
with regexp, *while using python, I encountered this problem again...
for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".
I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the "<span>"'s
title attribute

Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...http://rebertia.com
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle"><span title="Understanding the stock market"
class="MouseCursor">Understand....</span></td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''
re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)
#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "<span>" block but I can just get the "title" attribute
of the first "<span>" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</span></td>",
then I can continue match the second "<span>" block.
Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
--
http://mail.python.org/mailman/listinfo/python-list- Hide quoted text -

- Show quoted text -
Really thanks for quickly reply Chris!
Actually I tried BeautifulSoup and it's great.
But I'm not very familiar with it and it need more codes to parse the
html and get the right text.
I think regexp is more convenient if there is a way to filter out the
list just in one line:)
I did this all the way but stopped here...
Nov 7 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Michael Vilain | last post by:
Originally, I was using $value =~ s/<.*>//g; to strip HTML tags from a variable. It actually stripped everything from the first "<" to the last ">" after the ending tag. I found this regex...
7
by: alphatan | last post by:
Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...
11
by: Dimitris Georgakopuolos | last post by:
Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...
4
by: skavan | last post by:
Use Case: We have music files that describe, in their filename, attributes of the music. We do not know a general pattern that applies to all filenames -- but we do know that filenames that are...
1
by: rh | last post by:
hi all, take the following 2 c# lines: 1) str = Regex.Replace(str, ".*AAA", ""); 2) str = Regex.Replace(str, "^.*AAA", ""); notice that the only difference is that the pattern in line 2 has a...
11
by: Steve | last post by:
Hi All, I'm having a tough time converting the following regex.compile patterns into the new re.compile format. There is also a differences in the regsub.sub() vs. re.sub() Could anyone lend...
5
by: Maqsood Ahmed | last post by:
Hello, I am trying to create a Regex object which can match ASCII character 0x05 in a given string. I have written following code to accomplish this: System.Text.RegularExpressions.Regex...
3
by: =?Utf-8?B?TmF2ZWVu?= | last post by:
Not sure if this is the right forum for this question but couldn'd find another newsgroup. I am new to Regular expressions and would like help in deciding which pattern allows me to split a...
7
by: Chuck B | last post by:
In a C# Regex expression which would be faster when run against say 10,000 strings: Regex(@"\d+/\d+/\d+ The quick brown fox.*"); or Regex(@"\d+/\d+/\d+ The.*"); The reason I'm asking is...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.