problem with regex, how to conclude more than one character

tecspring

I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!

Nov 7 '08 #1

Subscribe Post Reply

1296

tecspring

On Nov 7, 3:06*pm, tecspr...@gmail.com wrote:

I always have no idea about how to express "conclude the entire word"
with regexp, *while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!

And by the way, I've tried both (!</td>) and (?:!</td>), many ways
doesn't work.... so sad...

Nov 7 '08 #2

Chris Rebert

On Thu, Nov 6, 2008 at 11:06 PM, <te*******@gmail.comwrote:

I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute

Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com

###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
--
http://mail.python.org/mailman/listinfo/python-list

Nov 7 '08 #3

tecspring

On Nov 7, 3:13*pm, "Chris Rebert" <c...@rebertia.comwrote:

On Thu, Nov 6, 2008 at 11:06 PM, *<tecspr...@gmail.comwrote:
I always have no idea about how to express "conclude the entire word"
with regexp, *while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute

Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...http://rebertia.com

###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Maybe I didn't describe this clearly, then feel free to tell me:)
thanks for any further reply!
--
http://mail.python.org/mailman/listinfo/python-list- Hide quoted text -

- Show quoted text -

Really thanks for quickly reply Chris!
Actually I tried BeautifulSoup and it's great.
But I'm not very familiar with it and it need more codes to parse the
html and get the right text.
I think regexp is more convenient if there is a way to filter out the
list just in one line:)
I did this all the way but stopped here...

Nov 7 '08 #4

Similar topics

regex for stripping HTML

by: Michael Vilain | last post by:

Originally, I was using $value =~ s/<.*>//g; to strip HTML tags from a variable. It actually stripped everything from the first "<" to the last ">" after the ending tag. I found this regex...

Perl

How can I embed the *regex* engine into C program?

by: alphatan | last post by:

Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...

C / C++

Regular expression problem - Replacing a pattern

by: Dimitris Georgakopuolos | last post by:

Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...

C# / C Sharp

Advanced RegEx (the cluster problem).

by: skavan | last post by:

Use Case: We have music files that describe, in their filename, attributes of the music. We do not know a general pattern that applies to all filenames -- but we do know that filenames that are...

C# / C Sharp

regex high cpu utilization

by: rh | last post by:

hi all, take the following 2 c# lines: 1) str = Regex.Replace(str, ".*AAA", ""); 2) str = Regex.Replace(str, "^.*AAA", ""); notice that the only difference is that the pattern in line 2 has a...

.NET Framework

Regular Expression - old regex module vs. re module

by: Steve | last post by:

Hi All, I'm having a tough time converting the following regex.compile patterns into the new re.compile format. There is also a differences in the regsub.sub() vs. re.sub() Could anyone lend...

Python

Help with Regex

by: Maqsood Ahmed | last post by:

Hello, I am trying to create a Regex object which can match ASCII character 0x05 in a given string. I have written following code to accomplish this: System.Text.RegularExpressions.Regex...

C# / C Sharp

Help with Regex Pattern

by: =?Utf-8?B?TmF2ZWVu?= | last post by:

Not sure if this is the right forum for this question but couldn'd find another newsgroup. I am new to Regular expressions and would like help in deciding which pattern allows me to split a...

C# / C Sharp

Regex optimization

by: Chuck B | last post by:

In a C# Regex expression which would be faster when run against say 10,000 strings: Regex(@"\d+/\d+/\d+ The quick brown fox.*"); or Regex(@"\d+/\d+/\d+ The.*"); The reason I'm asking is...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA