467,077 Members | 1,014 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,077 developers. It's quick & easy.

How can I exclude a word by using re?

In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.
Aug 14 '05 #1
  • viewed: 14936
Share:
15 Replies
re.findall('(.*)hello|(.*)', 'hi, how are you. hello')
re.findall('(.*)hello|(.*)', 'hi, how are you. ello')
take a look at the outputs of these.

Aug 14 '05 #2
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


import re

def demonstrate(regex, text):
pattern = re.compile(regex)
match = pattern.search(text)

print " ", text
if match:
print " Matched '%s'" % match.group(0)
print " Captured '%s'" % match.group(1)
else:
print " Did not match"

# Option 1: Match it all, but capture only the part before "hello." The
(.*?)
# matches as few characters as possible, so that this pattern would end
before
# the first hello in "hello hello".

pattern = r"(.*?)hello"
print "Option 1:", pattern
demonstrate( pattern, "hi, how are you. hello" )

# Option 2: Don't even match the "hello," but make sure it's there.
# The first of these calls will match, but the second will not. The
# (?=...) construct is using a feature called "forward look-ahead."

pattern = r"(.*)(?=hello)"
print "\nOption 2:", pattern
demonstrate( pattern, "hi, how are you. hello" )
demonstrate( pattern, "hi, how are you. ", )
Aug 14 '05 #3
Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

On 14 Aug 2005 08:02:16 -0700, Christoph Rackwitz
<ch****************@gmail.com> wrote:
re.findall('(.*)hello|(.*)', 'hi, how are you. hello')
re.findall('(.*)hello|(.*)', 'hi, how are you. ello')
take a look at the outputs of these.

--
http://mail.python.org/mailman/listinfo/python-list

Aug 14 '05 #4
could ildg a écrit :
Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello".
Read The Fine Manual ?-)

For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?


re.findall(r'^(.*)hello', your_string_full_of_hellos)
Aug 14 '05 #5
could ildg wrote:
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?


The simplest solution is to use str.split():
helo = "hi, how are you? HELLO I'm fine, thank you hello. that's it"
helo.split("hello", 1)[0] "hi, how are you? HELLO I'm fine, thank you "

But regular expressions offer a similar feature:
re.compile("hello", re.IGNORECASE).split(helo, 1)[0]

'hi, how are you? '

Peter

Aug 15 '05 #6
Bruno Desthuilliers wrote:
could ildg a écrit :
Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello".

Read The Fine Manual ?-)

For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

re.findall(r'^(.*)hello', your_string_full_of_hellos)


Nice try, but it needs a little refinement to do what the OP asked for:
import re
h = "hi g'day hello hello hello"
re.findall(r'^(.*)hello', h) ["hi g'day hello hello "] re.findall(r'^(.*?)hello', h) ["hi g'day "] re.findall(r'^(.*?)hello', h)[0]

"hi g'day "
Aug 15 '05 #7
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
Aug 15 '05 #8
I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.
On 8/15/05, John Machin <sj******@lexicon.net> wrote:
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list

Aug 16 '05 #9
could ildg said:
I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.
Actually, for properly processing html, you shouldn't really be using
regular expressions, precisely because the problem is complicated -
regular expressions are too simple and can't properly model a language
like HTML, which is generated by a context free grammar.

If thats only meaningless technical mumbo-jumbo to you, never mind -
the important point is you shouldn't really use an re. Trust me :)

What you want for a job like is an HTML parser. Theres one in the
standard library; if it doesnt suit, there are plenty of third party
ones. I like Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you insist on using an re, well I'm sure someone on this group will
figure out a solution to your issue thats as good as you're going to
get...


My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.
On 8/15/05, John Machin <sj******@lexicon.net> wrote:
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list


Aug 16 '05 #10
Given the example re that you've been trying to get working, here is a
pyparsing approach that might be more, um, approachable.
Unfortunately, since I don't have the URL of the page you are working
with, I'm unable to test this before posting.

Good luck,
-- Paul

# getMP3s.py
# get pyparsing at http://pyparsing.sourceforge.net
#

from pyparsing import *
import urllib

#~
r=re.compile(ur'valign=top>(?P*<number>\d{1,2})</td><td[^>]*>*\s{0,2}'

#~ ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
#~ ur'(?P<name>.+)</td>',re.UNICO*DE|re.IGNORECASE)

tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

number = Word(nums)
valign = CaselessLiteral("valign=top>")

mp3Entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

# get list of mp3's
targetURL = "http://whatever"
targetPage = urllib.urlopen( targetURL )
targetHTML = targetPage.read()
targetPage.close()

for toks,s,e in mp3Entry.scanString(targetHTML):
print toks.number, toks.starta.href

Aug 16 '05 #11
Thank you,
you code using pyparsing works very well. Now I got the "number" and
the "url". But I still want to get the "name".
I'll turn to pyparsing and see how to get the "name" from the html.
But I hope you can enlighten me for one more time since I'm not
farmiliar with the pyparsing module.

On 15 Aug 2005 21:15:02 -0700, Paul McGuire <pt***@austin.rr.com> wrote:
Given the example re that you've been trying to get working, here is a
pyparsing approach that might be more, um, approachable.
Unfortunately, since I don't have the URL of the page you are working
with, I'm unable to test this before posting.

Good luck,
-- Paul

# getMP3s.py
# get pyparsing at http://pyparsing.sourceforge.net
#

from pyparsing import *
import urllib

#~
r=re.compile(ur'valign=top>(?P*<number>\d{1,2})</td><td[^>]*>*\s{0,2}'

#~ ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
#~ ur'(?P<name>.+)</td>',re.UNICO*DE|re.IGNORECASE)

tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

number = Word(nums)
valign = CaselessLiteral("valign=top>")

mp3Entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

# get list of mp3's
targetURL = "http://whatever"
targetPage = urllib.urlopen( targetURL )
targetHTML = targetPage.read()
targetPage.close()

for toks,s,e in mp3Entry.scanString(targetHTML):
print toks.number, toks.starta.href

--
http://mail.python.org/mailman/listinfo/python-list

Aug 16 '05 #12
Just as with re you were using "?P<xxx>" to assign the matching text to
the variable "xxx", pyparsing allows you to associate a name with an
element of your grammar using setResultsName.

Here is your original re:
r=re.compile(ur'valign=top>(?P**<number>\d{1,2})</td><td[^>]**>*\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICO**DE|re.IGNORECASE)

Here is the pyparsing expression:
valign + number.setResultsName("number"*) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

Here are the re and pyparsing pieces side by side:
re => pyparsing
-----------------------
valign=top> => valign = CaselessLiteral("valign=top>")
(?P**<number>\d{1,2}) => number = Word(nums),
number.setResultsName("number")
</td> => tdEnd
<td[^>]**>* => tdStart
\s{0,2} => I don't know what this re does, so I just used
SkipTo(aStart)
<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank> => aStart (which
returns a value whose named attributes correspond to the HTML
attributes, such as href)
(?P<name>.+) => SkipTo(tdEnd) *** here is where we'll make our
change ***
</td> => tdEnd

To capture the body of the second <td></td> tag pair, we'll add
setResultsName("name") to the pyparsing expression:
mp3entry = valign + number.setResultsName("number"*) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd)setResultsName("name") + tdEnd

Now you should be able to extract the data using:
for toks,s,e in mp3Entry.scanString(targetHTML*):
print toks.number, toks.starta.href, toks.name

Good luck!
-- Paul

Aug 16 '05 #13
Oof! That should be:

mp3entry = valign + number.setResultsName("number"**) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd).setResultsName("n*ame") + tdEnd

Aug 16 '05 #14
On Tue, 16 Aug 2005 09:09:11 +0800, could ildg <co*******@gmail.com>
declaimed the following in comp.lang.python:
I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.
Yeesh... Wouldn't it be faster to use an HTML parser (I think there
is one in the standard library) that just doesn't emit anything for the
particular tags in question (and, at the simplest, just copies
everything else to the output unchanged).
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Aug 16 '05 #15
I just reviewed what the re "\s" signifies: whitespace. This is easy,
pyparsing ignores all intervening whitespace by default. So mp3Entry
simplfies to:

mp3entry = valign + number.setResultsName("number"***) + tdEnd + \
tdStart + aStart + \
SkipTo(tdEnd).setResultsName("*n*ame") + tdEnd

which leads me to another question - isn't there a closing </a> in
there somewhere, probably at the end of the name? If so, then you
might be better off with:

mp3entry = valign + number.setResultsName("number"***) + tdEnd + \
tdStart + aStart + \
SkipTo(aEnd).setResultsName("*n*ame") + aEnd + tdEnd

-- Paul

Aug 16 '05 #16

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Mohee | last post: by
reply views Thread by Mills | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.