473,394 Members | 1,706 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

How can I exclude a word by using re?

In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.
Aug 14 '05 #1
15 17829
re.findall('(.*)hello|(.*)', 'hi, how are you. hello')
re.findall('(.*)hello|(.*)', 'hi, how are you. ello')
take a look at the outputs of these.

Aug 14 '05 #2
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


import re

def demonstrate(regex, text):
pattern = re.compile(regex)
match = pattern.search(text)

print " ", text
if match:
print " Matched '%s'" % match.group(0)
print " Captured '%s'" % match.group(1)
else:
print " Did not match"

# Option 1: Match it all, but capture only the part before "hello." The
(.*?)
# matches as few characters as possible, so that this pattern would end
before
# the first hello in "hello hello".

pattern = r"(.*?)hello"
print "Option 1:", pattern
demonstrate( pattern, "hi, how are you. hello" )

# Option 2: Don't even match the "hello," but make sure it's there.
# The first of these calls will match, but the second will not. The
# (?=...) construct is using a feature called "forward look-ahead."

pattern = r"(.*)(?=hello)"
print "\nOption 2:", pattern
demonstrate( pattern, "hi, how are you. hello" )
demonstrate( pattern, "hi, how are you. ", )
Aug 14 '05 #3
Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

On 14 Aug 2005 08:02:16 -0700, Christoph Rackwitz
<ch****************@gmail.com> wrote:
re.findall('(.*)hello|(.*)', 'hi, how are you. hello')
re.findall('(.*)hello|(.*)', 'hi, how are you. ello')
take a look at the outputs of these.

--
http://mail.python.org/mailman/listinfo/python-list

Aug 14 '05 #4
could ildg a écrit :
Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello".
Read The Fine Manual ?-)

For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?


re.findall(r'^(.*)hello', your_string_full_of_hellos)
Aug 14 '05 #5
could ildg wrote:
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?


The simplest solution is to use str.split():
helo = "hi, how are you? HELLO I'm fine, thank you hello. that's it"
helo.split("hello", 1)[0] "hi, how are you? HELLO I'm fine, thank you "

But regular expressions offer a similar feature:
re.compile("hello", re.IGNORECASE).split(helo, 1)[0]

'hi, how are you? '

Peter

Aug 15 '05 #6
Bruno Desthuilliers wrote:
could ildg a écrit :
Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello".

Read The Fine Manual ?-)

For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

re.findall(r'^(.*)hello', your_string_full_of_hellos)


Nice try, but it needs a little refinement to do what the OP asked for:
import re
h = "hi g'day hello hello hello"
re.findall(r'^(.*)hello', h) ["hi g'day hello hello "] re.findall(r'^(.*?)hello', h) ["hi g'day "] re.findall(r'^(.*?)hello', h)[0]

"hi g'day "
Aug 15 '05 #7
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
Aug 15 '05 #8
I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.
On 8/15/05, John Machin <sj******@lexicon.net> wrote:
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list

Aug 16 '05 #9
could ildg said:
I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.
Actually, for properly processing html, you shouldn't really be using
regular expressions, precisely because the problem is complicated -
regular expressions are too simple and can't properly model a language
like HTML, which is generated by a context free grammar.

If thats only meaningless technical mumbo-jumbo to you, never mind -
the important point is you shouldn't really use an re. Trust me :)

What you want for a job like is an HTML parser. Theres one in the
standard library; if it doesnt suit, there are plenty of third party
ones. I like Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you insist on using an re, well I'm sure someone on this group will
figure out a solution to your issue thats as good as you're going to
get...


My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.
On 8/15/05, John Machin <sj******@lexicon.net> wrote:
could ildg wrote:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.


(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list


Aug 16 '05 #10
Given the example re that you've been trying to get working, here is a
pyparsing approach that might be more, um, approachable.
Unfortunately, since I don't have the URL of the page you are working
with, I'm unable to test this before posting.

Good luck,
-- Paul

# getMP3s.py
# get pyparsing at http://pyparsing.sourceforge.net
#

from pyparsing import *
import urllib

#~
r=re.compile(ur'valign=top>(?P*<number>\d{1,2})</td><td[^>]*>*\s{0,2}'

#~ ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
#~ ur'(?P<name>.+)</td>',re.UNICO*DE|re.IGNORECASE)

tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

number = Word(nums)
valign = CaselessLiteral("valign=top>")

mp3Entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

# get list of mp3's
targetURL = "http://whatever"
targetPage = urllib.urlopen( targetURL )
targetHTML = targetPage.read()
targetPage.close()

for toks,s,e in mp3Entry.scanString(targetHTML):
print toks.number, toks.starta.href

Aug 16 '05 #11
Thank you,
you code using pyparsing works very well. Now I got the "number" and
the "url". But I still want to get the "name".
I'll turn to pyparsing and see how to get the "name" from the html.
But I hope you can enlighten me for one more time since I'm not
farmiliar with the pyparsing module.

On 15 Aug 2005 21:15:02 -0700, Paul McGuire <pt***@austin.rr.com> wrote:
Given the example re that you've been trying to get working, here is a
pyparsing approach that might be more, um, approachable.
Unfortunately, since I don't have the URL of the page you are working
with, I'm unable to test this before posting.

Good luck,
-- Paul

# getMP3s.py
# get pyparsing at http://pyparsing.sourceforge.net
#

from pyparsing import *
import urllib

#~
r=re.compile(ur'valign=top>(?P*<number>\d{1,2})</td><td[^>]*>*\s{0,2}'

#~ ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
#~ ur'(?P<name>.+)</td>',re.UNICO*DE|re.IGNORECASE)

tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

number = Word(nums)
valign = CaselessLiteral("valign=top>")

mp3Entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

# get list of mp3's
targetURL = "http://whatever"
targetPage = urllib.urlopen( targetURL )
targetHTML = targetPage.read()
targetPage.close()

for toks,s,e in mp3Entry.scanString(targetHTML):
print toks.number, toks.starta.href

--
http://mail.python.org/mailman/listinfo/python-list

Aug 16 '05 #12
Just as with re you were using "?P<xxx>" to assign the matching text to
the variable "xxx", pyparsing allows you to associate a name with an
element of your grammar using setResultsName.

Here is your original re:
r=re.compile(ur'valign=top>(?P**<number>\d{1,2})</td><td[^>]**>*\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICO**DE|re.IGNORECASE)

Here is the pyparsing expression:
valign + number.setResultsName("number"*) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

Here are the re and pyparsing pieces side by side:
re => pyparsing
-----------------------
valign=top> => valign = CaselessLiteral("valign=top>")
(?P**<number>\d{1,2}) => number = Word(nums),
number.setResultsName("number")
</td> => tdEnd
<td[^>]**>* => tdStart
\s{0,2} => I don't know what this re does, so I just used
SkipTo(aStart)
<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank> => aStart (which
returns a value whose named attributes correspond to the HTML
attributes, such as href)
(?P<name>.+) => SkipTo(tdEnd) *** here is where we'll make our
change ***
</td> => tdEnd

To capture the body of the second <td></td> tag pair, we'll add
setResultsName("name") to the pyparsing expression:
mp3entry = valign + number.setResultsName("number"*) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd)setResultsName("name") + tdEnd

Now you should be able to extract the data using:
for toks,s,e in mp3Entry.scanString(targetHTML*):
print toks.number, toks.starta.href, toks.name

Good luck!
-- Paul

Aug 16 '05 #13
Oof! That should be:

mp3entry = valign + number.setResultsName("number"**) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd).setResultsName("n*ame") + tdEnd

Aug 16 '05 #14
On Tue, 16 Aug 2005 09:09:11 +0800, could ildg <co*******@gmail.com>
declaimed the following in comp.lang.python:
I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.
Yeesh... Wouldn't it be faster to use an HTML parser (I think there
is one in the standard library) that just doesn't emit anything for the
particular tags in question (and, at the simplest, just copies
everything else to the output unchanged).
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Aug 16 '05 #15
I just reviewed what the re "\s" signifies: whitespace. This is easy,
pyparsing ignores all intervening whitespace by default. So mp3Entry
simplfies to:

mp3entry = valign + number.setResultsName("number"***) + tdEnd + \
tdStart + aStart + \
SkipTo(tdEnd).setResultsName("*n*ame") + tdEnd

which leads me to another question - isn't there a closing </a> in
there somewhere, probably at the end of the name? If so, then you
might be better off with:

mp3entry = valign + number.setResultsName("number"***) + tdEnd + \
tdStart + aStart + \
SkipTo(aEnd).setResultsName("*n*ame") + aEnd + tdEnd

-- Paul

Aug 16 '05 #16

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Mohee | last post by:
In VB.NET I am trying to create a regular expression that will validate any string as long as it does not contain a specified string. For example, I want to match any word that does not contain...
1
by: unotin | last post by:
I have an application in ASP that exports to Word using the Response.ContentType method. The application references another ASP page through the img tag that uses a Response.BinaryWrite (of an...
4
by: s.subbarayan | last post by:
Dear all, I would like to know the easiest efficient way to set or inject a particular value in the given word or byte?The problem is: I have to implement a function which will set a value from...
0
by: Tommy | last post by:
Hello! Does anybody know how to insert text into bookmarks in Word using late binding? If I use early binding everything works fine. What I want to implement is somthing like this: for (int i...
0
by: Mills | last post by:
Hi, I am currently trying to automate word using c#, I have created a ..dot file and have placed a bookmark in the middle of the document, I am trying to create a variable number of tables from...
2
by: ads | last post by:
hi i was given a task to display sql server data in ms word using xml. Im currently doing research to accomplish the task. What i have in mind is to create an xml template (or schema?) to load...
0
by: sajil | last post by:
hai i have a problem i have done a program in visual basic 6 where i connected to word but i am not able to create a table in word if possible please tell me and can we move the table in the word...
2
by: prinsipe | last post by:
hi all, i have an app that calls a sp. values generated from sp are stored in a dataset. dataset is then filtered using dataview rowfilter then displayed on datagrid. all works fine. my question...
1
by: abhilash12 | last post by:
hai how can i search word using java from open office og doc file pls help me
0
by: saravanakumar muthurangan | last post by:
Hello all, i need to correct a misspelled word automatically with a most matching word by using MS word.dll in vb.net 2005, i m getting the checkspelling window with the below code but...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.