By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,235 Members | 1,011 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,235 IT Pros & Developers. It's quick & easy.

How to extract a part of html file

P: n/a
Joe
I'm trying to extract part of html code from a tag to a tag code begins
with <span class="boldyellow"><B><U> and ends with
TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>

I was thinking of using a regular expression however I having hard time
getting the desired string. I use

htmlSource = urllib.urlopen("http://address/")
s = htmlSource.read()
htmlSource.close()

to get the html into a string, now I want to match string s from a <span
class Tag to <img src="http://whatever/some.gif"> </TD></TR></TABLE> and
store that into a new string.

Thanks
Oct 20 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag


For tag soup, use BeautifulSoup:

<URL:http://www.crummy.com/software/BeautifulSoup/>

Available as a package in Debian, probably other decent OSen also.

--
\ "I think it would be a good idea." -- Mahatma Gandhi (when |
`\ asked what he thought of Western civilization) |
_o__) |
Ben Finney
Oct 20 '05 #2

P: n/a
Ben Finney <bi****************@benfinney.id.au> writes:
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>


Except he's trying to extract an apparently random part of the
file. BeautifulSoup is a wonderful thing for dealing with X/HTML
documents as structured documents, which is how you want to deal with
them most of the time.

In this case, an re works nicely:
import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>'
r = re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', s)
r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span class="boldyellow"><B><U>')
stop = s.find('TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '


Not a lot to choose between them.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 20 '05 #3

P: n/a
Joe
Thanks Mike that is just what I was looking for, I have looked at
beautifulsoup but it doesn't really do what I want it to do, maybe I'm
just new to python and don't exactly know what it is doing just yet.
However string find woks. Thanks

On Thu, 20 Oct 2005 09:47:37 -0400, Mike Meyer wrote:
Ben Finney <bi****************@benfinney.id.au> writes:
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>


Except he's trying to extract an apparently random part of the file.
BeautifulSoup is a wonderful thing for dealing with X/HTML documents as
structured documents, which is how you want to deal with them most of
the time.

In this case, an re works nicely:
import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>' r =
re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', s) r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span
class="boldyellow"><B><U>') stop = s.find('TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '

Not a lot to choose between them.

<mike

Oct 20 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.