467,228 Members | 1,391 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,228 developers. It's quick & easy.

How to extract a part of html file

Joe
I'm trying to extract part of html code from a tag to a tag code begins
with <span class="boldyellow"><B><U> and ends with
TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>

I was thinking of using a regular expression however I having hard time
getting the desired string. I use

htmlSource = urllib.urlopen("http://address/")
s = htmlSource.read()
htmlSource.close()

to get the html into a string, now I want to match string s from a <span
class Tag to <img src="http://whatever/some.gif"> </TD></TR></TABLE> and
store that into a new string.

Thanks
Oct 20 '05 #1
  • viewed: 5773
Share:
3 Replies
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag


For tag soup, use BeautifulSoup:

<URL:http://www.crummy.com/software/BeautifulSoup/>

Available as a package in Debian, probably other decent OSen also.

--
\ "I think it would be a good idea." -- Mahatma Gandhi (when |
`\ asked what he thought of Western civilization) |
_o__) |
Ben Finney
Oct 20 '05 #2
Ben Finney <bi****************@benfinney.id.au> writes:
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>


Except he's trying to extract an apparently random part of the
file. BeautifulSoup is a wonderful thing for dealing with X/HTML
documents as structured documents, which is how you want to deal with
them most of the time.

In this case, an re works nicely:
import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>'
r = re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', s)
r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span class="boldyellow"><B><U>')
stop = s.find('TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '


Not a lot to choose between them.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 20 '05 #3
Joe
Thanks Mike that is just what I was looking for, I have looked at
beautifulsoup but it doesn't really do what I want it to do, maybe I'm
just new to python and don't exactly know what it is doing just yet.
However string find woks. Thanks

On Thu, 20 Oct 2005 09:47:37 -0400, Mike Meyer wrote:
Ben Finney <bi****************@benfinney.id.au> writes:
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>


Except he's trying to extract an apparently random part of the file.
BeautifulSoup is a wonderful thing for dealing with X/HTML documents as
structured documents, which is how you want to deal with them most of
the time.

In this case, an re works nicely:
import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>' r =
re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', s) r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span
class="boldyellow"><B><U>') stop = s.find('TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '

Not a lot to choose between them.

<mike

Oct 20 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Jane Doe | last post: by
3 posts views Thread by Phong Ho | last post: by
5 posts views Thread by Logical | last post: by
reply views Thread by centur | last post: by
8 posts views Thread by Fabian Braennstroem | last post: by
9 posts views Thread by flit | last post: by
7 posts views Thread by erikcw | last post: by
1 post views Thread by Edwin.Madari@VerizonWireless.com | last post: by
reply views Thread by Adict | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.