473,395 Members | 1,488 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

How to extract a part of html file

Joe
I'm trying to extract part of html code from a tag to a tag code begins
with <span class="boldyellow"><B><U> and ends with
TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>

I was thinking of using a regular expression however I having hard time
getting the desired string. I use

htmlSource = urllib.urlopen("http://address/")
s = htmlSource.read()
htmlSource.close()

to get the html into a string, now I want to match string s from a <span
class Tag to <img src="http://whatever/some.gif"> </TD></TR></TABLE> and
store that into a new string.

Thanks
Oct 20 '05 #1
3 5989
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag


For tag soup, use BeautifulSoup:

<URL:http://www.crummy.com/software/BeautifulSoup/>

Available as a package in Debian, probably other decent OSen also.

--
\ "I think it would be a good idea." -- Mahatma Gandhi (when |
`\ asked what he thought of Western civilization) |
_o__) |
Ben Finney
Oct 20 '05 #2
Ben Finney <bi****************@benfinney.id.au> writes:
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>


Except he's trying to extract an apparently random part of the
file. BeautifulSoup is a wonderful thing for dealing with X/HTML
documents as structured documents, which is how you want to deal with
them most of the time.

In this case, an re works nicely:
import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>'
r = re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', s)
r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span class="boldyellow"><B><U>')
stop = s.find('TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '


Not a lot to choose between them.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 20 '05 #3
Joe
Thanks Mike that is just what I was looking for, I have looked at
beautifulsoup but it doesn't really do what I want it to do, maybe I'm
just new to python and don't exactly know what it is doing just yet.
However string find woks. Thanks

On Thu, 20 Oct 2005 09:47:37 -0400, Mike Meyer wrote:
Ben Finney <bi****************@benfinney.id.au> writes:
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>


Except he's trying to extract an apparently random part of the file.
BeautifulSoup is a wonderful thing for dealing with X/HTML documents as
structured documents, which is how you want to deal with them most of
the time.

In this case, an re works nicely:
import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>' r =
re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', s) r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span
class="boldyellow"><B><U>') stop = s.find('TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '

Not a lot to choose between them.

<mike

Oct 20 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Jane Doe | last post by:
Hi I took a quick look in the archives, but didn't find an answer to this one. I'd like to display a list of HTML files in a directory, showing the author's name between brackets after the...
3
by: Phong Ho | last post by:
Hi everyone, I try to write a simple web crawler. It has to do the following: 1) Open an URL and retrieve a HTML file. 2) Extract news headlines from the HTML file 3) Put the headlines into a...
5
by: Logical | last post by:
I wanted to do: include('page.htm?id=12&foo=bar'); But since I can't (and don't want to make another seperate HTTP request with include('http://...')); I was wondering if there's a function...
7
by: Neo Geshel | last post by:
Greetings. I have managed to stitch together an awesome method of posting text along with an image to a database, in a way that allows an unlimited number of previews to ensure that text and...
0
by: centur | last post by:
I need to acquire content body of MIME encoded message (as IMessage object).I want using C# and CDO Interop extract such data ("eJ8+IggVAQaQ..." unicode part). Here is example of Bodypart...
8
by: Fabian Braennstroem | last post by:
Hi, I would like to remove certain lines from a log files. I had some sed/awk scripts for this, but now, I want to use python with its re module for this task. Actually, I have two different...
9
by: flit | last post by:
Hello All, Using poplib in python I can extract only the headers using the .top, there is a way to extract only the message text without the headers? like remove the fields below: "...
7
by: erikcw | last post by:
Hi all, I'm trying to extract zip file (containing an xml file) from an email so I can process it. But I'm running up against some brick walls. I've been googling and reading all afternoon, and...
1
by: Edwin.Madari | last post by:
from each line separate out url and request parts. split the request into key-value pairs, use urllib to unquote key-value pairs......as show below... import urllib line = "GET...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.