How to extract a part of html file

Joe

I'm trying to extract part of html code from a tag to a tag code begins
with <span class="boldyellow"><B><U> and ends with
TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>

I was thinking of using a regular expression however I having hard time
getting the desired string. I use

htmlSource = urllib.urlopen("http://address/")
s = htmlSource.read()
htmlSource.close()

to get the html into a string, now I want to match string s from a <span
class Tag to <img src="http://whatever/some.gif"> </TD></TR></TABLE> and
store that into a new string.

Thanks

Oct 20 '05 #1

Subscribe Post Reply

5989

Ben Finney

Joe <di******@lycos.com> wrote:

I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:

<URL:http://www.crummy.com/software/BeautifulSoup/>

Available as a package in Debian, probably other decent OSen also.

--
\ "I think it would be a good idea." -- Mahatma Gandhi (when |
`\ asked what he thought of Western civilization) |
_o__) |
Ben Finney

Oct 20 '05 #2

Mike Meyer

Ben Finney <bi****************@benfinney.id.au> writes:

Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>

Except he's trying to extract an apparently random part of the
file. BeautifulSoup is a wonderful thing for dealing with X/HTML
documents as structured documents, which is how you want to deal with
them most of the time.

In this case, an re works nicely:

import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>'
r = re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', s)
r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span class="boldyellow"><B><U>')
stop = s.find('TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '

Not a lot to choose between them.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Oct 20 '05 #3

Joe

Thanks Mike that is just what I was looking for, I have looked at
beautifulsoup but it doesn't really do what I want it to do, maybe I'm
just new to python and don't exactly know what it is doing just yet.
However string find woks. Thanks

On Thu, 20 Oct 2005 09:47:37 -0400, Mike Meyer wrote:

Ben Finney <bi****************@benfinney.id.au> writes:
Joe <di******@lycos.com> wrote:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>

Except he's trying to extract an apparently random part of the file.
BeautifulSoup is a wonderful thing for dealing with X/HTML documents as
structured documents, which is how you want to deal with them most of
the time.

In this case, an re works nicely:
import re
s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>' r =
re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', s) r.group(1) ' and ends with '
String.find also works really well:
start = s.find('<span class="boldyellow"><B><U>') + len('<span
class="boldyellow"><B><U>') stop = s.find('TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '

Not a lot to choose between them.

<mike

Oct 20 '05 #4

Similar topics

Eregi() to extract author meta tag?

by: Jane Doe | last post by:

Hi I took a quick look in the archives, but didn't find an answer to this one. I'd like to display a list of HTML files in a directory, showing the author's name between brackets after the...

PHP

Extract headlines from a HTML file.

by: Phong Ho | last post by:

Hi everyone, I try to write a simple web crawler. It has to do the following: 1) Open an URL and retrieve a HTML file. 2) Extract news headlines from the HTML file 3) Put the headlines into a...

PHP

Query Strings, extract() and including pages

by: Logical | last post by:

I wanted to do: include('page.htm?id=12&foo=bar'); But since I can't (and don't want to make another seperate HTTP request with include('http://...')); I was wondering if there's a function...

PHP

Extract image dimensions (height, width) from Base64 String?

by: Neo Geshel | last post by:

Greetings. I have managed to stitch together an awesome method of posting text along with an image to a database, in a way that allows an unlimited number of previews to ensure that text and...

Visual Basic .NET

Extract body from MIME message

by: centur | last post by:

I need to acquire content body of MIME encoded message (as IMessage object).I want using C# and CDO Interop extract such data ("eJ8+IggVAQaQ..." unicode part). Here is example of Bodypart...

C# / C Sharp

extract certain values from file with re

by: Fabian Braennstroem | last post by:

Hi, I would like to remove certain lines from a log files. I had some sed/awk scripts for this, but now, I want to use python with its re module for this task. Actually, I have two different...

Python

way to extract only the message from pop3

by: flit | last post by:

Hello All, Using poplib in python I can extract only the headers using the .top, there is a way to extract only the message text without the headers? like remove the fields below: "...

Python

Extract zip file from email attachment

by: erikcw | last post by:

Hi all, I'm trying to extract zip file (containing an xml file) from an email so I can process it. But I'm running up against some brick walls. I've been googling and reading all afternoon, and...

Python

RE: Extract string from log file

by: Edwin.Madari | last post by:

from each line separate out url and request parts. split the request into key-value pairs, use urllib to unquote key-value pairs......as show below... import urllib line = "GET...

Python

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General