473,327 Members | 2,012 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

Extract Title from HTML documents

Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):

inside_title = False
title = ''

def start_title(self, attrs):
self.inside_title = True

def end_title(self):
self.inside_title = False

def handle_data(self, data):
if self.inside_title and data:
self.title = self.title + data + ' '

p = MyParser()
p.feed(file('test.html').read())
p.close()
print p.title.strip()

---------------- END SCRIPT -------------------------
Many thanks in advance!

Best regards,
Nickolay Kolev
Jul 18 '05 #1
6 2432
You may find BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)
useful.

from BeautifulSoup import BeautifulSoup
b = BeautifulSoup()
b.feed(file('test.html').read())
print b.first('title').renderContents()

HTH

--
Anakim Border
http://pydc.sourceforge.net
ab*****@users.sourceforge.net
Jul 18 '05 #2
Nickolay Kolev <nm*****@uni-bonn.de> writes:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):

inside_title = False
title = ''

def start_title(self, attrs):
self.inside_title = True

def end_title(self):
self.inside_title = False

def handle_data(self, data):
if self.inside_title and data:
self.title = self.title + data + ' '
I'm pretty sure the trailing "+ ' '" is wrong. At least I never needed
it when I was using sgmllib for this kind of thing.

<mike
p = MyParser()
p.feed(file('test.html').read())
p.close()
print p.title.strip()

---------------- END SCRIPT -------------------------
Many thanks in advance!

Best regards,
Nickolay Kolev


--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Jul 18 '05 #3
Nickolay Kolev wrote:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?


You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
match = linkPattern.search(source)
if match is None:
result = ''
result = match.group(0)

If you need more than just the title I would definitely go with
BeautifulSoap.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
Jul 18 '05 #4
Max M <ma**@mxm.dk> writes:
Nickolay Kolev wrote:
Hi all,
I am looking for a way to extract the titles of HTML documents. I
have made an honest attempt at doing it, and it even works. Is there
an easier (faster / more efficient / clearer) way?
You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)

^^^^
Shouldn't that be </title>

<mike?
match = linkPattern.search(source)
if match is None:
result = ''
result = match.group(0)

If you need more than just the title I would definitely go with
BeautifulSoap.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science


--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Jul 18 '05 #5
Nickolay Kolev wrote:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?


You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
---
(This uses libxml2's HTML parser internally).

Bye,
Walter Dörwald

Jul 18 '05 #6
Nickolay Kolev wrote:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?


You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
---
(This uses libxml2's HTML parser internally).

Bye,
Walter Dörwald

Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Jane Doe | last post by:
Hi I took a quick look in the archives, but didn't find an answer to this one. I'd like to display a list of HTML files in a directory, showing the author's name between brackets after the...
3
by: Lars G. Svensson | last post by:
Currently, I'm marking up a few pages in German, containing quite some English abbreviations. The abbreviations are marked up as <abbr> with the appropriate title attribute, and -- when appropriate...
8
by: Lian | last post by:
Hi all, It is a newbie's question about html tag "img". The attributes "title" and "alt" for "img" seems having the same function. So what is the main difference between them? Can i use them at...
3
by: Joe | last post by:
I'm trying to extract part of html code from a tag to a tag code begins with <span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE> I was...
1
by: caine | last post by:
I want to extract web data from a news feed page http://everling.nierchi.net/mmubulletins.php. Just want to extract necessary info between open n closing tags of <title>, <categoryand <link>....
1
by: steveyjg | last post by:
I want to extract the following data from a retrieved html file and store the information as strings. 'get the text of "title" <h1 id="test_title">title</h1> 'get the contents of the value...
0
by: ayyanarj | last post by:
Hi, I have a word document that has attached(ie. embedded ) documents like word, ppt, pdf, etc. I have to extract those embedded documents in the document through code. To extract...
4
by: Farooqui | last post by:
Hi, I need help in writing a VB6 program to read 1000 word documents and extract my required information into a .dbf file. All word documents are having different information in a tabular form...
3
rizwan6feb
by: rizwan6feb | last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this <\?*?\?> but this doesn't cater quotation marks...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.