473,555 Members | 2,506 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Extract Title from HTML documents

Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmlli b.SGMLParser):

inside_title = False
title = ''

def start_title(sel f, attrs):
self.inside_tit le = True

def end_title(self) :
self.inside_tit le = False

def handle_data(sel f, data):
if self.inside_tit le and data:
self.title = self.title + data + ' '

p = MyParser()
p.feed(file('te st.html').read( ))
p.close()
print p.title.strip()

---------------- END SCRIPT -------------------------
Many thanks in advance!

Best regards,
Nickolay Kolev
Jul 18 '05 #1
6 2445
You may find BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)
useful.

from BeautifulSoup import BeautifulSoup
b = BeautifulSoup()
b.feed(file('te st.html').read( ))
print b.first('title' ).renderContent s()

HTH

--
Anakim Border
http://pydc.sourceforge.net
ab*****@users.s ourceforge.net
Jul 18 '05 #2
Nickolay Kolev <nm*****@uni-bonn.de> writes:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmlli b.SGMLParser):

inside_title = False
title = ''

def start_title(sel f, attrs):
self.inside_tit le = True

def end_title(self) :
self.inside_tit le = False

def handle_data(sel f, data):
if self.inside_tit le and data:
self.title = self.title + data + ' '
I'm pretty sure the trailing "+ ' '" is wrong. At least I never needed
it when I was using sgmllib for this kind of thing.

<mike
p = MyParser()
p.feed(file('te st.html').read( ))
p.close()
print p.title.strip()

---------------- END SCRIPT -------------------------
Many thanks in advance!

Best regards,
Nickolay Kolev


--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Jul 18 '05 #3
Nickolay Kolev wrote:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?


You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((< title.*?>(.*?)</body>))', re.I|re.S)
match = linkPattern.sea rch(source)
if match is None:
result = ''
result = match.group(0)

If you need more than just the title I would definitely go with
BeautifulSoap.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
Jul 18 '05 #4
Max M <ma**@mxm.dk> writes:
Nickolay Kolev wrote:
Hi all,
I am looking for a way to extract the titles of HTML documents. I
have made an honest attempt at doing it, and it even works. Is there
an easier (faster / more efficient / clearer) way?
You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((< title.*?>(.*?)</body>))', re.I|re.S)

^^^^
Shouldn't that be </title>

<mike?
match = linkPattern.sea rch(source)
if match is None:
result = ''
result = match.group(0)

If you need more than just the title I would definitely go with
BeautifulSoap.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science


--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Jul 18 '05 #5
Nickolay Kolev wrote:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?


You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFi le("test.html" , tidy=True)
print unicode(xfind.f irst(e//html.title))
---
(This uses libxml2's HTML parser internally).

Bye,
Walter Dörwald

Jul 18 '05 #6
Nickolay Kolev wrote:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?


You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFi le("test.html" , tidy=True)
print unicode(xfind.f irst(e//html.title))
---
(This uses libxml2's HTML parser internally).

Bye,
Walter Dörwald

Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
6910
by: Jane Doe | last post by:
Hi I took a quick look in the archives, but didn't find an answer to this one. I'd like to display a list of HTML files in a directory, showing the author's name between brackets after the file name. I can successfully extract the TITLE section, but no luck with the AUTHOR part. Any idea why?
3
4661
by: Lars G. Svensson | last post by:
Currently, I'm marking up a few pages in German, containing quite some English abbreviations. The abbreviations are marked up as <abbr> with the appropriate title attribute, and -- when appropriate -- I add a "class='initialism'" to tell speach browers to spell it out rather than to read it as one word (Ben Meadowcroft's idea). Simple...
8
10631
by: Lian | last post by:
Hi all, It is a newbie's question about html tag "img". The attributes "title" and "alt" for "img" seems having the same function. So what is the main difference between them? Can i use them at the same time and set different values? Thank you for suggestions!
3
5994
by: Joe | last post by:
I'm trying to extract part of html code from a tag to a tag code begins with <span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE> I was thinking of using a regular expression however I having hard time getting the desired string. I use htmlSource = urllib.urlopen("http://address/")...
1
2661
by: caine | last post by:
I want to extract web data from a news feed page http://everling.nierchi.net/mmubulletins.php. Just want to extract necessary info between open n closing tags of <title>, <categoryand <link>. Whenever I initiated the extraction, first news title is always "MMU Bulletin Board RSS Feed" with the proper bulletin's link stored, but not the...
1
3661
by: steveyjg | last post by:
I want to extract the following data from a retrieved html file and store the information as strings. 'get the text of "title" <h1 id="test_title">title</h1> 'get the contents of the value attribute <input name="test_code" type="text" value='<object </object>' > 'get the text of "category" or value of c <div class="smallText">
0
1264
by: ayyanarj | last post by:
Hi, I have a word document that has attached(ie. embedded ) documents like word, ppt, pdf, etc. I have to extract those embedded documents in the document through code. To extract embedded word document. I used the following code.
4
2615
by: Farooqui | last post by:
Hi, I need help in writing a VB6 program to read 1000 word documents and extract my required information into a .dbf file. All word documents are having different information in a tabular form but the following fields are common in all of them: Name, Address, City, Country, Telephone, Mobile, Fax, Email, Title, Remarks I want to...
3
4096
rizwan6feb
by: rizwan6feb | last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this <\?*?\?> but this doesn't cater quotation marks (single and double quotes) and comments, i mean how can i skip php tags inside a string (and comments). Please have a look at the following code ...
0
7621
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
8060
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7587
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
7903
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
1
5452
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3593
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
2034
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1156
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
863
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.