Hi all,
I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?
------------ START SCRIPT --------------------
#!/usr/bin/python
import sgmllib
class MyParser(sgmllib.SGMLParser):
inside_title = False
title = ''
def start_title(self, attrs):
self.inside_title = True
def end_title(self):
self.inside_title = False
def handle_data(self, data):
if self.inside_title and data:
self.title = self.title + data + ' '
p = MyParser()
p.feed(file('test.html').read())
p.close()
print p.title.strip()
---------------- END SCRIPT -------------------------
Many thanks in advance!
Best regards,
Nickolay Kolev 6 2432
Nickolay Kolev <nm*****@uni-bonn.de> writes: Hi all,
I am looking for a way to extract the titles of HTML documents. I have made an honest attempt at doing it, and it even works. Is there an easier (faster / more efficient / clearer) way?
------------ START SCRIPT --------------------
#!/usr/bin/python
import sgmllib
class MyParser(sgmllib.SGMLParser):
inside_title = False title = ''
def start_title(self, attrs): self.inside_title = True
def end_title(self): self.inside_title = False
def handle_data(self, data): if self.inside_title and data: self.title = self.title + data + ' '
I'm pretty sure the trailing "+ ' '" is wrong. At least I never needed
it when I was using sgmllib for this kind of thing.
<mike
p = MyParser() p.feed(file('test.html').read()) p.close() print p.title.strip()
---------------- END SCRIPT -------------------------
Many thanks in advance!
Best regards, Nickolay Kolev
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Nickolay Kolev wrote: Hi all,
I am looking for a way to extract the titles of HTML documents. I have made an honest attempt at doing it, and it even works. Is there an easier (faster / more efficient / clearer) way?
You anly need one tag here, so using a regex is ok.
linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
match = linkPattern.search(source)
if match is None:
result = ''
result = match.group(0)
If you need more than just the title I would definitely go with
BeautifulSoap.
--
hilsen/regards Max M, Denmark http://www.mxm.dk/
IT's Mad Science
Max M <ma**@mxm.dk> writes: Nickolay Kolev wrote: Hi all, I am looking for a way to extract the titles of HTML documents. I have made an honest attempt at doing it, and it even works. Is there an easier (faster / more efficient / clearer) way? You anly need one tag here, so using a regex is ok.
linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
^^^^
Shouldn't that be </title>
<mike?
match = linkPattern.search(source) if match is None: result = '' result = match.group(0)
If you need more than just the title I would definitely go with BeautifulSoap.
--
hilsen/regards Max M, Denmark
http://www.mxm.dk/ IT's Mad Science
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Nickolay Kolev wrote: Hi all,
I am looking for a way to extract the titles of HTML documents. I have made an honest attempt at doing it, and it even works. Is there an easier (faster / more efficient / clearer) way?
You might try XIST ( http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html
e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
---
(This uses libxml2's HTML parser internally).
Bye,
Walter Dörwald
Nickolay Kolev wrote: Hi all,
I am looking for a way to extract the titles of HTML documents. I have made an honest attempt at doing it, and it even works. Is there an easier (faster / more efficient / clearer) way?
You might try XIST ( http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html
e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
---
(This uses libxml2's HTML parser internally).
Bye,
Walter Dörwald This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Jane Doe |
last post by:
Hi
I took a quick look in the archives, but didn't find an answer
to this one.
I'd like to display a list of HTML files in a directory, showing the
author's name between brackets after the...
|
by: Lars G. Svensson |
last post by:
Currently, I'm marking up a few pages in German, containing quite some
English abbreviations. The abbreviations are marked up as <abbr>
with the appropriate title attribute, and -- when appropriate...
|
by: Lian |
last post by:
Hi all,
It is a newbie's question about html tag "img".
The attributes "title" and "alt" for "img" seems having the same
function. So what is the main difference between them?
Can i use them at...
|
by: Joe |
last post by:
I'm trying to extract part of html code from a tag to a tag code begins
with <span class="boldyellow"><B><U> and ends with
TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>
I was...
|
by: caine |
last post by:
I want to extract web data from a news feed page
http://everling.nierchi.net/mmubulletins.php.
Just want to extract necessary info between open n closing tags of
<title>, <categoryand <link>....
|
by: steveyjg |
last post by:
I want to extract the following data from a retrieved html file and store the information as strings.
'get the text of "title"
<h1 id="test_title">title</h1>
'get the contents of the value...
|
by: ayyanarj |
last post by:
Hi,
I have a word document that has attached(ie. embedded ) documents like word, ppt, pdf, etc.
I have to extract those embedded documents in the document through code.
To extract...
|
by: Farooqui |
last post by:
Hi,
I need help in writing a VB6 program to read 1000 word documents and extract my required information into a .dbf file.
All word documents are having different information in a tabular form...
|
by: rizwan6feb |
last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this
<\?*?\?>
but this doesn't cater quotation marks...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |