Hi everyone,
I try to write a simple web crawler. It has to do the following:
1) Open an URL and retrieve a HTML file.
2) Extract news headlines from the HTML file
3) Put the headlines into a RSS file.
For example, I want to go to this site and extract the headlines:
www.unstrung.com/section.asp?section_id=86
The problem is I do not know howto extract a headline from a HTML
file.
I mean HTML is not structured as XML, so I do not really know to solve
this problem. I notice that PHP has URL Functions to deal with HTML
file. For example, you have get_meta_tags () to extract meta tag
content attributes from a HTML file. But then, extract meta tag is
easy. With headlines, I don't really know where the headlines are on
a HTML file. Would anyone give me inputs on this?
This is not an impossible problem. If you look at Google News
(http://news.google.com/), they crawl the web and sort the headlines
on their site.
Thanks,
P. Ho