By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,173 Members | 796 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,173 IT Pros & Developers. It's quick & easy.

HTML Parsing and Indexing

P: n/a
Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish

Nov 13 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
ma********@gmail.com wrote:
I need a help on HTML parser.
http://www.effbot.org/pyfaq/tutor-ho...ut-of-html.htm

</F>

Nov 13 '06 #2

P: n/a
a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

ma********@gmail.com a écrit :
Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish
Nov 13 '06 #3

P: n/a

ma********@gmail.com wrote:
I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc
I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear. Are you _sure_ there's
still a need to do this thoroughly awkward task? How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?

Nov 13 '06 #4

P: n/a
ma********@gmail.com wrote:
I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.
lxml includes an HTML parser which can parse straight from URLs.

http://codespeak.net/lxml/
http://cheeseshop.python.org/pypi/lxml

Stefan
Nov 14 '06 #5

P: n/a
On Nov 13, 1:12 pm, mailtog...@gmail.com wrote:
>
I need a help on HTML parser.
<snip>
>
I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing.
Geez, how hard did you look? pyparsing's wiki menu includes an
'Examples' link, which take you to a page of examples including 3
having to do with scraping HTML. You can view the examples right in
the wiki, without even having to download the package (of course, you
*would* have to download to actually run the examples).

-- Paul

Nov 16 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.