Connecting Tech Pros Worldwide Forums | Help | Site Map

HTML Parsing and Indexing

mailtogops@gmail.com
Guest
 
Posts: n/a
#1: Nov 13 '06
Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish


Fredrik Lundh
Guest
 
Posts: n/a
#2: Nov 13 '06

re: HTML Parsing and Indexing


mailtogops@gmail.com wrote:
Quote:
I need a help on HTML parser.
http://www.effbot.org/pyfaq/tutor-ho...ut-of-html.htm

</F>

Bernard
Guest
 
Posts: n/a
#3: Nov 13 '06

re: HTML Parsing and Indexing


a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

mailtogops@gmail.com a écrit :
Quote:
Hi All,
>
I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.
>
I need a help on HTML parser.
>
I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..
>
I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.
>
What we need is,
>
One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish
Andy Dingley
Guest
 
Posts: n/a
#4: Nov 13 '06

re: HTML Parsing and Indexing



mailtogops@gmail.com wrote:
Quote:
I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc
I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear. Are you _sure_ there's
still a need to do this thoroughly awkward task? How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?

Stefan Behnel
Guest
 
Posts: n/a
#5: Nov 14 '06

re: HTML Parsing and Indexing


mailtogops@gmail.com wrote:
Quote:
I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.
>
I need a help on HTML parser.
lxml includes an HTML parser which can parse straight from URLs.

http://codespeak.net/lxml/
http://cheeseshop.python.org/pypi/lxml

Stefan
Paul McGuire
Guest
 
Posts: n/a
#6: Nov 16 '06

re: HTML Parsing and Indexing


On Nov 13, 1:12 pm, mailtog...@gmail.com wrote:
Quote:
>
I need a help on HTML parser.
>
<snip>
Quote:
>
I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing.
Geez, how hard did you look? pyparsing's wiki menu includes an
'Examples' link, which take you to a page of examples including 3
having to do with scraping HTML. You can view the examples right in
the wiki, without even having to download the package (of course, you
*would* have to download to actually run the examples).

-- Paul

Closed Thread