473,385 Members | 1,838 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

HTML Parsing and Indexing

Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish

Nov 13 '06 #1
5 4140
ma********@gmail.com wrote:
I need a help on HTML parser.
http://www.effbot.org/pyfaq/tutor-ho...ut-of-html.htm

</F>

Nov 13 '06 #2
a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

ma********@gmail.com a écrit :
Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish
Nov 13 '06 #3

ma********@gmail.com wrote:
I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc
I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear. Are you _sure_ there's
still a need to do this thoroughly awkward task? How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?

Nov 13 '06 #4
ma********@gmail.com wrote:
I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.
lxml includes an HTML parser which can parse straight from URLs.

http://codespeak.net/lxml/
http://cheeseshop.python.org/pypi/lxml

Stefan
Nov 14 '06 #5
On Nov 13, 1:12 pm, mailtog...@gmail.com wrote:
>
I need a help on HTML parser.
<snip>
>
I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing.
Geez, how hard did you look? pyparsing's wiki menu includes an
'Examples' link, which take you to a page of examples including 3
having to do with scraping HTML. You can view the examples right in
the wiki, without even having to download the package (of course, you
*would* have to download to actually run the examples).

-- Paul

Nov 16 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
14
by: Ludwig77 | last post by:
I read that there are some tags that can be entered in a web page's meta tags in order to prevent web bot searching and indexing of the web page for search engines. What is the tagging that I...
7
by: Rajiv Gupta | last post by:
Hi, We are moving from asp to asp.net. In our existing model we execute the asp by including them in .html files. For example: In abc.html file we include following directive: <!--exec...
82
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...
59
by: Lennart Björk | last post by:
Hi All, I have a tiny program: <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>MyTitle</title> <meta...
0
by: cactus | last post by:
Eight years after the invention of XML, DOM and SAX, despite their respective issues, are still the mainstays of application developers. So is it the end of road for XML parsing innovation? ...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
3
by: toton | last post by:
Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in...
11
by: Tim Arnold | last post by:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.