HTML Parsing and Indexing

mailtogops

Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish

Nov 13 '06 #1

Subscribe Post Reply

4140

Fredrik Lundh

ma********@gmail.com wrote:

I need a help on HTML parser.

http://www.effbot.org/pyfaq/tutor-ho...ut-of-html.htm

</F>

Nov 13 '06 #2

Bernard

a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

ma********@gmail.com a écrit :

Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish

Nov 13 '06 #3

Andy Dingley

ma********@gmail.com wrote:

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc

I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear. Are you _sure_ there's
still a need to do this thoroughly awkward task? How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?

Nov 13 '06 #4

Stefan Behnel

ma********@gmail.com wrote:

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

lxml includes an HTML parser which can parse straight from URLs.

http://codespeak.net/lxml/
http://cheeseshop.python.org/pypi/lxml

Stefan

Nov 14 '06 #5

Paul McGuire

On Nov 13, 1:12 pm, mailtog...@gmail.com wrote:

>
I need a help on HTML parser.

<snip>

>
I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing.

Geez, how hard did you look? pyparsing's wiki menu includes an
'Examples' link, which take you to a page of examples including 3
having to do with scraping HTML. You can view the examples right in
the wiki, without even having to download the package (of course, you
*would* have to download to actually run the examples).

-- Paul

Nov 16 '06 #6

Similar topics

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

html source to prevent web bot searching

by: Ludwig77 | last post by:

I read that there are some tags that can be entered in a web page's meta tags in order to prevent web bot searching and indexing of the web page for search engines. What is the tagging that I...

HTML / CSS

execute asp.net file in .html file

by: Rajiv Gupta | last post by:

Hi, We are moving from asp to asp.net. In our existing model we execute the asp by including them in .html files. For example: In abc.html file we include following directive: <!--exec...

ASP.NET

Understanding simplest HTML page

by: Eric Lindsay | last post by:

I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...

HTML / CSS

PHP-Yes, HTML-No --- Why?

by: Lennart Björk | last post by:

Hi All, I have a tiny program: <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>MyTitle</title> <meta...

PHP

xml parsing innovation

by: cactus | last post by:

Eight years after the invention of XML, DOM and SAX, despite their respective issues, are still the mainstays of application developers. So is it the end of road for XML parsing innovation? ...

.NET Framework

html parsing / regular expressions

by: yonido | last post by:

hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...

.NET Framework

parsing an ifstream to get some specific text

by: toton | last post by:

Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in...

C / C++

convert xhtml back to html

by: Tim Arnold | last post by:

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing