467,073 Members | 1,262 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,073 developers. It's quick & easy.

HTML parsing/scraping & python

We are looking to select the language & toolset more suitable for a
project that requires getting data from several web-sites in real-
time....html parsing/scraping. It would require full emulation of the
browser, including handling cookies, automated logins & following
multiple web-link paths. Multiple threading would be a plus but not
requirement.

Some solutions were suggested:

Perl:

LWP::Simple
WWW::Mechanize
HTML::Parser

Curl & libcurl:

Can you suggest solutions for python? Pros & Cons using Perl vs. Python?
Why Python?

Pointers to various other tools & their comparisons with python
solutions will be most appreciated. Anyone who is knowledgeable about
the application subject, please do share your knowledge to help us do
this right.

With best regards.
Sanjay.

Dec 1 '05 #1
  • viewed: 2570
Share:
3 Replies
Sanjay Arora <sa************@gmail.com> writes:
We are looking to select the language & toolset more suitable for a
project that requires getting data from several web-sites in real-
time....html parsing/scraping. It would require full emulation of the
browser, including handling cookies, automated logins & following
multiple web-link paths. Multiple threading would be a plus but not
requirement.
Believe it or not, everything you ask for can be done by Python out of
the box. But there are limitations.

For one, the HTML parsing module that comes with Python doesn't handle
invalid HTML very well. Thanks to Netscape, invalid HTML is the rule
rather than the exception on the web. So you probably want to use a
third party module for that. I use BeautifulSoup, which handles XML,
HTML, has a *lovely* API (going from BeautifulSoup to DOM is always a
major dissapointment), and works well with broken X/HTML.

That sufficient for my needs, but I haven't been asked to do a lot of
automated form filling, so the facilities in the standard library work
for me. There are third party tools to help with that. I'm sure
someone willsuggest them.
Can you suggest solutions for python? Pros & Cons using Perl vs. Python?
Why Python?


Because it's beautiful. Seriously, Python code is very readable, by
design. Of course, some of the features that make that happen drive
some people crazy. If you're one of them, then Python isn't the
language for you.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 1 '05 #2
The standard library module for fetching HTML is urllib2.

The best module for scraping the HTML is BeautifulSoup.

There is a project called mechanize, built by John Lee on top of
urllib2 and other standard modules.

It will emulate a browsers behaviour - including history, cookies,
basic authentication, etc.

There are several modules for automated form filling - FormEncode being
one.

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Dec 1 '05 #3
"Fuzzyman" <fu******@gmail.com> writes:
The standard library module for fetching HTML is urllib2.
Does urllib2 replace everything in urllib? I thought there was some
urllib functionality that urllib2 didn't do.
There is a project called mechanize, built by John Lee on top of
urllib2 and other standard modules.
It will emulate a browsers behaviour - including history, cookies,
basic authentication, etc.


urllib2 handles cookies and authentication. I use those features
daily. I'm not sure history would apply, unless you're also handling
javascript. Is there some other way to ask the browser to go back in
history?

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 1 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Tuang | last post: by
4 posts views Thread by David Jones | last post: by
1 post views Thread by mustafa | last post: by
2 posts views Thread by George Durzi | last post: by
1 post views Thread by pigge79@yahoo.com | last post: by
5 posts views Thread by mailtogops@gmail.com | last post: by
2 posts views Thread by s. d. rose | last post: by
5 posts views Thread by mtuller | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.