HTML parsing/scraping & python

Sanjay Arora

We are looking to select the language & toolset more suitable for a
project that requires getting data from several web-sites in real-
time....html parsing/scraping. It would require full emulation of the
browser, including handling cookies, automated logins & following
multiple web-link paths. Multiple threading would be a plus but not
requirement.

Some solutions were suggested:

Perl:

LWP::Simple
WWW::Mechanize
HTML::Parser

Curl & libcurl:

Can you suggest solutions for python? Pros & Cons using Perl vs. Python?
Why Python?

Pointers to various other tools & their comparisons with python
solutions will be most appreciated. Anyone who is knowledgeable about
the application subject, please do share your knowledge to help us do
this right.

With best regards.
Sanjay.

Dec 1 '05 #1

Subscribe Post Reply

2865

Mike Meyer

Sanjay Arora <sa************@gmail.com> writes:

We are looking to select the language & toolset more suitable for a
project that requires getting data from several web-sites in real-
time....html parsing/scraping. It would require full emulation of the
browser, including handling cookies, automated logins & following
multiple web-link paths. Multiple threading would be a plus but not
requirement.
Believe it or not, everything you ask for can be done by Python out of
the box. But there are limitations.

For one, the HTML parsing module that comes with Python doesn't handle
invalid HTML very well. Thanks to Netscape, invalid HTML is the rule
rather than the exception on the web. So you probably want to use a
third party module for that. I use BeautifulSoup, which handles XML,
HTML, has a *lovely* API (going from BeautifulSoup to DOM is always a
major dissapointment), and works well with broken X/HTML.

That sufficient for my needs, but I haven't been asked to do a lot of
automated form filling, so the facilities in the standard library work
for me. There are third party tools to help with that. I'm sure
someone willsuggest them.
Can you suggest solutions for python? Pros & Cons using Perl vs. Python?
Why Python?

Because it's beautiful. Seriously, Python code is very readable, by
design. Of course, some of the features that make that happen drive
some people crazy. If you're one of them, then Python isn't the
language for you.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Dec 1 '05 #2

Fuzzyman

The standard library module for fetching HTML is urllib2.

The best module for scraping the HTML is BeautifulSoup.

There is a project called mechanize, built by John Lee on top of
urllib2 and other standard modules.

It will emulate a browsers behaviour - including history, cookies,
basic authentication, etc.

There are several modules for automated form filling - FormEncode being
one.

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Dec 1 '05 #3

Mike Meyer

"Fuzzyman" <fu******@gmail.com> writes:

The standard library module for fetching HTML is urllib2.
Does urllib2 replace everything in urllib? I thought there was some
urllib functionality that urllib2 didn't do.
There is a project called mechanize, built by John Lee on top of
urllib2 and other standard modules.
It will emulate a browsers behaviour - including history, cookies,
basic authentication, etc.

urllib2 handles cookies and authentication. I use those features
daily. I'm not sure history would apply, unless you're also handling
javascript. Is there some other way to ask the browser to go back in
history?

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Dec 1 '05 #4

Similar topics

Parsing strings -> numbers

by: Tuang | last post by:

I've been looking all over in the docs, but I can't figure out how you're *supposed* to parse formatted strings into numbers (and other data types, for that matter) in Python. In C#, you can say...

Python

Web Scraping/Site Scraping

by: David Jones | last post by:

Hi, I'm interested in learning about web scraping/site scraping using Python. Does anybody know of some online resources or have any modules that are available to help out. O'Reilly published an...

Python

HTML Scraping??

by: mustafa | last post by:

anyone know some good reliable html scraping (with python) tutorials. i have looked around and found a few. one uses urllib2 and beautifull soap modules for scraping and parsing...

Python

HTML Screen Scraping Q

by: George Durzi | last post by:

I'd like to screen-scrape company news from cbsmarketwatch. Consider this URL as an example: http://cbs.marketwatch.com/tools/quotes/news.asp?symb=MSFT When you browse there, there's two...

ASP.NET

Parsing data from printer file

by: pigge79 | last post by:

Hi there, We are thinking of developing a product that needs to be able to parse data printed via some printer driver. The data will then be used to fill a PDF form. We believe the printer...

Visual Basic .NET

How to read contents of html table with .net?

by: Jim S | last post by:

I have a need to read the contents of an html table on a remote web page into a variable. I guess this is called screen scraping but not sure. I'm not sure where to start or what the best...

.NET Framework

HTML Parsing and Indexing

by: mailtogops | last post by:

Hi All, I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc and sortlist them and create a bookmark on our...

Python

Help extracting info from HTML source ..

by: s. d. rose | last post by:

Hello All. I am learning Python, and have never worked with HTML. However, I would like to write a simple script to audit my 100+ Netware servers via their web portal. I was reading Chapter 8...

Python

Parsing HTML

by: mtuller | last post by:

I am trying to parse a webpage and extract information. I am trying to use pyparser. Here is what I have: from pyparsing import * import urllib # define basic text pattern spanStart =...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++