How do I enter/receive webpage information?

Mudcat

Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.

I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?

Thanks,
Marc

Jul 18 '05 #1

Subscribe Post Reply

1604

Jorgen Grahn

On 4 Feb 2005 15:33:50 -0800, Mudcat <mn******@gmail.com> wrote:

Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.
That's called "web scraping", in case you want to Google for info.
I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?

I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

/Jorgen

--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!

Jul 18 '05 #2

John J. Lee

Jorgen Grahn <jg*********@algonet.se> writes:
[...]

I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.

One starting point for web scraping with Python:

http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

There are some modules you may find useful there, too.

Google Groups for urlencode. Or use my module ClientForm, if you
prefer. Experiment a little with an HTML form in a local file and
(eg.) the 'ethereal' sniffer to see what happens when you click
submit.

The stdlib now has cookie support (in Python 2.4):

import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(c j))

r = opener.open("http://example.com/")
print r.read()

Unfortunately, it's true that network sniffing and a reasonable
smattering of knowledge about HTTP &c., does often turn out to be
necessary to scrape stuff. A few useful tips:

http://wwwsearch.sourceforge.net/Cli...html#debugging
John

Jul 18 '05 #3

John J. Lee

Jorgen Grahn <jg*********@algonet.se> writes:
[...]

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

[...]

BeautifulSoup is often recommended (never tried it myself).

Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.

Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...
John

Jul 18 '05 #4

Jorgen Grahn

On 05 Feb 2005 22:58:52 +0000, John J. Lee <jj*@pobox.com> wrote:

Jorgen Grahn <jg*********@algonet.se> writes:
[...]
I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)
....
I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.

That's quite possibly true. I remember looking at and rejecting
urllib/urllib2, but I cannot remember my reasons. Maybe I didn't feel they
were documented well enough (in Python 2.1, which is where I live).

[more useful info snipped]

/Jorgen

--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!

Jul 18 '05 #5

by: KGrein | last post by:

Hi. I have a form that contains a combo box with customer number & customer name in it. The form is called F_DeleteUSCust and the combo box is named CB_getUScust It picks up the information for...

Microsoft Access / VBA

[newby question] looking for c# on webpage & c# 3d information

by: Rick D. | last post by:

Hi all, I'm looking for information on running a c# application on a webpage, just like a java-applet. And the second thing i'm looking for is information on how to display 3d graphics with...

C# / C Sharp

How to receive a value return from other website in C#?

by: Wing | last post by:

Hi all, I am writing a webpage to test the credit card transaction process. After transction, there are some vaules such as transaction refernece will be return back to my webpage. How can i...

C# / C Sharp

sms email receive twice for tmobile and cingular

by: Homa | last post by:

Hi, I'm writing a webpage that will send sms to cell phone using smtp (System.Web.Mail.SmtpMail.SmtpServer) When I send to tmobile (tmomail.net) or cingular (mycingular.net), cell phone...

ASP.NET

I want to enter a register key and check with the database at installation

by: Mamatha | last post by:

Hi Earlier i posted the query regarding to this same problem and i got some replies from them,i tried those ways also but i didn't reach my output. What my actual problem is...i have one web...

Visual Basic .NET

Refresh web page when receive a message?

by: rockdale | last post by:

Hi, all I am coding a asp.net application as user data entry and report interface. We also have another C#.net appplication (a server) does some other stuff, my webserver can send and receive...

ASP.NET

How to make a loop that asks user if they want to enter more information?

by: JWest46088 | last post by:

I want the user to enter information, such as names, and then I want to ask them after they entered the first name if they want to enter another name. How would I do that? This is what I have so...

C / C++

Need Enter key to trigger post

by: Mark B | last post by:

I have a function that looks up a SQL table to see if a search term matches. It works fine but so far there are two things yet to work: 1) After entering a search term and pressing Enter, nothing...

ASP.NET

How to allow left click to enter info in text box form in FireFox

by: jerrydigital | last post by:

Hello, I had my webpage up and running great on both Internet Explorer and Firefox. The problem is, on FireFox the user can only right click the text box in forms to enter information. How do I...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

How do I enter/receive webpage information?

Similar topics