473,385 Members | 2,005 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

How do I enter/receive webpage information?

Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.

I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?

Thanks,
Marc

Jul 18 '05 #1
4 1604
On 4 Feb 2005 15:33:50 -0800, Mudcat <mn******@gmail.com> wrote:
Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.
That's called "web scraping", in case you want to Google for info.
I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?


I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

/Jorgen

--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!
Jul 18 '05 #2
Jorgen Grahn <jg*********@algonet.se> writes:
[...]
I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.


I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.

One starting point for web scraping with Python:

http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

There are some modules you may find useful there, too.

Google Groups for urlencode. Or use my module ClientForm, if you
prefer. Experiment a little with an HTML form in a local file and
(eg.) the 'ethereal' sniffer to see what happens when you click
submit.

The stdlib now has cookie support (in Python 2.4):

import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(c j))

r = opener.open("http://example.com/")
print r.read()

Unfortunately, it's true that network sniffing and a reasonable
smattering of knowledge about HTTP &c., does often turn out to be
necessary to scrape stuff. A few useful tips:

http://wwwsearch.sourceforge.net/Cli...html#debugging
John
Jul 18 '05 #3
Jorgen Grahn <jg*********@algonet.se> writes:
[...]
- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

[...]

BeautifulSoup is often recommended (never tried it myself).

Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.

Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...
John
Jul 18 '05 #4
On 05 Feb 2005 22:58:52 +0000, John J. Lee <jj*@pobox.com> wrote:
Jorgen Grahn <jg*********@algonet.se> writes:
[...]
I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)
....
I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.


That's quite possibly true. I remember looking at and rejecting
urllib/urllib2, but I cannot remember my reasons. Maybe I didn't feel they
were documented well enough (in Python 2.1, which is where I live).

[more useful info snipped]

/Jorgen

--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!
Jul 18 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: KGrein | last post by:
Hi. I have a form that contains a combo box with customer number & customer name in it. The form is called F_DeleteUSCust and the combo box is named CB_getUScust It picks up the information for...
2
by: Rick D. | last post by:
Hi all, I'm looking for information on running a c# application on a webpage, just like a java-applet. And the second thing i'm looking for is information on how to display 3d graphics with...
3
by: Wing | last post by:
Hi all, I am writing a webpage to test the credit card transaction process. After transction, there are some vaules such as transaction refernece will be return back to my webpage. How can i...
1
by: Homa | last post by:
Hi, I'm writing a webpage that will send sms to cell phone using smtp (System.Web.Mail.SmtpMail.SmtpServer) When I send to tmobile (tmomail.net) or cingular (mycingular.net), cell phone...
0
by: Mamatha | last post by:
Hi Earlier i posted the query regarding to this same problem and i got some replies from them,i tried those ways also but i didn't reach my output. What my actual problem is...i have one web...
1
by: rockdale | last post by:
Hi, all I am coding a asp.net application as user data entry and report interface. We also have another C#.net appplication (a server) does some other stuff, my webserver can send and receive...
2
by: JWest46088 | last post by:
I want the user to enter information, such as names, and then I want to ask them after they entered the first name if they want to enter another name. How would I do that? This is what I have so...
6
by: Mark B | last post by:
I have a function that looks up a SQL table to see if a search term matches. It works fine but so far there are two things yet to work: 1) After entering a search term and pressing Enter, nothing...
4
by: jerrydigital | last post by:
Hello, I had my webpage up and running great on both Internet Explorer and Firefox. The problem is, on FireFox the user can only right click the text box in forms to enter information. How do I...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.