473,398 Members | 2,427 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

waiting for html to load: a followup

Hi - A couple days ago I posted asking for help on how to download a
pushed file. I am trying to write a script to download a bunch of links
from a page that takes a while to load.

I managed to get just about everything done using python to load IE, but
aside from not really liking that style, I couldnt figure out how to
have python download the pushed file, or how to read IE headers into
python (the headers point to the download location)

Anyway, I decided to forget IE and I am now trying to use urllib2 to
open up the page, read it, etc. My problem is the page has a built-in
refresh and I don't know how to have python re-read the page until it's
ready to hand over the links.

An example of the page is:
http://edcw2ks23.cr.usgs.gov/Website...8&prodList=NED,

I believe I need to read the header, grab the cookie session id, and add
it back to the header. I can do all thus, but I'm stuck on probably
very simple syntax to re-read the page rather than open a new
connection, if that makes sense (I'm new to http as well as python).
My code snippets:

myreq = urllib2.Request(url)
opener = urllib2.build_opener()
headers = feeddata.info()
cookie = headers['set-cookie']
cookie = cookie[:-8]
while x < 10:
feeddata = opener.open(myreq)
data = feeddata.read()
myreq.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1)')
myreq.add_header('Cookie', cookie)
print data[1600:1650]
print '\n\n\n\n*****************Using Cookie: %s' % cookie
print '****************Header info: \n',headers
sleep(3)
x = x+1

Any help greatly appreciated. Thanks in advance, and when I know what
I'm doing I'll repay the favors.

-Josh
Jul 18 '05 #1
3 3178
Josh <jo***@commenspace.org> writes:
[...]
Anyway, I decided to forget IE and I am now trying to use urllib2 to
open up the page, read it, etc. My problem is the page has a built-in
refresh and I don't know how to have python re-read the page until
it's ready to hand over the links.
ClientCookie does that (HTTPRefreshProcessor and HTTPEquivProcessor in
particular).

http://wwwsearch.sf.net/ClientCookie
I recommend using the alpha release. The interface will change a
little soon, but you almost certainly won't notice.

An example of the page is:
http://edcw2ks23.cr.usgs.gov/Website...8&prodList=NED,

I believe I need to read the header, grab the cookie session id, and
add it back to the header. I can do all thus, but I'm stuck on
It'll do the cookies too :-)

[...] probably very simple syntax to re-read the page rather than open a new
connection, if that makes sense (I'm new to http as well as python).


You don't need to ensure it's the same connection. In fact, you can't
easily do that with urllib2 (or ClientCookie) as it is currently.

HTH
John
Jul 18 '05 #2
Josh <jo***@commenspace.org> writes:
[...]
Anyway, I decided to forget IE and I am now trying to use urllib2 to
open up the page, read it, etc. My problem is the page has a built-in
refresh and I don't know how to have python re-read the page until
it's ready to hand over the links.

An example of the page is:
http://edcw2ks23.cr.usgs.gov/Website...8&prodList=NED,


Example, with some debugging turned on so you can see some of what's
going on:

import ClientCookie
opener = ClientCookie.build_opener(
ClientCookie.HTTPRefreshProcessor(max_time=None),
ClientCookie.HTTPResponseDebugProcessor(),
ClientCookie.HTTPRedirectDebugProcessor(),
)
ClientCookie.getLogger("ClientCookie").setLevel(Cl ientCookie.DEBUG)

r = opener.open('http://edcw2ks23.cr.usgs.gov/Website/zipship/waiting.jsp?areaList=49.0,47.0,-122.0,-124.08&prodList=NED,')
f = open('out.html', 'w')
f.write(r.read())
Don't mix ClientCookie and urllib2, BTW.
John
Jul 18 '05 #3
John,

I really appreciate your reply. I actually grabbed the ClientCookie
module last night and spent a long time trying to figure out how to
write this; your snippet of code was incredibly helpful to me. Nothing
quite like being totally new to a subject, I must say.

Thanks again,

-Josh

John J. Lee wrote:
Josh <jo***@commenspace.org> writes:
[...]
Anyway, I decided to forget IE and I am now trying to use urllib2 to
open up the page, read it, etc. My problem is the page has a built-in
refresh and I don't know how to have python re-read the page until
it's ready to hand over the links.

An example of the page is:
http://edcw2ks23.cr.usgs.gov/Website...8&prodList=NED,

Example, with some debugging turned on so you can see some of what's
going on:

import ClientCookie
opener = ClientCookie.build_opener(
ClientCookie.HTTPRefreshProcessor(max_time=None),
ClientCookie.HTTPResponseDebugProcessor(),
ClientCookie.HTTPRedirectDebugProcessor(),
)
ClientCookie.getLogger("ClientCookie").setLevel(Cl ientCookie.DEBUG)

r = opener.open('http://edcw2ks23.cr.usgs.gov/Website/zipship/waiting.jsp?areaList=49.0,47.0,-122.0,-124.08&prodList=NED,')
f = open('out.html', 'w')
f.write(r.read())
Don't mix ClientCookie and urllib2, BTW.
John

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

12
by: Phil Powell | last post by:
The customer made a wild request: they want on their admin panel a textarea that will display an existing resume. This textarea, however, must have a dynamic width, one that "fills the screen...
2
by: Mark Richards | last post by:
An applet on one of my pages uses a big picture as background. I want to load it before the applet and the rest of the html source. By doing this I want to avoid that the pane of the applet is...
3
by: Brett | last post by:
I have just picked up javascript so that I can write some scripts for Greasemonkey. For those not familiar with grease monkey it is a program that allows you to change websites before you view...
14
by: Alex Molochnikov | last post by:
Is there any way to embed the HTML code inside FRAMESET? Something like this: <frameset cols="50%,*"> <frame src=" ... HTML code for the frame ... "> <frame src="Frame2.html" name="main">...
7
by: clilush | last post by:
I'm trying to fix up a database on my server by transferring all the readable data to a new database, and then upgrading it from db2-7 to db2-8. I'm using the export command to dump the tables...
4
by: Jono | last post by:
Hi Everyone, As it says in the title, I'm looking for a way to display a page while long running operations are performed on the server. Ideally, I'd like some way to push the current request...
2
by: Andrew Neiderer | last post by:
This is simple HTML, Java but I am really confused. I include the code since it is so small - ---------------------------------------------------------------------------- -- test.html -- ...
11
by: Faisal Vali | last post by:
Are there any guidelines people use that help them decide when it is better to dynamically generate all html elements using javascript versus actually writing some html and using it as scaffolding?...
9
by: erikcw | last post by:
Hi, I have a cgi script where users are uploading large files for processing. I want to launch a subprocess to process the file so the user doesn't have to wait for the page to load. What is...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.