By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,201 Members | 922 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,201 IT Pros & Developers. It's quick & easy.

How to use urllib2.BaseHandler class

P: n/a
Hi all,

I'm trying to build a web page crawler to help us build our websites,
which are driven by static pages after they are called the first time.
Anyway, I can use urllib2.urlopen() no problem, but I'd like to have
more control over the process. In particular I'd like to get back the
HTTP status code from the request, even if it's a 200. It looks like I
can do that by deriving my own class from HTTPHandler, but I'm not
sure how to go about it. Can anyone direct me to some useful example
code for this kind of thing?

Thanks in advance,
Doug Farrell
Jul 18 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
wr******@charter.net (Doug Farrell) writes:
Hi all,

I'm trying to build a web page crawler to help us build our websites,
which are driven by static pages after they are called the first time.
Anyway, I can use urllib2.urlopen() no problem, but I'd like to have
more control over the process. In particular I'd like to get back the
HTTP status code from the request, even if it's a 200. It looks like I
can do that by deriving my own class from HTTPHandler, but I'm not
sure how to go about it. Can anyone direct me to some useful example
code for this kind of thing?


In 2.3, urllib2 only ever *returns* a response if the code is 200. In
other cases, HTTPError exceptions are *raised*. HTTPError instances
satisfy the normal response interface, so you can catch them and use
them just as you would the return value of urlopen(). As you've
noticed, they also have .code and .msg attributes (unlike normal
response objects, in 2.3 -- since it's always 200, they weren't really
necessary!).

Now for 2.4, where things have changed a bit.

I *think* the 2.4 CVS urllib2.py will work fine with Python 2.3 (the
annoying Python test suite runner makes it a mild pain to check).

As I mentioned in another thread, don't use the urllib2 from 2.4a1 --
it's broken.

In 2.4, some successful responses other than 200 are also returned (at
present, only 200 and 206). Also, all response objects have .code and
..msg attributes -- not only HTTPError, but those that get returned,
too (ie. 200 and 206 ATM). If you want all responses returned rather
than raised as exceptions, or vice-versa, it's much easier to achieve
that in 2.4 than in 2.3. It's easier because the interface of handler
objects has been extended to allow pre- post-processing of requests
and responses respectively, and that feature is now used by urllib2 to
implement HTTP error handling separately from the rest of HTTP
fetching. Snip from CVS urllib2.py:

class HTTPErrorProcessor(BaseHandler):
"""Process HTTP error responses."""
handler_order = 1000 # after all other processing

def http_response(self, request, response):
code, msg, hdrs = response.code, response.msg, response.info()

if code not in (200, 206):
response = self.parent.error(
'http', request, response, code, msg, hdrs)

return response

https_response = http_response
So, to get all responses returned without error handling, regardless
of error code (this will disable things like authentication and
redirection, of course, so you might want to be a bit more
restrictive, by still passing on selected error codes to
self.parent.error()):

import urllib2

class NullHTTPErrorProcessor(urllib2.HTTPErrorProcessor) :
def http_response(self, request, response):
return response

https_response = http_response

opener = urllib2.build_opener(NullHTTPErrorProcessor())
opener.open("http://www.python.org/") # never raises HTTPError
You should probably only do this if you have good reason, because you
may confuse people reading your code.

Use urllib2.install_opener() if you want to use urllib2.urlopen().
Usually there's no real point, though.

If you want to stick with pre-2.4 code, look at ClientCookie for
example code. That code is full of cruft though, since it's supposed
to work back to 1.5.2, and has to cut and paste a fair amount as a
result ;-)

HTH
John
Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.