By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,714 Members | 750 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,714 IT Pros & Developers. It's quick & easy.

404 errors

P: n/a
I'm probably a bit off topic with this, but I'm not sure where else to ask.
Hopefully someone here will know the answer.

I'm writing a script (in Python) which reads a webpage from a user supplied
URL using urllib.urlopen. I want to detect an error from the server. If the
server doesn't exist, that's easy - catch the IOError. However, if the
server exists, but the path in the URL is wrong, how do I detect the error?
Some servers respond with a nicely formatted bit of HTML explaining the
problem, which is fine for a human, but not for a script. Is there some
flag or something definitive on the response which says "this is a 404
error"?
Jul 18 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Tut
Tue, 27 Apr 2004 11:00:57 +0800, Derek Fountain wrote:
Some servers respond with a nicely formatted bit of HTML explaining the
problem, which is fine for a human, but not for a script. Is there some
flag or something definitive on the response which says "this is a 404
error"?


Maybe catch the urllib2.HTTPError?

Jul 18 '05 #2

P: n/a
On Tue, 27 Apr 2004 10:46:47 +0200, Tut wrote:
Tue, 27 Apr 2004 11:00:57 +0800, Derek Fountain wrote:
Some servers respond with a nicely formatted bit of HTML explaining the
problem, which is fine for a human, but not for a script. Is there some
flag or something definitive on the response which says "this is a 404
error"?


Maybe catch the urllib2.HTTPError?


This kind of answers the question. urllib will let you read whatever it
receives, regardless of the HTTP status; you need to use urllib2 if you
want to find out the status code when a request results in an error (any
HTTP status beginning with a 4 or 5). This can be done like so:

import urllib2
try:
asock = urllib2.urlopen("http://www.foo.com/qwerty.html")
except urllib2.HTTPError, e:
print e.code

The value in urllib2.HTTPError.code comes from the first line of the web
server's HTTP response, just before the headers begin, e.g. "HTTP/1.1 200
OK", or "HTTP/1.1 404 Not Found".

One thing you need to be aware of is that some web sites don't behave as
you would expect them to; e.g. responding with a redirection rather than a
404 error when you when you request a page that doesn't exist. In these
cases you might still have to rely on some clever scripting.

Cheers,

Ivan
Jul 18 '05 #3

P: n/a
Ivan Karajas <my***********************@myrealbox.com> writes:
On Tue, 27 Apr 2004 10:46:47 +0200, Tut wrote:
Tue, 27 Apr 2004 11:00:57 +0800, Derek Fountain wrote:
Some servers respond with a nicely formatted bit of HTML explaining the
problem, which is fine for a human, but not for a script. Is there some
flag or something definitive on the response which says "this is a 404
error"?
Maybe catch the urllib2.HTTPError?


This kind of answers the question. urllib will let you read whatever it
receives, regardless of the HTTP status; you need to use urllib2 if you
want to find out the status code when a request results in an error (any
HTTP status beginning with a 4 or 5). This can be done like so:


FWIW, note that urllib2's own idea of an error (ie. something for
which it throws a response object as an HTTPError exception rather
than returning it) is: 'anything other than 200 is an error'. The
only exceptions are where some responses happen to be handled by
urllib2 handlers (eg. 302), or at a lower level by httplib (eg. 100).

import urllib2
try:
asock = urllib2.urlopen("http://www.foo.com/qwerty.html")
except urllib2.HTTPError, e:
print e.code

The value in urllib2.HTTPError.code comes from the first line of the web
server's HTTP response, just before the headers begin, e.g. "HTTP/1.1 200
OK", or "HTTP/1.1 404 Not Found".

One thing you need to be aware of is that some web sites don't behave as
you would expect them to; e.g. responding with a redirection rather than a
404 error when you when you request a page that doesn't exist. In these
cases you might still have to rely on some clever scripting.


The following kind of functionality is in urllib2 in Python 2.4 (there
are some loose ends, which I will tie up soon). It's slightly simpler
in 2.4 than in my ClientCookie clone of that module, but (UNTESTED):

import ClientCookie
from ClientCookie._Util import response_seek_wrapper

class BadResponseProcessor(ClientCookie.BaseProcessor):
# Convert apparently-successful 200 OK or 30x redirection responses to 404s
# iff they contain tell-tale text that indicates failure.

def __init__(self, diagnostic_text):
self.diagnostic_text = diagnostic_text

def http_response(self, request, response):
if not hasattr(response, "seek"):
response = response_seek_wrapper(response)

if response.code in [200, 301, 302, 303, 307]:
ct = response.info().getheaders("content-type")
if ct and ct[0].startswith("text/html"):
try:
data = response.read(4096)
if self.diagnostic_text in data:
response.code = 404
finally:
response.seek(0)
return response

https_response = http_response

brp = BadResponseProcessor("Whoops, an error occurred.")
opener = ClientCookie.build_opener(brp)

r = opener.open("http://nonstandard.com/bad/url")
assert r.code == 404
Hmm, looking at that, I suppose it would be better done *after*
redirection (which is quite possible, with the modifications I've
made, without needing any heavy subclassing or other hacks -- use the
processor_order attribute). You'd then just check for 200 rather than
200 or 30x in the code above.

A similar problem: as I mention above, by default, urllib2 only
returns 200 responses, and always raises an exception for other HTTP
response codes. Occasionally, it's much more convenient to have an
OpenerDirector that behaves differently:

class HTTPErrorProcessor(ClientCookie.HTTPErrorProcessor ):
# return most error responses rather than raising an exception

def http_response(self, request, response):
code, msg, hdrs = response.code, response.msg, response.info()

category = divmod(code, 100)[0] # eg. 200 --> 2
if category not in [2, 4, 5] or code in [401, 407]:
response = self.parent.error(
'http', request, response, code, msg, hdrs)

return response

https_response = http_response
John
Jul 18 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.