472,139 Members | 1,318 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,139 software developers and data experts.

Link Checking Issues - Sub domains

Hi,

I have written this script to run as a cron that will loop through a
text file with a list of urls. It works fine for most of the links,
however there are a number of urls which are subdomains (they are
government sites) such as http://basename.airforce.mil, these links
are always throwing 400 errors even though the site exists.

Is there a way to get around this?

Here is the script:

import httplib
from urlparse import urlparse

class LinkChecker:

def oldStuff():
p = urlparse(url)
h = HTTP(p[1])
h.putrequest('HEAD', p[2])
h.endheaders()
if h.getreply()[0] == 200: return 1
else: return 0
def check(self):
print "\nLooping through the file, line by line."

# define default values for the paremeters
text_file = open("/home/jjaffe/pythonModules/JAMRSscripts/urls.txt",
"r")
output = ""
errors = "=================== ERRORS (website exists but 404, 503
etc ): ===================\n"
failures= "\n=================== FAILURES (cannot connect to website
at all): ===================\n"
eCount = 0
fCount = 0

#loop through each line and see what the response code is
for line in text_file:
p = urlparse(line)
try:
conn = httplib.HTTPConnection(p[1])
conn.request("GET", p[2])
r1 = conn.getresponse()
if r1.status != 200: #if the response code was not success (200)
then report the error
errors += "\n "+str(r1.status)+" error for: "+p[1]+p[2]
eCount = (eCount + 1)
data1 = r1.read()
conn.close()
except: #the connection attempt timed out - hence the website
doesn't even exist
failures +="\n Could not create connection object: "+p[1]+p[2]
fCount = (fCount + 1)
text_file.close()

#see if there were errors and create output string
if (eCount == 0) and (fCount == 0):
output = "No errors or failures to report"
else:
output = errors+"\n\n"+failures

print output

if __name__ == '__main__':
lc = LinkChecker()
lc.check()
del lc
Thanks in advance.
Aug 5 '08 #1
1 968


rpupkin77 wrote:
Hi,

I have written this script to run as a cron that will loop through a
text file with a list of urls. It works fine for most of the links,
however there are a number of urls which are subdomains (they are
government sites) such as http://basename.airforce.mil, these links
are always throwing 400 errors even though the site exists.
Have you looked at urllib/urllib2 (urllib.request in 3.0)
for checking links?
If 'http://basename.airforce.mil' works typed into your browser,
this from the doc for urllib.request.Request might be relevant:

"headers should be a dictionary, and will be treated as if add_header()
was called with each key and value as arguments. This is often used to
“spoof” the User-Agent header, which is used by a browser to identify
itself – some HTTP servers only allow requests coming from common
browsers as opposed to scripts. For example, Mozilla Firefox may
identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127
Firefox/2.0.0.11", while urllib‘s default user agent string is
"Python-urllib/2.6" (on Python 2.6)."
Aug 5 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

13 posts views Thread by Razzbar | last post: by
5 posts views Thread by Scott Tilton | last post: by
13 posts views Thread by Derek | last post: by
14 posts views Thread by Steve McLellan | last post: by
6 posts views Thread by Ludvig | last post: by
26 posts views Thread by libsfan01 | last post: by
2 posts views Thread by Visine_Eyes | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.