By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,885 Members | 1,474 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,885 IT Pros & Developers. It's quick & easy.

web crawler error: connection timed out

rhitam30111985
100+
P: 112
hi all,,, i am testing a web crawler on a site passsed as a command line argument.. it works fine until it finds a server which is down or some other error ... here is my code:

Expand|Select|Wrap|Line Numbers
  1. #! /usr/bin/python
  2. import urllib
  3. import re
  4. import sys
  5.  
  6.  
  7. def crawl(urllist,done):
  8.  
  9.     curl=urllist[0].upper()
  10.  
  11.     f = urllib.urlopen(curl)
  12.     rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
  13.  
  14.     src=f.read()
  15.     src.replace('\n',' ')
  16.  
  17.     ma =rx.findall(src)
  18.  
  19.     for i in range(0,len(ma)):
  20.         ma[i]=ma[i].upper()
  21.  
  22.     urllist=urllist+ma
  23.  
  24.     done.append(curl.upper())        
  25.  
  26.     print "**Done**"+curl
  27.  
  28.  
  29.     for i in range(0,len(done)):        
  30.         while urllist.count(done[i]):
  31.             urllist.pop(urllist.index(done[i]))
  32.  
  33.  
  34.     if len(urllist)>0:
  35.         crawl(urllist,done)
  36.  
  37. url=sys.argv[1]
  38. url=url.upper()
  39.  
  40. print "Seed="+url
  41.  
  42. urllist=[url]
  43. done=[]
  44. crawl(urllist,done)
  45.  
after a certain amount of crawling... the program crashes givig the following error:

File "./crawler.py", line 35, in crawl
crawl(urllist,done)
File "./crawler.py", line 11, in crawl
f = urllib.urlopen(curl)
File "/usr/lib/python2.4/urllib.py", line 82, in urlopen
return opener.open(url)
File "/usr/lib/python2.4/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/usr/lib/python2.4/urllib.py", line 313, in open_http
h.endheaders()
File "/usr/lib/python2.4/httplib.py", line 798, in endheaders
self._send_output()
File "/usr/lib/python2.4/httplib.py", line 679, in _send_output
self.send(msg)
File "/usr/lib/python2.4/httplib.py", line 646, in send
self.connect()
File "/usr/lib/python2.4/httplib.py", line 630, in connect
raise socket.error, msg
IOError: [Errno socket error] (110, 'Connection timed out')


is there a way around this problem?
Sep 17 '07 #1
Share this Question
Share on Google+
3 Replies


Expert 100+
P: 511
you can use a try:except clause around the part where you open the url for reading.
Sep 17 '07 #2

rhitam30111985
100+
P: 112
welll i tried the following modification using try except:
Expand|Select|Wrap|Line Numbers
  1. try:    
  2.         f = urllib.urlopen(curl)
  3.         rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
  4.  
  5.         src=f.read()
  6.         src.replace('\n',' ')
  7. except IOError:
  8.         pass
  9.  
  10.  
it gives following error after reaching http://freenode.net (seed url being wikipedia)

File "./crawler.py", line 19, in crawl
ma =rx.findall(src)
UnboundLocalError: local variable 'rx' referenced before assignment
Sep 17 '07 #3

Expert 100+
P: 511
welll i tried the following modification using try except:
Expand|Select|Wrap|Line Numbers
  1. try:    
  2.         f = urllib.urlopen(curl)
  3.         rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
  4.  
  5.         src=f.read()
  6.         src.replace('\n',' ')
  7. except IOError:
  8.         pass
  9.  
  10.  
it gives following error after reaching http://freenode.net (seed url being wikipedia)

File "./crawler.py", line 19, in crawl
ma =rx.findall(src)
UnboundLocalError: local variable 'rx' referenced before assignment
check to see if you have a correct regular expression.
Also you can use
Expand|Select|Wrap|Line Numbers
  1. try:
  2. ....
  3. except Exception,e:
  4.     print e
  5.  
so that try:except catches all errors and you can print the errors out.
Sep 17 '07 #4

Post your reply

Sign in to post your reply or Sign up for a free account.