472,969 Members | 1,953 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,969 software developers and data experts.

web crawler error: connection timed out

rhitam30111985
112 100+
hi all,,, i am testing a web crawler on a site passsed as a command line argument.. it works fine until it finds a server which is down or some other error ... here is my code:

Expand|Select|Wrap|Line Numbers
  1. #! /usr/bin/python
  2. import urllib
  3. import re
  4. import sys
  5.  
  6.  
  7. def crawl(urllist,done):
  8.  
  9.     curl=urllist[0].upper()
  10.  
  11.     f = urllib.urlopen(curl)
  12.     rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
  13.  
  14.     src=f.read()
  15.     src.replace('\n',' ')
  16.  
  17.     ma =rx.findall(src)
  18.  
  19.     for i in range(0,len(ma)):
  20.         ma[i]=ma[i].upper()
  21.  
  22.     urllist=urllist+ma
  23.  
  24.     done.append(curl.upper())        
  25.  
  26.     print "**Done**"+curl
  27.  
  28.  
  29.     for i in range(0,len(done)):        
  30.         while urllist.count(done[i]):
  31.             urllist.pop(urllist.index(done[i]))
  32.  
  33.  
  34.     if len(urllist)>0:
  35.         crawl(urllist,done)
  36.  
  37. url=sys.argv[1]
  38. url=url.upper()
  39.  
  40. print "Seed="+url
  41.  
  42. urllist=[url]
  43. done=[]
  44. crawl(urllist,done)
  45.  
after a certain amount of crawling... the program crashes givig the following error:

File "./crawler.py", line 35, in crawl
crawl(urllist,done)
File "./crawler.py", line 11, in crawl
f = urllib.urlopen(curl)
File "/usr/lib/python2.4/urllib.py", line 82, in urlopen
return opener.open(url)
File "/usr/lib/python2.4/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/usr/lib/python2.4/urllib.py", line 313, in open_http
h.endheaders()
File "/usr/lib/python2.4/httplib.py", line 798, in endheaders
self._send_output()
File "/usr/lib/python2.4/httplib.py", line 679, in _send_output
self.send(msg)
File "/usr/lib/python2.4/httplib.py", line 646, in send
self.connect()
File "/usr/lib/python2.4/httplib.py", line 630, in connect
raise socket.error, msg
IOError: [Errno socket error] (110, 'Connection timed out')


is there a way around this problem?
Sep 17 '07 #1
3 4589
ghostdog74
511 Expert 256MB
you can use a try:except clause around the part where you open the url for reading.
Sep 17 '07 #2
rhitam30111985
112 100+
welll i tried the following modification using try except:
Expand|Select|Wrap|Line Numbers
  1. try:    
  2.         f = urllib.urlopen(curl)
  3.         rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
  4.  
  5.         src=f.read()
  6.         src.replace('\n',' ')
  7. except IOError:
  8.         pass
  9.  
  10.  
it gives following error after reaching http://freenode.net (seed url being wikipedia)

File "./crawler.py", line 19, in crawl
ma =rx.findall(src)
UnboundLocalError: local variable 'rx' referenced before assignment
Sep 17 '07 #3
ghostdog74
511 Expert 256MB
welll i tried the following modification using try except:
Expand|Select|Wrap|Line Numbers
  1. try:    
  2.         f = urllib.urlopen(curl)
  3.         rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
  4.  
  5.         src=f.read()
  6.         src.replace('\n',' ')
  7. except IOError:
  8.         pass
  9.  
  10.  
it gives following error after reaching http://freenode.net (seed url being wikipedia)

File "./crawler.py", line 19, in crawl
ma =rx.findall(src)
UnboundLocalError: local variable 'rx' referenced before assignment
check to see if you have a correct regular expression.
Also you can use
Expand|Select|Wrap|Line Numbers
  1. try:
  2. ....
  3. except Exception,e:
  4.     print e
  5.  
so that try:except catches all errors and you can print the errors out.
Sep 17 '07 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: Bart Spearen | last post by:
Hello, I am trying to write a web crawler with java and while I have most of it worked out and able to access pages, I keep coming up with a cookie problem. The long and short of it is that...
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
0
by: Shaun | last post by:
Hi Got an odd one on an application, I'm using Access 97 to talk to SQL Server (also we have 2002 and 2002 versions of the same application as the client has many differant PC builds!!) Just done...
0
by: Anthony Banks | last post by:
I get error below when trying to build my application. Could not copy built outputs to the Web. Unable to add 'C:\Documents and Settings\Anthony...
13
by: abhinav | last post by:
Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to...
10
by: Jim Underwood | last post by:
I am having a problem with my web page timng out while retrieving a long runnign report (90-120 seconds. I have tried modifying several settings in various places and cannot get it to run for more...
3
by: splintercell | last post by:
well i got this code from java.sun.com and tried modiifying it in all the possible ways,but to no good.. stil its not workin..pleas help me out and try postin good workinw web cralwer if u have.....
1
by: vincedav31 | last post by:
I have a connection to a server and my database. I use it like this in my code : Class.forName("com.mysql.jdbc.Driver"); String DBurl = "jdbc:mysql://138.63.222.7:3306/ns3"; m_connection =...
1
by: piyushgpt1 | last post by:
My company has a web application running on JBoss-2.4.4_Tomcat-4.0.1. It works fine for a few days, but once or twice in a week, getting the exception :- Error : Runnable did not complete within...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: Aliciasmith | last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...
0
tracyyun
by: tracyyun | last post by:
Hello everyone, I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...
4
NeoPa
by: NeoPa | last post by:
Hello everyone. I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report). I know it can be done by selecting :...
3
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
1
by: Teri B | last post by:
Hi, I have created a sub-form Roles. In my course form the user selects the roles assigned to the course. 0ne-to-many. One course many roles. Then I created a report based on the Course form and...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM) Please note that the UK and Europe revert to winter time on...
0
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.