web crawler error: connection timed out

rhitam30111985

112 100+

hi all,,, i am testing a web crawler on a site passsed as a command line argument.. it works fine until it finds a server which is down or some other error ... here is my code:

Expand|Select|Wrap|Line Numbers

 
#! /usr/bin/python

import urllib

import re

import sys
 
def crawl(urllist,done):
 
    curl=urllist[0].upper()
 
    f = urllib.urlopen(curl)

    rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
 
    src=f.read()

    src.replace('\n',' ')
 
    ma =rx.findall(src)
 
    for i in range(0,len(ma)):

        ma[i]=ma[i].upper()
 
    urllist=urllist+ma
 
    done.append(curl.upper())        
 
    print "**Done**"+curl
 
    for i in range(0,len(done)):        

        while urllist.count(done[i]):

            urllist.pop(urllist.index(done[i]))
 
    if len(urllist)>0:

        crawl(urllist,done)
 
url=sys.argv[1]

url=url.upper()
 
print "Seed="+url
 
urllist=[url]

done=[]

crawl(urllist,done)

after a certain amount of crawling... the program crashes givig the following error:

File "./crawler.py", line 35, in crawl
crawl(urllist,done)
File "./crawler.py", line 11, in crawl
f = urllib.urlopen(curl)
File "/usr/lib/python2.4/urllib.py", line 82, in urlopen
return opener.open(url)
File "/usr/lib/python2.4/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/usr/lib/python2.4/urllib.py", line 313, in open_http
h.endheaders()
File "/usr/lib/python2.4/httplib.py", line 798, in endheaders
self._send_output()
File "/usr/lib/python2.4/httplib.py", line 679, in _send_output
self.send(msg)
File "/usr/lib/python2.4/httplib.py", line 646, in send
self.connect()
File "/usr/lib/python2.4/httplib.py", line 630, in connect
raise socket.error, msg
IOError: [Errno socket error] (110, 'Connection timed out')

is there a way around this problem?

Sep 17 '07 #1

Subscribe Post Reply

4629

ghostdog74

511

Expert 256MB

you can use a try:except clause around the part where you open the url for reading.

Sep 17 '07 #2

rhitam30111985

112

100+

welll i tried the following modification using try except:

Expand|Select|Wrap|Line Numbers

 
try:    

        f = urllib.urlopen(curl)

        rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")    
 
        src=f.read()

        src.replace('\n',' ')

except IOError:

        pass

it gives following error after reaching http://freenode.net (seed url being wikipedia)

File "./crawler.py", line 19, in crawl
ma =rx.findall(src)
UnboundLocalError: local variable 'rx' referenced before assignment

Sep 17 '07 #3

ghostdog74

511

Expert 256MB

welll i tried the following modification using try except:

Expand|Select|Wrap|Line Numbers

try:

        f = urllib.urlopen(curl)

        rx=re.compile("href=\"(http://[a-zA-Z0-9_\./\?&%=#\-]+)[\s\"]")

        src=f.read()

        src.replace('\n',' ')

except IOError:

        pass

it gives following error after reaching http://freenode.net (seed url being wikipedia)

File "./crawler.py", line 19, in crawl
ma =rx.findall(src)
UnboundLocalError: local variable 'rx' referenced before assignment

check to see if you have a correct regular expression.
Also you can use

Expand|Select|Wrap|Line Numbers

 
try:

....

except Exception,e:

    print e

so that try:except catches all errors and you can print the errors out.

Sep 17 '07 #4

Similar topics

Web Crawler via java

by: Bart Spearen | last post by:

Hello, I am trying to write a web crawler with java and while I have most of it worked out and able to access pages, I keep coming up with a cookie problem. The long and short of it is that...

Java

C# Crawler and performance (speed of crawling)

by: Benjamin Lefevre | last post by:

I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...

.NET Framework

SQL ODBC Error when idle

by: Shaun | last post by:

Hi Got an odd one on an application, I'm using Access 97 to talk to SQL Server (also we have 2002 and 2002 versions of the same application as the client has many differant PC builds!!) Just done...

Microsoft Access / VBA

Error copying dll to webserver VS 2003 - server connection timed o

by: Anthony Banks | last post by:

I get error below when trying to build my application. Could not copy built outputs to the Web. Unable to add 'C:\Documents and Settings\Anthony...

ASP.NET

web crawler in python or C?

by: abhinav | last post by:

Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to...

Python

error: The operation has timed-out (executionTimeout?)

by: Jim Underwood | last post by:

I am having a problem with my web page timng out while retrieving a long runnign report (90-120 seconds. I have tried modifying several settings in various places and cannot get it to run for more...

ASP.NET

crawler crawler....help needed

by: splintercell | last post by:

well i got this code from java.sun.com and tried modiifying it in all the possible ways,but to no good.. stil its not workin..pleas help me out and try postin good workinw web cralwer if u have.....

Java

Error with MySql

by: vincedav31 | last post by:

I have a connection to a server and my database. I use it like this in my code : Class.forName("com.mysql.jdbc.Driver"); String DBurl = "jdbc:mysql://138.63.222.7:3306/ns3"; m_connection =...

Java

Error : Runnable did not complete within 10000ms

by: piyushgpt1 | last post by:

My company has a web application running on JBoss-2.4.4_Tomcat-4.0.1. It works fine for a few days, but once or twice in a week, getting the exception :- Error : Runnable did not complete within...

Java

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware