473,382 Members | 1,441 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

urllib2 pinger : insight as to use, cause of hang-up?

EP
Hello patient and tolerant Pythonistas,

Iterating through a long list of arbitrary (and possibly syntactically flawed) urls with a urllib2 pinging function I get a hang up. No exception is raised, however (according to Windows Task Manager) python.exe stops using any CPU time, neither increasing nor decreasing the memory it uses, and thescript does not progress (permanently stalled, it seems). As an example, the below function has been stuck on url number 364 for ~40 minutes.

Does this simply indicate the need for a time-out function, or could there be something else going on (error in my usage) I've overlooked?

If it requires a time-out control, is there a way to implement that withoutusing separate threads? Any best practice recommendations?

Here's my function:

--------------------------------------------------
def testLinks2(urlList=[]):
import urllib2
goodLinks=[]
badLinks=[]
user_agent = 'mySpider Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
print len(urlList), " links to test"
count=0
for url in urlList:
count+=1
print count,
try:
request = urllib2.Request(url)
request.add_header('User-Agent', user_agent)
handle = urllib2.urlopen(request)
goodLinks.append(url)
except urllib2.HTTPError, e:
badLinks.append({url:e.code})
print e.code,": ",url
except:
print "unknown error: ",url
badLinks.append({url:"unknown error"})
print len(goodLinks)," working links found"
return goodLinks, badLinks

good, bad=testLinks2(linkList)
--------------------------------------------------

Thannks in advance for your thoughts.

Eric Pederson

Jul 19 '05 #1
3 1995
Timing it out will probably solve it.

Jul 19 '05 #2
EP
"Mahesh" advised:

Timing it out will probably solve it.

Thanks.

Follow-on question regarding implementing a timeout for use by urllib2. I am guessing the simplest way to do this is via socket.setdefaulttimeout(), but I am not sure if this sets a global parameter, and if so, whether it might be reset via instantiations of urllib, urllib2, httplib, etc. I assumesocket and the timeout parameter is in the global namespace and that I canjust reset it at will for application to all the socket module 'users'. Is that right?

(TIA)
[experimenting]
import urllib2plus
urllib2plus.setSocketTimeOut(1)
urllib2plus.urlopen('http://zomething.com')
Traceback (most recent call last):
File "<pyshell#52>", line 1, in -toplevel-
urllib2plus.urlopen('http://zomething.com')
File "C:\Python24\lib\urllib2plus.py", line 130, in urlopen
return _opener.open(url, data)
File "C:\Python24\lib\urllib2plus.py", line 361, in open
response = self._open(req, data)
File "C:\Python24\lib\urllib2plus.py", line 379, in _open
'_open', req)
File "C:\Python24\lib\urllib2plus.py", line 340, in _call_chain
result = func(*args)
File "C:\Python24\lib\urllib2plus.py", line 1024, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python24\lib\urllib2plus.py", line 999, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
urllib2plus.setSocketTimeOut(10)
urllib2plus.urlopen('http://zomething.com') <addinfourl at 12449152 whose fp = <socket._fileobject object at 0x00BE1340>>
import socket
socket.setdefaulttimeout(0)
urllib2plus.urlopen('http://zomething.com') Traceback (most recent call last):
File "<pyshell#60>", line 1, in -toplevel-
urllib2plus.urlopen('http://zomething.com')
File "C:\Python24\lib\urllib2plus.py", line 130, in urlopen
return _opener.open(url, data)
File "C:\Python24\lib\urllib2plus.py", line 361, in open
response = self._open(req, data)
File "C:\Python24\lib\urllib2plus.py", line 379, in _open
'_open', req)
File "C:\Python24\lib\urllib2plus.py", line 340, in _call_chain
result = func(*args)
File "C:\Python24\lib\urllib2plus.py", line 1024, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python24\lib\urllib2plus.py", line 999, in do_open
raise URLError(err)
URLError: <urlopen error (10035, 'The socket operation could not complete without blocking')> socket.setdefaulttimeout(1)
urllib2plus.urlopen('http://zomething.com')

<addinfourl at 12449992 whose fp = <socket._fileobject object at 0x00BE1420>>

Jul 19 '05 #3
socket.setdefaulttimeout() is what I have used in the past and it has
worked well. I think it is set in the global namespace though I could
be wrong. I think it retains its value within the module it is called
in. If you use it in a different module if will probably get reset
though it is easy enough to test that out.

Jul 19 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Derek Fountain | last post by:
I just tried this: >>> import urllib2 >>> urllib2.urlopen( "https://passenger.ssc.com/~dmarti/contrib-faq/" ) It sits forever. Loading that URL in a normal browser gets me a popup asking...
3
by: Anand Pillai | last post by:
I recently noted that urllib2.urlopen(...) for http:// urls does not make an explicit call to close the underlying HTTPConnection socket once the data from the socket is read. This might not be...
5
by: Pascal | last post by:
Hello, I want to acces my OWA (Outlook Web Acces - http Exchange interface) server with urllib2 but, when I try, I've always a 401 http error. Can someone help me (and us)? Thanks. ...
0
by: Richie Hindle | last post by:
Hi, I'm trying to write a test script to hammer an HTTP server over a persistent HTTP 1.1 connection. The server uses cookies, so I'm using a combination of ClientCookie 0.4.19 and urllib2 with...
0
by: Chris | last post by:
hello, I have an odd behaviour. I try to download some files connected to a specific webpage (e.g. all stylesheets) with urllib2.urlopen(x) This seems to hang on the 2nd file or so. Doing the...
0
by: bob sacamano | last post by:
Certain pages cause urllib2 to go into an infinite loop when using readline(), but everything works fine if read() is used instead. Is this a bug or am I missing something simple? import urllib2...
1
by: Ben Edwards | last post by:
Have been experimenting with HTTP stuff in python 2.4 and am having a problem getting debug info. If I use utllib.utlopen I get debug but if I user utllib2 I do not. Below is the probram and the...
3
by: kdotsky | last post by:
Hello All, I've ran into this problem on several sites where urllib2 will hang (using all the CPU) trying to read a page. I was able to reproduce it for one particular site. I'm using python 2.4...
1
by: cp.finances.gouv | last post by:
Hello all, I'm facing a strange behavior of urllib2 trying to access gmail account behind a proxy (Squid). The following works perfectly : wget --save-cookies cookies...
1
by: Larry Hale | last post by:
Since it seems I have a "unique" problem, I wonder if anyone could point me in the general/right direction for tracking down the issue and resolving it myself. See my prior post @...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.