473,543 Members | 2,432 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

urllib (54, 'Connection reset by peer') error

Hi,

I have a small Python script to fetch some pages from the internet.
There are a lot of pages and I am looping through them and then
downloading the page using urlretrieve() in the urllib module.

The problem is that after 110 pages or so the script sort of hangs and
then I get the following traceback:
>>>>
Traceback (most recent call last):
File "volume_archive r.py", line 21, in <module>
urllib.urlretri eve(remotefile, localfile)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 89, in urlretrieve
return _urlopener.retr ieve(url, filename, reporthook, data)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 222, in retrieve
fp = self.open(url, data)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 328, in open_http
errcode, errmsg, headers = h.getreply()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 1195, in getreply
response = self._conn.getr esponse()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 924, in getresponse
response.begin( )
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_stat us()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 343, in _read_status
line = self.fp.readlin e()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/socket.py", line 331, in readline
data = recv(1)
IOError: [Errno socket error] (54, 'Connection reset by peer')
>>>>>>
My script code is as follows:
-----------------------------------------
import os
import urllib

volume_number = 149 # The volumes number 150 to 544

while volume_number < 544:
volume_number = volume_number + 1
localfile = '/Users/Chris/Desktop/Decisions/' + str(volume_numb er) +
'.html'
remotefile = 'http://caselaw.lp.find law.com/scripts/getcase.pl?
court=us&navby= vol&vol=' + str(volume_numb er)
print 'Getting volume number:', volume_number
urllib.urlretri eve(remotefile, localfile)

print 'Download complete.'
-----------------------------------------

Once I get the error once running the script again doesn't do much
good. It usually gets two or three pages and then hangs again.

What is causing this?


Jun 27 '08 #1
5 13025
On Jun 13, 4:21*pm, chrispoliq...@g mail.com wrote:
Hi,

I have a small Python script to fetch some pages from the internet.
There are a lot of pages and I am looping through them and then
downloading the page using urlretrieve() in the urllib module.

The problem is that after 110 pages or so the script sort of hangs and
then I get the following traceback:

Traceback (most recent call last):
* File "volume_archive r.py", line 21, in <module>
* * urllib.urlretri eve(remotefile, localfile)
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 89, in urlretrieve
* * return _urlopener.retr ieve(url, filename, reporthook, data)
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 222, in retrieve
* * fp = self.open(url, data)
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 190, in open
* * return getattr(self, name)(url)
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 328, in open_http
* * errcode, errmsg, headers = h.getreply()
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 1195, in getreply
* * response = self._conn.getr esponse()
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 924, in getresponse
* * response.begin( )
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 385, in begin
* * version, status, reason = self._read_stat us()
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 343, in _read_status
* * line = self.fp.readlin e()
* File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/socket.py", line 331, in readline
* * data = recv(1)
IOError: [Errno socket error] (54, 'Connection reset by peer')

My script code is as follows:
-----------------------------------------
import os
import urllib

volume_number = 149 # The volumes number 150 to 544

while volume_number < 544:
* * * * volume_number = volume_number + 1
* * * * localfile = '/Users/Chris/Desktop/Decisions/' + str(volume_numb er) +
'.html'
* * * * remotefile = 'http://caselaw.lp.find law.com/scripts/getcase.pl?
court=us&navby= vol&vol=' + str(volume_numb er)
* * * * print 'Getting volume number:', volume_number
* * * * urllib.urlretri eve(remotefile, localfile)

print 'Download complete.'
-----------------------------------------

Once I get the error once running the script again doesn't do much
good. *It usually gets two or three pages and then hangs again.

What is causing this?
The server is causing it, you could just alter your code

import os
import urllib
import time

volume_number = 149 # The volumes number 150 to 544
localfile = '/Users/Chris/Desktop/Decisions/%s.html'
remotefile = 'http://caselaw.lp.find law.com/scripts/getcase.pl?
court=us&navby= vol&vol=%s'
while volume_number < 544:
volume_number += 1
print 'Getting volume number:', volume_number
try:
urllib.urlretri eve(remotefile% volume_number,l ocalfile
%volume_number)
except IOError:
volume_number -= 1
time.sleep(5)

print 'Download complete.'

That way if the attempt fails it rolls back the volume number, pauses
for a few seconds and tries again.
Jun 27 '08 #2
It means your client received a TCP segment with the reset bit sent.
The 'peer' will toss one your way if it determines that a connection
is no longer valid or if it receives a bad sequence number. If I had
to hazard a guess, I'd say it's probably a network device on the
server side trying to stop you from running a mass download
(especially if it's easily repeatable and happens at about the same
byte range).

-Jeff


On Fri, Jun 13, 2008 at 10:21 AM, <ch***********@ gmail.comwrote:
Hi,

I have a small Python script to fetch some pages from the internet.
There are a lot of pages and I am looping through them and then
downloading the page using urlretrieve() in the urllib module.

The problem is that after 110 pages or so the script sort of hangs and
then I get the following traceback:
>>>>>
Traceback (most recent call last):
File "volume_archive r.py", line 21, in <module>
urllib.urlretri eve(remotefile, localfile)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 89, in urlretrieve
return _urlopener.retr ieve(url, filename, reporthook, data)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 222, in retrieve
fp = self.open(url, data)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/urllib.py", line 328, in open_http
errcode, errmsg, headers = h.getreply()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 1195, in getreply
response = self._conn.getr esponse()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 924, in getresponse
response.begin( )
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_stat us()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/httplib.py", line 343, in _read_status
line = self.fp.readlin e()
File "/Library/Frameworks/Python.framewor k/Versions/2.5/lib/
python2.5/socket.py", line 331, in readline
data = recv(1)
IOError: [Errno socket error] (54, 'Connection reset by peer')
>>>>>>>

My script code is as follows:
-----------------------------------------
import os
import urllib

volume_number = 149 # The volumes number 150 to 544

while volume_number < 544:
volume_number = volume_number + 1
localfile = '/Users/Chris/Desktop/Decisions/' + str(volume_numb er) +
'.html'
remotefile = 'http://caselaw.lp.find law.com/scripts/getcase.pl?
court=us&navby= vol&vol=' + str(volume_numb er)
print 'Getting volume number:', volume_number
urllib.urlretri eve(remotefile, localfile)

print 'Download complete.'
-----------------------------------------

Once I get the error once running the script again doesn't do much
good. It usually gets two or three pages and then hangs again.

What is causing this?


--
http://mail.python.org/mailman/listinfo/python-list
Jun 27 '08 #3
Thanks for the help. The error handling worked to a certain extent
but after a while the server does seem to stop responding to my
requests.

I have a list of about 7,000 links to pages I want to parse the HTML
of (it's basically a web crawler) but after a certain number of
urlretrieve() or urlopen() calls the server just stops responding.
Anyone know of a way to get around this? I don't own the server so I
can't make any modifications on that side.
Jun 27 '08 #4
ch***********@g mail.com wrote:
Thanks for the help. The error handling worked to a certain extent
but after a while the server does seem to stop responding to my
requests.

I have a list of about 7,000 links to pages I want to parse the HTML
of (it's basically a web crawler) but after a certain number of
urlretrieve() or urlopen() calls the server just stops responding.
Anyone know of a way to get around this? I don't own the server so I
can't make any modifications on that side.
I think someone's already mentioned this, but it's almost
certainly an explicit or implicit throttling on the remote server.
If you're pulling 7,000 pages from a single server you need to
be sure that you're within the Terms of Use of that service, or
at the least you need to contact the maintainers in courtesy to
confirm that this is acceptable.

If you don't you may well cause your IP block to be banned on
their network, which could affect others as well as yourself.

TJG
Jun 27 '08 #5
Tim Golden wrote:
ch***********@g mail.com wrote:
>Thanks for the help. The error handling worked to a certain extent
but after a while the server does seem to stop responding to my
requests.

I have a list of about 7,000 links to pages I want to parse the HTML
of (it's basically a web crawler) but after a certain number of
urlretrieve( ) or urlopen() calls the server just stops responding.
Anyone know of a way to get around this? I don't own the server so I
can't make any modifications on that side.

I think someone's already mentioned this, but it's almost
certainly an explicit or implicit throttling on the remote server.
If you're pulling 7,000 pages from a single server you need to
be sure that you're within the Terms of Use of that service, or
at the least you need to contact the maintainers in courtesy to
confirm that this is acceptable.

If you don't you may well cause your IP block to be banned on
their network, which could affect others as well as yourself.
Interestingly, "lp.findlaw.com " doesn't have any visible terms of service.
The information being downloaded is case law, which is public domain, so
there's no copyright issue. Some throttling and retry is needed to slow
down the process, but it should be fixable.

Try this: put in the retry code someone else suggested. Use a variable
retry delay, and wait one retry delay between downloading files. Whenever
a download fails, double the retry delay and try
again; don't let it get bigger than, say, 256 seconds. When a download
succeeds, halve the retry delay, but don't let it get smaller than 1 second.
That will make your downloader self-tune to the throttling imposed by
the server.

John Nagle
Jun 27 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
11139
by: Donnal Walter | last post by:
On Windows XP I am able to connect to a remote telnet server from the command prompt using: telnet nnn.nnn.nnn.nnn 23 where nnn.nnn.nnn.nnn is the IP address of the host. But using telnetlib, this code returns the traceback that follows: import telnetlib host = 'nnn.nnn.nnn.nnn'
3
14100
by: mirandacascade | last post by:
This may be more of a socket question than a python question; not sure. Using this code to instantiate/connect/set options connectionHandle = socket.socket(socket.AF_INET, socket.SOCK_STREAM) errorStatus = connectionHandle.connect_ex((ipAddress, port)) connectionHandle.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, 60000) Using this...
4
15711
by: Joe Lester | last post by:
I'm seeing this message a couple times per day in my postgres log: 2004-04-20 14:47:46 LOG: could not receive data from client: Connection reset by peer What does it mean? I've seen in the archives that it seems to be some kind of system error. Does anyone know how serious this message is? Does it mean that one of my clients is failing...
3
4365
by: Van_Gogh | last post by:
Hi, I am learning how to use the smtplib module, but am having some very early problems, maybe because I don't understand it. So, am I correct that by following the example in the Python: >>> import smtplib >>> server = smtplib.SMTP('localhost') >>> server.sendmail('soothsayer@example.org', 'jcaesar@example.org', """To:...
4
11490
by: maneeshjp | last post by:
Hi to all, I have wrote a jsp-servlet program(using struts) which connects to DB2 via JDBC. I am using Tomcat5.5. JDBC driver as400thinjdbc.jar . DB connection paramaters are <data-sources> <data-source id="DB2" type="org.apache.tomcat.dbcp.dbcp.BasicDataSource"> <set-property property="username" value="XXXX"/>
2
2508
by: maneeshjp | last post by:
Hi to all, I have wrote a jsp-servlet program(using struts) which connects to DB2 via JDBC. I am using Tomcat5.5. JDBC driver as400thinjdbc.jar . DB connection paramaters are <data-sources> <data-source id="DB2" type="org.apache.tomcat.dbcp.dbcp.BasicDataSource"> <set-property property="username" value="XXXX"/>
14
12724
by: ahlongxp | last post by:
Hi, everyone, I'm implementing a simple client/server protocol. Now I've got a situation: client will send server command,header paires and optionally body. server checks headers and decides whether to accept(read) the body. if server decided to throw(dump) the request's body, it'll send back a response message, such as "resource...
0
1584
by: jhaski | last post by:
I made a python program that crawls a major website and collects a bunch of data. I have found that I get "connection reset by peer" issues on almost every page I crawl. I keep a log of every page, which gives me this error, so I can recrawl that page, but is there a way to not get that error?
2
3136
by: gigs | last post by:
I connect to web site with httplib.HTTPConnection. after some time i get this error: 104 "connection reset by peer". What exception i should use to catche this error thx!
0
7402
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7347
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7590
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7733
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
5883
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5264
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
4890
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3388
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3391
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.