473,324 Members | 2,002 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

Python "robots.txt" parser broken since 2003

This bug, "[ 813986 ] robotparser interactively prompts for username and
password", has been open since 2003. It killed a big batch job of ours
last night.

Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
If the server asks for basic authentication on that file, "robotparser"
prompts for the password on standard input. Which is rarely what you
want. You can demonstrate this with:

import robotparser
url = 'http://mueblesmoraleda.com' # this site is password-protected.
parser = robotparser.RobotFileParser()
parser.set_url(url)
parser.read() # Prompts for password

That's the tandard, although silly, "urllib" behavior.

This was reported in 2003, and a patch was uploaded in 2005, but the patch
never made it into Python 2.4 or 2.5.

A temporary workaround is this:

import robotparser
def prompt_user_passwd(self, host, realm):
return None, None
robotparser.URLopener.prompt_user_passwd = prompt_user_passwd # temp patch
John Nagle
Apr 21 '07 #1
5 1915

"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy.n et...
| This was reported in 2003, and a patch was uploaded in 2005, but the
patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.

tjr

Apr 22 '07 #2
Terry Reedy wrote:
"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy.n et...
| This was reported in 2003, and a patch was uploaded in 2005, but the
patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.
I tried it on Python 2.4 and it's in our production system now.
But someone who regularly does check-ins should do this.

John Nagle
Apr 22 '07 #3
John Nagle wrote:
Terry Reedy wrote:
>"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy. net...
| This was reported in 2003, and a patch was uploaded in 2005, but the
patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.
I tried it on Python 2.4 and it's in our production system now.
But someone who regularly does check-ins should do this.
If you post such a review (even just the short sentence above) to the
patch tracker, it often increases the chance of someone committing the
patch.

Steve
Apr 22 '07 #4
In article <Fv******************@newssvr29.news.prodigy.net >,
John Nagle <na***@animats.comwrote:
This bug, "[ 813986 ] robotparser interactively prompts for username and
password", has been open since 2003. It killed a big batch job of ours
last night.

Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
If the server asks for basic authentication on that file, "robotparser"
prompts for the password on standard input. Which is rarely what you
want. You can demonstrate this with:

import robotparser
url = 'http://mueblesmoraleda.com' # this site is password-protected.
parser = robotparser.RobotFileParser()
parser.set_url(url)
parser.read() # Prompts for password

That's the tandard, although silly, "urllib" behavior.
John,
robotparser is (IMO) suboptimal in a few other ways, too.
- It doesn't handle non-ASCII characters. (They're infrequent but when
writing a spider which sees thousands of robots.txt files in a short
time, "infrequent" can become "daily").
- It doesn't account for BOMs in robots.txt (which are rare).
- It ignores any Expires header sent with the robots.txt
- It handles some ambiguous return codes (e.g. 503) that it ought to
pass up to the caller.

I wrote my own parser to address these problems. It probably suffers
from the same urllib hang that you've found (I have not encountered it
myself) and I appreciate you posting a fix. Here's the code &
documentation in case you're interested:
http://NikitaTheSpider.com/python/rerp/

Cheers

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Apr 22 '07 #5
Steven Bethard wrote:
John Nagle wrote:
>Terry Reedy wrote:
>>"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy .net...
| This was reported in 2003, and a patch was uploaded in 2005, but
the patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.
I tried it on Python 2.4 and it's in our production system now.
But someone who regularly does check-ins should do this.


If you post such a review (even just the short sentence above) to the
patch tracker, it often increases the chance of someone committing the
patch.

Steve
OK, updated the tracker comments.

John Nagle
Apr 22 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: MLH | last post by:
Is it good programming practice to delete existing file before running Open "c:\MyFile.txt" For Output As #1 ??? IE, I already have a file c:\MyFile.txt on disk and I launch Open "c:\MyFile.txt"...
8
by: Seth Darr | last post by:
I'm working on migrating an Classic ASP/VB6 COM application from an NT Server with IIS<6 to an virtual machine running Windows Server 2003 and IIS 6. I've worked through most of the obvious...
0
by: John Appleby | last post by:
Hi there, I've been searching high and low for instructions on where to add the command to launch "ResGen.exe" in Visual Studio in order to compile a ".txt" based resources file I've added to my...
4
by: Hrvoje Vrbanc | last post by:
Hello all, I'd like to use an ASP.NET web application to download (or stream down) "third-party" TXT, HTML and XML files from the Internet and then save them to local HDD. I encountered some...
1
by: John Layton | last post by:
Hi there, Does anyone know if there's a built-in way to read a ".txt" based resources file (one that's converted to a ".resources" file by "resgen.exe" at build time). I need to read/write the...
1
by: Ducknut | last post by:
Not so much a problem as a discussion. I am currently in the early stages of designing a database to hold a bunch of water quality data (e.g., concentrations of heavy metals in drinking water). Water...
5
by: John Nagle | last post by:
Python's "robots.txt" file parser may be misinterpreting a special case. Given a robots.txt file like this: User-agent: * Disallow: // Disallow: /account/registration Disallow: /account/mypro...
2
by: John Nagle | last post by:
For some reason, Python's parser for "robots.txt" files doesn't like Wikipedia's "robots.txt" file: False The Wikipedia robots.txt file passes robots.txt validation, and it doesn't disallow...
2
by: ishakarthika | last post by:
i have a .txt file which contains line of string like the below format |111|34 sdddd| ppppp| A/C| 32| sadfd| |4324|23 mmm| yyyyy| A/C|32| fdfffdfd| the sixth column is the no...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.