By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,857 Members | 1,768 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,857 IT Pros & Developers. It's quick & easy.

Python "robots.txt" parser broken since 2003

P: n/a
This bug, "[ 813986 ] robotparser interactively prompts for username and
password", has been open since 2003. It killed a big batch job of ours
last night.

Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
If the server asks for basic authentication on that file, "robotparser"
prompts for the password on standard input. Which is rarely what you
want. You can demonstrate this with:

import robotparser
url = 'http://mueblesmoraleda.com' # this site is password-protected.
parser = robotparser.RobotFileParser()
parser.set_url(url)
parser.read() # Prompts for password

That's the tandard, although silly, "urllib" behavior.

This was reported in 2003, and a patch was uploaded in 2005, but the patch
never made it into Python 2.4 or 2.5.

A temporary workaround is this:

import robotparser
def prompt_user_passwd(self, host, realm):
return None, None
robotparser.URLopener.prompt_user_passwd = prompt_user_passwd # temp patch
John Nagle
Apr 21 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a

"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy.n et...
| This was reported in 2003, and a patch was uploaded in 2005, but the
patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.

tjr

Apr 22 '07 #2

P: n/a
Terry Reedy wrote:
"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy.n et...
| This was reported in 2003, and a patch was uploaded in 2005, but the
patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.
I tried it on Python 2.4 and it's in our production system now.
But someone who regularly does check-ins should do this.

John Nagle
Apr 22 '07 #3

P: n/a
John Nagle wrote:
Terry Reedy wrote:
>"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy. net...
| This was reported in 2003, and a patch was uploaded in 2005, but the
patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.
I tried it on Python 2.4 and it's in our production system now.
But someone who regularly does check-ins should do this.
If you post such a review (even just the short sentence above) to the
patch tracker, it often increases the chance of someone committing the
patch.

Steve
Apr 22 '07 #4

P: n/a
In article <Fv******************@newssvr29.news.prodigy.net >,
John Nagle <na***@animats.comwrote:
This bug, "[ 813986 ] robotparser interactively prompts for username and
password", has been open since 2003. It killed a big batch job of ours
last night.

Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
If the server asks for basic authentication on that file, "robotparser"
prompts for the password on standard input. Which is rarely what you
want. You can demonstrate this with:

import robotparser
url = 'http://mueblesmoraleda.com' # this site is password-protected.
parser = robotparser.RobotFileParser()
parser.set_url(url)
parser.read() # Prompts for password

That's the tandard, although silly, "urllib" behavior.
John,
robotparser is (IMO) suboptimal in a few other ways, too.
- It doesn't handle non-ASCII characters. (They're infrequent but when
writing a spider which sees thousands of robots.txt files in a short
time, "infrequent" can become "daily").
- It doesn't account for BOMs in robots.txt (which are rare).
- It ignores any Expires header sent with the robots.txt
- It handles some ambiguous return codes (e.g. 503) that it ought to
pass up to the caller.

I wrote my own parser to address these problems. It probably suffers
from the same urllib hang that you've found (I have not encountered it
myself) and I appreciate you posting a fix. Here's the code &
documentation in case you're interested:
http://NikitaTheSpider.com/python/rerp/

Cheers

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Apr 22 '07 #5

P: n/a
Steven Bethard wrote:
John Nagle wrote:
>Terry Reedy wrote:
>>"John Nagle" <na***@animats.comwrote in message
news:Fv******************@newssvr29.news.prodigy .net...
| This was reported in 2003, and a patch was uploaded in 2005, but
the patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.
I tried it on Python 2.4 and it's in our production system now.
But someone who regularly does check-ins should do this.


If you post such a review (even just the short sentence above) to the
patch tracker, it often increases the chance of someone committing the
patch.

Steve
OK, updated the tracker comments.

John Nagle
Apr 22 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.