Problem with Python's "robots.txt" file parser in module robotparser

John Nagle

Python's "robots.txt" file parser may be misinterpreting a
special case. Given a robots.txt file like this:

User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
...

the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be incorrect.

This is the robots.txt file for "http://ibm.com".
Some IBM operating systems recognize filenames starting with "//"
as a special case like a network root, so they may be trying to
handle some problem like that.

The spec for "robots.txt", at

http://www.robotstxt.org/wc/norobots.html

says "Disallow: The value of this field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL that starts with
this value will not be retrieved." That suggests that "//" should only disallow
paths beginning with "//".

John Nagle
SiteTruth

Jul 11 '07 #1

Subscribe Post Reply

2315

Nikita the Spider

In article <0T*****************@newssvr19.news.prodigy.net> ,
John Nagle <na***@animats.comwrote:

Python's "robots.txt" file parser may be misinterpreting a
special case. Given a robots.txt file like this:

User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
...

the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be
incorrect.

This is the robots.txt file for "http://ibm.com".

Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /

I don't see the bug to which you're referring:

>>import robotparser
r = robotparser.RobotFileParser()
r.set_url("http://www.ibm.com/robots.txt")
r.read()
r.can_fetch("WhateverBot", "http://www.ibm.com/foo.html")

>>r.can_fetch("WhateverBot", "http://www.ibm.com//foo.html")

>>>

I'll use this opportunity to shamelessly plug an alternate robots.txt
parser that I wrote to address some small bugs in the parser in the
standard library.
http://NikitaTheSpider.com/python/rerp/

Cheers

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Jul 11 '07 #2

John Nagle

Nikita the Spider wrote:

>
Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /

Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft.com/robots.txt",
which does redirect.

John Nagle

Jul 11 '07 #3

Nikita the Spider

In article <IE******************@newssvr23.news.prodigy.net >,
John Nagle <na***@animats.comwrote:

Nikita the Spider wrote:

Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /
Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft.com/robots.txt",
which does redirect.

Strange thing for them to do, isn't it? Especially with two such
different robots.txt files.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Jul 12 '07 #4

John Nagle

Nikita the Spider wrote:

In article <IE******************@newssvr23.news.prodigy.net >,
John Nagle <na***@animats.comwrote:

>>Nikita the Spider wrote:

>>>Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /

Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft.com/robots.txt",
which does redirect.

Strange thing for them to do, isn't it? Especially with two such
different robots.txt files.

I asked over at Webmaster World, and over there, they recommend against
using redirects on robots.txt files, because they questioned whether all of
the major search engines understand that. Does a redirect for
"foo.com/robots.txt" mean that the robots.txt file applies to the domain
being redirected from, or the domain being redirected to?

John Nagle

Jul 13 '07 #5

Nikita the Spider

In article <yP*******************@newssvr21.news.prodigy.net> ,
John Nagle <na***@animats.comwrote:

I asked over at Webmaster World, and over there, they recommend against
using redirects on robots.txt files, because they questioned whether all of
the major search engines understand that. Does a redirect for
"foo.com/robots.txt" mean that the robots.txt file applies to the domain
being redirected from, or the domain being redirected to?

Good question. I'd guess the latter, but it's a little ambiguous. I
agree that redirecting a request for robots.txt is probably not a good
idea. Given that the robots.txt standard isn't as standard as it could
be, I think it's a good idea in general to apply the KISS principle when
dealing with things robots.txt-y.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Jul 13 '07 #6

by: mike | last post by:

regards: How to add a string "XXXXXXXXXXXX" in front of a txt file?....@@ thank you may god be with you

HTML / CSS

db2 corrupts data while loading from a TXT file

by: Metin Esat | last post by:

Hi, I have to deal with the following problem to increase the integrity of our data. While doing a load to a DB2 table from a TXT file if there is a double quote before the data that needs to be...

DB2 Database

Fetch info from website and write to txt file.

by: Pitmairen | last post by:

I want to make a program that get info from a website and prints it out in a txt file. I made this: import urllib f = urllib.urlopen("http://www.imdb.com/title/tt0407304/") s = f.read() k =...

Python

How to read ".txt" resource files

by: John Layton | last post by:

Hi there, Does anyone know if there's a built-in way to read a ".txt" based resources file (one that's converted to a ".resources" file by "resgen.exe" at build time). I need to read/write the...

C# / C Sharp

How to add a ".txt" resources file

by: John Torville | last post by:

Hi there, Does anyone know how to add ".txt" resources file to a C# project in Visual Studio. It should compile down to an embedded ".resources" file the same way as a ".resx" file. Thanks.

C# / C Sharp

Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

by: John Nagle | last post by:

For some reason, Python's parser for "robots.txt" files doesn't like Wikipedia's "robots.txt" file: False The Wikipedia robots.txt file passes robots.txt validation, and it doesn't disallow...

Python

Looking for a Python Program/Tool That Will Add Line Numbers to atxt File

by: W. Watson | last post by:

See Subject. It's a simple txt file, each line is a Python stmt, but I need up to four digits added to each line with a space between the number field and the text. Perhaps someone has already done...

Python

please help which checking no of seats in the flat file ".txt"

by: ishakarthika | last post by:

i have a .txt file which contains line of string like the below format |111|34 sdddd| ppppp| A/C| 32| sadfd| |4324|23 mmm| yyyyy| A/C|32| fdfffdfd| the sixth column is the no...

C / C++

Search txt file for user input.

by: myself2211 | last post by:

Hi, this is my first attempt at Python, what I am trying to achieve is a script that will ask the user for input, search a txt file and display the output or give not found message. Ideally I would...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Problem with Python's "robots.txt" file parser in module robotparser

Similar topics