473,574 Members | 3,129 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Problem with Python's "robots.txt " file parser in module robotparser

Python's "robots.txt " file parser may be misinterpreting a
special case. Given a robots.txt file like this:

User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
...

the python library "robotparser.Ro botFileParser() " considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be incorrect.

This is the robots.txt file for "http://ibm.com".
Some IBM operating systems recognize filenames starting with "//"
as a special case like a network root, so they may be trying to
handle some problem like that.

The spec for "robots.txt ", at

http://www.robotstxt.org/wc/norobots.html

says "Disallow: The value of this field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL that starts with
this value will not be retrieved." That suggests that "//" should only disallow
paths beginning with "//".

John Nagle
SiteTruth
Jul 11 '07 #1
5 2327
In article <0T************ *****@newssvr19 .news.prodigy.n et>,
John Nagle <na***@animats. comwrote:
Python's "robots.txt " file parser may be misinterpreting a
special case. Given a robots.txt file like this:

User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
...

the python library "robotparser.Ro botFileParser() " considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be
incorrect.

This is the robots.txt file for "http://ibm.com".
Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /

I don't see the bug to which you're referring:
>>import robotparser
r = robotparser.Rob otFileParser()
r.set_url("ht tp://www.ibm.com/robots.txt")
r.read()
r.can_fetch(" WhateverBot", "http://www.ibm.com/foo.html")
1
>>r.can_fetch(" WhateverBot", "http://www.ibm.com//foo.html")
0
>>>
I'll use this opportunity to shamelessly plug an alternate robots.txt
parser that I wrote to address some small bugs in the parser in the
standard library.
http://NikitaTheSpider.com/python/rerp/

Cheers

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Jul 11 '07 #2
Nikita the Spider wrote:
>
Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /
Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft. com/robots.txt",
which does redirect.

John Nagle
Jul 11 '07 #3
In article <IE************ ******@newssvr2 3.news.prodigy. net>,
John Nagle <na***@animats. comwrote:
Nikita the Spider wrote:

Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /
Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft. com/robots.txt",
which does redirect.
Strange thing for them to do, isn't it? Especially with two such
different robots.txt files.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Jul 12 '07 #4
Nikita the Spider wrote:
In article <IE************ ******@newssvr2 3.news.prodigy. net>,
John Nagle <na***@animats. comwrote:

>>Nikita the Spider wrote:

>>>Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /

Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft. com/robots.txt",
which does redirect.


Strange thing for them to do, isn't it? Especially with two such
different robots.txt files.
I asked over at Webmaster World, and over there, they recommend against
using redirects on robots.txt files, because they questioned whether all of
the major search engines understand that. Does a redirect for
"foo.com/robots.txt" mean that the robots.txt file applies to the domain
being redirected from, or the domain being redirected to?

John Nagle
Jul 13 '07 #5
In article <yP************ *******@newssvr 21.news.prodigy .net>,
John Nagle <na***@animats. comwrote:
I asked over at Webmaster World, and over there, they recommend against
using redirects on robots.txt files, because they questioned whether all of
the major search engines understand that. Does a redirect for
"foo.com/robots.txt" mean that the robots.txt file applies to the domain
being redirected from, or the domain being redirected to?
Good question. I'd guess the latter, but it's a little ambiguous. I
agree that redirecting a request for robots.txt is probably not a good
idea. Given that the robots.txt standard isn't as standard as it could
be, I think it's a good idea in general to apply the KISS principle when
dealing with things robots.txt-y.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Jul 13 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2162
by: mike | last post by:
regards: How to add a string "XXXXXXXXXXXX" in front of a txt file?....@@ thank you may god be with you
1
1580
by: Metin Esat | last post by:
Hi, I have to deal with the following problem to increase the integrity of our data. While doing a load to a DB2 table from a TXT file if there is a double quote before the data that needs to be parsed, load parser ignores the specified coldel delimeter. For instance, "ˇERKEKˇS02 ˇ0123456ˇˇˇˇANKARAˇˇˇYAŞIYORˇT▄RKİYEˇT▄RKİYE gets loaded...
3
3440
by: Pitmairen | last post by:
I want to make a program that get info from a website and prints it out in a txt file. I made this: import urllib f = urllib.urlopen("http://www.imdb.com/title/tt0407304/") s = f.read() k = open("test.txt","w") k.write(s)
1
2365
by: John Layton | last post by:
Hi there, Does anyone know if there's a built-in way to read a ".txt" based resources file (one that's converted to a ".resources" file by "resgen.exe" at build time). I need to read/write the ".txt" file directly. Thanks.in advance.
3
5888
by: John Torville | last post by:
Hi there, Does anyone know how to add ".txt" resources file to a C# project in Visual Studio. It should compile down to an embedded ".resources" file the same way as a ".resx" file. Thanks.
2
2614
by: John Nagle | last post by:
For some reason, Python's parser for "robots.txt" files doesn't like Wikipedia's "robots.txt" file: False The Wikipedia robots.txt file passes robots.txt validation, and it doesn't disallow unknown browsers. But the Python parser doesn't see it that way. No matter what user agent or URL is specified; for that robots.txt file, the only...
18
2429
by: W. Watson | last post by:
See Subject. It's a simple txt file, each line is a Python stmt, but I need up to four digits added to each line with a space between the number field and the text. Perhaps someone has already done this or there's a source on the web for it. I'm not yet into files with Python. A sudden need has burst upon me. I'm using Win XP. -- Wayne...
2
1694
by: ishakarthika | last post by:
i have a .txt file which contains line of string like the below format |111|34 sdddd| ppppp| A/C| 32| sadfd| |4324|23 mmm| yyyyy| A/C|32| fdfffdfd| the sixth column is the no of seats column where when i check the status of that particulat bus no 4324 it is displaying that the "seat is not available" and "available seats...
4
2125
by: myself2211 | last post by:
Hi, this is my first attempt at Python, what I am trying to achieve is a script that will ask the user for input, search a txt file and display the output or give not found message. Ideally I would like it to ask the user once the first output is achieved for further user input if required, otherwise exit the program. However, would be happy if it...
0
7837
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, weĺll explore What is ONU, What Is Router, ONU & Routerĺs main...
0
7755
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8097
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8268
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7852
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8133
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5333
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
1
2268
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1366
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.