Having problems with urlparser concatenation

i80and

I'm working on a basic web spider, and I'm having problems with the
urlparser.
This is the effected function:
------------------------------
def FindLinks(Website):
WebsiteLen = len(Website)+1
CurrentLink = ''
i = 0
SpliceStart = 0
SpliceEnd = 0

LinksString = ""
LinkQueue = open('C:/LinkQueue.txt', 'a')

while (i < WebsiteLen) and (i != -1):

#Debugging info
#print '-----'
#print 'Length = ' + str(WebsiteLen)
#print 'SpliceStart = ' + str(SpliceStart)
#print 'SpliceEnd = ' + str(SpliceEnd)
#print 'i = ' + str(i)

SpliceStart = Website.find('<a href="', (i+1))
SpliceEnd = (Website.find('">', SpliceStart))

ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
robotparser.read()
if (robotparser.can_fetch("*",
(Website[SpliceStart+9:(SpliceEnd+1)])) == False):
i = i - 1
else:
LinksString = LinksString + "\n" +
(Website[SpliceStart+9:(SpliceEnd+1)])
LinksString = LinksString[:(len(LinksString) - 1)]
#print 'found ' + LinksString
i = SpliceEnd

LinkQueue.write(LinksString)
LinkQueue.close()
------------------------------
Sorry if it's uncommented. When I run my program, I get this error:
-----
Traceback (most recent call last):
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in <module>
FindLinks(Website)
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?

If this has been answered before, please just give me a link to the
proper thread. If you need more contextual code, I can post more.

Nov 9 '06 #1

Subscribe Post Reply

1583

Marc 'BlackJack' Rintsch

In <11**********************@b28g2000cwb.googlegroups .com>, i80and wrote:

return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?

You don't have that file on you local computer. :-)

If you look at the messages above you'll see there's a function
`open_local_file()` involved. This function is chosen by `urllib` because
your path looks like a local file, i.e. it lacks the protocol information.
You don't want 'en.wikipedia.org/robots.txt', you want
'http://en.wikipedia.org/robots.txt'!

Ciao,
Marc 'BlackJack' Rintsch

Nov 9 '06 #2

Gabriel Genellina

At Thursday 9/11/2006 20:23, i80and wrote:

>I'm working on a basic web spider, and I'm having problems with the
urlparser.
[...]
SpliceStart = Website.find('<a href="', (i+1))
SpliceEnd = (Website.find('">', SpliceStart))

ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
-----
Traceback (most recent call last):
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in <module>
FindLinks(Website)
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?

No, you don't want 'en.wikipedia.org/robots.txt'; you want
'http://en.wikipedia.org/robots.txt'
urllib treats the former as a file: request, here the \\ in the
normalized path.
You are parsing the link and then building a new URI using ONLY the
hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.

You may try Beautiful Soup for a better HTML parsing.

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar

Nov 9 '06 #3

i80and

Thank you! Fixed my problem perfectly!
Gabriel Genellina wrote:

At Thursday 9/11/2006 20:23, i80and wrote:

I'm working on a basic web spider, and I'm having problems with the
urlparser.
[...]
SpliceStart = Website.find('<a href="', (i+1))
SpliceEnd = (Website.find('">', SpliceStart))

ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
-----
Traceback (most recent call last):
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in <module>
FindLinks(Website)
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?

No, you don't want 'en.wikipedia.org/robots.txt'; you want
'http://en.wikipedia.org/robots.txt'
urllib treats the former as a file: request, here the \\ in the
normalized path.
You are parsing the link and then building a new URI using ONLY the
hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.

You may try Beautiful Soup for a better HTML parsing.

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar

Nov 10 '06 #4

Similar topics

String concatenation

by: Jonas Galvez | last post by:

Is it true that joining the string elements of a list is faster than concatenating them via the '+' operator? "".join() vs 'a'+'b'+'c' If so, can anyone explain why?

Python

possible unicode bug in implicit string concatenation?

by: Fahd Khan | last post by:

Hi team! While troubleshooting a crash I had while using BitTorrent where the torrent's target file names didn't fall into the ascii range I was playing around in the interpreter and noticed this...

Python

String Concatenation problems

by: Daniel Bergquist | last post by:

Consider the following chunk of code: -------------------------------------------------- open (IN, "<:raw", "test2.txt") or die "Can't open test.txt"; chomp($line = <IN>); # Capture excerpt...

Perl

Overloading comma to give a concatenation operator: various problems

by: Paul Davis | last post by:

I'd like to overload 'comma' to define a concatenation operator for integer-like classes. I've got some first ideas, but I'd appreciate a sanity check. The concatenation operator needs to so...

C / C++

Script Problems (Speed & Compatibility)

by: Adelson Anton | last post by:

Greetings fellow coders. Please check this page for a code which rearranges chess cells: http://s95005072.onlinehome.us/blog/Chess/chess.htm There are 2 problems: 1. It takes quite a bit for...

Javascript

String concatenation -- not the typical question

by: Justin M. Keyes | last post by:

Hi, Please read carefully before assuming that this is the same old question about string concatenation in C#! It is well-known that the following concatenation produces multiple immutable...

C# / C Sharp

stringBuilder vs string concatenation myth of usage...

by: genc_ymeri | last post by:

Hi over there, Propably this subject is discussed over and over several times. I did google it too but I was a little bit surprised what I read on internet when it comes 'when to use what'. Most...

C# / C Sharp

PATCH: Speed up direct string concatenation by 20+%!

by: Larry Hastings | last post by:

This is such a long posting that I've broken it out into sections. Note that while developing this patch I discovered a Subtle Bug in CPython, which I have discussed in its own section below. ...

Python

StringBuilder much much faster and better than String forconcatenation !!!

by: raylopez99 | last post by:

StringBuilder better and faster than string for adding many strings. Look at the below. It's amazing how much faster StringBuilder is than string. The last loop below is telling: for adding...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice