By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,908 Members | 1,860 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,908 IT Pros & Developers. It's quick & easy.

Having problems with urlparser concatenation

P: n/a
I'm working on a basic web spider, and I'm having problems with the
urlparser.
This is the effected function:
------------------------------
def FindLinks(Website):
WebsiteLen = len(Website)+1
CurrentLink = ''
i = 0
SpliceStart = 0
SpliceEnd = 0

LinksString = ""
LinkQueue = open('C:/LinkQueue.txt', 'a')

while (i < WebsiteLen) and (i != -1):

#Debugging info
#print '-----'
#print 'Length = ' + str(WebsiteLen)
#print 'SpliceStart = ' + str(SpliceStart)
#print 'SpliceEnd = ' + str(SpliceEnd)
#print 'i = ' + str(i)

SpliceStart = Website.find('<a href="', (i+1))
SpliceEnd = (Website.find('">', SpliceStart))

ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
robotparser.read()
if (robotparser.can_fetch("*",
(Website[SpliceStart+9:(SpliceEnd+1)])) == False):
i = i - 1
else:
LinksString = LinksString + "\n" +
(Website[SpliceStart+9:(SpliceEnd+1)])
LinksString = LinksString[:(len(LinksString) - 1)]
#print 'found ' + LinksString
i = SpliceEnd

LinkQueue.write(LinksString)
LinkQueue.close()
------------------------------
Sorry if it's uncommented. When I run my program, I get this error:
-----
Traceback (most recent call last):
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in <module>
FindLinks(Website)
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?

If this has been answered before, please just give me a link to the
proper thread. If you need more contextual code, I can post more.

Nov 9 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
In <11**********************@b28g2000cwb.googlegroups .com>, i80and wrote:
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?
You don't have that file on you local computer. :-)

If you look at the messages above you'll see there's a function
`open_local_file()` involved. This function is chosen by `urllib` because
your path looks like a local file, i.e. it lacks the protocol information.
You don't want 'en.wikipedia.org/robots.txt', you want
'http://en.wikipedia.org/robots.txt'!

Ciao,
Marc 'BlackJack' Rintsch
Nov 9 '06 #2

P: n/a
At Thursday 9/11/2006 20:23, i80and wrote:
>I'm working on a basic web spider, and I'm having problems with the
urlparser.
[...]
SpliceStart = Website.find('<a href="', (i+1))
SpliceEnd = (Website.find('">', SpliceStart))

ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
-----
Traceback (most recent call last):
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in <module>
FindLinks(Website)
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?
No, you don't want 'en.wikipedia.org/robots.txt'; you want
'http://en.wikipedia.org/robots.txt'
urllib treats the former as a file: request, here the \\ in the
normalized path.
You are parsing the link and then building a new URI using ONLY the
hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.

You may try Beautiful Soup for a better HTML parsing.

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ˇgratis!
ˇAbrí tu cuenta ya! - http://correo.yahoo.com.ar
Nov 9 '06 #3

P: n/a
Thank you! Fixed my problem perfectly!
Gabriel Genellina wrote:
At Thursday 9/11/2006 20:23, i80and wrote:
I'm working on a basic web spider, and I'm having problems with the
urlparser.
[...]
SpliceStart = Website.find('<a href="', (i+1))
SpliceEnd = (Website.find('">', SpliceStart))

ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
-----
Traceback (most recent call last):
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in <module>
FindLinks(Website)
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?

No, you don't want 'en.wikipedia.org/robots.txt'; you want
'http://en.wikipedia.org/robots.txt'
urllib treats the former as a file: request, here the \\ in the
normalized path.
You are parsing the link and then building a new URI using ONLY the
hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.

You may try Beautiful Soup for a better HTML parsing.

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ˇgratis!
ˇAbrí tu cuenta ya! - http://correo.yahoo.com.ar
Nov 10 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.