H!
I thought I was ready with my own spider...
But then there was a bug, or in other words a missing part in my code.
I forget that people do this in website html:
<a href="http://www.nic.nl/monkey.html">is oke</a>
<a href="../monkey.html">error</a>
<a href="../../monkey.html">error</a>
So now i'm trying to fix my spider but it fails and it fails.
I tryed something like this.
------------------------
import urlparse
import string
def fixPath(urlpath,deep):
path=''
test = urlpath.split('/',deep)
for this in test:
if this<>'' and this.count('.')==0:
path=path+'/'+this
return path
def fixUrl2(src,url):
url = urlparse.urlparse('http://'+url)
src = urlparse.urlparse('http://'+src)
if url[2]:
thepath = fixPath(url[2],url[2].count('/')-(src[2].count('/')))
if src[1] == '..':
if url[1]<>'':
theurl = url[1]+''+thepath+''+src[2].replace('../','')
print theurl
fixUrl2('../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../monkey2.html','www.nic.nl/info/monkey1.html')
------------------------
info:
fixUrl2('a new link found','in this page')
I hope someone knows a professional working code for this,
Thanks a lot,
GC-Martijn