469,311 Members | 2,444 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,311 developers. It's quick & easy.

Spider - path conflict [../test.htm,www.nic.nl/index.html]

H!

I thought I was ready with my own spider...
But then there was a bug, or in other words a missing part in my code.

I forget that people do this in website html:
<a href="http://www.nic.nl/monkey.html">is oke</a>
<a href="../monkey.html">error</a>
<a href="../../monkey.html">error</a>

So now i'm trying to fix my spider but it fails and it fails.
I tryed something like this.

------------------------
import urlparse
import string

def fixPath(urlpath,deep):
path=''
test = urlpath.split('/',deep)
for this in test:
if this<>'' and this.count('.')==0:
path=path+'/'+this
return path

def fixUrl2(src,url):
url = urlparse.urlparse('http://'+url)
src = urlparse.urlparse('http://'+src)

if url[2]:
thepath = fixPath(url[2],url[2].count('/')-(src[2].count('/')))

if src[1] == '..':
if url[1]<>'':
theurl = url[1]+''+thepath+''+src[2].replace('../','')

print theurl

fixUrl2('../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../monkey2.html','www.nic.nl/info/monkey1.html')
------------------------

info:
fixUrl2('a new link found','in this page')

I hope someone knows a professional working code for this,
Thanks a lot,
GC-Martijn

Jul 18 '05 #1
3 1405
I think you want urllib.basejoin().
urllib.basejoin("http://www.example.com/test/page.html", "otherpage.html")

'http://www.example.com/test/otherpage.html'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFCTUdyJd01MZaTXX0RAq6+AKCArMtuo7DkBeQLo4wD+s 0gSjHP8wCgre+e
Pnd6TJbU/ejRPPRSEduIqxY=
=z6Z5
-----END PGP SIGNATURE-----

Jul 18 '05 #2

martijn> I thought I was ready with my own spider... But then there was
martijn> a bug, or in other words a missing part in my code.

martijn> I forget that people do this in website html:
martijn> <a href="http://www.nic.nl/monkey.html">is oke</a>
martijn> <a href="../monkey.html">error</a>
martijn> <a href="../../monkey.html">error</a>

pydoc urlparse.urljoin

Skip
Jul 18 '05 #3
urllib.basejoin() that's what I need :)

haha what a stupid code did I made.

Thanks
GC-Martijn

Jul 18 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by griffith | last post: by
3 posts views Thread by David Thielen | last post: by
3 posts views Thread by vunet.us | last post: by
3 posts views Thread by Tony Lance | last post: by
2 posts views Thread by =?Utf-8?B?Q2hhcnRz?= | last post: by
1 post views Thread by tedpottel | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by harlem98 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.