473,405 Members | 2,262 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Spider - path conflict [../test.htm,www.nic.nl/index.html]

H!

I thought I was ready with my own spider...
But then there was a bug, or in other words a missing part in my code.

I forget that people do this in website html:
<a href="http://www.nic.nl/monkey.html">is oke</a>
<a href="../monkey.html">error</a>
<a href="../../monkey.html">error</a>

So now i'm trying to fix my spider but it fails and it fails.
I tryed something like this.

------------------------
import urlparse
import string

def fixPath(urlpath,deep):
path=''
test = urlpath.split('/',deep)
for this in test:
if this<>'' and this.count('.')==0:
path=path+'/'+this
return path

def fixUrl2(src,url):
url = urlparse.urlparse('http://'+url)
src = urlparse.urlparse('http://'+src)

if url[2]:
thepath = fixPath(url[2],url[2].count('/')-(src[2].count('/')))

if src[1] == '..':
if url[1]<>'':
theurl = url[1]+''+thepath+''+src[2].replace('../','')

print theurl

fixUrl2('../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../monkey2.html','www.nic.nl/info/monkey1.html')
------------------------

info:
fixUrl2('a new link found','in this page')

I hope someone knows a professional working code for this,
Thanks a lot,
GC-Martijn

Jul 18 '05 #1
3 1497
I think you want urllib.basejoin().
urllib.basejoin("http://www.example.com/test/page.html", "otherpage.html")

'http://www.example.com/test/otherpage.html'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFCTUdyJd01MZaTXX0RAq6+AKCArMtuo7DkBeQLo4wD+s 0gSjHP8wCgre+e
Pnd6TJbU/ejRPPRSEduIqxY=
=z6Z5
-----END PGP SIGNATURE-----

Jul 18 '05 #2

martijn> I thought I was ready with my own spider... But then there was
martijn> a bug, or in other words a missing part in my code.

martijn> I forget that people do this in website html:
martijn> <a href="http://www.nic.nl/monkey.html">is oke</a>
martijn> <a href="../monkey.html">error</a>
martijn> <a href="../../monkey.html">error</a>

pydoc urlparse.urljoin

Skip
Jul 18 '05 #3
urllib.basejoin() that's what I need :)

haha what a stupid code did I made.

Thanks
GC-Martijn

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: griffith | last post by:
I need some rather technical spidering advice, and I'm hoping that this is a good place to find it (and my apologies if this isn't). My site contains pages of images, where each image includes a...
3
by: David Thielen | last post by:
Hi; I created a virtual directory in IIS 6.0 and my asp.net app runs fine. But when it tries to write a file I get: Access to the path is denied. - C:\Inetpub\wwwroot\RunReportASP\images ...
0
by: dtsearch | last post by:
New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a...
3
by: vunet.us | last post by:
Hello, I have a floating div which I drag all over the page. I want to enable window.scrollBy function when div is dragged close to top or bottom of the page, so users would drag this div to the...
3
by: Tony Lance | last post by:
Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site...
2
by: =?Utf-8?B?Q2hhcnRz?= | last post by:
I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I don’t know how to use the program to click...
1
by: tedpottel | last post by:
Hi, I can read the home page using the mechanize lib. Is there a way to load in web pages using filename.html instad of servername/ filename.html. Lots of time the links just have the file...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.