Spider - path conflict [../test.htm,www.nic.nl/index.html]

martijn

H!

I thought I was ready with my own spider...
But then there was a bug, or in other words a missing part in my code.

I forget that people do this in website html:
<a href="http://www.nic.nl/monkey.html">is oke</a>
<a href="../monkey.html">error</a>
<a href="../../monkey.html">error</a>

So now i'm trying to fix my spider but it fails and it fails.
I tryed something like this.

------------------------
import urlparse
import string

def fixPath(urlpath,deep):
path=''
test = urlpath.split('/',deep)
for this in test:
if this<>'' and this.count('.')==0:
path=path+'/'+this
return path

def fixUrl2(src,url):
url = urlparse.urlparse('http://'+url)
src = urlparse.urlparse('http://'+src)

if url[2]:
thepath = fixPath(url[2],url[2].count('/')-(src[2].count('/')))

if src[1] == '..':
if url[1]<>'':
theurl = url[1]+''+thepath+''+src[2].replace('../','')

print theurl

fixUrl2('../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../monkey2.html','www.nic.nl/info/monkey1.html')
------------------------

info:
fixUrl2('a new link found','in this page')

I hope someone knows a professional working code for this,
Thanks a lot,
GC-Martijn

Jul 18 '05 #1

Subscribe Post Reply

1497

Jeff Epler

I think you want urllib.basejoin().

urllib.basejoin("http://www.example.com/test/page.html", "otherpage.html")

'http://www.example.com/test/otherpage.html'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFCTUdyJd01MZaTXX0RAq6+AKCArMtuo7DkBeQLo4wD+s 0gSjHP8wCgre+e
Pnd6TJbU/ejRPPRSEduIqxY=
=z6Z5
-----END PGP SIGNATURE-----

Jul 18 '05 #2

Skip Montanaro

martijn> I thought I was ready with my own spider... But then there was
martijn> a bug, or in other words a missing part in my code.

martijn> I forget that people do this in website html:
martijn> <a href="http://www.nic.nl/monkey.html">is oke</a>
martijn> <a href="../monkey.html">error</a>
martijn> <a href="../../monkey.html">error</a>

pydoc urlparse.urljoin

Skip

Jul 18 '05 #3

martijn

urllib.basejoin() that's what I need :)

haha what a stupid code did I made.

Thanks
GC-Martijn

Jul 18 '05 #4

Similar topics

making popups spider-friendly

by: griffith | last post by:

I need some rather technical spidering advice, and I'm hoping that this is a good place to find it (and my apologies if this isn't). My site contains pages of images, where each image includes a...

HTML / CSS

Access to the path is denied - trying to write a file

by: David Thielen | last post by:

Hi; I created a virtual directory in IIS 6.0 and my asp.net app runs fine. But when it tries to write a file I get: Access to the path is denied. - C:\Inetpub\wwwroot\RunReportASP\images ...

ASP.NET

Announcing New dtSearch® .NET Spider API; Terabyte Engine for Linux; OpenOffice Support

by: dtsearch | last post by:

New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a...

.NET Framework

ScrollBy and Floating Div Conflict

by: vunet.us | last post by:

Hello, I have a floating div which I drag all over the page. I want to enable window.scrollBy function when div is dragged close to top or bottom of the page, so users would drag this div to the...

Javascript

Big Bertha Thing spider

by: Tony Lance | last post by:

Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site...

C / C++

how to spider web page with button and hyperlink

by: =?Utf-8?B?Q2hhcnRz?= | last post by:

I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I donâ€™t know how to use the program to click...

ASP.NET

Trying to make a spider using mechanize

by: tedpottel | last post by:

Hi, I can read the home page using the mechanize lib. Is there a way to load in web pages using filename.html instad of servername/ filename.html. Lots of time the links just have the file...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA