473,385 Members | 1,351 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

urllib interpretation of URL with ".."

Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?

John Nagle
Jun 23 '07 #1
8 1577
John Nagle schrieb:
Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?
I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin
Jun 23 '07 #2
Martin v. Löwis wrote:
John Nagle schrieb:
>>Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?


I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin
I think you're right. The problem is that there is apparently a de-facto
standard in browsers that any number of "../" sequences at the beginning of
the path part of a URL have no effect. Even Google seems to use that
interpretation; not only does it follow that link, it lists it in Google
without the "..".

John Nagle
Jun 23 '07 #3
"Martin v. Löwis" <ma****@v.loewis.dewrote:
>Is "urllib" wrong?

I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.
Section 5.2 is also relevant here. In particular:
g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.
The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.

Jun 25 '07 #4
Duncan Booth wrote:
"Martin v. Löwis" <ma****@v.loewis.dewrote:

>>>Is "urllib" wrong?
Section 5.2 is also relevant here. In particular:

> g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.


The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.
That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)

John Nagle
Jun 25 '07 #5
John Nagle <na***@animats.comwrites:
Duncan Booth wrote:
>"Martin v. Löwis" <ma****@v.loewis.dewrote:

>>>>Is "urllib" wrong?
>Section 5.2 is also relevant here. In particular:

>> g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.


The common practice seems to be for client-side implementations to
handle this using option 2 (removing them) and servers to use option
3 (avoiding traversal of the reference). urllib uses option 1 which
is also correct but not as useful as it might be.

That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)
Note that RFC 3986 obsoletes RFC 2396, and attempts to codify current
good practice re generic URL syntax (URI and relative reference
syntax, to use the precise terminology of the RFC). It discusses
normalisation at length, quite sensibly and pragmatically. And very
readable and useful it is too.

Somebody submitted a module implementing the URL splitting / joining
algorithms specified in RFC 3986 for inclusion in Python 2.6 -- I
haven't looked at that recently...

See also RFC 3987.
John
Jun 25 '07 #6
John Nagle wrote:
In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes option
1.
>>import urlparse
base="http://somesite.com/level1/"
path="../page.html"
urlparse.urljoin(base,path)
'http://somesite.com/page.html'
>>base="http://somesite.com/"
urlparse.urljoin(base,path)
'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.
--
Best regards,
--
Sérgio M. B.
Jun 26 '07 #7
En Tue, 26 Jun 2007 17:26:06 -0300, sergio <se****@sergiomb.no-ip.org>
escribió:
John Nagle wrote:
> In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes
option
1.
>>>import urlparse
base="http://somesite.com/level1/"
path="../page.html"
urlparse.urljoin(base,path)
'http://somesite.com/page.html'
>>>base="http://somesite.com/"
urlparse.urljoin(base,path)
'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.
I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.

--
Gabriel Genellina
Jun 27 '07 #8
Gabriel Genellina wrote:
En Tue, 26 Jun 2007 17:26:06 -0300, sergio <se****@sergiomb.no-ip.org>
escribió:
>John Nagle wrote:
>> In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes
option
1.
>>>>import urlparse
base="http://somesite.com/level1/"
path="../page.html"
urlparse.urljoin(base,path)
'http://somesite.com/page.html'
>>>>base="http://somesite.com/"
urlparse.urljoin(base,path)
'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.

I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.
I had exactly the same though the solution is simply this:

urlparse.urljoin(base,path).replace("/../","/")
Many thanks,
--
Sérgio M. B.
Jun 27 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... ...
0
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...
1
by: Timothy Wu | last post by:
Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the...
4
by: william | last post by:
I've got a strange problem on windows (not very familiar with that OS). I can ping a host, but cannot get it via urllib (see here under). I can even telnet the host on port 80. Thus network...
6
by: justsee | last post by:
Hi, I'm using Python 2.3 on Windows for the first time, and am doing something wrong in using urllib to retrieve images from urls embedded in a csv file. If I explicitly specify a url and image...
11
by: George Sakkis | last post by:
The following snippet results in different outcome for (at least) the last three major releases: # Python 2.3.4 u'%94' # Python 2.4.2 UnicodeDecodeError: 'ascii' codec can't decode byte...
4
by: John Nagle | last post by:
There's no way to set a timeout if you use "urllib" to open a URL. "HTTP", which "urllib" uses, supports this, but the functionality is lost at the "urllib" level. It's not available via "class...
5
by: supercooper | last post by:
I am downloading images using the script below. Sometimes it will go for 10 mins, sometimes 2 hours before timing out with the following error: Traceback (most recent call last): File...
5
by: chrispoliquin | last post by:
Hi, I have a small Python script to fetch some pages from the internet. There are a lot of pages and I am looping through them and then downloading the page using urlretrieve() in the urllib...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.