471,356 Members | 1,702 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,356 software developers and data experts.

urllib interpretation of URL with ".."

Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?

John Nagle
Jun 23 '07 #1
8 1495
John Nagle schrieb:
Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?
I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin
Jun 23 '07 #2
Martin v. Löwis wrote:
John Nagle schrieb:
>>Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?


I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin
I think you're right. The problem is that there is apparently a de-facto
standard in browsers that any number of "../" sequences at the beginning of
the path part of a URL have no effect. Even Google seems to use that
interpretation; not only does it follow that link, it lists it in Google
without the "..".

John Nagle
Jun 23 '07 #3
"Martin v. Löwis" <ma****@v.loewis.dewrote:
>Is "urllib" wrong?

I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.
Section 5.2 is also relevant here. In particular:
g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.
The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.

Jun 25 '07 #4
Duncan Booth wrote:
"Martin v. Löwis" <ma****@v.loewis.dewrote:

>>>Is "urllib" wrong?
Section 5.2 is also relevant here. In particular:

> g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.


The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.
That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)

John Nagle
Jun 25 '07 #5
John Nagle <na***@animats.comwrites:
Duncan Booth wrote:
>"Martin v. Löwis" <ma****@v.loewis.dewrote:

>>>>Is "urllib" wrong?
>Section 5.2 is also relevant here. In particular:

>> g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.


The common practice seems to be for client-side implementations to
handle this using option 2 (removing them) and servers to use option
3 (avoiding traversal of the reference). urllib uses option 1 which
is also correct but not as useful as it might be.

That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)
Note that RFC 3986 obsoletes RFC 2396, and attempts to codify current
good practice re generic URL syntax (URI and relative reference
syntax, to use the precise terminology of the RFC). It discusses
normalisation at length, quite sensibly and pragmatically. And very
readable and useful it is too.

Somebody submitted a module implementing the URL splitting / joining
algorithms specified in RFC 3986 for inclusion in Python 2.6 -- I
haven't looked at that recently...

See also RFC 3987.
John
Jun 25 '07 #6
John Nagle wrote:
In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes option
1.
>>import urlparse
base="http://somesite.com/level1/"
path="../page.html"
urlparse.urljoin(base,path)
'http://somesite.com/page.html'
>>base="http://somesite.com/"
urlparse.urljoin(base,path)
'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.
--
Best regards,
--
Sérgio M. B.
Jun 26 '07 #7
En Tue, 26 Jun 2007 17:26:06 -0300, sergio <se****@sergiomb.no-ip.org>
escribió:
John Nagle wrote:
> In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes
option
1.
>>>import urlparse
base="http://somesite.com/level1/"
path="../page.html"
urlparse.urljoin(base,path)
'http://somesite.com/page.html'
>>>base="http://somesite.com/"
urlparse.urljoin(base,path)
'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.
I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.

--
Gabriel Genellina
Jun 27 '07 #8
Gabriel Genellina wrote:
En Tue, 26 Jun 2007 17:26:06 -0300, sergio <se****@sergiomb.no-ip.org>
escribió:
>John Nagle wrote:
>> In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes
option
1.
>>>>import urlparse
base="http://somesite.com/level1/"
path="../page.html"
urlparse.urljoin(base,path)
'http://somesite.com/page.html'
>>>>base="http://somesite.com/"
urlparse.urljoin(base,path)
'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.

I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.
I had exactly the same though the solution is simply this:

urlparse.urljoin(base,path).replace("/../","/")
Many thanks,
--
Sérgio M. B.
Jun 27 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Richard Shea | last post: by
reply views Thread by Pieter Edelman | last post: by
1 post views Thread by Timothy Wu | last post: by
4 posts views Thread by william | last post: by
6 posts views Thread by justsee | last post: by
11 posts views Thread by George Sakkis | last post: by
4 posts views Thread by John Nagle | last post: by
5 posts views Thread by supercooper | last post: by
5 posts views Thread by chrispoliquin | last post: by
reply views Thread by XIAOLAOHU | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.