471,873 Members | 992 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,873 software developers and data experts.

urllib.unquote and unicode

The following snippet results in different outcome for (at least) the
last three major releases:
>>import urllib
urllib.unquote(u'%94')
# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

George

Dec 19 '06 #1
11 9147

George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>import urllib
urllib.unquote(u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?
IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can consider
current behaviour as undefined just like if you pass a random object
into some function you can get different outcome in different python
versions.

-- Leo

Dec 19 '06 #2
George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>>>import urllib
urllib.unquote(u'%94')
# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)
Python 2.4.3 (#3, Aug 23 2006, 09:40:15)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>import urllib
urllib.unquote(u"%94")
u'\x94'
>>>
From the above I infer that the 2.4.2 behaviour was considered a bug.

Peter

Dec 19 '06 #3
George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>>>import urllib
urllib.unquote(u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?
why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.

</F>

Dec 19 '06 #4
"Leo Kislov" <Le********@gmail.comwrote:
George Sakkis wrote:
>The following snippet results in different outcome for (at least) the
last three major releases:
>>import urllib
urllib.unquote(u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position
0: ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed
to change every other week ?

IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can
consider current behaviour as undefined just like if you pass a random
object into some function you can get different outcome in different
python versions.
I agree with you that none of the results is right, but not that the
behaviour should be undefined.

The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.

That means that the string u'\x94' should be encoded as %c2%94. The
string %94 should generate a unicode decode error, but it should be the
utf-8 codec raising the error not the ascii codec.

Unfortunately RFC3986 isn't entirely clear-cut on this issue:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.

Also, urllib.quote() should encode into utf-8 instead of throwing KeyError
for a unicode string.

Dec 19 '06 #5
Fredrik Lundh wrote:
George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>>import urllib
urllib.unquote(u'%94')
# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.
I'm using BeautifulSoup, which from version 3 returns Unicode only, and
I stumbled on a page with such bogus char encodings; I have the
impression that whatever generated it used ord() to encode reserved
characters instead of the proper hex representation in latin-1. If
that's the case, unquote() won't do anyway and I'd have to go with
chr() on the number part.

George

Dec 19 '06 #6
Duncan Booth schrieb:
The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.
Unfortunately RFC3986 isn't entirely clear-cut on this issue:
> When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
This is irrelevant, it talks about new URI schemes only.
I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.
No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:

# When comparing two URIs to decide if they match or not, a client
# SHOULD use a case-sensitive octet-by-octet comparison of the entire
# URIs, [...]
# Characters other than those in the "reserved" and "unsafe" sets (see
# RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).

The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.

Regards,
Martin
Dec 19 '06 #7
"Martin v. Lwis" <ma****@v.loewis.dewrote:
Duncan Booth schrieb:
>The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not
in the permitted range for characters is encoded as % followed by two
hex characters.

Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
I'm not sure I have time to read the various RFC's in depth right now,
so I may have to come back on this thread later. The one thing I'm
convinced of is that the current implementations of urllib.quote and
urllib.unquote are broken in respect to their handling of unicode. In
particular % encoding is defined in terms of octets, so when given a
unicode string urllib.quote should either encoded it, or throw a suitable
exception (not KeyError which is what it seems to throw now).

My objection to urllib.unquote is that urllib.unquote(u'%a3') returns
u'\xa3' which is a character not an octet. I think it should always return
a byte string, or it should calculate a byte string and then decode it
according to some suitable encoding, or it should throw an exception
[choose any of the above].

Adding an optional encoding parameter to quote/unquote be one option,
although since you can encode/decode the parameter it doesn't add much.
No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:
The applicable RFC is 3986. See RFC2616 section 3.2.1:
For definitive information on URL syntax and semantics, see "Uniform
Resource Identifiers (URI):
Generic Syntax and Semantics," RFC 2396 [42] (which replaces RFCs
1738 [4] and RFC 1808 [11]).
and RFC 2396:
Obsoleted by: 3986
Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).
and RFC2277, 3.1 says that it MUST identify which charset is used (although
that's just a best practice document not a standard). (The block capitals
are the RFC's not mine.)
The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.
Yes, I know that in practice some systems use other character sets.
Dec 20 '06 #8
Martin v. Lwis wrote:
Duncan Booth schrieb:
>The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.

Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Servus,
Walter
Dec 21 '06 #9
>>The way that uri encoding is supposed to work is that first the input
>>string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
Thanks. Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.

If somebody implemented IRIs, that would be an entirely different
matter.

Regards,
Martin
Dec 21 '06 #10
"Martin v. Lwis" <ma****@v.loewis.dewrote:
>>>The way that uri encoding is supposed to work is that first the
input string in unicode is encoded to UTF-8 and then each byte
which is not in the permitted range for characters is encoded as %
followed by two hex characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Thanks.
and thanks from me too.
Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.
So you believe that because something is only recommended by a standard
Python should refuse to implement it? This is the kind of thinking that in
the 1980's gave us a version of gcc where any attempt to use #pragma (which
according to the standard invokes undefined behaviour) would spawn a copy
of nethack or rogue.

You don't seem to have realised yet, but my objection to the behaviour of
urllib.unquote is precisely that it does guess, and it guesses wrongly. In
fact it guesses latin1 instead of utf8. If it threw an exception for non-
ascii values, then it would match the standard (in the sense of not
following a recommendation because it doesn't have to) and it would be
purely a quality of implementation issue.

If you don't believe me that it guesses latin1, try it. For all valid URIs
(i.e. ignoring those with non-ascii characters already in them) in the
current implementation where u is a unicode object:

unquote(u)==unquote(u.encode('ascii')).decode('lat in1')

I generally agree that Python should avoid guessing, so I wouldn't really
object if it threw an exception or always returned a byte string even
though the html standard recommends using utf8 and the uri rfc requires it
for all new uri schemes. However, in this case I think it would be useful
behaviour: e.g. a decent xml parser is going to give me back the attributes
including encoded uris in unicode. To handle those correctly you must
encode to ascii before unquoting. This is an avoidable pitfall in the
standard library.

On second thoughts, perhaps the current behaviour is actually closer to:

unquote(u)==unquote(u.encode('latin1')).decode('la tin1')

as that also matches the current behaviour for uris which contain non-ascii
characters when the characters have a latin1 encoding. To fully conform
with the html standard's recommendation it should actually be equivalent
to:

unquote(u)==unquote(u.encode('utf8')).decode('utf8 ')

The catch with the current behaviour is that it doesn't exactly mimic any
sensible behaviour at all. It decodes the escaped octets as though they
were latin1 encoded, but it mixes them into a unicode string so there is no
way to correct its bad guess. In other words the current behaviour is
actively harmful.
Dec 22 '06 #11
Duncan Booth schrieb:
So you believe that because something is only recommended by a standard
Python should refuse to implement it?
Yes. In the face of ambiguity, refuse the temptation to guess.

This is *deeply* ambiguous; people have been using all kinds of
encodings in http URLs.
You don't seem to have realised yet, but my objection to the behaviour of
urllib.unquote is precisely that it does guess, and it guesses wrongly.
Yes, it seems that this was a bad move.

Regards,
Martin
Dec 22 '06 #12

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

7 posts views Thread by Stuart McGraw | last post: by
reply views Thread by Pieter Edelman | last post: by
1 post views Thread by Timothy Wu | last post: by
12 posts views Thread by sleytr | last post: by
1 post views Thread by John Nagle | last post: by
1 post views Thread by koara | last post: by
reply views Thread by zermasroor | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.