urllib.unquote and unicode

The following snippet results in different outcome for (at least) the
last three major releases:

>>import urllib
urllib.unquot e(u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

George

Dec 19 '06 #1

Subscribe Reply

9495

Leo Kislov

George Sakkis wrote:

The following snippet results in different outcome for (at least) the
last three major releases:

>import urllib
urllib.unquote (u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote( u'%94'.encode(' ascii')) that is '\x94'. You can consider
current behaviour as undefined just like if you pass a random object
into some function you can get different outcome in different python
versions.

-- Leo

Dec 19 '06 #2

Peter Otten

George Sakkis wrote:

The following snippet results in different outcome for (at least) the
last three major releases:

>>>import urllib
urllib.unquo te(u'%94')

# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

Python 2.4.3 (#3, Aug 23 2006, 09:40:15)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright" , "credits" or "license" for more information.

>>import urllib
urllib.unquot e(u"%94")

u'\x94'

>>>

From the above I infer that the 2.4.2 behaviour was considered a bug.

Peter

Dec 19 '06 #3

Fredrik Lundh

George Sakkis wrote:

The following snippet results in different outcome for (at least) the
last three major releases:

>>>import urllib
urllib.unquo te(u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.

</F>

Dec 19 '06 #4

Duncan Booth

"Leo Kislov" <Le********@gma il.comwrote:

George Sakkis wrote:
>The following snippet results in different outcome for (at least) the
last three major releases:

>>import urllib
urllib.unquot e(u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeE rror: 'ascii' codec can't decode byte 0x94 in position
0: ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed
to change every other week ?

IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote( u'%94'.encode(' ascii')) that is '\x94'. You can
consider current behaviour as undefined just like if you pass a random
object into some function you can get different outcome in different
python versions.

I agree with you that none of the results is right, but not that the
behaviour should be undefined.

The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.

That means that the string u'\x94' should be encoded as %c2%94. The
string %94 should generate a unicode decode error, but it should be the
utf-8 codec raising the error not the ascii codec.

Unfortunately RFC3986 isn't entirely clear-cut on this issue:

When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2" .

I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.

Also, urllib.quote() should encode into utf-8 instead of throwing KeyError
for a unicode string.

Dec 19 '06 #5

George Sakkis

Fredrik Lundh wrote:

George Sakkis wrote:

The following snippet results in different outcome for (at least) the
last three major releases:

>>import urllib
urllib.unquot e(u'%94')
# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.

I'm using BeautifulSoup, which from version 3 returns Unicode only, and
I stumbled on a page with such bogus char encodings; I have the
impression that whatever generated it used ord() to encode reserved
characters instead of the proper hex representation in latin-1. If
that's the case, unquote() won't do anyway and I'd have to go with
chr() on the number part.

George

Dec 19 '06 #6

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Duncan Booth schrieb:

The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.

Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.

Unfortunately RFC3986 isn't entirely clear-cut on this issue:

> When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2" .

This is irrelevant, it talks about new URI schemes only.

I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.

No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:

# When comparing two URIs to decide if they match or not, a client
# SHOULD use a case-sensitive octet-by-octet comparison of the entire
# URIs, [...]
# Characters other than those in the "reserved" and "unsafe" sets (see
# RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).

The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.

Regards,
Martin

Dec 19 '06 #7

Duncan Booth

"Martin v. Löwis" <ma****@v.loewi s.dewrote:

Duncan Booth schrieb:
>The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not
in the permitted range for characters is encoded as % followed by two
hex characters.

Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

I'm not sure I have time to read the various RFC's in depth right now,
so I may have to come back on this thread later. The one thing I'm
convinced of is that the current implementations of urllib.quote and
urllib.unquote are broken in respect to their handling of unicode. In
particular % encoding is defined in terms of octets, so when given a
unicode string urllib.quote should either encoded it, or throw a suitable
exception (not KeyError which is what it seems to throw now).

My objection to urllib.unquote is that urllib.unquote( u'%a3') returns
u'\xa3' which is a character not an octet. I think it should always return
a byte string, or it should calculate a byte string and then decode it
according to some suitable encoding, or it should throw an exception
[choose any of the above].

Adding an optional encoding parameter to quote/unquote be one option,
although since you can encode/decode the parameter it doesn't add much.

No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:

The applicable RFC is 3986. See RFC2616 section 3.2.1:

For definitive information on URL syntax and semantics, see "Uniform
Resource Identifiers (URI):
Generic Syntax and Semantics," RFC 2396 [42] (which replaces RFCs
1738 [4] and RFC 1808 [11]).

and RFC 2396:

Obsoleted by: 3986

Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).

and RFC2277, 3.1 says that it MUST identify which charset is used (although
that's just a best practice document not a standard). (The block capitals
are the RFC's not mine.)

The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.

Yes, I know that in practice some systems use other character sets.

Dec 20 '06 #8

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Martin v. Löwis wrote:

Duncan Booth schrieb:
>The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.

Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.

http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Servus,
Walter

Dec 21 '06 #9

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

>>The way that uri encoding is supposed to work is that first the input

>>string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Thanks. Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.

If somebody implemented IRIs, that would be an entirely different
matter.

Regards,
Martin

Dec 21 '06 #10

Similar topics

2324

bad data from urllib when run from MS .bat file

by: Stuart McGraw | last post by:

I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or maybe it will save time for someone else who runs into this problem... ================================================ Environment: - Microsoft...

Python

3577

POST data with 401 authentication using urllib(2)

by: Pieter Edelman | last post by:

Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's my cry for help. The server is an elog logbook server (http://midas.psi.ch/elog/). It is protected with a password and an empty username. I can...

Python

2063

urllib problem (maybe bugs?)

by: Timothy Wu | last post by:

Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the parameters on purpose because that's the only way I can get it to work.. For "version" I am suppose to leave it unchecked, having value of empty string....

Python

4912

urllib.urlencode wrongly encoding ± character

by: sleytr | last post by:

Hi, I'm trying to make a gui for a web service. Site using ± character in value of some fields. But I can't encode this character properly. >>> data = {'key':'±'} >>> urllib.urlencode(data) 'key=%C2%B1' but it should be only %B1 not %C2%B1. where is this %C2 coming from?

Python

3106

urllib.quote fails on Unicode URL

by: John Nagle | last post by:

The code in urllib.quote fails on Unicode input, when called by robotparser. That bit of code needs some attention. - It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now. - The initialization may not be thread-safe; a table is being initialized on first use. The code is too clever and uncommented. ...

Python

3688

urllib.unquote + unicode

by: koara | last post by:

Hello all, i am using urllib.unquote_plus to unquote a string. Sometimes i get a strange string like for example "spolu%u017E%E1ci.cz" to unquote. Here the problem is that some application decided to quote a non-ascii character as %uxxxx directly, instead of using an encoding and quoting byte per byte. Python (2.4.1) simply returns...

Python

4352

Problem: neither urllib2.quote nor urllib.quote encode the unicodestrings arguments

by: Valery Khamenya | last post by:

Hi all things like urllib.quote(u"Ð¿Ð¸Ð²Ð¾ MÃ¼ller ") fail with error message: <type 'exceptions.KeyError'>: u'\u043f' Similarly with urllib2. Anyone got a hint?? I need it to form the URI containing non-ascii chars. thanks in advance, best regards

Python

1312

Re: Problem: neither urllib2.quote nor urllib.quote encode theunicode strings arguments

by: Jerry Hill | last post by:

On Fri, Oct 3, 2008 at 5:38 PM, Valery Khamenya <khamenya@gmail.comwrote: Do you know what, exactly, you'd like the result to be? The encoding of unicode characters into URIs is not well defined. My understanding is that the most common case is to percent-encode UTF-8, like this: 'M%C3%BCller' If you need to, you can encode your...

Python

2585

sys.stdout, urllib and unicode... I don't understand.

by: Thierry | last post by:

Hello fellow pythonists, I'm a relatively new python developer, and I try to adjust my understanding about "how things works" to python, but I have hit a block, that I cannot understand. I needed to output unicode datas back from a web service, and could not get back unicode/multibyte text before applying an hack that I don't understand...

Python

7908

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7836

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...

Windows Server

8199

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

8212

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...

General

5710

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

5389

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3835

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...

Networking - Hardware / Configuration

3863

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1175

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

General