The following snippet results in different outcome for (at least) the
last three major releases:
>>import urllib urllib.unquot e(u'%94')
# Python 2.3.4
u'%94'
# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)
# Python 2.5
u'\x94'
Is the current version the "right" one or is this function supposed to
change every other week ?
George 11 9495
George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>import urllib urllib.unquote (u'%94')
# Python 2.3.4
u'%94'
# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)
# Python 2.5
u'\x94'
Is the current version the "right" one or is this function supposed to
change every other week ?
IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote( u'%94'.encode(' ascii')) that is '\x94'. You can consider
current behaviour as undefined just like if you pass a random object
into some function you can get different outcome in different python
versions.
-- Leo
George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>>>import urllib urllib.unquo te(u'%94')
# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)
Python 2.4.3 (#3, Aug 23 2006, 09:40:15)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright" , "credits" or "license" for more information.
>>import urllib urllib.unquot e(u"%94")
u'\x94'
>>>
From the above I infer that the 2.4.2 behaviour was considered a bug.
Peter
George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>>>import urllib urllib.unquo te(u'%94')
# Python 2.3.4
u'%94'
# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)
# Python 2.5
u'\x94'
Is the current version the "right" one or is this function supposed to
change every other week ?
why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.
</F>
"Leo Kislov" <Le********@gma il.comwrote:
George Sakkis wrote:
>The following snippet results in different outcome for (at least) the last three major releases:
>>import urllib urllib.unquot e(u'%94')
# Python 2.3.4 u'%94'
# Python 2.4.2 UnicodeDecodeE rror: 'ascii' codec can't decode byte 0x94 in position 0: ordinal not in range(128)
# Python 2.5 u'\x94'
Is the current version the "right" one or is this function supposed to change every other week ?
IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote( u'%94'.encode(' ascii')) that is '\x94'. You can
consider current behaviour as undefined just like if you pass a random
object into some function you can get different outcome in different
python versions.
I agree with you that none of the results is right, but not that the
behaviour should be undefined.
The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.
That means that the string u'\x94' should be encoded as %c2%94. The
string %94 should generate a unicode decode error, but it should be the
utf-8 codec raising the error not the ascii codec.
Unfortunately RFC3986 isn't entirely clear-cut on this issue:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2" .
I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.
Also, urllib.quote() should encode into utf-8 instead of throwing KeyError
for a unicode string.
Fredrik Lundh wrote:
George Sakkis wrote:
The following snippet results in different outcome for (at least) the
last three major releases:
>>import urllib urllib.unquot e(u'%94')
# Python 2.3.4
u'%94'
# Python 2.4.2
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)
# Python 2.5
u'\x94'
Is the current version the "right" one or is this function supposed to
change every other week ?
why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.
I'm using BeautifulSoup, which from version 3 returns Unicode only, and
I stumbled on a page with such bogus char encodings; I have the
impression that whatever generated it used ord() to encode reserved
characters instead of the proper hex representation in latin-1. If
that's the case, unquote() won't do anyway and I'd have to go with
chr() on the number part.
George
Duncan Booth schrieb:
The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.
Unfortunately RFC3986 isn't entirely clear-cut on this issue:
> When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2" .
This is irrelevant, it talks about new URI schemes only.
I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.
No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:
# When comparing two URIs to decide if they match or not, a client
# SHOULD use a case-sensitive octet-by-octet comparison of the entire
# URIs, [...]
# Characters other than those in the "reserved" and "unsafe" sets (see
# RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).
The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.
Regards,
Martin
"Martin v. Löwis" <ma****@v.loewi s.dewrote:
Duncan Booth schrieb:
>The way that uri encoding is supposed to work is that first the input string in unicode is encoded to UTF-8 and then each byte which is not in the permitted range for characters is encoded as % followed by two hex characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
I'm not sure I have time to read the various RFC's in depth right now,
so I may have to come back on this thread later. The one thing I'm
convinced of is that the current implementations of urllib.quote and
urllib.unquote are broken in respect to their handling of unicode. In
particular % encoding is defined in terms of octets, so when given a
unicode string urllib.quote should either encoded it, or throw a suitable
exception (not KeyError which is what it seems to throw now).
My objection to urllib.unquote is that urllib.unquote( u'%a3') returns
u'\xa3' which is a character not an octet. I think it should always return
a byte string, or it should calculate a byte string and then decode it
according to some suitable encoding, or it should throw an exception
[choose any of the above].
Adding an optional encoding parameter to quote/unquote be one option,
although since you can encode/decode the parameter it doesn't add much.
No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:
The applicable RFC is 3986. See RFC2616 section 3.2.1:
For definitive information on URL syntax and semantics, see "Uniform
Resource Identifiers (URI):
Generic Syntax and Semantics," RFC 2396 [42] (which replaces RFCs
1738 [4] and RFC 1808 [11]).
and RFC 2396:
Obsoleted by: 3986
Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).
and RFC2277, 3.1 says that it MUST identify which charset is used (although
that's just a best practice document not a standard). (The block capitals
are the RFC's not mine.)
The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.
Yes, I know that in practice some systems use other character sets.
Martin v. Löwis wrote:
Duncan Booth schrieb:
>The way that uri encoding is supposed to work is that first the input string in unicode is encoded to UTF-8 and then each byte which is not in the permitted range for characters is encoded as % followed by two hex characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
Servus,
Walter
>>The way that uri encoding is supposed to work is that first the input
>>string in unicode is encoded to UTF-8 and then each byte which is not in the permitted range for characters is encoded as % followed by two hex characters.
Can you back up this claim ("is supposed to work") by reference to a specification (ideally, chapter and verse)?
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
Thanks. Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.
If somebody implemented IRIs, that would be an entirely different
matter.
Regards,
Martin This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Stuart McGraw |
last post by:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...
================================================
Environment:
- Microsoft...
|
by: Pieter Edelman |
last post by:
Hi all,
I'm trying to submit some data using a POST request to a HTTP server with
BASIC authentication with python, but I can't get it to work. Since it's
driving me completely nuts, so here's my cry for help.
The server is an elog logbook server (http://midas.psi.ch/elog/). It is
protected with a password and an empty username. I can...
|
by: Timothy Wu |
last post by:
Hi,
I'm trying to fill the form on page
http://www.cbs.dtu.dk/services/TMHMM/ using urllib.
There are two peculiarities. First of all, I am filling in incorrect
key/value pairs in the parameters on purpose because that's the only
way I can get it to work.. For "version" I am suppose to leave it
unchecked, having value of empty string....
|
by: sleytr |
last post by:
Hi, I'm trying to make a gui for a web service. Site using ±
character in value of some fields. But I can't encode this character
properly.
>>> data = {'key':'±'}
>>> urllib.urlencode(data)
'key=%C2%B1'
but it should be only %B1 not %C2%B1. where is this %C2 coming from?
|
by: John Nagle |
last post by:
The code in urllib.quote fails on Unicode input, when
called by robotparser.
That bit of code needs some attention.
- It still assumes ASCII goes up to 255, which hasn't been true in Python
for a while now.
- The initialization may not be thread-safe; a table is being initialized
on first use. The code is too clever and uncommented.
...
| |
by: koara |
last post by:
Hello all,
i am using urllib.unquote_plus to unquote a string. Sometimes i get a
strange string like for example "spolu%u017E%E1ci.cz" to unquote. Here
the problem is that some application decided to quote a non-ascii
character as %uxxxx directly, instead of using an encoding and quoting
byte per byte.
Python (2.4.1) simply returns...
|
by: Valery Khamenya |
last post by:
Hi all
things like urllib.quote(u"пиво Müller ") fail with error message:
<type 'exceptions.KeyError'>: u'\u043f'
Similarly with urllib2.
Anyone got a hint?? I need it to form the URI containing non-ascii chars.
thanks in advance,
best regards
|
by: Jerry Hill |
last post by:
On Fri, Oct 3, 2008 at 5:38 PM, Valery Khamenya <khamenya@gmail.comwrote:
Do you know what, exactly, you'd like the result to be? The encoding
of unicode characters into URIs is not well defined. My understanding
is that the most common case is to percent-encode UTF-8, like this:
'M%C3%BCller'
If you need to, you can encode your...
|
by: Thierry |
last post by:
Hello fellow pythonists,
I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it. ...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |