By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,427 Members | 1,378 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,427 IT Pros & Developers. It's quick & easy.

urlencode with high characters

P: n/a
Jim
Hello,

I'm trying to do urllib.urlencode() with unicode correctly, and I
wonder if some kind person could set me straight?

My understanding is that I am supposed to be able to urlencode anything
up to the top half of latin-1 -- decimal 128-255.

I can't just send urlencode a unicode character:

Python 2.3.5 (#2, May 4 2005, 08:51:39)
[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import urllib
s=u'abc'+unichr(246)+u'def'
dct={'x':s}
urllib.urlencode(dct) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.3/urllib.py", line 1206, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 3: ordinal not in range(128)

Is it instead Right that I should send a unicode string to urlencode by
first encoding it to 'latin-1' ?
import urllib
s=u'abc'+unichr(246)+u'def'
dct={'x':s.encode('latin-1')}
urllib.urlencode(dct)

'x=abc%F6def'

If it is Right, I'm puzzled as to why urlencode doesn't do it. Or am I
missing something? urllib.ulrencode() contains the lines:

elif _is_unicode(v):
# is there a reasonable way to convert to ASCII?
# encode generates a string, but "replace" or "ignore"
# lose information and "strict" can raise UnicodeError
v = quote_plus(v.encode("ASCII","replace"))
l.append(k + '=' + v)

so I think that it is *not* liking latin-1.

Thank you,
Jim

Nov 2 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Jim wrote:
My understanding is that I am supposed to be able to urlencode anything
up to the top half of latin-1 -- decimal 128-255.


I believe your understanding is incorrect. Without being able to quote
RFCs precisely, I think your understanding should be this:

- the URL literal syntax only allows for ASCII characters
- bytes with no meaning in ASCII can be quoted through %hh in URLs
- the precise meaning of such bytes in the URL is defined in the
URL scheme, and may vary from URL scheme to URL scheme
- the http scheme does not specify any interpretation of the bytes,
but apparantly assumes that they denote characters, and follow
some encoding - which encoding is something that the web server
defines, when mapping URLs to resources.

If you get the impression that this is underspecified: your impression
is correct; it is underspecified indeed.

There is a recent attempt to tighten the specification through IRIs.
The IRI RFC defines a mapping between IRIs and URIs, and it uses
UTF-8 as the encoding, not latin-1.

Regards,
Martin
Nov 2 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.