By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,086 Members | 1,875 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,086 IT Pros & Developers. It's quick & easy.

urllib.urlencode wrongly encoding character

P: n/a
Hi, I'm trying to make a gui for a web service. Site using
character in value of some fields. But I can't encode this character
properly.

data = {'key':''}
urllib.urlencode(data)

'key=%C2%B1'

but it should be only %B1 not %C2%B1. where is this %C2 coming from?

Apr 6 '06 #1
Share this Question
Share on Google+
12 Replies


P: n/a
sl****@gmail.com wrote:
Hi, I'm trying to make a gui for a web service. Site using
character in value of some fields. But I can't encode this character
properly.

data = {'key':''}
urllib.urlencode(data) 'key=%C2%B1'

but it should be only %B1 not %C2%B1.
It should be %C2%B1, because de-facto urls are encoded as utf-8. I've
just tried entering into four input field: firefox 1.5 search
toolbar, www.google.com search in firefox 1.5, google toolbar in IE 6,
www.google.com search in IE 6. Everywhere is encoded as %C2%B1. In
older browsers YMMV.
where is this %C2 coming from?


Your console must be utf-8.
u''.encode('utf-8')

'\xc2\xb1'

Apr 6 '06 #2

P: n/a
you are right. but when I capture traffic in firefox via
livehttpheaders extension, it shows me that is encoded to %B1.
Addition to that, I found lots of page about urlencoding they have a
conversation tables or scripts. All of them defines as %B1 .

realy confused? I can copy and use urlencoded values from firefox, but
I'm realy want to do things with right way.

Apr 6 '06 #3

P: n/a
sl****@gmail.com wrote:
you are right. but when I capture traffic in firefox via
livehttpheaders extension, it shows me that is encoded to %B1.
It depends on whether user entered url into address bar or clicked on
submit button on a page. In the first case there were no standard how
to deal with non-ascii characters for a long time. Only rfc 3986 in
2005 said: use utf-8. In the second case browsers submit forms in the
encoding of the page where the form is defined. Most likely that is
what you see when you capture traffic.

Addition to that, I found lots of page about urlencoding they have a
conversation tables or scripts. All of them defines as %B1 .
I guess it is because web pages usually serve pretty closed language
communities. Some people just encode urls as latin-1, and it works for
99.9999% of their users. They just don't care that they don't handle
chinese characters since they have no chinese users.

realy confused? I can copy and use urlencoded values from firefox, but
I'm realy want to do things with right way.


It is not clear what you do. Are you interacting with independant 3rd
party web service or you control both server and client?

Apr 6 '06 #4

P: n/a
I have no control over server side.

I'm using Ubuntu Breezy at home and Ubuntu Dapper at work. Now I'm at
work and same code working properly here! (returning %B1) I'm not sure
and not checked yet but locale settings and/or installed Python version
may be different between two computers.

I think there should be way to encode to %B1 on any platform/locale
combination. While searching for a real solution, I'm going to add a
search&destroy filter for %C2 on urlencoded dictionary as a workaround.
Because my queries are constant and %C2 is the only problem for now.

Apr 6 '06 #5

P: n/a
sl****@gmail.com wrote:
I think there should be way to encode to %B1 on any platform/locale
combination. While searching for a real solution, I'm going to add a
search&destroy filter for %C2 on urlencoded dictionary as a workaround.
Because my queries are constant and %C2 is the only problem for now.


I'm obviously missing some context here, but "encoding to %B1 on any
platform" is exactly what urlencode does:
import urllib
urllib.urlencode([("key", chr(0xb1))])

'key=%B1'

(however, if you pass in unicode values with non-ascii characters, url-
encode will give you an error).

are you sure the conversion to UTF-8 isn't happening *before* you pass
your data to urlencode ? what does

print "1", repr(data)
print "2", repr(urllib.urlencode(data))

print for the kind of data you're encoding ?

</F>

Apr 6 '06 #6

P: n/a

"Fredrik Lundh" <fr*****@pythonware.com> wrote in message
news:ma***************************************@pyt hon.org...
I'm obviously missing some context here, but "encoding to %B1 on any
platform" is exactly what urlencode does:
>>> import urllib
>>> urllib.urlencode([("key", chr(0xb1))])

'key=%B1'


Yeah but you're cheating by using the platform independent chr(0xb1)
instead of a literal '' in an unspecified encoding.
Apr 6 '06 #7

P: n/a
when I remove "# -*- coding: utf-8 -*-" line from start of the script
it worked properly. So I moved variable decleration to another file and
imported than it worked too.

Now it's working but I dont understand what I'm doing wrong? I'm new to
Python and unicode encoding. I'm tried
encode/decode(ascii,utf-8,latin-1,iso-8859-9) on this string. None of
them worked and gave fallowing error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 5.
I think I must read more docs about Python and Unicode strings :)

Apr 6 '06 #8

P: n/a
I'm just discovered that I don't have to remove that line, just change
utf-8 to iso-8859-9 and it worked again. But I want to use utf-8.
Please advise...

Apr 6 '06 #9

P: n/a
Evren Esat Ozkan wrote:
when I remove "# -*- coding: utf-8 -*-" line from start of the script
it worked properly. So I moved variable decleration to another file and
imported than it worked too.


the coding directive controls how *unicode* literals in the *source code*
are parsed into unicode string objects. it has absolutely nothing to do with
how urlencode works.

if would help if you posted a short self-contained code snippet, so we
don't have to keep guessing.

</F>

Apr 6 '06 #10

P: n/a
Ok, I think this code snippet enough to show what i said;

===================================

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Change utf-8 to latin-1
#Or move variable decleration to another file than import it

val='00090NO:HHHH'

from urllib import urlencode

data={'key':val}

print urlencode(data)

===================================

Apr 7 '06 #11

P: n/a
Evren Esat Ozkan wrote:
Ok, I think this code snippet enough to show what i said;

===================================

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Change utf-8 to latin-1
#Or move variable decleration to another file than import it

val='00090NO:HHHH'

from urllib import urlencode

data={'key':val}

print urlencode(data)

===================================


did you cut and paste this into your mail program? because the file
I got was ISO-8859-1 encoded:

Content-Type: text/plain; charset="iso-8859-1"

and uses a single byte to store each "", and produces

key=00090%B1NO%3A%B1H%B1H%B1H%B1H%B1

when I run it, which is the expected result.

I think you're still not getting what's going on here, so let's try again:

- the urlencode function doesn't care about encodings; it translates
the bytes it gets one by one. if you pass in chr(0xB1), you get %B1
in the output.

- it's your editor that decides how that "" you typed in the original
script are stored on disk; it may use one ISO-8859-1 bytes, two
UTF-8 bytes, or something else.

- the coding directive doesn't affect non-Unicode string literals in
Python. in an 8-bit string, Python only sees a number of bytes.

- the urlencode function only cares about the bytes.

since you know that you want to use ISO-8859-1 encoding for your
URL, and you seem to insist on typing the "" characters in your code,
the most portable (and editor-independent) way to write your code is
to use Unicode literals when building the string, and explicitly convert
to ISO-8859-1 on the way out.

# build the URL as a Unicode string
val = u'00090NO:HHHH'

# encode as 8859-1 (latin-1)
val = val.encode("iso-8859-1")

from urllib import urlencode
data={'key':val}
print urlencode(data)

key=00090%B1NO%3A%B1H%B1H%B1H%B1H%B1
this will work the same way no matter what character set you use to
store the Python source file, as long as the coding directive matches
what your editor is actually doing.

if you want to make your code 100% robust, forget the idea of putting
non-ascii characters in string literals, and use \xB1 instead:

val = '00090\xb1NO:\xb1H\xb1H\xb1H\xb1H\xb1'

# no need to encode, since the byte string is already iso-8859-1

from urllib import urlencode
data={'key':val}
print urlencode(data)

key=00090%B1NO%3A%B1H%B1H%B1H%B1H%B1

hope this helps!

</F>

Apr 7 '06 #12

P: n/a
I copied and pasted my code to new file and saved with utf-8 encoding.
it produced 00090%C2%B1NO%3A%C2%B1H%C2%B1H%C2%B1H%C2%B1H%C2%B1
Than I added "u" to decleration and encode it with iso-8859-1 as you
wrote and finally it produced proper result.

Your reply is so helped and clarify some things about unicode string
usage on Python.
Thank you very much!

Apr 7 '06 #13

This discussion thread is closed

Replies have been disabled for this discussion.