By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,846 Members | 1,646 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,846 IT Pros & Developers. It's quick & easy.

sys.stdout, urllib and unicode... I don't understand.

P: n/a
Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode
>>srcText=unicodedata.normalize('NFKD',srcText).en code('latin1','ignore')
After that, an urllib request is sent with this encoded string to the
web service
>>con=urllib2.Request(self.url, headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host='http://translate.google.com')
>>req=urllib2.urlopen(con)
First problem, how to determine the encoding of the return ?
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
for line in req:
ret=ret+string.replace(line.strip(),'\n',chr(10))
I end up with an UnicodeDecodeError. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

Second problem, if I try an
>>print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutStreamEncoder(sys.stdout, 'utf-8')
But what is strange, is that since I did that, even without using this
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeErro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.

Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw
Nov 11 '08 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Thierry wrote:
Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode
>>srcText=unicodedata.normalize('NFKD',srcText).en code('latin1','ignore')
urllib.quote() operates on byte streams. If your web service is UTF-8
it would make sense to use UTF-8 as input encoding not latin1,
wouldn't it? unicodeinput.encode("utf-8")
After that, an urllib request is sent with this encoded string to the
web service
>>con=urllib2.Request(self.url, headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host='http://translate.google.com')
>>req=urllib2.urlopen(con)

First problem, how to determine the encoding of the return ?
It is sent as part of the headers. e.g. content-type: text/html;
charset=utf-8
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
for line in req:
ret=ret+string.replace(line.strip(),'\n',chr(10))
I end up with an UnicodeDecodeError. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.
web server answer is encoded byte stream too (usually utf-8 but you
can check the headers) so

line.decoce("utf-8") should give you unicode to operate on (always
do string operations on canonized form)
Second problem, if I try an
>>print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
But it is what it does. Basically unicode() is a constructor for
unicode objects.
Here again, trying several normalize/decode combination did not helped
at all.
Its not too complicated, you just need to keep unicode and byte strings
separate and draw a clean line between the two. (the line is decode()
and encode() )
Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutStreamEncoder(sys.stdout, 'utf-8')
This is fancy but not needed if you take care like above.

HTH
Tino

Nov 11 '08 #2

P: n/a
On Tue, 11 Nov 2008 12:18:26 -0800, Thierry wrote:
I have realized an wxPython simple application, that takes the input of
a user, send it to a web service, and get back translations in several
languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode
>>>srcText=unicodedata.normalize('NFKD',srcText).e ncode('latin1','ignore')
If the service uses UTF-8 why don't you just encode the data you send as
UTF-8 but Latin-1 with potentially throwing away data because of the
'ignore' argument!? Make that ``src_text = unicodedata.encode('utf-8')``
>>>req=urllib2.urlopen(con)

First problem, how to determine the encoding of the return ? If I
inspect a request from firefox, I see that the server return header
specify UTF-8
But if I use this code:
>>>ret=U''
for line in req:
ret=ret+string.replace(line.strip(),'\n',chr(10))
I end up with an UnicodeDecodeError.
Because `line` contains bytes and `ret` is a `unicode` object. If you
add a `unicode` object and a `str` object, Python tries to convert the
`str` to `unicode` using the default == ASCII encoding. And this fails
if there are byte value >127. *You* have to decode `line` from a bunch
of bytes to a bunch of (unicode)characters before you concatenate the
strings.

BTW: ``line.strip()`` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore. And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.

Ciao,
Marc 'BlackJack' Rintsch
Nov 12 '08 #3

P: n/a
Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an
>line=line.encode('utf-8')
and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.
>>BTW: ``line.strip()`` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.
Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.
Nov 12 '08 #4

P: n/a
Thierry wrote:
Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an
>>line=line.encode('utf-8')
and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.
>>BTW: ``line.strip()`` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.
Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
In that case you would need the following code:

ret=U''
for line in req:
ret=ret+string.replace(line.strip(),'\\n', '\n')

Otherwise you just replace chr(10)'s with chr(10)'s, which won't help you.

Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?
>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Nov 12 '08 #5

P: n/a
Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?
Yes, I am sure. Because I dumped the ord() values to check them.
But again, I'm stumped on how complicated I have made this.
I should not try to code anymore at 2am.
Nov 12 '08 #6

This discussion thread is closed

Replies have been disabled for this discussion.