473,626 Members | 3,119 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

sys.stdout, urllib and unicode... I don't understand.

Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
>>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')
After that, an urllib request is sent with this encoded string to the
web service
>>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')
>>req=urllib2.u rlopen(con)
First problem, how to determine the encoding of the return ?
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))
I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

Second problem, if I try an
>>print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')
But what is strange, is that since I did that, even without using this
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeEr ro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.

Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw
Nov 11 '08 #1
5 2590
Thierry wrote:
Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
>>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')
urllib.quote() operates on byte streams. If your web service is UTF-8
it would make sense to use UTF-8 as input encoding not latin1,
wouldn't it? unicodeinput.en code("utf-8")
After that, an urllib request is sent with this encoded string to the
web service
>>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')
>>req=urllib2.u rlopen(con)

First problem, how to determine the encoding of the return ?
It is sent as part of the headers. e.g. content-type: text/html;
charset=utf-8
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))
I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.
web server answer is encoded byte stream too (usually utf-8 but you
can check the headers) so

line.decoce("ut f-8") should give you unicode to operate on (always
do string operations on canonized form)
Second problem, if I try an
>>print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
But it is what it does. Basically unicode() is a constructor for
unicode objects.
Here again, trying several normalize/decode combination did not helped
at all.
Its not too complicated, you just need to keep unicode and byte strings
separate and draw a clean line between the two. (the line is decode()
and encode() )
Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')
This is fancy but not needed if you take care like above.

HTH
Tino

Nov 11 '08 #2
On Tue, 11 Nov 2008 12:18:26 -0800, Thierry wrote:
I have realized an wxPython simple application, that takes the input of
a user, send it to a web service, and get back translations in several
languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
>>>srcText=unic odedata.normali ze('NFKD',srcTe xt).encode('lat in1','ignore')
If the service uses UTF-8 why don't you just encode the data you send as
UTF-8 but Latin-1 with potentially throwing away data because of the
'ignore' argument!? Make that ``src_text = unicodedata.enc ode('utf-8')``
>>>req=urllib2. urlopen(con)

First problem, how to determine the encoding of the return ? If I
inspect a request from firefox, I see that the server return header
specify UTF-8
But if I use this code:
>>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))
I end up with an UnicodeDecodeEr ror.
Because `line` contains bytes and `ret` is a `unicode` object. If you
add a `unicode` object and a `str` object, Python tries to convert the
`str` to `unicode` using the default == ASCII encoding. And this fails
if there are byte value >127. *You* have to decode `line` from a bunch
of bytes to a bunch of (unicode)charac ters before you concatenate the
strings.

BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore. And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.

Ciao,
Marc 'BlackJack' Rintsch
Nov 12 '08 #3
Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an
>line=line.enco de('utf-8')
and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.
>>BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.
Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.
Nov 12 '08 #4
Thierry wrote:
Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an
>>line=line.enc ode('utf-8')
and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.
>>BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.
Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
In that case you would need the following code:

ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\\n', '\n')

Otherwise you just replace chr(10)'s with chr(10)'s, which won't help you.

Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?
>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Nov 12 '08 #5
Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?
Yes, I am sure. Because I dumped the ord() values to check them.
But again, I'm stumped on how complicated I have made this.
I should not try to code anymore at 2am.
Nov 12 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
3985
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... finA= urllib.urlopen('http://www.python.org/') foutA=open('C:\\testout.html','w') tidy(finA,foutA,None) I get ...
7
2324
by: Stuart McGraw | last post by:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or maybe it will save time for someone else who runs into this problem... ================================================ Environment: - Microsoft Windows 2000 Pro
2
5059
by: velle | last post by:
My headache is growing while playing arround with unicode in Python, please help this novice. I have chosen to divide my problem into a few questions. Python 2.3.4 (#1, Feb 2 2005, 12:11:53) on linux2 1) Does " >>>print 'hello' " simply write to sys.stdout?
12
4916
by: sleytr | last post by:
Hi, I'm trying to make a gui for a web service. Site using ± character in value of some fields. But I can't encode this character properly. >>> data = {'key':'±'} >>> urllib.urlencode(data) 'key=%C2%B1' but it should be only %B1 not %C2%B1. where is this %C2 coming from?
11
9497
by: George Sakkis | last post by:
The following snippet results in different outcome for (at least) the last three major releases: # Python 2.3.4 u'%94' # Python 2.4.2 UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0: ordinal not in range(128)
5
7684
by: John Nagle | last post by:
I thought I had all the timeout problems with urllib worked around, but no. socket.setdefaulttimeout is useful, but not always effective. I'm setting that to 15 seconds. If the host end won't open the connection within 15 seconds, urllib times out. But if the host end opens the connection, then never sends anything, urllib waits for many minutes before timing out. Any idea how to deal with this? And don't just say "use urllib2"...
1
3106
by: John Nagle | last post by:
The code in urllib.quote fails on Unicode input, when called by robotparser. That bit of code needs some attention. - It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now. - The initialization may not be thread-safe; a table is being initialized on first use. The code is too clever and uncommented. "robotparser" was trying to check if a URL,
2
3118
by: Brent Lievers | last post by:
Greetings, I have observed the following (python 2.5.1): UTF-8 é Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
0
1317
by: Jerry Hill | last post by:
On Fri, Oct 3, 2008 at 5:38 PM, Valery Khamenya <khamenya@gmail.comwrote: Do you know what, exactly, you'd like the result to be? The encoding of unicode characters into URIs is not well defined. My understanding is that the most common case is to percent-encode UTF-8, like this: 'M%C3%BCller' If you need to, you can encode your unicode string differently, like this:
0
8262
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8701
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8637
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8364
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8502
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7192
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5571
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4090
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
1507
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.