473,889 Members | 1,428 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

sys.stdout, urllib and unicode... I don't understand.

Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
>>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')
After that, an urllib request is sent with this encoded string to the
web service
>>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')
>>req=urllib2.u rlopen(con)
First problem, how to determine the encoding of the return ?
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))
I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

Second problem, if I try an
>>print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')
But what is strange, is that since I did that, even without using this
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeEr ro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.

Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw
Nov 11 '08 #1
5 2608
Thierry wrote:
Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
>>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')
urllib.quote() operates on byte streams. If your web service is UTF-8
it would make sense to use UTF-8 as input encoding not latin1,
wouldn't it? unicodeinput.en code("utf-8")
After that, an urllib request is sent with this encoded string to the
web service
>>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')
>>req=urllib2.u rlopen(con)

First problem, how to determine the encoding of the return ?
It is sent as part of the headers. e.g. content-type: text/html;
charset=utf-8
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))
I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.
web server answer is encoded byte stream too (usually utf-8 but you
can check the headers) so

line.decoce("ut f-8") should give you unicode to operate on (always
do string operations on canonized form)
Second problem, if I try an
>>print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
But it is what it does. Basically unicode() is a constructor for
unicode objects.
Here again, trying several normalize/decode combination did not helped
at all.
Its not too complicated, you just need to keep unicode and byte strings
separate and draw a clean line between the two. (the line is decode()
and encode() )
Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')
This is fancy but not needed if you take care like above.

HTH
Tino

Nov 11 '08 #2
On Tue, 11 Nov 2008 12:18:26 -0800, Thierry wrote:
I have realized an wxPython simple application, that takes the input of
a user, send it to a web service, and get back translations in several
languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
>>>srcText=unic odedata.normali ze('NFKD',srcTe xt).encode('lat in1','ignore')
If the service uses UTF-8 why don't you just encode the data you send as
UTF-8 but Latin-1 with potentially throwing away data because of the
'ignore' argument!? Make that ``src_text = unicodedata.enc ode('utf-8')``
>>>req=urllib2. urlopen(con)

First problem, how to determine the encoding of the return ? If I
inspect a request from firefox, I see that the server return header
specify UTF-8
But if I use this code:
>>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))
I end up with an UnicodeDecodeEr ror.
Because `line` contains bytes and `ret` is a `unicode` object. If you
add a `unicode` object and a `str` object, Python tries to convert the
`str` to `unicode` using the default == ASCII encoding. And this fails
if there are byte value >127. *You* have to decode `line` from a bunch
of bytes to a bunch of (unicode)charac ters before you concatenate the
strings.

BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore. And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.

Ciao,
Marc 'BlackJack' Rintsch
Nov 12 '08 #3
Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an
>line=line.enco de('utf-8')
and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.
>>BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.
Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.
Nov 12 '08 #4
Thierry wrote:
Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an
>>line=line.enc ode('utf-8')
and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.
>>BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.
Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).
In that case you would need the following code:

ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\\n', '\n')

Otherwise you just replace chr(10)'s with chr(10)'s, which won't help you.

Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?
>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.
I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Nov 12 '08 #5
Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?
Yes, I am sure. Because I dumped the ord() values to check them.
But again, I'm stumped on how complicated I have made this.
I should not try to code anymore at 2am.
Nov 12 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
3999
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... finA= urllib.urlopen('http://www.python.org/') foutA=open('C:\\testout.html','w') tidy(finA,foutA,None) I get ...
7
2341
by: Stuart McGraw | last post by:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or maybe it will save time for someone else who runs into this problem... ================================================ Environment: - Microsoft Windows 2000 Pro
2
5078
by: velle | last post by:
My headache is growing while playing arround with unicode in Python, please help this novice. I have chosen to divide my problem into a few questions. Python 2.3.4 (#1, Feb 2 2005, 12:11:53) on linux2 1) Does " >>>print 'hello' " simply write to sys.stdout?
12
4963
by: sleytr | last post by:
Hi, I'm trying to make a gui for a web service. Site using ± character in value of some fields. But I can't encode this character properly. >>> data = {'key':'±'} >>> urllib.urlencode(data) 'key=%C2%B1' but it should be only %B1 not %C2%B1. where is this %C2 coming from?
11
9532
by: George Sakkis | last post by:
The following snippet results in different outcome for (at least) the last three major releases: # Python 2.3.4 u'%94' # Python 2.4.2 UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0: ordinal not in range(128)
5
7727
by: John Nagle | last post by:
I thought I had all the timeout problems with urllib worked around, but no. socket.setdefaulttimeout is useful, but not always effective. I'm setting that to 15 seconds. If the host end won't open the connection within 15 seconds, urllib times out. But if the host end opens the connection, then never sends anything, urllib waits for many minutes before timing out. Any idea how to deal with this? And don't just say "use urllib2"...
1
3121
by: John Nagle | last post by:
The code in urllib.quote fails on Unicode input, when called by robotparser. That bit of code needs some attention. - It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now. - The initialization may not be thread-safe; a table is being initialized on first use. The code is too clever and uncommented. "robotparser" was trying to check if a URL,
2
3143
by: Brent Lievers | last post by:
Greetings, I have observed the following (python 2.5.1): UTF-8 é Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
0
1332
by: Jerry Hill | last post by:
On Fri, Oct 3, 2008 at 5:38 PM, Valery Khamenya <khamenya@gmail.comwrote: Do you know what, exactly, you'd like the result to be? The encoding of unicode characters into URIs is not well defined. My understanding is that the most common case is to percent-encode UTF-8, like this: 'M%C3%BCller' If you need to, you can encode your unicode string differently, like this:
0
9962
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9810
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11198
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10889
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10442
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7993
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7150
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5829
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
3256
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.