sys.stdout, urllib and unicode... I don't understand.

Thierry

Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode

>>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')

After that, an urllib request is sent with this encoded string to the
web service

>>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')

>>req=urllib2.u rlopen(con)

First problem, how to determine the encoding of the return ?
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:

>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))

I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

Second problem, if I try an

>>print line

into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:

>>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')

But what is strange, is that since I did that, even without using this
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeEr ro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.

Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw

Nov 11 '08 #1

Subscribe Reply

2590

Tino Wildenhain

Thierry wrote:

Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode

>>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')

urllib.quote() operates on byte streams. If your web service is UTF-8
it would make sense to use UTF-8 as input encoding not latin1,
wouldn't it? unicodeinput.en code("utf-8")

After that, an urllib request is sent with this encoded string to the
web service

>>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')

>>req=urllib2.u rlopen(con)

First problem, how to determine the encoding of the return ?

It is sent as part of the headers. e.g. content-type: text/html;
charset=utf-8

If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:

>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))

I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

web server answer is encoded byte stream too (usually utf-8 but you
can check the headers) so

line.decoce("ut f-8") should give you unicode to operate on (always
do string operations on canonized form)

Second problem, if I try an

>>print line

into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.

But it is what it does. Basically unicode() is a constructor for
unicode objects.

Here again, trying several normalize/decode combination did not helped
at all.

Its not too complicated, you just need to keep unicode and byte strings
separate and draw a clean line between the two. (the line is decode()
and encode() )

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/pyt...er/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:

>>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')

This is fancy but not needed if you take care like above.

HTH
Tino

Nov 11 '08 #2

Marc 'BlackJack' Rintsch

On Tue, 11 Nov 2008 12:18:26 -0800, Thierry wrote:

I have realized an wxPython simple application, that takes the input of
a user, send it to a web service, and get back translations in several
languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode

>>>srcText=unic odedata.normali ze('NFKD',srcTe xt).encode('lat in1','ignore')

If the service uses UTF-8 why don't you just encode the data you send as
UTF-8 but Latin-1 with potentially throwing away data because of the
'ignore' argument!? Make that ``src_text = unicodedata.enc ode('utf-8')``

>>>req=urllib2. urlopen(con)

First problem, how to determine the encoding of the return ? If I
inspect a request from firefox, I see that the server return header
specify UTF-8
But if I use this code:

>>>ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\n',chr( 10))

I end up with an UnicodeDecodeEr ror.

Because `line` contains bytes and `ret` is a `unicode` object. If you
add a `unicode` object and a `str` object, Python tries to convert the
`str` to `unicode` using the default == ASCII encoding. And this fails
if there are byte value >127. *You* have to decode `line` from a bunch
of bytes to a bunch of (unicode)charac ters before you concatenate the
strings.

BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore. And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.

Ciao,
Marc 'BlackJack' Rintsch

Nov 12 '08 #3

Thierry

Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an

>line=line.enco de('utf-8')

and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.

>>BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.

Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).

>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.

I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.

Nov 12 '08 #4

Steve Holden

Thierry wrote:

Thank you to both of you (Marc and Tino).

I feel a bit stupid right now, because as both of you said, encoding
my source string to utf-8 do not produce an exception when I pass it
to urllib.quote() and is what it should be.
I was certain that this created an error sooner, and id not tried it
again.
The result of 2 days making random changes and hoping it works. I
know, reflection should have primed. My bad...

The same goes for my treatment in the iteration over the request
result.
I now have an

>>line=line.enc ode('utf-8')

and no errors (as long as I don't try to print this to stdout, which I
understand).
So, I'm now really getting back an unicode string that I can handle as
such.

I really am confused about what I was trying to do...
I cannot understand what I did that caused those errors, because the
state the script is now correspond to what I have in mind originally.

>>BTW: ``line.strip()` ` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore.

Not exactly...
It's that I receive a string, with 2 literal characters in it: "\" and
"n".
What I (want to) do here is that I replace those 2 characters with 1
chr(10).

In that case you would need the following code:

ret=U''
for line in req:
ret=ret+string. replace(line.st rip(),'\\n', '\n')

Otherwise you just replace chr(10)'s with chr(10)'s, which won't help you.

Are you sure that Python wasn't just printing out "\n" because you'd
asked it to show you the repr() of a string containing newlines?

>>And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.

I actually had read that, but not modified my code.
Thank to point it out

Anyway, thanks again to both of you.
I'm quite happy to see it working the way I intended.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Nov 12 '08 #5

Thierry

Are you sure that Python wasn't just printing out "\n" because you'd

asked it to show you the repr() of a string containing newlines?

Yes, I am sure. Because I dumped the ord() values to check them.
But again, I'm stumped on how complicated I have made this.
I should not try to code anymore at 2am.

Nov 12 '08 #6

Similar topics

3985

Simple Question : files and URLLIB

by: Richard Shea | last post by:

Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... finA= urllib.urlopen('http://www.python.org/') foutA=open('C:\\testout.html','w') tidy(finA,foutA,None) I get ...

Python

2324

bad data from urllib when run from MS .bat file

by: Stuart McGraw | last post by:

I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or maybe it will save time for someone else who runs into this problem... ================================================ Environment: - Microsoft Windows 2000 Pro

Python

5059

Getting the encoding of sys.stdout and sys.stdin, and changing it properly

by: velle | last post by:

My headache is growing while playing arround with unicode in Python, please help this novice. I have chosen to divide my problem into a few questions. Python 2.3.4 (#1, Feb 2 2005, 12:11:53) on linux2 1) Does " >>>print 'hello' " simply write to sys.stdout?

Python

4916

urllib.urlencode wrongly encoding ± character

by: sleytr | last post by:

Hi, I'm trying to make a gui for a web service. Site using ± character in value of some fields. But I can't encode this character properly. >>> data = {'key':'±'} >>> urllib.urlencode(data) 'key=%C2%B1' but it should be only %B1 not %C2%B1. where is this %C2 coming from?

Python

9497

urllib.unquote and unicode

by: George Sakkis | last post by:

The following snippet results in different outcome for (at least) the last three major releases: # Python 2.3.4 u'%94' # Python 2.4.2 UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0: ordinal not in range(128)

Python

7684

More urllib timeout issues.

by: John Nagle | last post by:

I thought I had all the timeout problems with urllib worked around, but no. socket.setdefaulttimeout is useful, but not always effective. I'm setting that to 15 seconds. If the host end won't open the connection within 15 seconds, urllib times out. But if the host end opens the connection, then never sends anything, urllib waits for many minutes before timing out. Any idea how to deal with this? And don't just say "use urllib2"...

Python

3106

urllib.quote fails on Unicode URL

by: John Nagle | last post by:

The code in urllib.quote fails on Unicode input, when called by robotparser. That bit of code needs some attention. - It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now. - The initialization may not be thread-safe; a table is being initialized on first use. The code is too clever and uncommented. "robotparser" was trying to check if a URL,

Python

3118

print vs sys.stdout.write, and UnicodeError

by: Brent Lievers | last post by:

Greetings, I have observed the following (python 2.5.1): UTF-8 Ã© Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

Python

1317

Re: Problem: neither urllib2.quote nor urllib.quote encode theunicode strings arguments

by: Jerry Hill | last post by:

On Fri, Oct 3, 2008 at 5:38 PM, Valery Khamenya <khamenya@gmail.comwrote: Do you know what, exactly, you'd like the result to be? The encoding of unicode characters into URIs is not well defined. My understanding is that the most common case is to percent-encode UTF-8, like this: 'M%C3%BCller' If you need to, you can encode your unicode string differently, like this:

Python

8262

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8701

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8637

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8364

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8502

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7192

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5571

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4090

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

1507

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General