473,289 Members | 2,155 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,289 software developers and data experts.

UTF-8 and stdin/stdout?

Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.

Say I have a file, utf8_input, that contains a single character, é,
coded as UTF-8:

$ hexdump -C utf8_input
00000000 c3 a9
00000002

If I read this file by opening it in this Python script:

$ cat utf8_from_file.py
import codecs
file = codecs.open('utf8_input', encoding='utf-8')
data = file.read()
print "length of data =", len(data)

everything goes well:

$ python utf8_from_file.py
length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

$ cat utf8_from_stdin.py
import sys
data = sys.stdin.read()
print "length of data =", len(data)

does not work:

$ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
length of data = 2

Here, the contents of utf8_input is not interpreted as UTF-8, so
Python believes there are two separate characters.

The question, then:
How could one get utf8_from_stdin.py to work properly with UTF-8?
(And same question for stdout.)

I googled around, and found rather complex stuff (see, for example,
http://blog.ianbicking.org/illusive-...encoding.html), but even
that didn't work: I still get "length of data = 2" even after
successively calling sys.setdefaultencoding('utf-8').

-- dave
Jun 27 '08 #1
5 5857
da*********@hotmail.com writes:
Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.

Say I have a file, utf8_input, that contains a single character, é,
coded as UTF-8:

$ hexdump -C utf8_input
00000000 c3 a9
00000002

If I read this file by opening it in this Python script:

$ cat utf8_from_file.py
import codecs
file = codecs.open('utf8_input', encoding='utf-8')
data = file.read()
print "length of data =", len(data)

everything goes well:

$ python utf8_from_file.py
length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

$ cat utf8_from_stdin.py
import sys
data = sys.stdin.read()
print "length of data =", len(data)
Shouldn't you do data = data.decode('utf8') ?
does not work:

$ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
length of data = 2
--
Arnaud

Jun 27 '08 #2
On May 28, 11:08*am, dave_140...@hotmail.com wrote:
Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.

Say I have a file, utf8_input, that contains a single character, é,
coded as UTF-8:

* * * * $ hexdump -C utf8_input
* * * * 00000000 *c3 a9
* * * * 00000002

If I read this file by opening it in this Python script:

* * * * $ cat utf8_from_file.py
* * * * import codecs
* * * * file = codecs.open('utf8_input', encoding='utf-8')
* * * * data = file.read()
* * * * print "length of data =", len(data)

everything goes well:

* * * * $ python utf8_from_file.py
* * * * length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

* * * * $ cat utf8_from_stdin.py
* * * * import sys
* * * * data = sys.stdin.read()
* * * * print "length of data =", len(data)

does not work:

* * * * $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
* * * * length of data = 2

Here, the contents of utf8_input is not interpreted as UTF-8, so
Python believes there are two separate characters.

The question, then:
How could one get utf8_from_stdin.py to work properly with UTF-8?
(And same question for stdout.)

I googled around, and found rather complex stuff (see, for example,http://blog.ianbicking.org/illusive-...encoding.html), but even
that didn't work: I still get "length of data = 2" even after
successively calling sys.setdefaultencoding('utf-8').

-- dave
weird thing is 'c3 a9' is é on my side... and copy/pasting the é
gives me 'e9' with the first script giving a result of zero and second
script gives me 1
Jun 27 '08 #3
Shouldn't you do data = data.decode('utf8') ?

Yes, that's it! Thanks.

-- dave
Jun 27 '08 #4
Chris wrote:
On May 28, 11:08Â*am, dave_140...@hotmail.com wrote:
>Say I have a file, utf8_input, that contains a single character, é,
coded as UTF-8:

$ hexdump -C utf8_input
00000000 Â*c3 a9
00000002
[...]
weird thing is 'c3 a9' is é on my side... and copy/pasting the é
gives me 'e9' with the first script giving a result of zero and second
script gives me 1
Don't worry, it can be that those are equivalent. The point is that some
characters exist more than once and some exist in a composite form (e with
accent) and separately (e and combining accent).

Looking at http://unicode.org/charts I see that the letter above should have
codepoint 0xe9 (combined character) or 0x61 (e) and 0x301 (accent).

0xe9 = 1110 1001 (codepoint)
0xc3 0xa9 = 1100 0011 1010 1001 (UTF-8)

Anyhow, further looking at this shows that your editor simply doesn't
interpret the two bytes as UTF-8 but as Latin-1 or similar encoding, where
they represent the capital A with tilde and the copyrigth sign.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Jun 27 '08 #5
$ cat utf8_from_stdin.py
import sys
data = sys.stdin.read()
print "length of data =", len(data)
sys.stdin is a byte stream in Python 2, not a character stream.
To make it a character stream, do

sys.stdin = codecs.getreader("utf-8")(sys.stdin)

HTH,
Martin
Jun 27 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: lawrence | last post by:
Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work?...
4
by: Alban Hertroys | last post by:
Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....
12
by: Mike Dee | last post by:
A very very basic UTF-8 question that's driving me nuts: If I have this in the beginning of my Python script in Linux: #!/usr/bin/env python # -*- coding: UTF-8 -*- should I - or should I...
38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
6
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...
6
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...
7
by: Jimmy Shaw | last post by:
Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32,...
4
by: shreshth.luthra | last post by:
Hi All, I am having a GUI which accepts a Unicode string and searches a given set of xml files for that string. Now, i have 2 XML files both of them saved in UTF-8 format, having characters...
10
by: Jed | last post by:
I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email...
23
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: marcoviolo | last post by:
Dear all, I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.