By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,767 Members | 1,987 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,767 IT Pros & Developers. It's quick & easy.

UTF-8 and stdin/stdout?

P: n/a
Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.

Say I have a file, utf8_input, that contains a single character, ,
coded as UTF-8:

$ hexdump -C utf8_input
00000000 c3 a9
00000002

If I read this file by opening it in this Python script:

$ cat utf8_from_file.py
import codecs
file = codecs.open('utf8_input', encoding='utf-8')
data = file.read()
print "length of data =", len(data)

everything goes well:

$ python utf8_from_file.py
length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

$ cat utf8_from_stdin.py
import sys
data = sys.stdin.read()
print "length of data =", len(data)

does not work:

$ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
length of data = 2

Here, the contents of utf8_input is not interpreted as UTF-8, so
Python believes there are two separate characters.

The question, then:
How could one get utf8_from_stdin.py to work properly with UTF-8?
(And same question for stdout.)

I googled around, and found rather complex stuff (see, for example,
http://blog.ianbicking.org/illusive-...encoding.html), but even
that didn't work: I still get "length of data = 2" even after
successively calling sys.setdefaultencoding('utf-8').

-- dave
Jun 27 '08 #1
Share this Question
Share on Google+
5 Replies


P: n/a
da*********@hotmail.com writes:
Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.

Say I have a file, utf8_input, that contains a single character, ,
coded as UTF-8:

$ hexdump -C utf8_input
00000000 c3 a9
00000002

If I read this file by opening it in this Python script:

$ cat utf8_from_file.py
import codecs
file = codecs.open('utf8_input', encoding='utf-8')
data = file.read()
print "length of data =", len(data)

everything goes well:

$ python utf8_from_file.py
length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

$ cat utf8_from_stdin.py
import sys
data = sys.stdin.read()
print "length of data =", len(data)
Shouldn't you do data = data.decode('utf8') ?
does not work:

$ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
length of data = 2
--
Arnaud

Jun 27 '08 #2

P: n/a
On May 28, 11:08*am, dave_140...@hotmail.com wrote:
Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.

Say I have a file, utf8_input, that contains a single character, ,
coded as UTF-8:

* * * * $ hexdump -C utf8_input
* * * * 00000000 *c3 a9
* * * * 00000002

If I read this file by opening it in this Python script:

* * * * $ cat utf8_from_file.py
* * * * import codecs
* * * * file = codecs.open('utf8_input', encoding='utf-8')
* * * * data = file.read()
* * * * print "length of data =", len(data)

everything goes well:

* * * * $ python utf8_from_file.py
* * * * length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

* * * * $ cat utf8_from_stdin.py
* * * * import sys
* * * * data = sys.stdin.read()
* * * * print "length of data =", len(data)

does not work:

* * * * $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
* * * * length of data = 2

Here, the contents of utf8_input is not interpreted as UTF-8, so
Python believes there are two separate characters.

The question, then:
How could one get utf8_from_stdin.py to work properly with UTF-8?
(And same question for stdout.)

I googled around, and found rather complex stuff (see, for example,http://blog.ianbicking.org/illusive-...encoding.html), but even
that didn't work: I still get "length of data = 2" even after
successively calling sys.setdefaultencoding('utf-8').

-- dave
weird thing is 'c3 a9' is é on my side... and copy/pasting the
gives me 'e9' with the first script giving a result of zero and second
script gives me 1
Jun 27 '08 #3

P: n/a
Shouldn't you do data = data.decode('utf8') ?

Yes, that's it! Thanks.

-- dave
Jun 27 '08 #4

P: n/a
Chris wrote:
On May 28, 11:08*am, dave_140...@hotmail.com wrote:
>Say I have a file, utf8_input, that contains a single character, é,
coded as UTF-8:

$ hexdump -C utf8_input
00000000 *c3 a9
00000002
[...]
weird thing is 'c3 a9' is é on my side... and copy/pasting the é
gives me 'e9' with the first script giving a result of zero and second
script gives me 1
Don't worry, it can be that those are equivalent. The point is that some
characters exist more than once and some exist in a composite form (e with
accent) and separately (e and combining accent).

Looking at http://unicode.org/charts I see that the letter above should have
codepoint 0xe9 (combined character) or 0x61 (e) and 0x301 (accent).

0xe9 = 1110 1001 (codepoint)
0xc3 0xa9 = 1100 0011 1010 1001 (UTF-8)

Anyhow, further looking at this shows that your editor simply doesn't
interpret the two bytes as UTF-8 but as Latin-1 or similar encoding, where
they represent the capital A with tilde and the copyrigth sign.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Jun 27 '08 #5

P: n/a
$ cat utf8_from_stdin.py
import sys
data = sys.stdin.read()
print "length of data =", len(data)
sys.stdin is a byte stream in Python 2, not a character stream.
To make it a character stream, do

sys.stdin = codecs.getreader("utf-8")(sys.stdin)

HTH,
Martin
Jun 27 '08 #6

This discussion thread is closed

Replies have been disabled for this discussion.