Greetings,
I'm trying to read (japanese) chars from a file. While doing so
I encounter that a char with length 2 is returned. Is this to be
expected or is there something wrong?
Basically it's this what I'm doing:
import codecs
f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed
c = f.read(1)
while c:
if len(c)==1:
print hex(ord(c)),
else:
print "{",
for x in c: print hex(ord(x)),
print "}",
c = f.read(1)
This is my input (file is also attached):
$ od -tx1 ident.in
0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
0000013
This is what I'm getting:
$ python ident.py ## python 2.3.4
on Windows
0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa
"Python" believes that there are 6 chars on the stream while there are
actually 7 chars.
My naive assumption was that f.read(1) returns always a char of length 1 (or
zero).
Remark:
The input is believed to be "SJIS" but I haven't found a Python codecs for
this.
Therefore I'm using Shift-JIS. Of course this could be the problem. Note
that
when feeding Java with my input "correct" using SJIS, chars are spit out:
c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)
References:
I downloaded Japanese codecs from here (version: 1.4.10)
http://www.asahi-net.or.jp/~rd6t-kjym/python/
Thanks for any hints,
Wolfgang.