473,320 Members | 1,940 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

read(1) returns string of length 2

Greetings,

I'm trying to read (japanese) chars from a file. While doing so
I encounter that a char with length 2 is returned. Is this to be
expected or is there something wrong?

Basically it's this what I'm doing:

import codecs
f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed

c = f.read(1)
while c:
if len(c)==1:
print hex(ord(c)),
else:
print "{",
for x in c: print hex(ord(x)),
print "}",
c = f.read(1)

This is my input (file is also attached):

$ od -tx1 ident.in
0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
0000013

This is what I'm getting:

$ python ident.py ## python 2.3.4
on Windows
0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa

"Python" believes that there are 6 chars on the stream while there are
actually 7 chars.

My naive assumption was that f.read(1) returns always a char of length 1 (or
zero).

Remark:
The input is believed to be "SJIS" but I haven't found a Python codecs for
this.
Therefore I'm using Shift-JIS. Of course this could be the problem. Note
that
when feeding Java with my input "correct" using SJIS, chars are spit out:

c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)

References:
I downloaded Japanese codecs from here (version: 1.4.10)
http://www.asahi-net.or.jp/~rd6t-kjym/python/

Thanks for any hints,
Wolfgang.

Jul 18 '05 #1
5 2055

wolfgang> I'm trying to read (japanese) chars from a file. While doing
wolfgang> so I encounter that a char with length 2 is returned. Is this
wolfgang> to be expected or is there something wrong?

I believe it's to be expected. You opened the file with codecs.open(), so
your basic unit of operation will be a Unicode character, not a byte.

wolfgang> My naive assumption was that f.read(1) returns always a char
wolfgang> of length 1 (or zero).

If you simply used the builtin open() to open the file that would be true.

Skip

Jul 18 '05 #2
Hey Skip,
your basic unit of operation will be a Unicode character

That's exactly the point. What I'm expecting to be returned is
a unicode string of length 1, ie. something I'm calling a uni-
code character.

Note that I do not count the number of bytes at all.

Btw, you can see that the first unicode string returned
by f.read(1) is

0x5408 (21512)

The lenght of this unicode string is 1, ie. we got a char (but
we need 2 bytes represent it).

Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.

Anyway, my question can also be written like this:

f = codecs.open(...)
c = f.read(1)
if c:
assert len(c)==1

I was thinking that this piece of code should be true in
general.

Cheers,
Wolfgang.

"Skip Montanaro" <sk**@pobox.com> wrote in message
news:ma**************************************@pyth on.org...
wolfgang> I'm trying to read (japanese) chars from a file. While doing
wolfgang> so I encounter that a char with length 2 is returned. Is this
wolfgang> to be expected or is there something wrong?

I believe it's to be expected. You opened the file with codecs.open(), so
your basic unit of operation will be a Unicode character, not a byte.

wolfgang> My naive assumption was that f.read(1) returns always a char
wolfgang> of length 1 (or zero).

If you simply used the builtin open() to open the file that would be true.

Skip

Jul 18 '05 #3
wolfgang haefelinger wrote:
Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.


I tried this out with Python 2.3 and 2.4 and noticed that they
handle input streams differently.
With 2.4 I get the same result as Java:

0x5408 0x8a08 0x6642 0x9593 0x3b 0xd 0xa

(There's no {} marks.)

This makes me wonder where the difference comes from?
Is this a bug in 2.3 or a new feature in 2.4?

-- george
Jul 18 '05 #4
On Wed, 24 Nov 2004 12:03:41 GMT, "wolfgang haefelinger" <wh****@web.de> wrote:
Greetings,

I'm trying to read (japanese) chars from a file. While doing so
I encounter that a char with length 2 is returned. Is this to be
expected or is there something wrong?

Basically it's this what I'm doing:

import codecs
f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed

c = f.read(1)
while c:
if len(c)==1:
print hex(ord(c)),
else:
print "{",
for x in c: print hex(ord(x)),
print "}",
c = f.read(1)

This is my input (file is also attached):

$ od -tx1 ident.in
0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
0000013

This is what I'm getting:

$ python ident.py ## python 2.3.4
on Windows
0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa

"Python" believes that there are 6 chars on the stream while there are
actually 7 chars.

My naive assumption was that f.read(1) returns always a char of length 1 (or
zero). On my 2.4b1 it does, see below.

Remark:
The input is believed to be "SJIS" but I haven't found a Python codecs for
this.
Therefore I'm using Shift-JIS. Of course this could be the problem. Note
that
when feeding Java with my input "correct" using SJIS, chars are spit out:

c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)

References:
I downloaded Japanese codecs from here (version: 1.4.10)
http://www.asahi-net.or.jp/~rd6t-kjym/python/

Thanks for any hints,
Wolfgang.

I added a print line and dropped the ending commas on your print chunks,
but otherwise didn't (I think ;-) change your code:

Python 2.4b1 (#56, Nov 3 2004, 01:47:27)
[GCC 3.2.3 (mingw special 20030504-1)] on win32
Type "help", "copyright", "credits" or "license" for more information.
import codecs
f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed
c = f.read(1)
while c: ... print repr(c), len(c), '=>',
... if len(c)==1:
... print hex(ord(c))
... else:
... print "{",
... for x in c: print hex(ord(x)),
... print "}"
... c = f.read(1)
...
u'\u5408' 1 => 0x5408
u'\u8a08' 1 => 0x8a08
u'\u6642' 1 => 0x6642
u'\u9593' 1 => 0x9593
u';' 1 => 0x3b
u'\r' 1 => 0xd
u'\n' 1 => 0xa

I reproduced your binary file:
for c in open('ident.in','rb').read(): print ('%02x'% ord(c)),

...
8d 87 8c 76 8e 9e 8a d4 3b 0d 0a

What version/platform are you using? Perhaps you can upgrade?

Regards,
Bengt Richter
Jul 18 '05 #5
Hi,

works fine for me with 2.4c1! Don't even need to install
Japanese codecs now as it's already done. Shame that this
not mentioned.

I believe it's a bug but perhaps in the installed
Japanese Codecs.

Thanks to all provided feedback,
Wolfgang.
"George Yoshida" <ml@dynkin.com> wrote in message
news:co**********@dojima-n0.hi-ho.ne.jp...
wolfgang haefelinger wrote:
Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.


I tried this out with Python 2.3 and 2.4 and noticed that they
handle input streams differently.
With 2.4 I get the same result as Java:

0x5408 0x8a08 0x6642 0x9593 0x3b 0xd 0xa

(There's no {} marks.)

This makes me wonder where the difference comes from?
Is this a bug in 2.3 or a new feature in 2.4?

-- george

Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
by: Gavin | last post by:
Hi, I'm a newbie to programming of any kind. I have posted this to other groups in a hope to get a response from anyone. Can any one tell me how to make my VB program read the Bios serial number...
9
by: JuanK | last post by:
hello, i'm trying to read a character from console just like getc function in c languaje i'm trying with WINAPI but dont works at this time.. other methods like clear screen works OK with the...
3
by: Programmer | last post by:
Hi all I wan't to know if i'm able to read mail from a mail server. My mail server is a pop3 server (UNIX) and i want to be able to get the mails from an aspx or an asmx. with out using external...
24
by: Sillaba atona | last post by:
I use this code to read dynamic string: char *s1; ....... puts("Inserire una stringa: "); while((*s1++=getchar())!='\n'); *s1='\0'; The compilation (ANSI C) is OK but I receive an error...
10
by: Lyle Fairfield | last post by:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbaac11/html/acfctNZ_HV05186465.asp "If the value of the variant argument is Null, the Nz function returns the number zero or a...
3
by: Carlo Stonebanks | last post by:
I need to read a binary file which has mixed data types embedded in it. Fixed length strings, 16-bit integers and zero (null) terminated ASCII strings. (akak ASCIIZ) The first two types are no...
23
by: ShaneO | last post by:
Hello, I wish to extract embedded string data from a file using a Binary Read method. The following code sample is used in VB.NET and similar code is used in VB6 - (Assume variable...
14
by: WStoreyII | last post by:
the following code is supposed to read a whole line upto a new line char from a file. however it does not work. it is producing weird results. please help. I had error checking in there for...
0
by: vishnu | last post by:
Hi, Am trying to post the data over https and am getting error in httpwebresponse.getResponseStream.Please help me to get rid of this issue. Here is the message from immediate window ...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.