read(1) returns string of length 2

wolfgang haefelinger

Greetings,

I'm trying to read (japanese) chars from a file. While doing so
I encounter that a char with length 2 is returned. Is this to be
expected or is there something wrong?

Basically it's this what I'm doing:

import codecs
f = codecs.open("id ent.in",'rb','S hift-JIS') ## japanses codecs installed

c = f.read(1)
while c:
if len(c)==1:
print hex(ord(c)),
else:
print "{",
for x in c: print hex(ord(x)),
print "}",
c = f.read(1)

This is my input (file is also attached):

$ od -tx1 ident.in
0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
0000013

This is what I'm getting:

$ python ident.py ## python 2.3.4
on Windows
0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa

"Python" believes that there are 6 chars on the stream while there are
actually 7 chars.

My naive assumption was that f.read(1) returns always a char of length 1 (or
zero).

Remark:
The input is believed to be "SJIS" but I haven't found a Python codecs for
this.
Therefore I'm using Shift-JIS. Of course this could be the problem. Note
that
when feeding Java with my input "correct" using SJIS, chars are spit out:

c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)

References:
I downloaded Japanese codecs from here (version: 1.4.10)
http://www.asahi-net.or.jp/~rd6t-kjym/python/

Thanks for any hints,
Wolfgang.

Jul 18 '05 #1

Subscribe Reply

2082

Skip Montanaro

wolfgang> I'm trying to read (japanese) chars from a file. While doing
wolfgang> so I encounter that a char with length 2 is returned. Is this
wolfgang> to be expected or is there something wrong?

I believe it's to be expected. You opened the file with codecs.open(), so
your basic unit of operation will be a Unicode character, not a byte.

wolfgang> My naive assumption was that f.read(1) returns always a char
wolfgang> of length 1 (or zero).

If you simply used the builtin open() to open the file that would be true.

Skip

Jul 18 '05 #2

wolfgang haefelinger

Hey Skip,

your basic unit of operation will be a Unicode character

That's exactly the point. What I'm expecting to be returned is
a unicode string of length 1, ie. something I'm calling a uni-
code character.

Note that I do not count the number of bytes at all.

Btw, you can see that the first unicode string returned
by f.read(1) is

0x5408 (21512)

The lenght of this unicode string is 1, ie. we got a char (but
we need 2 bytes represent it).

Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.

Anyway, my question can also be written like this:

f = codecs.open(... )
c = f.read(1)
if c:
assert len(c)==1

I was thinking that this piece of code should be true in
general.

Cheers,
Wolfgang.

"Skip Montanaro" <sk**@pobox.com > wrote in message
news:ma******** *************** *************** @python.org...
wolfgang> I'm trying to read (japanese) chars from a file. While doing
wolfgang> so I encounter that a char with length 2 is returned. Is this
wolfgang> to be expected or is there something wrong?

I believe it's to be expected. You opened the file with codecs.open(), so
your basic unit of operation will be a Unicode character, not a byte.

wolfgang> My naive assumption was that f.read(1) returns always a char
wolfgang> of length 1 (or zero).

If you simply used the builtin open() to open the file that would be true.

Skip

Jul 18 '05 #3

George Yoshida

wolfgang haefelinger wrote:

Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.

I tried this out with Python 2.3 and 2.4 and noticed that they
handle input streams differently.
With 2.4 I get the same result as Java:

0x5408 0x8a08 0x6642 0x9593 0x3b 0xd 0xa

(There's no {} marks.)

This makes me wonder where the difference comes from?
Is this a bug in 2.3 or a new feature in 2.4?

-- george

Jul 18 '05 #4

Bengt Richter

On Wed, 24 Nov 2004 12:03:41 GMT, "wolfgang haefelinger" <wh****@web.d e> wrote:

Greetings,

I'm trying to read (japanese) chars from a file. While doing so
I encounter that a char with length 2 is returned. Is this to be
expected or is there something wrong?

Basically it's this what I'm doing:

import codecs
f = codecs.open("id ent.in",'rb','S hift-JIS') ## japanses codecs installed

c = f.read(1)
while c:
if len(c)==1:
print hex(ord(c)),
else:
print "{",
for x in c: print hex(ord(x)),
print "}",
c = f.read(1)

This is my input (file is also attached):

$ od -tx1 ident.in
0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
0000013

This is what I'm getting:

$ python ident.py ## python 2.3.4
on Windows
0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa

"Python" believes that there are 6 chars on the stream while there are
actually 7 chars.

My naive assumption was that f.read(1) returns always a char of length 1 (or
zero). On my 2.4b1 it does, see below.

Remark:
The input is believed to be "SJIS" but I haven't found a Python codecs for
this.
Therefore I'm using Shift-JIS. Of course this could be the problem. Note
that
when feeding Java with my input "correct" using SJIS, chars are spit out:

c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)

References:
I downloaded Japanese codecs from here (version: 1.4.10)
http://www.asahi-net.or.jp/~rd6t-kjym/python/

Thanks for any hints,
Wolfgang.

I added a print line and dropped the ending commas on your print chunks,
but otherwise didn't (I think ;-) change your code:

Python 2.4b1 (#56, Nov 3 2004, 01:47:27)
[GCC 3.2.3 (mingw special 20030504-1)] on win32
Type "help", "copyright" , "credits" or "license" for more information.

import codecs
f = codecs.open("id ent.in",'rb','S hift-JIS') ## japanses codecs installed
c = f.read(1)
while c: ... print repr(c), len(c), '=>',
... if len(c)==1:
... print hex(ord(c))
... else:
... print "{",
... for x in c: print hex(ord(x)),
... print "}"
... c = f.read(1)
...
u'\u5408' 1 => 0x5408
u'\u8a08' 1 => 0x8a08
u'\u6642' 1 => 0x6642
u'\u9593' 1 => 0x9593
u';' 1 => 0x3b
u'\r' 1 => 0xd
u'\n' 1 => 0xa

I reproduced your binary file:
for c in open('ident.in' ,'rb').read(): print ('%02x'% ord(c)),

...
8d 87 8c 76 8e 9e 8a d4 3b 0d 0a

What version/platform are you using? Perhaps you can upgrade?

Regards,
Bengt Richter

Jul 18 '05 #5

wolfgang haefelinger

Hi,

works fine for me with 2.4c1! Don't even need to install
Japanese codecs now as it's already done. Shame that this
not mentioned.

I believe it's a bug but perhaps in the installed
Japanese Codecs.

Thanks to all provided feedback,
Wolfgang.
"George Yoshida" <ml@dynkin.co m> wrote in message
news:co******** **@dojima-n0.hi-ho.ne.jp...

wolfgang haefelinger wrote:
Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.

I tried this out with Python 2.3 and 2.4 and noticed that they
handle input streams differently.
With 2.4 I get the same result as Java:

0x5408 0x8a08 0x6642 0x9593 0x3b 0xd 0xa

(There's no {} marks.)

This makes me wonder where the difference comes from?
Is this a bug in 2.3 or a new feature in 2.4?

-- george

Jul 18 '05 #6

Similar topics

43073

read BIOS or HDD or Machine Serial Number

by: Gavin | last post by:

Hi, I'm a newbie to programming of any kind. I have posted this to other groups in a hope to get a response from anyone. Can any one tell me how to make my VB program read the Bios serial number (or would HDD be better, or both?) and put that info into VB prog so the program won't work on another computer. My program uses an MSAccess table. Much appreciated if you can help! Thanks

Visual Basic 4 / 5 / 6

5284

read a single character without press enter.. just like getc in c language

by: JuanK | last post by:

hello, i'm trying to read a character from console just like getc function in c languaje i'm trying with WINAPI but dont works at this time.. other methods like clear screen works OK with the WINAPI.. an others to please helpe and try understandme because i dont speak in english tx

C# / C Sharp

2307

Read POP 3 Mail.

by: Programmer | last post by:

Hi all I wan't to know if i'm able to read mail from a mail server. My mail server is a pop3 server (UNIX) and i want to be able to get the mails from an aspx or an asmx. with out using external objs. Only classes from the .NET Is there a way?? Thanks in advance

ASP.NET

2406

Read dynamic string

by: Sillaba atona | last post by:

I use this code to read dynamic string: char *s1; ....... puts("Inserire una stringa: "); while((*s1++=getchar())!='\n'); *s1='\0'; The compilation (ANSI C) is OK but I receive an error during the execution.

C / C++

4590

NZ always returns a zero-length string when used in a query expression?

by: Lyle Fairfield | last post by:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbaac11/html/acfctNZ_HV05186465.asp "If the value of the variant argument is Null, the Nz function returns the number zero or a zero-length string (always returns a zero-length string when used in a query expression)" **** How many records are there in FirstTable in which Product Is Null. SELECT COUNT(*) AS CountofNullProdcut

Microsoft Access / VBA

5635

How to file-read ASCIIZ (null-terminated strings)?

by: Carlo Stonebanks | last post by:

I need to read a binary file which has mixed data types embedded in it. Fixed length strings, 16-bit integers and zero (null) terminated ASCII strings. (akak ASCIIZ) The first two types are no problem, but the zero terminated string is a problem because I have to read byte--by-byte with BinaryStream(), and I don't know whether the class or the O/S uses buffering to make this efficient. Is my code slowing the app down - is there a better...

C# / C Sharp

2838

Binary Read Method?

by: ShaneO | last post by:

Hello, I wish to extract embedded string data from a file using a Binary Read method. The following code sample is used in VB.NET and similar code is used in VB6 - (Assume variable declarations etc.) FileOpen(iFileIn, sInputFile, OpenMode.Binary, OpenAccess.Read)

Visual Basic .NET

2726

read line function outputs weird results

by: WStoreyII | last post by:

the following code is supposed to read a whole line upto a new line char from a file. however it does not work. it is producing weird results. please help. I had error checking in there for mallocs and ect, but i removed them to help me debug. thanks. #include <stdio.h> #include <stdlib.h> #include <string.h> void freadl ( FILE *stream, char **string ) {

C / C++

2888

Stream does not support concurrent IO read or write operations

by: vishnu | last post by:

Hi, Am trying to post the data over https and am getting error in httpwebresponse.getResponseStream.Please help me to get rid of this issue. Here is the message from immediate window ?myResp.GetResponseStream() {System.Net.ConnectStream}

ASP.NET

10324

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10147

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10090

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8971

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7499

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6739

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5511

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3645

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2879

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General