473,327 Members | 2,012 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

Unicode string handling problem

The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

Any suggestions?

Richard Schulman
(For email reply, delete the 'xx' characters)
Sep 6 '06 #1
8 2330
Richard Schulman wrote:
The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
You mean in_line = in_file.readline(), I hope. Do please copy/paste
actual code, not what you think you ran.
attribute_count = in_line.count('",')
print attribute_count
Insert
print type(in_line)
print repr(in_line)
here [also make the appropriate changes to get the same info from the
first line], run it again, copy/paste what you get, show us what you
see.

If you're coy about that, then you'll have to find out yourself if it
has a BOM at the front, and if not whether it's little/big/endian.
finally:
in_file.close()

Any suggestions?
1. Read the Unicode HOWTO.
2. Read the docs on the codecs module ...

You'll need to use

in_file = codecs.open(filepath, mode, encoding="utf16???????")

It would also be a good idea to get into the habit of using unicode
constants like u'",'

HTH,
John

Sep 6 '06 #2

Richard Schulman wrote:
The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

Any suggestions?

Richard Schulman
(For email reply, delete the 'xx' characters)
You're not detecting the file encoding and then
using it in the open statement. If you know this is
utf-16le or utf-16be, you need to say so in the
open. If you don't, then you should read it into
a string, go through some autodetect logic, and
then decode it with the <string>.decode(encoding)
method.

A clue: a properly formatted utf-16 or utf-32
file MUST have a BOM as the first character.
That's mandated in the unicode standard. If
it doesn't have a BOM, then try ascii and
utf-8 in that order. The first
one that succeeds is correct. If neither succeeds,
you're on your own in guessing the file encoding.

John Roth

Sep 6 '06 #3
Thanks for your excellent debugging suggestions, John. See below for
my follow-up:

Richard Schulman:
>The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:
...
John Machin:
>Insert
print type(in_line)
print repr(in_line)
here [also make the appropriate changes to get the same info from the
first line], run it again, copy/paste what you get, show us what you
see.
Here's the revised program, per your suggestion:

================================================== ===

# This program processes a UTF-16 input file that is
# to be loaded later into a mySQL table. The input file
# is not yet ready for prime time. The purpose of this
# program is to ready it.

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# The first line read is a SQL INSERT statement; no
# processing will be required.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging

# The second line read is the first data row.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging

# For this and subsequent rows, we must count all
# the < ", character-pairs in a given line/row.
# This will provide an n-1 measure of the attributes
# for a SQL insert of this row. All rows must have
# sixteen attributes, but some don't yet.
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

================================================== ===

The output of this program, which I ran at the command line,
must needs to be copied by hand and abridged, but I think I
have included the relevant information:

C:\pythonapps>python graf_correction.py
<type 'str'>
'\xff\xfeI\x00N\x00S... [the beginning of a SQL INSERT statement]
....\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row,
followed by an end-of-line]
<type 'str'>
'\x00\n' [oh-oh! For the second row, all we're seeing
is an end-of-line character. Is that from
the first row? Wasn't the "rU" mode
supposed to handle that]
0 [the counter value. It's hardly surprising
it's only zero, given that most of the row
never got loaded, just an eol mark]

J.M.:
>If you're coy about that, then you'll have to find out yourself if it
has a BOM at the front, and if not whether it's little/big/endian.
The BOM is little-endian, I believe.

R.S.:
>Any suggestions?
J.M.
>1. Read the Unicode HOWTO.
2. Read the docs on the codecs module ...

You'll need to use

in_file = codecs.open(filepath, mode, encoding="utf16???????")
Right you are. Here is the output produced by so doing:

<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'
<type 'unicode'>
u'\n'
0 [The counter value]
>It would also be a good idea to get into the habit of using unicode
constants like u'",'
Right.
>HTH,
John
Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows. That represents two surprises: first, I
thought that Microsoft files ended as \n\r ; second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Richard Schulman
Sep 6 '06 #4
On 5 Sep 2006 19:50:27 -0700, "John Roth" <Jo*******@jhrothjr.com>
wrote:
>[T]he file I actually want to process is Unicode (utf-16 encoding).
...
in_file = open("c:\\pythonapps\\in-graf1.my","rU")
...
John Roth:
>You're not detecting the file encoding and then
using it in the open statement. If you know this is
utf-16le or utf-16be, you need to say so in the
open. If you don't, then you should read it into
a string, go through some autodetect logic, and
then decode it with the <string>.decode(encoding)
method.

A clue: a properly formatted utf-16 or utf-32
file MUST have a BOM as the first character.
That's mandated in the unicode standard. If
it doesn't have a BOM, then try ascii and
utf-8 in that order. The first
one that succeeds is correct. If neither succeeds,
you're on your own in guessing the file encoding.
Thanks for this further information. I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.
Sep 6 '06 #5
On Wed, 06 Sep 2006 03:55:18 GMT, Richard Schulman
<ra**********@verizon.netwrote:
>...I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.
Of course, I could do a double read on each row and ignore the second
read, which merely fetches the final of the two u'\n' characters. But
that's not very elegant, and I'm sure there's a better way to do it
(hint, hint someone).

Richard Schulman (for email, drop the 'xx' in the reply-to)
Sep 6 '06 #6
Richard Schulman wrote:
[big snip]
>
The BOM is little-endian, I believe.
Correct.
in_file = codecs.open(filepath, mode, encoding="utf16???????")

Right you are. Here is the output produced by so doing:
You don't say which encoding you used, but I guess that you used
utf_16_le.
>
<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'
Use utf_16 -- it will strip off the BOM for you.
<type 'unicode'>
u'\n'
0 [The counter value]
[snip]
Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows.
Well we don't know yet exactly what you have there. We need a byte dump
of the first few bytes of your file. Get into the interactive
interpreter and do this:

open('yourfile', 'rb').read(200)
(the 'b' is for binary, in case you are on Windows)
That will show us exactly what's there, without *any* EOL
interpretation at all.

That represents two surprises: first, I
thought that Microsoft files ended as \n\r ;
Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
from CP/M.

Ummmm ... are you saying the file has \n\r at the end of each row?? How
did you know that if you didn't know what if any BOM it had??? Who
created the file????
second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.
Nah again. It contemplates only \n, \r, and \r\n as end of line. See
the docs. Thus \n\r becomes *two* newlines when read with "rU".

Having "\n\r" at the end of each row does fit with your symptoms:

| >>bom = u"\ufeff"
| >>guff = '\n\r'.join(['abc', 'def', 'ghi'])
| >>guffu = unicode(guff)
| >>import codecs
| >>f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
| >>f.write(bom+guffu)
| >>f.close()

| >>open('guff.utf16le', 'rb').read() #### see exactly what we've got

|
'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x0 0\n\x00\r\x00g\x00h\x00i\x00'

| >>codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!

| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
| u'abc\n\ndef\n\nghi' #### U means \r -\n

| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
experience

| >>open('guff.utf16le', 'rU').readlines()
| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
'\x00\n', '\x00
| g\x00h\x00i\x00']
| >>f = open('guff.utf16le', 'rU')
| >>f.readline()
| '\xff\xfea\x00b\x00c\x00\n'
| >>f.readline()
| '\x00\n' ######### reproduces your first experience
| >>f.readline()
| '\x00d\x00e\x00f\x00\n'
| >>>

If that file is a one-off, you can obviously fix it by
throwing away every second line. Otherwise, if it's an ongoing
exercise, you need to talk sternly to the file's creator :-)

HTH,
John

Sep 6 '06 #7
Many thanks for your help, John, in giving me the tools to work
successfully in Python with Unicode from here on out.

It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was

#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.

Once the mode string "rU" was dropped, as in

in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.

This behavior of "rU" was not at all what I had expected from the
brief discussion of it in _Python Cookbook_. Which all goes to point
out how difficult it is to cook challenging dishes with sketchy
recipes alone. There is no substitute for the helpful advice of an
experienced chef.

-Richard Schulman
(remove "xx" for email reply)

On 5 Sep 2006 22:29:59 -0700, "John Machin" <sj******@lexicon.net>
wrote:
>Richard Schulman wrote:
[big snip]
>>
The BOM is little-endian, I believe.
Correct.
>in_file = codecs.open(filepath, mode, encoding="utf16???????")

Right you are. Here is the output produced by so doing:

You don't say which encoding you used, but I guess that you used
utf_16_le.
>>
<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'

Use utf_16 -- it will strip off the BOM for you.
><type 'unicode'>
u'\n'
0 [The counter value]
[snip]
>Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows.

Well we don't know yet exactly what you have there. We need a byte dump
of the first few bytes of your file. Get into the interactive
interpreter and do this:

open('yourfile', 'rb').read(200)
(the 'b' is for binary, in case you are on Windows)
That will show us exactly what's there, without *any* EOL
interpretation at all.

>That represents two surprises: first, I
thought that Microsoft files ended as \n\r ;

Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
from CP/M.

Ummmm ... are you saying the file has \n\r at the end of each row?? How
did you know that if you didn't know what if any BOM it had??? Who
created the file????
>second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Nah again. It contemplates only \n, \r, and \r\n as end of line. See
the docs. Thus \n\r becomes *two* newlines when read with "rU".

Having "\n\r" at the end of each row does fit with your symptoms:

| >>bom = u"\ufeff"
| >>guff = '\n\r'.join(['abc', 'def', 'ghi'])
| >>guffu = unicode(guff)
| >>import codecs
| >>f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
| >>f.write(bom+guffu)
| >>f.close()

| >>open('guff.utf16le', 'rb').read() #### see exactly what we've got

|
'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x 00\n\x00\r\x00g\x00h\x00i\x00'

| >>codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!

| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
| u'abc\n\ndef\n\nghi' #### U means \r -\n

| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
experience

| >>open('guff.utf16le', 'rU').readlines()
| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
'\x00\n', '\x00
| g\x00h\x00i\x00']
| >>f = open('guff.utf16le', 'rU')
| >>f.readline()
| '\xff\xfea\x00b\x00c\x00\n'
| >>f.readline()
| '\x00\n' ######### reproduces your first experience
| >>f.readline()
| '\x00d\x00e\x00f\x00\n'
| >>>

If that file is a one-off, you can obviously fix it by
throwing away every second line. Otherwise, if it's an ongoing
exercise, you need to talk sternly to the file's creator :-)

HTH,
John
Sep 7 '06 #8

Richard Schulman wrote:
It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was

#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.

Once the mode string "rU" was dropped, as in

in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.
You are on Windows. I would *not* describe as "well" lines read in (the
default) text mode ending in u"\r\n". It would expect it to convert the
line endings to u"\n". At best, this should be documented. Perhaps
someone with some knowledge of the intended treatment of line endings
by codecs.open() in text mode could comment? The two problems are
succintly described below:

File created in Windows Notepad and saved with "Unicode" encoding.
Results in UTF-16LE encoding, line terminator is CR LF, has BOM (LE) at
front -- as show below.

| Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>open('notepad_uc.txt', 'rb').read()
|
'\xff\xfea\x00b\x00c\x00\r\x00\n\x00d\x00e\x00f\x0 0\r\x00\n\x00g\x00h\x00i\x00\r
| \x00\n\x00'
| >>import codecs
| >>codecs.open('notepad_uc.txt', 'r',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\r\n', u'def\r\n', u'ghi\r\n']
| >>codecs.open('notepad_uc.txt', 'r', encoding='utf_16').readlines()
| [u'abc\r\n', u'def\r\n', u'ghi\r\n']
### presence ot u'\r' was *not* expected
| >>codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
| >>codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16').readlines()
| [u'abc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
### 'U' flag does change the behaviour, but *not* as expected.

Cheers,
John

Sep 7 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: ..... | last post by:
I have an established program that I am changing to allow users to select one of eight languages and have all the label captions change accordingly. I have no problems with English, French, Dutch,...
9
by: Rune Hansen | last post by:
Hi, I've got the string "Gratis øl",or in english:"Free beer", I know there is no such thing but... Python 2.3 (#1, Aug 1 2003, 15:23:03) on linux2 Type "help", "copyright", "credits" or...
19
by: Gerson Kurz | last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...
2
by: kepes.krisztian | last post by:
Hi ! I use iso-8859-2 chrset (win1250), because I'm hungarian. I create an audio tagger program in the past, but I found a problem in it. I have the problem in the unicode string handling. ...
9
by: François Pinard | last post by:
Hi, people. I hope someone would like to enlighten me. For any application handling Unicode internally, I'm usually careful at properly converting those Unicode strings into 8-bit strings before...
12
by: damjan | last post by:
This may look like a silly question to someone, but the more I try to understand Unicode the more lost I feel. To say that I am not a beginner C++ programmer, only had no need to delve into...
19
by: Thomas W | last post by:
I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word "fødselsdag". I stored the string as "fødselsdag"...
17
by: Adam Olsen | last post by:
As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...
6
by: pabloski | last post by:
Hi to all, I have a little problem with unicode handling under Python. I have this code s = u'A unicode string with this damn apostrophe \x2019' outf = codecs.open('filename.txt', 'w',...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.