The following program fragment works correctly with an ascii input
file.
But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.
When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:
in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()
Any suggestions?
Richard Schulman
(For email reply, delete the 'xx' characters) 8 2310
Richard Schulman wrote:
The following program fragment works correctly with an ascii input
file.
But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.
When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:
in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
You mean in_line = in_file.readline(), I hope. Do please copy/paste
actual code, not what you think you ran.
attribute_count = in_line.count('",')
print attribute_count
Insert
print type(in_line)
print repr(in_line)
here [also make the appropriate changes to get the same info from the
first line], run it again, copy/paste what you get, show us what you
see.
If you're coy about that, then you'll have to find out yourself if it
has a BOM at the front, and if not whether it's little/big/endian.
finally:
in_file.close()
Any suggestions?
1. Read the Unicode HOWTO.
2. Read the docs on the codecs module ...
You'll need to use
in_file = codecs.open(filepath, mode, encoding="utf16???????")
It would also be a good idea to get into the habit of using unicode
constants like u'",'
HTH,
John
Richard Schulman wrote:
The following program fragment works correctly with an ascii input
file.
But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.
When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:
in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()
Any suggestions?
Richard Schulman
(For email reply, delete the 'xx' characters)
You're not detecting the file encoding and then
using it in the open statement. If you know this is
utf-16le or utf-16be, you need to say so in the
open. If you don't, then you should read it into
a string, go through some autodetect logic, and
then decode it with the <string>.decode(encoding)
method.
A clue: a properly formatted utf-16 or utf-32
file MUST have a BOM as the first character.
That's mandated in the unicode standard. If
it doesn't have a BOM, then try ascii and
utf-8 in that order. The first
one that succeeds is correct. If neither succeeds,
you're on your own in guessing the file encoding.
John Roth
Thanks for your excellent debugging suggestions, John. See below for
my follow-up:
Richard Schulman:
>The following program fragment works correctly with an ascii input file.
But the file I actually want to process is Unicode (utf-16 encoding). The file must be Unicode rather than ASCII or Latin-1 because it contains mixed Chinese and English characters.
When I run the program below I get an attribute_count of zero, which is incorrect for the input file, which should give a value of fifteen or sixteen. In other words, the count function isn't recognizing the ", characters in the line being read. Here's the program: ...
John Machin:
>Insert
print type(in_line)
print repr(in_line) here [also make the appropriate changes to get the same info from the first line], run it again, copy/paste what you get, show us what you see.
Here's the revised program, per your suggestion:
================================================== ===
# This program processes a UTF-16 input file that is
# to be loaded later into a mySQL table. The input file
# is not yet ready for prime time. The purpose of this
# program is to ready it.
in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# The first line read is a SQL INSERT statement; no
# processing will be required.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging
# The second line read is the first data row.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging
# For this and subsequent rows, we must count all
# the < ", character-pairs in a given line/row.
# This will provide an n-1 measure of the attributes
# for a SQL insert of this row. All rows must have
# sixteen attributes, but some don't yet.
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()
================================================== ===
The output of this program, which I ran at the command line,
must needs to be copied by hand and abridged, but I think I
have included the relevant information:
C:\pythonapps>python graf_correction.py
<type 'str'>
'\xff\xfeI\x00N\x00S... [the beginning of a SQL INSERT statement]
....\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row,
followed by an end-of-line]
<type 'str'>
'\x00\n' [oh-oh! For the second row, all we're seeing
is an end-of-line character. Is that from
the first row? Wasn't the "rU" mode
supposed to handle that]
0 [the counter value. It's hardly surprising
it's only zero, given that most of the row
never got loaded, just an eol mark]
J.M.:
>If you're coy about that, then you'll have to find out yourself if it has a BOM at the front, and if not whether it's little/big/endian.
The BOM is little-endian, I believe.
R.S.:
>Any suggestions?
J.M.
>1. Read the Unicode HOWTO. 2. Read the docs on the codecs module ...
You'll need to use
in_file = codecs.open(filepath, mode, encoding="utf16???????")
Right you are. Here is the output produced by so doing:
<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'
<type 'unicode'>
u'\n'
0 [The counter value]
>It would also be a good idea to get into the habit of using unicode constants like u'",'
Right.
>HTH, John
Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows. That represents two surprises: first, I
thought that Microsoft files ended as \n\r ; second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.
Richard Schulman
On 5 Sep 2006 19:50:27 -0700, "John Roth" <Jo*******@jhrothjr.com>
wrote:
>[T]he file I actually want to process is Unicode (utf-16 encoding). ... in_file = open("c:\\pythonapps\\in-graf1.my","rU") ...
John Roth:
>You're not detecting the file encoding and then using it in the open statement. If you know this is utf-16le or utf-16be, you need to say so in the open. If you don't, then you should read it into a string, go through some autodetect logic, and then decode it with the <string>.decode(encoding) method.
A clue: a properly formatted utf-16 or utf-32 file MUST have a BOM as the first character. That's mandated in the unicode standard. If it doesn't have a BOM, then try ascii and utf-8 in that order. The first one that succeeds is correct. If neither succeeds, you're on your own in guessing the file encoding.
Thanks for this further information. I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.
On Wed, 06 Sep 2006 03:55:18 GMT, Richard Schulman
<ra**********@verizon.netwrote:
>...I'm now using the codec with improved results, but am still puzzled as to how to handle the row termination of \n\n, which is being interpreted as two rows instead of one.
Of course, I could do a double read on each row and ignore the second
read, which merely fetches the final of the two u'\n' characters. But
that's not very elegant, and I'm sure there's a better way to do it
(hint, hint someone).
Richard Schulman (for email, drop the 'xx' in the reply-to)
Richard Schulman wrote:
[big snip]
>
The BOM is little-endian, I believe.
Correct.
in_file = codecs.open(filepath, mode, encoding="utf16???????")
Right you are. Here is the output produced by so doing:
You don't say which encoding you used, but I guess that you used
utf_16_le.
>
<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'
Use utf_16 -- it will strip off the BOM for you.
<type 'unicode'>
u'\n'
0 [The counter value]
[snip]
Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows.
Well we don't know yet exactly what you have there. We need a byte dump
of the first few bytes of your file. Get into the interactive
interpreter and do this:
open('yourfile', 'rb').read(200)
(the 'b' is for binary, in case you are on Windows)
That will show us exactly what's there, without *any* EOL
interpretation at all.
That represents two surprises: first, I
thought that Microsoft files ended as \n\r ;
Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
from CP/M.
Ummmm ... are you saying the file has \n\r at the end of each row?? How
did you know that if you didn't know what if any BOM it had??? Who
created the file????
second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.
Nah again. It contemplates only \n, \r, and \r\n as end of line. See
the docs. Thus \n\r becomes *two* newlines when read with "rU".
Having "\n\r" at the end of each row does fit with your symptoms:
| >>bom = u"\ufeff"
| >>guff = '\n\r'.join(['abc', 'def', 'ghi'])
| >>guffu = unicode(guff)
| >>import codecs
| >>f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
| >>f.write(bom+guffu)
| >>f.close()
| >>open('guff.utf16le', 'rb').read() #### see exactly what we've got
|
'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x0 0\n\x00\r\x00g\x00h\x00i\x00'
| >>codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!
| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
| u'abc\n\ndef\n\nghi' #### U means \r -\n
| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
experience
| >>open('guff.utf16le', 'rU').readlines()
| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
'\x00\n', '\x00
| g\x00h\x00i\x00']
| >>f = open('guff.utf16le', 'rU')
| >>f.readline()
| '\xff\xfea\x00b\x00c\x00\n'
| >>f.readline()
| '\x00\n' ######### reproduces your first experience
| >>f.readline()
| '\x00d\x00e\x00f\x00\n'
| >>>
If that file is a one-off, you can obviously fix it by
throwing away every second line. Otherwise, if it's an ongoing
exercise, you need to talk sternly to the file's creator :-)
HTH,
John
Many thanks for your help, John, in giving me the tools to work
successfully in Python with Unicode from here on out.
It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was
#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")
This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.
Once the mode string "rU" was dropped, as in
in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")
all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.
This behavior of "rU" was not at all what I had expected from the
brief discussion of it in _Python Cookbook_. Which all goes to point
out how difficult it is to cook challenging dishes with sketchy
recipes alone. There is no substitute for the helpful advice of an
experienced chef.
-Richard Schulman
(remove "xx" for email reply)
On 5 Sep 2006 22:29:59 -0700, "John Machin" <sj******@lexicon.net>
wrote:
>Richard Schulman wrote: [big snip]
>> The BOM is little-endian, I believe.
Correct.
>in_file = codecs.open(filepath, mode, encoding="utf16???????")
Right you are. Here is the output produced by so doing:
You don't say which encoding you used, but I guess that you used utf_16_le.
>> <type 'unicode'> u'\ufeffINSERT INTO [...] VALUES\N'
Use utf_16 -- it will strip off the BOM for you.
><type 'unicode'> u'\n' 0 [The counter value]
[snip]
>Yes, it did. Many thanks! Now I've got to figure out the best way to handle that \n\n at the end of each row, which the program is interpreting as two rows.
Well we don't know yet exactly what you have there. We need a byte dump of the first few bytes of your file. Get into the interactive interpreter and do this:
open('yourfile', 'rb').read(200) (the 'b' is for binary, in case you are on Windows) That will show us exactly what's there, without *any* EOL interpretation at all.
>That represents two surprises: first, I thought that Microsoft files ended as \n\r ;
Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n (not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance from CP/M.
Ummmm ... are you saying the file has \n\r at the end of each row?? How did you know that if you didn't know what if any BOM it had??? Who created the file????
>second, I thought that Python mode "rU" was supposed to be the universal eol handler and would handle the \n\r as one mark.
Nah again. It contemplates only \n, \r, and \r\n as end of line. See the docs. Thus \n\r becomes *two* newlines when read with "rU".
Having "\n\r" at the end of each row does fit with your symptoms:
| >>bom = u"\ufeff" | >>guff = '\n\r'.join(['abc', 'def', 'ghi']) | >>guffu = unicode(guff) | >>import codecs | >>f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le') | >>f.write(bom+guffu) | >>f.close()
| >>open('guff.utf16le', 'rb').read() #### see exactly what we've got
| '\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x 00\n\x00\r\x00g\x00h\x00i\x00'
| >>codecs.open('guff.utf16le', 'r', encoding='utf_16').read() | u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!
| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16').read() | u'abc\n\ndef\n\nghi' #### U means \r -\n
| >>codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read() | u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second experience
| >>open('guff.utf16le', 'rU').readlines() | ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n', '\x00\n', '\x00 | g\x00h\x00i\x00'] | >>f = open('guff.utf16le', 'rU') | >>f.readline() | '\xff\xfea\x00b\x00c\x00\n' | >>f.readline() | '\x00\n' ######### reproduces your first experience | >>f.readline() | '\x00d\x00e\x00f\x00\n' | >>>
If that file is a one-off, you can obviously fix it by throwing away every second line. Otherwise, if it's an ongoing exercise, you need to talk sternly to the file's creator :-)
HTH, John
Richard Schulman wrote:
It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was
#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")
This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.
Once the mode string "rU" was dropped, as in
in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")
all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.
You are on Windows. I would *not* describe as "well" lines read in (the
default) text mode ending in u"\r\n". It would expect it to convert the
line endings to u"\n". At best, this should be documented. Perhaps
someone with some knowledge of the intended treatment of line endings
by codecs.open() in text mode could comment? The two problems are
succintly described below:
File created in Windows Notepad and saved with "Unicode" encoding.
Results in UTF-16LE encoding, line terminator is CR LF, has BOM (LE) at
front -- as show below.
| Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>open('notepad_uc.txt', 'rb').read()
|
'\xff\xfea\x00b\x00c\x00\r\x00\n\x00d\x00e\x00f\x0 0\r\x00\n\x00g\x00h\x00i\x00\r
| \x00\n\x00'
| >>import codecs
| >>codecs.open('notepad_uc.txt', 'r',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\r\n', u'def\r\n', u'ghi\r\n']
| >>codecs.open('notepad_uc.txt', 'r', encoding='utf_16').readlines()
| [u'abc\r\n', u'def\r\n', u'ghi\r\n']
### presence ot u'\r' was *not* expected
| >>codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
| >>codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16').readlines()
| [u'abc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
### 'U' flag does change the behaviour, but *not* as expected.
Cheers,
John This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: ..... |
last post by:
I have an established program that I am changing to allow users to select
one of eight languages and have all the label captions change accordingly.
I have no problems with English, French, Dutch,...
|
by: Rune Hansen |
last post by:
Hi,
I've got the string "Gratis øl",or in english:"Free beer", I know there
is no such thing but...
Python 2.3 (#1, Aug 1 2003, 15:23:03)
on linux2
Type "help", "copyright", "credits" or...
|
by: Gerson Kurz |
last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice
problem for y'all to enjoy: say you have a variable thats unicode
directory = u"c:\temp"
Its unicode not because you want it...
|
by: kepes.krisztian |
last post by:
Hi !
I use iso-8859-2 chrset (win1250), because I'm hungarian.
I create an audio tagger program in the past, but I found a problem in it.
I have the problem in the unicode string handling.
...
|
by: François Pinard |
last post by:
Hi, people. I hope someone would like to enlighten me.
For any application handling Unicode internally, I'm usually careful
at properly converting those Unicode strings into 8-bit strings before...
|
by: damjan |
last post by:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into...
|
by: Thomas W |
last post by:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.
The string below is the encoding of the norwegian word "fødselsdag".
I stored the string as "fødselsdag"...
|
by: Adam Olsen |
last post by:
As was seen in another thread, there's a great deal of confusion
with regard to surrogates. Most programmers assume Python's unicode
type exposes only complete characters. Even CPython's own...
|
by: pabloski |
last post by:
Hi to all, I have a little problem with unicode handling under Python.
I have this code
s = u'A unicode string with this damn apostrophe \x2019'
outf = codecs.open('filename.txt', 'w',...
|
by: DJRhino |
last post by:
Was curious if anyone else was having this same issue or not....
I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM)
The start time is equivalent to 19:00 (7PM) in Central...
|
by: Aliciasmith |
last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...
|
by: NeoPa |
last post by:
Introduction
For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM)
Please note that the UK and Europe revert to winter time on...
|
by: nia12 |
last post by:
Hi there,
I am very new to Access so apologies if any of this is obvious/not clear.
I am creating a data collection tool for health care employees to complete. It consists of a number of...
|
by: NeoPa |
last post by:
Introduction
For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...
|
by: isladogs |
last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, Mike...
|
by: GKJR |
last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...
| |