472,353 Members | 1,388 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,353 software developers and data experts.

split() can help to read UTF-16 encoded file without codecs support,why?

Hi Guys,

I was processing a UTF-16 coded file with BOM and was not aware of the
codecs package at first. I wrote the following code:
===== Code 1============================
for i in open("d:\python24\lzjtest.xml", 'r').readlines():
i = i.decode("utf-16")
print i
=======================================
Output was:
Traceback (most recent call last):
File "D:\Python24\testutf-16.py", line 4, in -toplevel-
i = i.decode("utf-16")
File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
84: truncated data

I searched google and found an article on the similar problem saying to use
split(). I had not quite caught the meaning of the article and recode as:
==== Code 2==============================
for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
i = i.decode("utf-16")
print i
=======================================
Then it worked (echo the file).

Later I get to know codecs and write the following code:

==== Code 3 =============================
import codecs
for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
print i
=======================================
It worked and echo the file.

I am wondering what is the problem with the first code and why the bug
is fixed in
the second.

Thanks in advance.

-Zhongjian
Mar 17 '06 #1
1 6664

Zhongjian Lu wrote:
Hi Guys,

I was processing a UTF-16 coded file with BOM and was not aware of the
codecs package at first. I wrote the following code:
===== Code 1============================
for i in open("d:\python24\lzjtest.xml", 'r').readlines():
i = i.decode("utf-16")
print i
=======================================
Output was:
Traceback (most recent call last):
File "D:\Python24\testutf-16.py", line 4, in -toplevel-
i = i.decode("utf-16")
File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
84: truncated data

UTF16 is a 'two-byte encoding'. This means that '\r\n' is represented
using :

'\r\x00\n\x00'

When you use readlines to split this up it splits on byte boundaries.
This probably returns something like :

'\r', '\x00\n', '\x00'

You can see how the last bit is 'truncated' (single byte only) because
the data has been split on bytes instead of characters.

I searched google and found an article on the similar problem saying to use
split(). I had not quite caught the meaning of the article and recode as:
==== Code 2==============================
for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
i = i.decode("utf-16")
print i
=======================================
Then it worked (echo the file).

You will probably find that '\r\n' never occurs in the byte-string, so
this does it *all* in one line, but the decode succeeds.

HTH

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Later I get to know codecs and write the following code:

==== Code 3 =============================
import codecs
for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
print i
=======================================
It worked and echo the file.

I am wondering what is the problem with the first code and why the bug
is fixed in
the second.

Thanks in advance.

-Zhongjian


Mar 17 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Blue Ocean | last post by:
In short, it's not working right for me. In long: The program is designed to read numbers from an accumulator and speak them out loud. ...
13
by: Larry L | last post by:
Access is noted for bloating a database when you add and delete records frequently. I have always had mine set to compact on close, and that works...
6
by: Ajit | last post by:
I have a tab delimited string and i want to split it to read all the values from the string I tried multiple way but doesn't work with ASP.Net ...
2
by: Jay | last post by:
Well this is my function to split the data into its own lines Sub preSift(ByVal input As String) Dim spl() As String spl = New String() {} ...
8
by: Zephyre | last post by:
I have some UTF-8 text files written in Chinese to be read. Now the only method that I know to read text from it is to use fopen() function. Thus,...
2
by: needin4mation | last post by:
Hi, thanks for any help here: SqlCommand cmd = new SqlCommand("SELECT categories FROM catalog" conn); rdr = cmd.ExecuteReader(); String temp;...
22
by: SmokeWilliams | last post by:
Hi, I am working on a Spell checker for my richtext editor. I cannot use any open source, and must develop everything myself. I need a RegExp...
20
by: weheh | last post by:
Dear web gods: After much, much, much struggle with unicode, many an hour reading all the examples online, coding them, testing them, ripping...
10
by: teddyber | last post by:
Hello, first i'm a newbie to python (but i searched the Internet i swear). i'm looking for some way to split up a string into a list of pairs...
0
by: Hvid Hat | last post by:
Hi At first, I thought I could only solve my problem with a C# method inside my XSLT but I'm beginning to think it might be possible with XSLT...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.