By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,621 Members | 1,074 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,621 IT Pros & Developers. It's quick & easy.

Using codecs.EncodedFile() with Python 2.5

P: n/a
I used this function successfully with Python 2.4 to alter the encoding
of a set of database records from latin-1 to utf-8, but the same
program raises an exception using Python 2.5. This small example shows
the problem:

import codecs
fo = open('test.dat', 'w')
fo.write('G\xe2teaux')
fo.close()

fi = open("test.dat",'r')
fx = codecs.EncodedFile(fi, 'utf-8', 'latin-1')
astring = fx.readline()
print astring
ustring = unicode(astring, 'utf-8' )
print repr(ustring)
print ustring.encode('latin-1')
print ustring.encode('utf-8')

Python 2.4 gives:

Gâteaux
u'G\xe2teaux'
Gâteaux
Gâteaux

which I believe is correct, while 2.5 produces

Traceback (most recent call last):
File "test_codec.py", line 8, in <module>
astring = fx.readline()
File "C:\Python25\lib\codecs.py", line 709, in readline
data = self.reader.readline()
File "C:\Python25\lib\codecs.py", line 471, in readline
data = self.read(readsize, firstline=True)
File "C:\Python25\lib\codecs.py", line 418, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3:
invalid data

Is there a genuine problem here, or have I been misusing this function?
--
Regards
David Hughes

Jan 3 '07 #1
Share this Question
Share on Google+
1 Reply


P: n/a
David Hughes wrote:
I used this function successfully with Python 2.4 to alter the encoding
of a set of database records from latin-1 to utf-8, but the same
program raises an exception using Python 2.5. This small example shows
the problem:

import codecs
fo = open('test.dat', 'w')
fo.write('G\xe2teaux')
fo.close()

fi = open("test.dat",'r')
fx = codecs.EncodedFile(fi, 'utf-8', 'latin-1')
astring = fx.readline()
print astring
ustring = unicode(astring, 'utf-8' )
print repr(ustring)
print ustring.encode('latin-1')
print ustring.encode('utf-8')

Python 2.4 gives:

Gâteaux
u'G\xe2teaux'
Gâteaux
Gâteaux

which I believe is correct, while 2.5 produces

Traceback (most recent call last):
File "test_codec.py", line 8, in <module>
astring = fx.readline()
File "C:\Python25\lib\codecs.py", line 709, in readline
data = self.reader.readline()
File "C:\Python25\lib\codecs.py", line 471, in readline
data = self.read(readsize, firstline=True)
File "C:\Python25\lib\codecs.py", line 418, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3:
invalid data

Is there a genuine problem here, or have I been misusing this function?
This is indeed a bug in Python 2.5. Fixed in subversion.

http://svn.python.org/view/python/tr...52517&view=log

Peter

Jan 3 '07 #2

This discussion thread is closed

Replies have been disabled for this discussion.