By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,551 Members | 1,142 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,551 IT Pros & Developers. It's quick & easy.

Reading in a UTF-8 file but causing a UnicodeDecodeError exception

P: 92
I have a CSV file created by VisualBasic in UTF-8. If I open the file in vi/emacs I see the Byte-Order marker (BOM), <feff>


So now when I read the file:

Expand|Select|Wrap|Line Numbers
  1. import codecs
  2. f = open ('myfile')
  3. test = f.readline ()
  4. print test.decode ('utf-8')
  5.  
It prints a control character (u'\u\xef\xbb\xbf) as its first character. Shouldn't the decode strip this? Also tried the following to see what would happen and try to auto-detect the format:

Expand|Select|Wrap|Line Numbers
  1. import codecs
  2. for encoding in ['utf-8', 'utf-16']:
  3.     try:
  4.             f = codecs.open ('myfile', encoding=encoding)
  5.             test = f.readline ()
  6.             test
  7.     except Exception, exc:
  8.             f = None
  9.             print (exc)
  10.  
For UTF-16 this is weird cause it states "UTF-16 stream does not start with BOM" even though the first char is the BOM. For UTF-8 no errors but it prints the control characters (u'\ufeff)

Any ideas what is going on with this? Possibly a badly encoded file?
Aug 11 '09 #1
Share this Question
Share on Google+
3 Replies


P: 92
More context. I loaded the file in emacs with hex mode and I see efbbbf which should indicate UTF-8. So the questions are:

1. Why does vi show '<feff>' as the first char?
2. Why does the first code snipet I show not strip the control character?
3. Why when using codecs.open and forcing UTF-8 does it replace the control charcater with feff?
4. Does Python have a way to read in a file with auto-detection for encoding?
Aug 11 '09 #2

bvdet
Expert Mod 2.5K+
P: 2,851
Evan Jones has a good explanation here. I did a little test on my system reading a UTF-8 file with Python 2.3.
Expand|Select|Wrap|Line Numbers
  1. #UTF-8
  2. s1 = open('unicode_example1.txt', 'r').read()
  3. print repr(s1.decode("UTF-8"))
  4. if s1.startswith(codecs.BOM_UTF8):
  5.     s1 = s1.lstrip(codecs.BOM_UTF8)
  6. print repr(s1)
And the output:
Expand|Select|Wrap|Line Numbers
  1. >>> u'abcdef'
  2. 'abcdef'
Apparently object s1 is now a simple string.

Expand|Select|Wrap|Line Numbers
  1. >>> s1
  2. 'abcdef'
  3. >>> unicode(s1, 'UTF-8')
  4. u'abcdef'
  5. >>> 
Aug 12 '09 #3

P: 92
Thanks! This helps alot. I agree with Evan that this seems like a bug in Python coedecs, in that it strips the BOM from UTF-16 but not UTF-8. But adding:

lstrip ( unicode( codecs.BOM_UTF8, "utf8" ) )

Works wonders. Had to use the unicode stuff to avoid an excpetion due to non-ascii characters, but it now works nicely.
Aug 12 '09 #4

Post your reply

Sign in to post your reply or Sign up for a free account.