Connecting Tech Pros Worldwide Forums | Help | Site Map

"UnicodeError: UTF-16 stream does not start with BOM"

Newbie
 
Join Date: Sep 2008
Posts: 10
#1: Jan 6 '09
I have text file which contain Unicode data (say inp.txt)
I read file using following code:-

Expand|Select|Wrap|Line Numbers
  1. import codecs
  2. infile = codecs.open('C:\\tdata\\inp.txt','r','utf-16',errors='ignore')
  3. data = infile.readlines()
  4.  
If I run above code ... it throws following error :-
Expand|Select|Wrap|Line Numbers
  1. "Traceback (most recent call last):
  2.   File "C:\script\hypen\hyp.py", line 34, in ?
  3.     data = infile.readlines()
  4.   File "C:\Python24\lib\codecs.py", line 489, in readlines
  5.     return self.reader.readlines(sizehint)
  6.   File "C:\Python24\lib\codecs.py", line 404, in readlines
  7.     data = self.read()
  8.   File "C:\Python24\lib\codecs.py", line 293, in read
  9.     newchars, decodedbytes = self.decode(data, self.errors)
  10.   File "C:\Python24\lib\encodings\utf_16.py", line 49, in decode
  11.     raise UnicodeError,"UTF-16 stream does not start with BOM"
  12. UnicodeError: UTF-16 stream does not start with BOM"
But if I do create a new file (I did in Notepad on Win XP) and copy paste content of 'inp.txt' in it and save it as text file (choosing Unicode encoding which same as of inp.txt). Now with same above code reading this new file, it works absolutely fine. this seems weird... is notepad created file added some own magic chars :)

Can anyone help me regarding this , what can be the issue here ? . Why creating a new file and saving contents in it worked FINE while original file still throws error. (I have got such 15 localized files from clients on which some processing as to be done, I want to avoid manually copy/paste rework). Any help appreciated...


Thanks,
anil

bvdet's Avatar
Moderator
 
Join Date: Oct 2006
Location: Nashville, TN
Posts: 1,566
#2: Jan 6 '09

re: "UnicodeError: UTF-16 stream does not start with BOM"


I found information on this link helpful. Since you know your encoding is "UTF-16", you may be able to use string method decode() to read your data. Notepad adds the BOM based on the encoding selected.
Reply