I have text file which contain Unicode data (say inp.txt)
I read file using following code:-
-
import codecs
-
infile = codecs.open('C:\\tdata\\inp.txt','r','utf-16',errors='ignore')
-
data = infile.readlines()
-
If I run above code ... it throws following error :-
-
"Traceback (most recent call last):
-
File "C:\script\hypen\hyp.py", line 34, in ?
-
data = infile.readlines()
-
File "C:\Python24\lib\codecs.py", line 489, in readlines
-
return self.reader.readlines(sizehint)
-
File "C:\Python24\lib\codecs.py", line 404, in readlines
-
data = self.read()
-
File "C:\Python24\lib\codecs.py", line 293, in read
-
newchars, decodedbytes = self.decode(data, self.errors)
-
File "C:\Python24\lib\encodings\utf_16.py", line 49, in decode
-
raise UnicodeError,"UTF-16 stream does not start with BOM"
-
UnicodeError: UTF-16 stream does not start with BOM"
But if I do create a new file (I did in Notepad on Win XP) and copy paste content of 'inp.txt' in it and save it as text file (choosing Unicode encoding which same as of inp.txt). Now with same above code reading this new file, it works absolutely fine. this seems weird... is notepad created file added some own magic chars :)
Can anyone help me regarding this , what can be the issue here ? . Why creating a new file and saving contents in it worked FINE while original file still throws error. (I have got such 15 localized files from clients on which some processing as to be done, I want to avoid manually copy/paste rework). Any help appreciated...
Thanks,
anil