469,282 Members | 1,953 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,282 developers. It's quick & easy.

remove BOM from string read from utf-8 file

Hi,

I read some text from a utf-8 encoded text file like this:

text = codecs.open('example.txt','r','utf8').read()

If I pass this text to a COM object, I can see that there is still the BOM
in the file, which marks the file as utf-8. Simply removing the first
character in the string is not ok, because the BOM is optional. So I tried
something like this:

if text.startswith(codecs.BOM_UTF8):
print "found BOM"

but then I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)

What's the right way to remove the BOM from the string?

regards,
Achim
Jul 18 '05 #1
4 25982
>>>>> "Achim Domma" <do***@procoders.net> (AD) wrote:

AD> Hi,
AD> I read some text from a utf-8 encoded text file like this:

AD> text = codecs.open('example.txt','r','utf8').read()

AD> If I pass this text to a COM object, I can see that there is still the BOM
AD> in the file, which marks the file as utf-8. Simply removing the first
AD> character in the string is not ok, because the BOM is optional. So I tried
AD> something like this:

The BOM is in the file, but not in the string 'text'
text is a unicode string which consists of Unicode characters and the BOM
is not a Unicode character.

Check text[0] and len(text) to verify.

Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that
is the reason for the complaint.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.***********@hccnet.nl
Jul 18 '05 #2
"Piet van Oostrum" <pi**@cs.uu.nl> wrote in message
news:wz************@Ordesa.local...
Check text[0] and len(text) to verify.


That's what I did. The file contains 24 chinese characters and len(text) is
25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Achim
Jul 18 '05 #3
I found myself often needing to read text files that might be utf-8, unicode
or ansi, without knowing beforehand which, so I wrote a single function to
do it. I don't know if this is the correct way to handle this situation,
but I couldn't find any function that would simply open a file with the
appropriate codec automatically, so I use this (it doesn't handle all cases,
but just the ones I've needed so far):

import os, codecs
#---------------------------------------------------------------------------
-
# OpenTextFile()
#
# Opens a file correctly whether it is unicode or ansi. If the file
# doesn't exist, then the default encoding is unicode (UTF-16).
#
# Python documentation of the codecs module is pretty weak; for instance
# there are all these:
# BOM
# BOM_BE
# BOM_LE
# BOM_UTF8
# BOM_UTF16
# BOM_UTF16_BE
# BOM_UTF16_LE
# BOM_UTF32
# BOM_UTF32_BE
# BOM_UTF32_LE
# but no explanation of how they map to the encodings like 'utf-16'. Some
# can be inferred, but some are not so clear.
#---------------------------------------------------------------------------
-
def OpenTextFile(filename,mode='r',encoding=None):
if os.path.isfile(filename):
f = file(filename,'rb')
header = f.read(4) # Read just the first four bytes.
f.close()
# Don't change this to a map, because it is ordered!!!
encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
( codecs.BOM_UTF16, 'utf-16' ),
( codecs.BOM_UTF8, 'utf-8' ) ]
for h,e in encodings:
if header.find(h) == 0:
encoding = e
break
return codecs.open(filename,mode,encoding)
Jul 18 '05 #4
>>>>> "Achim Domma" <do***@procoders.net> (AD) wrote:

AD> "Piet van Oostrum" <pi**@cs.uu.nl> wrote in message
AD> news:wz************@Ordesa.local...
Check text[0] and len(text) to verify.


AD> That's what I did. The file contains 24 chinese characters and len(text) is
AD> 25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Sorry, I was wrong.
You have to check for text.startswith(u'\ufeff')

--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.***********@hccnet.nl
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

18 posts views Thread by Zygmunt Krynicki | last post: by
9 posts views Thread by Mark | last post: by
3 posts views Thread by Christian Lutz | last post: by
1 post views Thread by willie | last post: by
2 posts views Thread by GADOI | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.