473,406 Members | 2,208 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

print UTF-8 file with BOM

Hi Friends:

fileObj = codecs.open( filename, "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in
the file
print u

It says error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

I want to know how read from UTF-8 file, and convert to specified
locale (default is current system locale) and print out string. I hope
put away BOM header automatically.

Rgds, David

Dec 23 '05 #1
5 5971
FYI. I had just receive something from a friend, he give me following
nice example!

I have one more question on this: How to write if I want to specify
locale other than current locale? For example, program runn on Korea
locale system, and try reading a UTF-8 file that save chinese
characters.

-------------- The code is here --------------------
import codecs
def read_utf8_txt_file (filename):
fileObj = codecs.open( filename, "r", "utf-8" )
content = fileObj.read()
content = content[1:] #exclude BOM
print content
fileObj.close()

Dec 23 '05 #2
> 2005/12/23, David Xiao <da******@gmail.com>:
Hi Kuan:

Thanks a lot! One more question here: How to write if I want
to
specify locale other than current locale?

For example, running on Korea locale system, and try read a
UTF-8 file
that save chinese.


Use the encode method to translate the unicode object into whatever
encoding you want.

unicodeStr = ...
print unicodeStr.encode('big5')

Hope this helps,

Carsten.
Dec 23 '05 #3
UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.
Dec 23 '05 #4
John Bauman wrote:
UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.


However there's a pending patch (http://bugs.python.org/1177307) for a
new encoding named utf-8-sig, that would output a leading BOM on writing
and skip it on reading.

Bye,
Walter Dörwald
Dec 23 '05 #5
John Bauman wrote:
UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.


Yes and no. Yes, UTF-8 does not need a BOM to identify endianness. No,
usage of the BOM with UTF-8 is explicitly allowed in the Unicode specs
(so output of the BOM doesn't *have* to be restricted to UTF-16 and
greater), and the BOM has a well-defined meaning for UTF-8 (namely,
as the UTF-8 signature).

Regards,
Martin
Dec 23 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Andrew Chalk | last post by:
In a Python 2.2 app. running under CGI the statements print "Hello\n" print "World" print both words on the same line in IE6. How do I print the second one on a new line (i.e. respect the \n...
4
by: Joerg Lehmann | last post by:
I am using Python 2.2.3 (Fedora Core 1). The problem is, that strings containing umlauts do not work as I would expect. Here is my example: >>> a = 'äöü' >>> b = '123' >>> print "%-5s...
12
by: Michael Foord | last post by:
Here's a little oddity with 'print' being a reserved word... >>> class thing: pass >>> something = thing() >>> something.print = 3 SyntaxError: invalid syntax >>> print something.__dict__...
3
by: DD | last post by:
I have frmMain and fsub you choose a month from the ChooseMonth combo and all the records for that month are nopw visible in the fsub I know print with the following Where and only recieve the...
21
by: g.kanaka.raju | last post by:
Hi, I'm looking for a way to print ® in a C program. Could any of you help me out? Regards, Raju
8
by: =?gb2312?B?yMvR1MLkyNXKx8zs0cSjrM37vKvM7NHEsru8+7z | last post by:
I lookup the utf-8 form of delta from the link. http://www.fileformat.info/info/unicode/char/0394/index.htm and then I want to print it in the python ( I work under windows) #!/usr/bin/python...
0
by: jegathees | last post by:
Hi Friends, I am working in SAP XI Platform. In this, we use XSLT Mapping. I am facing one problem to print in field 'RecordID' starts from 1,2,3 in the target structure for those records...
44
by: Ioannis Vranos | last post by:
Has anyone actually managed to print non-English text by using wcout or wprintf and the rest of standard, wide character functions?
13
by: damonwischik | last post by:
I'd like to print out a unicode string. I'm running Python inside Emacs, which understands utf-8, so I want to force Python to send utf-8 to sys.stdout. From what I've googled, I think I need...
5
by: sniipe | last post by:
Hi, I have a problem with unicode string in Pylons templates(Mako). I will print first char from my string encoded in UTF-8 and urllib.quote(), for example string '£ukasz': ...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.