the same strings, different utf-8 repr values?

slowness.chen

I have two files:

test.py:
--------------------------------------------------
# -*- encoding : utf8 -*-
print 'in this file', repr('ÖÐÎÄ')

# tt.txt is saved as utf8 encoding
f = file('tt.txt')
line1 = f.readline().strip()
print 'another file', repr(line1)
-------------------------------------------------------

tt.txt:
----------------------------------------------------
ÖÐÎÄ
test
-------------------------------------------------------
run test.py and I get the following output:
in this file '\xe4\xb8\xad\xe6\x96\x87'
another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'

and I cann't encode line1 like:
line1.decode('utf8').encode('gbk')
get this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

why did I get the different repr values?

Sep 7 '06 #1

Subscribe Post Reply

2752

John Machin

sl***********@gmail.com wrote:

I have two files:

test.py:
--------------------------------------------------
# -*- encoding : utf8 -*-
print 'in this file', repr('ä¸*æ–‡')

# tt.txt is saved as utf8 encoding
f = file('tt.txt')
line1 = f.readline().strip()
print 'another file', repr(line1)
-------------------------------------------------------

tt.txt:
----------------------------------------------------
ä¸*æ–‡
test
-------------------------------------------------------
run test.py and I get the following output:
in this file '\xe4\xb8\xad\xe6\x96\x87'
another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'

and I cann't encode line1 like:
line1.decode('utf8').encode('gbk')
get this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

why did I get the different repr values?

Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John

Sep 7 '06 #2

Slowness Chen

got it. thanks.
John Machin å†™é“ï¼š

sl***********@gmail.com wrote:
I have two files:

test.py:
--------------------------------------------------
# -*- encoding : utf8 -*-
print 'in this file', repr('ä¸*æ–‡')

# tt.txt is saved as utf8 encoding
f = file('tt.txt')
line1 = f.readline().strip()
print 'another file', repr(line1)
-------------------------------------------------------

tt.txt:
----------------------------------------------------
ä¸*æ–‡
test
-------------------------------------------------------
run test.py and I get the following output:
in this file '\xe4\xb8\xad\xe6\x96\x87'
another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'

and I cann't encode line1 like:
line1.decode('utf8').encode('gbk')
get this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

why did I get the different repr values?

Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John

Sep 8 '06 #3

Similar topics

Print formatted Strings with Umlauts

by: Joerg Lehmann | last post by:

I am using Python 2.2.3 (Fedora Core 1). The problem is, that strings containing umlauts do not work as I would expect. Here is my example: >>> a = 'äöü' >>> b = '123' >>> print "%-5s...

Python

UTF-7 encoded strings

by: yawnmoth | last post by:

Is there a way to convert normal strings to UTF-7 encoded strings? ie. where < is +ADw-, where > is +AD4-, etc.? I was thinking that utf8_encode might encode something similar enough to UTF-8...

PHP

Strings from hell

by: Harold Skeggs | last post by:

Hello, I'm hoping someone can point me in the right direction, or better still, help me with a code snippet :) I have the following text in a string: ...

.NET Framework

Getting around .Net Strings being UTF-16 encoded only

by: Mike Murray | last post by:

Hi I have the following issue. I have a character that is return by a SQL Server database "É" to be precise, the issue is that when I store character in a .net string variable my understanding...

.NET Framework

Reading Unicode Strings from File

by: Jamie | last post by:

I have a file that was written using Java and the file has unicode strings. What is the best way to deal with these in C? The file definition reads: Data Field Description CHAR File...

C / C++

Need help with unicode strings.

by: Nick Z. | last post by:

I have a file that I want to read a UTF-16 unicode string from. What is the easiest way to accomplish that? Thanks in advance, Nick Z.

C# / C Sharp

WTF? Printing unicode strings

by: Ron Garret | last post by:

>>> u'\xbd' u'\xbd' >>> print _ Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in...

Python

Problem with sets and Unicode strings

by: Dennis Benzinger | last post by:

Hi! The following program in an UTF-8 encoded file: # -*- coding: UTF-8 -*- FIELDS = ("Fächer", ) FROZEN_FIELDS = frozenset(FIELDS) FIELDS_SET = set(FIELDS)

Python

parameterized translations strings;what is the solution?

by: Paul Elliott | last post by:

How are parameterized translation strings commonly handled? Suppose I need to create a string like: "file %1 failed to open", where %1 exists at runtime, but I need to create it in a way that...

C / C++

Problem with lower() for unicode strings in russian

by: Alexey Moskvin | last post by:

Hi! I have a set of strings (all letters are capitalized) at utf-8, russian language. I need to lower it, but my_string.lower(). Doesn't work. See sample script: # -*- coding: utf-8 -*- s1 =...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp