Encode exception for chinese text

Vinayakc

Hi all,

I am new to python.

I have written one small application which reads data from xml file and
tries to encode data using apprpriate charset.
I am facing problem while encoding one chinese paragraph with charset
"gb2312".

code is:

encoded_str = str_data.encode("gb2312")

The type of str_data is <type 'unicode'>

The exception is:

"UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
position 0: illegal multibyte sequence"

Can anyone please give me direction to solve this isssue.

Regards,
Vinayakc

May 19 '06 #1

Subscribe Post Reply

2926

swordsp

Are you sure all the characters in original text are in "gb2312"
charset?

Encoding with "utf8" seems work for this character (u'\xa0'), but I
don't know if the result is correct.

Could you give a subset of str_data in unicode?

May 19 '06 #2

Serge Orlov

Vinayakc wrote:

Hi all,

I am new to python.

I have written one small application which reads data from xml file and
tries to encode data using apprpriate charset.
I am facing problem while encoding one chinese paragraph with charset
"gb2312".

code is:

encoded_str = str_data.encode("gb2312")

The type of str_data is <type 'unicode'>

The exception is:

"UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
position 0: illegal multibyte sequence"

Hmm, this is 'no-break space' in the very beginning of the text. It
look suspiciously like a plain text utf-8 signature which is 'zero
width no-break space'. If you strip the first character do you still
have encoding errors?

May 19 '06 #3

Vinayakc

Yes serge, I have removed the first character but it is still giving
encoding exception.

May 19 '06 #4

John Machin

1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.
3. gb2312 is outdated. It is not really an "appropriate" charset for
anything much these days. You need to check out what your requirements
really are. The unknowing will cheerfully use "gb" to mean one or more
of those, or to mean "anything that's not big5" :-)
4. The slab of text you supplied is genuine unicode and encodes happily
into all those gb* charsets. It does *not* contain \u00a0.

I do hope some of this helps ....

Cheers,
John

May 19 '06 #5

Serge Orlov

Vinayakc wrote:

Yes serge, I have removed the first character but it is still giving
encoding exception.

Then I guess this character was used as a poor man indentation tool at
least in the beginning of your text. It's up to you to decide what to
do with that character, you have several choices:

* edit source xml file to get rid of it
* remove it while you process your data
* replace it with ordinary space
* consider utf-8

Note, there are legitimate use cases for no-break space, for example
one million can be written like 1 000 000, where spaces are
non-breakable. This prevents the number to be broken by right margin
like this: 1 000
000

Keep that in mind when you remove or replace no-break space.

May 19 '06 #6

Vinayakc

Hey Serge, john,

Thank you very much. I was really not aware of these facts. Anyways
this is happening only for one in millions so I can ignore this for
now.

Thanks again,

Vinayakc

May 19 '06 #7

Martin v. LÃ¶wis

John Machin wrote:

1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.

Also, *by definition*, though :-) For those that have not followed
encodings too closely: gb18030 is to gb2312 what UTF-8 is to ASCII.
Both encode the entire Unicode in an algorithmic way, and provide
byte-for-byte identical encodings for the for their respective
subset.

Regards,
Martin

May 19 '06 #8

John Machin

MvL wrote:

Also, *by definition*, though :-)

Ah yes, indeed; and thanks for reminding me. Aside: Similar definition,
but not similar design: IMHO utf-8 sits on top of ASCII like a rose on
a stalk, whereas gb18030 sits on top of gb2312 like a rhinoceros on a
unicycle :-)
Cheers,
John

May 19 '06 #9

Similar topics

chinese and arrays

by: Kobi Lurie | last post by:

Hello all, I'm trying to make a simple script beginner level script, with just functions. it uses the functions: file_get_contents substr taking into an array the text substr took then...

PHP

Encode() behaves differently with different charsets?

by: Scott Matthews | last post by:

I've recently come upon an odd Javascript (and/or browser) behavior, and after hunting around the Web I still can't seem to find an answer. Specifically, I have noticed that the Javascript...

Javascript

HTML encode problem

by: tkcheng | last post by:

Helllo, For security reason, we are changing our form submission coding with HTML encode on all the text fields to block the SQL injection. However, we encounter a problem on double byte...

ASP / Active Server Pages

Access 2000 and Chinese font(s)

by: bob | last post by:

Im having a wierd chinese font issue in Access 2000. I installed the chinese support for access 2000, have all the windows stuff setup, even have 10-12 other chinese fonts installed. I made a...

Microsoft Access / VBA

asp.net chinese encoding

by: pabv | last post by:

Hello all, I am having a few issues with encoding to chinese characters and perhaps someone might be able to assist. At the moment I am only able to see chinese characters when displayed as...

C# / C Sharp

Chinese characters don't display on excel using asp.net

by: c.verma | last post by:

I have a web application. There is a page which has a datagrid on it.The datagrid displays the data that comes from SAP. SAP sends the chinese characters to this grid. Before I display CHinese...

ASP.NET

Chinese Characters

by: Figmo | last post by:

I'm having a problem working with foreign characters (well....foreign to me anyway) I have a textbox control on a form. The font is set to MS Arial Unicode. If I use the Chinese input method...

C# / C Sharp

GD, IMagick and Chinese words in utf-8

by: Sebastian.Pawlus | last post by:

Hi, I have a problem with Chinese words coded in utf-8. I need to display Chinese marks/words with use of GD or IMagick libs. Chinese text that I need to display is written as plain text in .txt...

PHP

Judge the encode systm used by the file.

by: Hongyi Zhao | last post by:

Dear all, I want to judge the file's encoding system correctly, i.e., belong to utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on. Who can give me some hints on the fortran...

C / C++

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing