473,386 Members | 1,790 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Encode exception for chinese text

Hi all,

I am new to python.

I have written one small application which reads data from xml file and
tries to encode data using apprpriate charset.
I am facing problem while encoding one chinese paragraph with charset
"gb2312".

code is:

encoded_str = str_data.encode("gb2312")

The type of str_data is <type 'unicode'>

The exception is:

"UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
position 0: illegal multibyte sequence"

Can anyone please give me direction to solve this isssue.

Regards,
Vinayakc

May 19 '06 #1
8 2926
Are you sure all the characters in original text are in "gb2312"
charset?

Encoding with "utf8" seems work for this character (u'\xa0'), but I
don't know if the result is correct.

Could you give a subset of str_data in unicode?

May 19 '06 #2
Vinayakc wrote:
Hi all,

I am new to python.

I have written one small application which reads data from xml file and
tries to encode data using apprpriate charset.
I am facing problem while encoding one chinese paragraph with charset
"gb2312".

code is:

encoded_str = str_data.encode("gb2312")

The type of str_data is <type 'unicode'>

The exception is:

"UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
position 0: illegal multibyte sequence"


Hmm, this is 'no-break space' in the very beginning of the text. It
look suspiciously like a plain text utf-8 signature which is 'zero
width no-break space'. If you strip the first character do you still
have encoding errors?

May 19 '06 #3
Yes serge, I have removed the first character but it is still giving
encoding exception.

May 19 '06 #4
1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.
3. gb2312 is outdated. It is not really an "appropriate" charset for
anything much these days. You need to check out what your requirements
really are. The unknowing will cheerfully use "gb" to mean one or more
of those, or to mean "anything that's not big5" :-)
4. The slab of text you supplied is genuine unicode and encodes happily
into all those gb* charsets. It does *not* contain \u00a0.

I do hope some of this helps ....

Cheers,
John

May 19 '06 #5
Vinayakc wrote:
Yes serge, I have removed the first character but it is still giving
encoding exception.


Then I guess this character was used as a poor man indentation tool at
least in the beginning of your text. It's up to you to decide what to
do with that character, you have several choices:

* edit source xml file to get rid of it
* remove it while you process your data
* replace it with ordinary space
* consider utf-8

Note, there are legitimate use cases for no-break space, for example
one million can be written like 1 000 000, where spaces are
non-breakable. This prevents the number to be broken by right margin
like this: 1 000
000

Keep that in mind when you remove or replace no-break space.

May 19 '06 #6
Hey Serge, john,

Thank you very much. I was really not aware of these facts. Anyways
this is happening only for one in millions so I can ignore this for
now.

Thanks again,

Vinayakc

May 19 '06 #7
John Machin wrote:
1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.


Also, *by definition*, though :-) For those that have not followed
encodings too closely: gb18030 is to gb2312 what UTF-8 is to ASCII.
Both encode the entire Unicode in an algorithmic way, and provide
byte-for-byte identical encodings for the for their respective
subset.

Regards,
Martin
May 19 '06 #8
MvL wrote:
Also, *by definition*, though :-)


Ah yes, indeed; and thanks for reminding me. Aside: Similar definition,
but not similar design: IMHO utf-8 sits on top of ASCII like a rose on
a stalk, whereas gb18030 sits on top of gb2312 like a rhinoceros on a
unicycle :-)
Cheers,
John

May 19 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Kobi Lurie | last post by:
Hello all, I'm trying to make a simple script beginner level script, with just functions. it uses the functions: file_get_contents substr taking into an array the text substr took then...
5
by: Scott Matthews | last post by:
I've recently come upon an odd Javascript (and/or browser) behavior, and after hunting around the Web I still can't seem to find an answer. Specifically, I have noticed that the Javascript...
0
by: tkcheng | last post by:
Helllo, For security reason, we are changing our form submission coding with HTML encode on all the text fields to block the SQL injection. However, we encounter a problem on double byte...
2
by: bob | last post by:
Im having a wierd chinese font issue in Access 2000. I installed the chinese support for access 2000, have all the windows stuff setup, even have 10-12 other chinese fonts installed. I made a...
8
by: pabv | last post by:
Hello all, I am having a few issues with encoding to chinese characters and perhaps someone might be able to assist. At the moment I am only able to see chinese characters when displayed as...
7
by: c.verma | last post by:
I have a web application. There is a page which has a datagrid on it.The datagrid displays the data that comes from SAP. SAP sends the chinese characters to this grid. Before I display CHinese...
5
by: Figmo | last post by:
I'm having a problem working with foreign characters (well....foreign to me anyway) I have a textbox control on a form. The font is set to MS Arial Unicode. If I use the Chinese input method...
4
by: Sebastian.Pawlus | last post by:
Hi, I have a problem with Chinese words coded in utf-8. I need to display Chinese marks/words with use of GD or IMagick libs. Chinese text that I need to display is written as plain text in .txt...
26
by: Hongyi Zhao | last post by:
Dear all, I want to judge the file's encoding system correctly, i.e., belong to utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on. Who can give me some hints on the fortran...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.