Problem processing Chinese

Anthony Liu

I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.

__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

Oct 14 '05 #1

Subscribe Post Reply

2345

Peter Otten

Anthony Liu wrote:

I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.

Suppose you have a file with the following contents:

file("chinese.txt").read() '\xbc\xc7\xd5\xdf \xd0\xbb\xbd\xf0\xbb\xa2 \xa1\xa2'

Then it's best to open it via codecs -- of course you have to know the
encoding:
codecs.open("chinese.txt", "r", "gbk").read() u'\u8bb0\u8005 \u8c22\u91d1\u864e \u3001'

This may still look strange to you but it's the unicode string's repr().
If sys.stdout.encoding is properly set on your system you can just print it:
u = codecs.open("chinese.txt", "r", "gbk").read()
print u è®°è€… è°¢é‡‘è™Ž ã€

If that fails, provide the encoding explicitly:
print u.encode("utf-8") # probably "gbk" instead of "utf-8" on your system
è®°è€… è°¢é‡‘è™Ž ã€

Because now you are in unicode all further operations are performed on
characters rather than bytes. Processing Chinese is no longer more
difficult than any language that confines itself to plain ASCII.
But if you split your text into a list
u.split() [u'\u8bb0\u8005', u'\u8c22\u91d1\u864e', u'\u3001']

you probably think you are back to square one. That is because Python prints
the repr() of the list items (otherwise a comma would give the impression
that the list contains more items than it actually does). To get the actual
characters, choose an item explicitly
items = u.split()
print items[0] è®°è€…

or convert the entire list to a string of your liking, e. g:
print u"[%s]" % u", ".join(items)

[è®°è€…, è°¢é‡‘è™Ž, ã€]

Peter

Oct 14 '05 #2

by: Konrad Den Ende | last post by:

Normally, one can put any bull as long as comments are on, as in: // import yava.biteme.yo_mamma; but when i tried to do //String s = "\u305\u3048"; the stupid computer tells me to shut...

Java

Problem with Chinese

by: Antonio | last post by:

With the following page aspx I try to translate one my page from English to Chinese, using UTF8, the result Is that the Chinese characters do not come read correctly, if instead I insert directly...

.NET Framework

language-font problem

by: Elvira Zeinalova | last post by:

In my application I need to make the report in different languages. I made a table for report parameters, like: create myTable ( language_id int not null, parameter char(50) null, constraint...

Microsoft SQL Server

XML with chinese character problem

by: Agnes | last post by:

In my .net ,i need to generate an xml file , however, user may input a chinese character, Then , the xml will got something unknow characters. the following is my code, Does anyone know how to...

.NET Framework

A problem occurred while Microsoft Access was communicating with the OLE server or ActiveX Control

by: Stan | last post by:

Hallo, I have developed an application in MS Access 2000 (Polish version) under MS Windows XP prof (also Polish). Now I would like to run this code on MS Windows XP EN and MS Access XP EN. I have...

Microsoft Access / VBA

Codepage problem

by: Daatmor | last post by:

I saved my aspx with Encoding (Chinese Tranditional (Big5) - Codepage 950) and I had set codepage="950" in Page tag. The content rendered incorrect (all text changed to symbols), but some content...

ASP.NET

Problem of displaying Chinese & Thai font in Java J2ME games on Samsung D520

by: slchslch | last post by:

Hi all, I have search through multiple forum to look for the respective solution for this problem but still i do not find any solution to this problem. I have Samsung SGH-D520 on hand which...

Java

Encoding problem......

by: Atlas | last post by:

I'm working on a multilanguage ASP/HTML site using a IIS6 web server. It perfectly works with two languages (english and italian) in this way: - basically the same ASP code for every language -...

ASP / Active Server Pages

unicode textbox problem

by: =?Utf-8?B?dHBhcmtzNjk=?= | last post by:

OK I have some Chinese text in sql server column that looks like this: 12å¤§ä¸“é¢˜è°ƒç ”ç ´è§£å¹¿ä¸œç§‘å¦å‘å±•éš¾é¢˜ This is unicode? Anyway, I put this data into a text area like this:...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Problem processing Chinese

Similar topics