473,404 Members | 2,137 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,404 software developers and data experts.

Problem processing Chinese

I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.



__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Oct 14 '05 #1
1 2345
Anthony Liu wrote:
I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.


Suppose you have a file with the following contents:
file("chinese.txt").read() '\xbc\xc7\xd5\xdf \xd0\xbb\xbd\xf0\xbb\xa2 \xa1\xa2'

Then it's best to open it via codecs -- of course you have to know the
encoding:
codecs.open("chinese.txt", "r", "gbk").read() u'\u8bb0\u8005 \u8c22\u91d1\u864e \u3001'

This may still look strange to you but it's the unicode string's repr().
If sys.stdout.encoding is properly set on your system you can just print it:
u = codecs.open("chinese.txt", "r", "gbk").read()
print u 记者 谢金虎 、

If that fails, provide the encoding explicitly:
print u.encode("utf-8") # probably "gbk" instead of "utf-8" on your system
记者 谢金虎 、

Because now you are in unicode all further operations are performed on
characters rather than bytes. Processing Chinese is no longer more
difficult than any language that confines itself to plain ASCII.
But if you split your text into a list
u.split() [u'\u8bb0\u8005', u'\u8c22\u91d1\u864e', u'\u3001']

you probably think you are back to square one. That is because Python prints
the repr() of the list items (otherwise a comma would give the impression
that the list contains more items than it actually does). To get the actual
characters, choose an item explicitly
items = u.split()
print items[0] 记者

or convert the entire list to a string of your liking, e. g:
print u"[%s]" % u", ".join(items)

[记者, 谢金虎, 、]

Peter

Oct 14 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Konrad Den Ende | last post by:
Normally, one can put any bull as long as comments are on, as in: // import yava.biteme.yo_mamma; but when i tried to do //String s = "\u305\u3048"; the stupid computer tells me to shut...
1
by: Antonio | last post by:
With the following page aspx I try to translate one my page from English to Chinese, using UTF8, the result Is that the Chinese characters do not come read correctly, if instead I insert directly...
2
by: Elvira Zeinalova | last post by:
In my application I need to make the report in different languages. I made a table for report parameters, like: create myTable ( language_id int not null, parameter char(50) null, constraint...
8
by: Agnes | last post by:
In my .net ,i need to generate an xml file , however, user may input a chinese character, Then , the xml will got something unknow characters. the following is my code, Does anyone know how to...
3
by: Stan | last post by:
Hallo, I have developed an application in MS Access 2000 (Polish version) under MS Windows XP prof (also Polish). Now I would like to run this code on MS Windows XP EN and MS Access XP EN. I have...
4
by: Daatmor | last post by:
I saved my aspx with Encoding (Chinese Tranditional (Big5) - Codepage 950) and I had set codepage="950" in Page tag. The content rendered incorrect (all text changed to symbols), but some content...
0
by: slchslch | last post by:
Hi all, I have search through multiple forum to look for the respective solution for this problem but still i do not find any solution to this problem. I have Samsung SGH-D520 on hand which...
12
by: Atlas | last post by:
I'm working on a multilanguage ASP/HTML site using a IIS6 web server. It perfectly works with two languages (english and italian) in this way: - basically the same ASP code for every language -...
9
by: =?Utf-8?B?dHBhcmtzNjk=?= | last post by:
OK I have some Chinese text in sql server column that looks like this: 12大专题调研破解广东科学发展难题 This is unicode? Anyway, I put this data into a text area like this:...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.