By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,923 Members | 1,279 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,923 IT Pros & Developers. It's quick & easy.

Problem processing Chinese

P: n/a
I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.



__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Oct 14 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Anthony Liu wrote:
I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.


Suppose you have a file with the following contents:
file("chinese.txt").read() '\xbc\xc7\xd5\xdf \xd0\xbb\xbd\xf0\xbb\xa2 \xa1\xa2'

Then it's best to open it via codecs -- of course you have to know the
encoding:
codecs.open("chinese.txt", "r", "gbk").read() u'\u8bb0\u8005 \u8c22\u91d1\u864e \u3001'

This may still look strange to you but it's the unicode string's repr().
If sys.stdout.encoding is properly set on your system you can just print it:
u = codecs.open("chinese.txt", "r", "gbk").read()
print u 记者 谢金虎 、

If that fails, provide the encoding explicitly:
print u.encode("utf-8") # probably "gbk" instead of "utf-8" on your system
记者 谢金虎 、

Because now you are in unicode all further operations are performed on
characters rather than bytes. Processing Chinese is no longer more
difficult than any language that confines itself to plain ASCII.
But if you split your text into a list
u.split() [u'\u8bb0\u8005', u'\u8c22\u91d1\u864e', u'\u3001']

you probably think you are back to square one. That is because Python prints
the repr() of the list items (otherwise a comma would give the impression
that the list contains more items than it actually does). To get the actual
characters, choose an item explicitly
items = u.split()
print items[0] 记者

or convert the entire list to a string of your liking, e. g:
print u"[%s]" % u", ".join(items)

[记者, 谢金虎, 、]

Peter

Oct 14 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.