469,917 Members | 1,937 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,917 developers. It's quick & easy.

Unicode is driving me nuts!

I am trying to parse a huge Chinese corpus uing
python, and I am having a hard time handling the
Chinese characters.

I need to get some particular Chinese characters that
meet a certain standard one by one from the corpus.

Before I parse a sentence and try to locate the
character, I unicode the whole string I read in like
so:

str = unicode(raw_str, myencoding)

I used 'gbk' and 'cp936' encoding for example.

This works just fine with a small sample Chinese
document.

But when I attempted to run the script on the entire
corpus, I get the typical "incomplete multibyte
sequence error" or "UnicodeEncodeError: 'ascii' codec
can't encode characters in position 0-23: ordinal not
in range(128)"

I am at my wit's end, so frustrated at handling
non-ascii texts.

Any hint would be highly appreciated.

__________________________________________________ _______
Do You Yahoo!?
完全免费的雅虎电邮,马上注册获赠额外60兆网络存储空 间
http://cn.rd.yahoo.com/mail_cn/tag/?...mail.yahoo.com

Jul 18 '05 #1
0 862

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Keiron Waites | last post: by
12 posts views Thread by Marty | last post: by
reply views Thread by Simon Harris | last post: by
5 posts views Thread by fidtz | last post: by
3 posts views Thread by DuncanIdaho | last post: by
1 post views Thread by Waqarahmed | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.