I am trying to parse a huge Chinese corpus uing
python, and I am having a hard time handling the
Chinese characters.
I need to get some particular Chinese characters that
meet a certain standard one by one from the corpus.
Before I parse a sentence and try to locate the
character, I unicode the whole string I read in like
so:
str = unicode(raw_str, myencoding)
I used 'gbk' and 'cp936' encoding for example.
This works just fine with a small sample Chinese
document.
But when I attempted to run the script on the entire
corpus, I get the typical "incomplete multibyte
sequence error" or "UnicodeEncodeError: 'ascii' codec
can't encode characters in position 0-23: ordinal not
in range(128)"
I am at my wit's end, so frustrated at handling
non-ascii texts.
Any hint would be highly appreciated.
__________________________________________________ _______
Do You Yahoo!?
ÍêÈ«Ãâ·ÑµÄÑÅ»¢µçÓÊ£¬ÂíÉÏ×¢²á»ñÔù¶îÍâ60Õ×ÍøÂç´æ´¢¿Õ ¼ä
http://cn.rd.yahoo.com/mail_cn/tag/?...mail.yahoo.com