Thomas Moore wrote:
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information. u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
u.split() [u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']
I think u should get split.
why? split splits on whitespace (basically unicode category Zs), and
there are no whitespace symbols in there:
u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
[c.isspace() for c in u]
[False, False, False, False, False, False]
there's no universal "split on words in all languages" function in the
standard python library. You may be able to roll your own using the
information in
http://www.unicode.org/reports/tr29/ plus functions
in the unicodedata module (which currently doesn't include the
BreakTest tables; patches are welcome). Or maybe google can
help you find an existing implementation.
</F>