By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,598 Members | 1,508 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,598 IT Pros & Developers. It's quick & easy.

trying to understand unicode

P: n/a
Python has a very good support of unicode, utf8, encodings ... But I
have some difficulties with the concepts and the vocabulary. The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.

So I wrote the following script :

#!/usr/bin/env python

"""Example of use of the unicodedata module
http://docs.python.org/lib/module-unicodedata.html
"""

import unicodedata
import sys

# outcodec = 'latin_1'
outcodec = 'iso8859_15'
if len(sys.argv) > 1:
outcodec = sys.argv[1]

for c in range(256):
uc = unichr(c)
uname = unicodedata.name(uc, None)
if uname:
unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
'replace')
unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
'replace')
print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
unfc.ljust(2), \
unicodedata.category(uc), unicodedata.numeric(uc, None)
and here are some samples of output
44 COMMA , , Po None
45 HYPHEN-MINUS - - Pd None
46 FULL STOP . . Po None
47 SOLIDUS / / Po None
48 DIGIT ZERO 0 0 Nd 0.0
49 DIGIT ONE 1 1 Nd 1.0
50 DIGIT TWO 2 2 Nd 2.0

It seems that 'Nd' category means Numerical digit doh!

64 COMMERCIAL AT @ @ Po None
65 LATIN CAPITAL LETTER A A A Lu None
66 LATIN CAPITAL LETTER B B B Lu None

'Lu' should read 'Letter upper' ?

94 CIRCUMFLEX ACCENT ^ ^ Sk None
95 LOW LINE _ _ Pc None
96 GRAVE ACCENT ` ` Sk None
97 LATIN SMALL LETTER A a a Ll None
98 LATIN SMALL LETTER B b b Ll None
'Ll' == Letter lower

124 VERTICAL LINE | | Sm None
125 RIGHT CURLY BRACKET } } Pe None
126 TILDE ~ ~ Sm None
160 NO-BREAK SPACE * * Zs None
161 INVERTED EXCLAMATION MARK Po None

What a gap !

245 LATIN SMALL LETTER O WITH TILDE o? Ll None
246 LATIN SMALL LETTER O WITH DIAERESIS o? Ll None
247 DIVISION SIGN Sm None
248 LATIN SMALL LETTER O WITH STROKE Ll None

'Sm' should read 'sign mathematics' ?

I think that such code snippets should be included in the documentation
or in a Wiki.

Regards

Sorry for bad english, I'm not a native speaker.
Jul 19 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
On 20 Apr 2005 10:58:35 GMT, "F. Petitjean"
<li***********@news.free.fr> wrote:
Python has a very good support of unicode, utf8, encodings ... But I
have some difficulties with the concepts and the vocabulary.
You're not alone there. But I don't expect the docs for the Python
implementation of Unicode to explain the concepts and vocabulary of
Unicode. That's the job of the Unicode consortium, and they do a
not-unreasonable job of it; see www.unicode.org and in particular

http://www.unicode.org/Public/UNIDATA/UCD.html

explains all the things that the Python unicodedata module is
implementing.

The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.

So I wrote the following script :
[snip] I think that such code snippets should be included in the documentation
or in a Wiki.


Any effort should be directed (IMESHO) towards (a) keeping the URL in
the Python documentation up-to-date [it's not] (b) using the *LATEST*
version of the ucd file when each version of Python is released [still
stuck on 3.2.0 when the current version available from Unicode.org is
4.1.0]

[Exit, pursued by a bear.]
[Noises off.]

OK OK don't hit me, Martin, how about instructions on how to DIY,
then?

Cheers,
John

Jul 19 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.