By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,345 Members | 1,783 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,345 IT Pros & Developers. It's quick & easy.

[perl-python] unicode study with unicodedata module

P: n/a
python has this nice unicodedata module that deals with unicode nicely.

#-*- coding: utf-8 -*-
# python

from unicodedata import *

# each unicode char has a unique name.
# one can use the “lookup” func to find it

mychar=lookup('greek cApital letter sIgma')
# note letter case doesn't matter
print mychar.encode('utf-8')

m=lookup('CJK UNIFIED IDEOGRAPH-5929')
# for some reason, case must be right here.
print m.encode('utf-8')

# to find a char's name, use the “name” function
print name(u'天')

basically, in unicode, each char has a number of attributes (called
properties) besides its name. These attributes provides necessary info
to form letters, words, or processing such as sorting, capitalization,
etc, of varous human scripts. For example, Latin alphabets has two
forms of upper case and lower case. Korean alphabets are stacked
together. While many symbols corresponds to numbers, and there are also

combining forms used for example to put a bar over any letter or
character. Also some writings systems are directional. In order to form

these symbols for display or process them for computing, info of these
on each char is necessary.

the rest of functions in unicodedata return these attributes.

see unicodedata doc:
http://python.org/doc/2.4/lib/module-unicodedata.html

Official word on unicode character properties:
http://www.unicode.org/uni2book/ch04.pdf

--
i don't know what's the state of Perl's unicode. Is there something
similar?

--
this post is archived at
http://xahlee.org/perl-python/unicodedata_module.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
how do i get a unicode's number?

e.g. 03ba for greek lowercase kappa? (or in decimal form)

Xah
Xah Lee wrote:
python has this nice unicodedata module that deals with unicode nicely.
#-*- coding: utf-8 -*-
# python

from unicodedata import *

# each unicode char has a unique name.
# one can use the “lookup” func to find it

mychar=lookup('greek cApital letter sIgma')
# note letter case doesn't matter
print mychar.encode('utf-8')

m=lookup('CJK UNIFIED IDEOGRAPH-5929')
# for some reason, case must be right here.
print m.encode('utf-8')

# to find a char's name, use the “name” function
print name(u'天')

basically, in unicode, each char has a number of attributes (called
properties) besides its name. These attributes provides necessary info to form letters, words, or processing such as sorting, capitalization, etc, of varous human scripts. For example, Latin alphabets has two
forms of upper case and lower case. Korean alphabets are stacked
together. While many symbols corresponds to numbers, and there are also
combining forms used for example to put a bar over any letter or
character. Also some writings systems are directional. In order to form
these symbols for display or process them for computing, info of these on each char is necessary.

the rest of functions in unicodedata return these attributes.

see unicodedata doc:
http://python.org/doc/2.4/lib/module-unicodedata.html

Official word on unicode character properties:
http://www.unicode.org/uni2book/ch04.pdf

--
i don't know what's the state of Perl's unicode. Is there something
similar?

--
this post is archived at
http://xahlee.org/perl-python/unicodedata_module.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html


Jul 18 '05 #2

P: n/a
On 15 Mar 2005 04:55:17 -0800, rumours say that "Xah Lee" <xa*@xahlee.org> might
have written:
how do i get a unicode's number?

e.g. 03ba for greek lowercase kappa? (or in decimal form)


you get the character with:

..>> uc = u"\N{GREEK SMALL LETTER KAPPA}"

or with

..>> uc = unicodedata.lookup("GREEK SMALL LETTER KAPPA")

and you get the ordinal with:

..>> ord(uc)

ord works for strings and unicode.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #3

P: n/a
Xah Lee wrote:
i don't know what's the state of Perl's unicode.


perldoc perlunicode

Jul 18 '05 #4

P: n/a
here's a snippet of code that prints a range of unicode chars, along
with their ordinal in hex, and name.

chars without a name are skipped. (some of such are undefined code
points.)

On Microsoft Windows the encoding might need to be changed to utf-16.

Change the range to see different unicode chars.

# -*- coding: utf-8 -*-

from unicodedata import *

l=[]
for i in range(0x0000, 0x0fff):
l.append(eval('u"\\u%04x"' % i))

for x in l:
if name(x,'-')!='-':
print x.encode('utf-8'),'|', "%04x"%(ord(x)), '|', name(x,'-')
--
http://xahlee.org/perl-python/unicodedata_module.html

anyone wants to supply a Perl version?

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Brian McCauley wrote:
Xah Lee wrote:
i don't know what's the state of Perl's unicode.


perldoc perlunicode


Jul 18 '05 #5

P: n/a
Fuck google incorporated for editing my subject name without
permission.

and fuck google incorporated for editing my message content without
permission.

http://xahlee.org/UnixResource_dir/w...e_license.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.