473,495 Members | 2,021 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

unicodedata implementation - categories


Hi,

I'm trying to understand how CPython implements unicodedata, with a view to
providing an implementation for Jython. This is a background, low priority
thing for me, since I last posted to this list about it in February!

Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>import unicodedata
c = unichr(0x10FFFF)
unicodedata.name(c)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>unicodedata.category(unichr(0x10FFFF))
'Cn'

0x10FFFF is not a valid codepoint in Unicode 4.1, which is the version of
the Unicode standard that Python 2.5 supports.

So I have a couple of questions:

1) Why doesn't the category method raise an Exception, like the name method
does?
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.

My background is Mathematics rather than pure Computer Science, so doubtless
I still have some gaps in my education to be filled when it comes to data
structures and algorithms and I would welcome the opportunity to fill some
of those in. References to Knuth or some on-line reading would be much
appreciated, to help me understand the CPython part.

Cheers,

James
--
View this message in context: http://www.nabble.com/unicodedata-im...html#a13193027
Sent from the Python - python-list mailing list archive at Nabble.com.

Oct 13 '07 #1
3 2351
On Oct 13, 4:32 pm, James Abley <james.ab...@gmail.comwrote:
Hi,

I'm trying to understand how CPython implements unicodedata, with a view to
providing an implementation for Jython. This is a background, low priority
thing for me, since I last posted to this list about it in February!

Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>import unicodedata
>c = unichr(0x10FFFF)
unicodedata.name(c)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name>>unicodedata.category(unichr(0x10FFFF))

'Cn'

0x10FFFF is not a valid codepoint in Unicode 4.1, which is the version of
the Unicode standard that Python 2.5 supports.

So I have a couple of questions:

1) Why doesn't the category method raise an Exception, like the name method
does?
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.

My background is Mathematics rather than pure Computer Science, so doubtless
I still have some gaps in my education to be filled when it comes to data
structures and algorithms and I would welcome the opportunity to fill some
of those in. References to Knuth or some on-line reading would be much
appreciated, to help me understand the CPython part.

Cheers,

James
--
View this message in context:http://www.nabble.com/unicodedata-im...gories-tf46194...
Sent from the Python - python-list mailing list archive at Nabble.com.
Cn is the "Other, Not Assigned" category in Unicode. No characters in
Unicode have this property. I'm not sure why it doesn't raise an
Exception, but if category() returns Cn, then you know it's not a
valid character.

Oct 13 '07 #2
1) Why doesn't the category method raise an Exception, like the name method
does?
As Chris explains, the result category means "Other, Not Assigned".
Python returns this category because it's the truth: for those
characters, the value of the "category" property really *is* Cn;
it means that they are not assigned.

If you are wondering how unicodedata.c comes up with the result:
the unassigned characters get a record index of 0, and that has a
category value of 0, which is "Cn".
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.
You definitely should *not* follow the Python implementation. Instead,
the Unicode database is defined by the Unicode consortium, so the
Unicode standard is the ultimate specification.

To implement it in Java, I recommend to use java.lang.Character.getType.
If that returns java.lang.Character.UNASSIGNED, return "Cn".

Regards
Martin
Oct 14 '07 #3
1) Why doesn't the category method raise an Exception, like the name method
does?
As Chris explains, the result category means "Other, Not Assigned".
Python returns this category because it's the truth: for those
characters, the value of the "category" property really *is* Cn;
it means that they are not assigned.

If you are wondering how unicodedata.c comes up with the result:
the unassigned characters get a record index of 0, and that has a
category value of 0, which is "Cn".
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.
You definitely should *not* follow the Python implementation. Instead,
the Unicode database is defined by the Unicode consortium, so the
Unicode standard is the ultimate specification.

To implement it in Java, I recommend to use java.lang.Character.getType.
If that returns java.lang.Character.UNASSIGNED, return "Cn".

Regards
Martin
Oct 15 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1756
by: David Opstad | last post by:
Hi, all! I'm relatively new to Python, but have definitely fallen in love with it. It reminds me of Mesa (old Xerox development language) and LISP a bit. Anyway, on to the question. Now that...
9
4568
by: Ken Beesley | last post by:
Newbie question: on unicodedata.name If I do import unicodedata unicodedata.name(u"a") or unicodedata.name(u"\u0061")
3
4122
by: Christos TZOTZIOY Georgiou | last post by:
I found at least one case where decombining and recombining a unicode character does not result in the same character (see at end). I have no extensive knowledge about Unicode, yet I believe that...
5
1873
by: Xah Lee | last post by:
python has this nice unicodedata module that deals with unicode nicely. #-*- coding: utf-8 -*- # python from unicodedata import * # each unicode char has a unique name. # one can use the...
9
4614
by: Anon Email | last post by:
Hi people, I'm learning about header files in C++. The following is code from Bartosz Milewski: // Code const int maxStack = 16; class IStack
2
1495
by: Ken Loh | last post by:
Hi All, I'd like to develop something which has a concept like the folders-and-files in your PC. A folder can have subfolders and/or files. The subfolders themselves have the same...
2
2203
by: Szabolcs Nagy | last post by:
the unicodedata manual sais: " name( unichr) Returns the name assigned to the Unicode character unichr as a string. If no name is defined, default is returned, or, if not given, ValueError is...
110
8461
by: Gregory Pietsch | last post by:
I'm writing a portable implementation of the C standard library for http://www.clc-wiki.net and I was wondering if someone could check the functions in math.h for sanity/portability/whatever. I'm...
1
1405
by: James Abley | last post by:
Hi, I'm looking into implementing this module for Jython, and I'm trying to understand the contracts promised by the various methods. Please bear in mind that means I'm probably targeting...
0
7120
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
6991
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7160
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7196
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7373
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5456
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
3088
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3078
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1405
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.