473,406 Members | 2,847 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

unicodedata implementation

Hi,

[Originally posted this to the dev list, but the moderator advised
posting here first]

I'm looking into implementing this module for Jython, and I'm trying
to understand the contracts promised by the various methods. Please
bear in mind that means I'm probably targeting the CPython
implementation as of 2.3, although I would obviously be quite happy if
my implementation doesn't need too much extra to fit the 2.5
functionality!

As someone has previously posted [1], the documentation is a little
thin and they were pointed at the Unicode specification [2]. I've done
a little reading there, and have a little knowledge now, which is
always dangerous. There are still gaps, and I was hoping someone here
might be able to point out what I'm missing.

My problem, described here [3], but I'll summarise and add a little to it.

2468;CIRCLED DIGIT NINE;No;0;EN; 0039;;9;9;N;;;;;

(UnicodeData.txt [4] for Unicode 3.2.0 [5] entry for code-point 0x2468)

verify(unicodedata.decimal(u'\u2468',None) is None)
verify(unicodedata.digit(u'\u2468') == 9)
verify(unicodedata.numeric(u'\u2468') == 9.0)

That works fine, and I can see in the UnicodeData.txt file (the
mirrored property N towards the end is a fine marker; go back three
fields and then start working forward from there) that the decimal
property isn't defined, the digit property is 9 and the numeric
property is also 9.

However, this next bit is what confuses me:

325F;CIRCLED NUMBER THIRTY FIVE;No;0;ON; 0033 0035;;;35;N;;;;;

(UnicodeData.txt for Unicode 3.2.0 entry for code-point 0x325F)

verify(unicodedata.decimal(u'\u325F',None) is None)
verify(unicodedata.digit(u'\u325F', None) is None)
verify(unicodedata.numeric(u'\u325F') == 35.0)

The last one fails - ValueError: not a numeric character.

Now, again looking at the UnicodeData.txt entry and the mirrored N
property, working back three fields and going forward from there shows
that the decimal property isn't set, the digit property isn't set and
the numeric property appears to be 35.

So from my understanding of the Unicode (3.2.0) spec, the code point
0x325F has a numeric property with a value of 35, but the python (2.3
and 2.4 - I haven't put 2.5 onto my box yet) implementation of
unicodedata disagrees, presumably for good reason.

I can't see where I'm going wrong.

Cheers,

James

[1] http://groups.google.com/group/comp....bdda27be118836
[2] http://www.unicode.org/
[3] http://eternusuk.blogspot.com/2007/0...-overview.html
[4] http://www.unicode.org/Public/3.2-Up...Data-3.2.0.txt
[5] http://www.unicode.org/Public/3.2-Up...ata-3.2.0.html
Feb 18 '07 #1
1 1404
James Abley schrieb:
So from my understanding of the Unicode (3.2.0) spec, the code point
0x325F has a numeric property with a value of 35, but the python (2.3
and 2.4 - I haven't put 2.5 onto my box yet) implementation of
unicodedata disagrees, presumably for good reason.

I can't see where I'm going wrong.
You might not be wrong at all. CPython has a hard-coded list for the
numeric mapping (see Object/unicodectype.c), and that hadn't been
updated even when the rest of the character database was updated.
Patch #1494554 corrected this and updated the numeric properties to
Unicode 4.1, for Python 2.5.

There is still a patch pending generating this function, instead
of maintaining it manually.

HTH,
Martin
Feb 22 '07 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: David Opstad | last post by:
Hi, all! I'm relatively new to Python, but have definitely fallen in love with it. It reminds me of Mesa (old Xerox development language) and LISP a bit. Anyway, on to the question. Now that...
9
by: Ken Beesley | last post by:
Newbie question: on unicodedata.name If I do import unicodedata unicodedata.name(u"a") or unicodedata.name(u"\u0061")
3
by: Christos TZOTZIOY Georgiou | last post by:
I found at least one case where decombining and recombining a unicode character does not result in the same character (see at end). I have no extensive knowledge about Unicode, yet I believe that...
5
by: Xah Lee | last post by:
python has this nice unicodedata module that deals with unicode nicely. #-*- coding: utf-8 -*- # python from unicodedata import * # each unicode char has a unique name. # one can use the...
9
by: Anon Email | last post by:
Hi people, I'm learning about header files in C++. The following is code from Bartosz Milewski: // Code const int maxStack = 16; class IStack
29
by: Enrico `Trippo' Porreca | last post by:
Both K&R book and Steve Summit's tutorial define a getline() function correctly testing the return value of getchar() against EOF. I know that getchar() returns EOF or the character value cast to...
2
by: Szabolcs Nagy | last post by:
the unicodedata manual sais: " name( unichr) Returns the name assigned to the Unicode character unichr as a string. If no name is defined, default is returned, or, if not given, ValueError is...
52
by: lovecreatesbeauty | last post by:
Why the C standard committee doesn't provide a standard implementation including the C compiler and library when the language standard document is published? C works on the abstract model of low...
3
by: James Abley | last post by:
Hi, I'm trying to understand how CPython implements unicodedata, with a view to providing an implementation for Jython. This is a background, low priority thing for me, since I last posted to...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.