473,408 Members | 1,735 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,408 software developers and data experts.

unicodedata . normalize (NFD - NFC) inconsistency

I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.
However, I haven't found out how the decomp_data (in unicodedata_db.h)
is built, and neither did I find much more info about the specifics of
Unicode 3.2. I thought about posting here; anyone more knowing could
give it a look.

If we find out that it's a problem with Python, I'll open a bug report
(and volunteer work).

*** Example ***
import unicodedata as ud
def report(utext): for uchar in utext:
print ord(uchar), ud.name(uchar)

u1=u'\N{greek small letter alpha with oxia}'
report(u1) 8049 GREEK SMALL LETTER ALPHA WITH OXIA u2=ud.normalize('NFD', u1)
report(u2) 945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT u3=ud.normalize('NFC', u2)
report(u3) 940 GREEK SMALL LETTER ALPHA WITH TONOS


*** End of Example ***

I can understand this confusion; if, as I have found, there is no
COMBINING GREEK TONOS or COMBINING TONOS ACCENT in the Unicode table,
decombining, one has to use the 'oxeia' (acute) accent...
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
Jul 18 '05 #1
3 4117
Christos TZOTZIOY Georgiou wrote:
I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.


Without checking the details: very well possible. Could this be
an instance of python.org/sf/1054943 ?

Regards,
Martin
Jul 18 '05 #2
Christos TZOTZIOY Georgiou wrote:
I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.


I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world. :) Although it may
seem counterintuitive, it is in fact perfectly legitimate for a
character not to be its own canonical composition.
u1=u'\N{greek small letter alpha with oxia}'
report(u1)
8049 GREEK SMALL LETTER ALPHA WITH OXIA


This character is a "singleton decomposition". It decomposes into GREEK
SMALL LETTER ALPHA WITH TONOS, which further decomposes into GREEK SMALL
LETTER ALPHA and a COMBINING ACUTE ACCENT.

It is by definition not normalized, so when you normalize it to form C
it will turn into GREEK SMALL LETTER ALPHA WITH TONOS; there is no way
to get "back" to the original character in a normalized string. For some
more info see:
http://www.unicode.org/unicode/repor...ion_List_Table
u2=ud.normalize('NFD', u1)
report(u2)
945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT
u3=ud.normalize('NFC', u2)
report(u3)


940 GREEK SMALL LETTER ALPHA WITH TONOS


You should get this same result directly for ud.normalize('NFC', u1).
Converting directly to NFC should always give the same result as
converting to NFD and then NFC. Either will give you back the string you
started with if and only if it's already normalized to form C.

-- brion vibber (brion @ pobox.com)
Jul 18 '05 #3
On Mon, 08 Nov 2004 17:40:47 -0800, rumours say that Brion Vibber
<br***@pobox.com> might have written:
I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world. :)
....
http://www.unicode.org/unicode/repor...ion_List_Table


Thanks for the pointer, very informative, explaining why the observed
behaviour is well inside the definition of Unicode. Thanks go to Martin
also for taking a look at this.
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: David Opstad | last post by:
Hi, all! I'm relatively new to Python, but have definitely fallen in love with it. It reminds me of Mesa (old Xerox development language) and LISP a bit. Anyway, on to the question. Now that...
46
by: Scott Chapman | last post by:
There seems to be an inconsistency here: Python 2.3.2 (#1, Oct 3 2003, 19:04:58) on linux2 >>> 1 == True True >>> 3 == True False >>> if 1: print "true" ....
9
by: Ken Beesley | last post by:
Newbie question: on unicodedata.name If I do import unicodedata unicodedata.name(u"a") or unicodedata.name(u"\u0061")
5
by: Xah Lee | last post by:
python has this nice unicodedata module that deals with unicode nicely. #-*- coding: utf-8 -*- # python from unicodedata import * # each unicode char has a unique name. # one can use the...
2
by: Szabolcs Nagy | last post by:
the unicodedata manual sais: " name( unichr) Returns the name assigned to the Unicode character unichr as a string. If no name is defined, default is returned, or, if not given, ValueError is...
4
by: kollatjorva | last post by:
Hi all I'm trying to get a value from an xml node 'Publisher' use the value as a name of an .css class. This works fine until I get a value from the Publisher node with white space in it. I've...
1
by: James Abley | last post by:
Hi, I'm looking into implementing this module for Jython, and I'm trying to understand the contracts promised by the various methods. Please bear in mind that means I'm probably targeting...
5
by: =?iso-8859-1?B?TWF0dGlhcyBCcuRuZHN0cvZt?= | last post by:
Hello! I'm trying to find what package I should use if I want to: 1. Create 3d vectors. 2. Normalize those vectors. 3. Create a 3x3 rotation matrix from a unit 3-d vector and an angle in...
3
by: James Abley | last post by:
Hi, I'm trying to understand how CPython implements unicodedata, with a view to providing an implementation for Jython. This is a background, low priority thing for me, since I last posted to...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.