unicodedata . normalize (NFD - NFC) inconsistency

Christos TZOTZIOY Georgiou

I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.
However, I haven't found out how the decomp_data (in unicodedata_db.h)
is built, and neither did I find much more info about the specifics of
Unicode 3.2. I thought about posting here; anyone more knowing could
give it a look.

If we find out that it's a problem with Python, I'll open a bug report
(and volunteer work).

*** Example ***

import unicodedata as ud
def report(utext): for uchar in utext:
print ord(uchar), ud.name(uchar)

u1=u'\N{greek small letter alpha with oxia}'
report(u1) 8049 GREEK SMALL LETTER ALPHA WITH OXIA u2=ud.normalize('NFD', u1)
report(u2) 945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT u3=ud.normalize('NFC', u2)
report(u3) 940 GREEK SMALL LETTER ALPHA WITH TONOS

*** End of Example ***

I can understand this confusion; if, as I have found, there is no
COMBINING GREEK TONOS or COMBINING TONOS ACCENT in the Unicode table,
decombining, one has to use the 'oxeia' (acute) accent...
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek

Jul 18 '05 #1

Subscribe Post Reply

4117

Martin v. Löwis

Christos TZOTZIOY Georgiou wrote:

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.

Without checking the details: very well possible. Could this be
an instance of python.org/sf/1054943 ?

Regards,
Martin

Jul 18 '05 #2

Brion Vibber

Christos TZOTZIOY Georgiou wrote:

I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.

I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world. :) Although it may
seem counterintuitive, it is in fact perfectly legitimate for a
character not to be its own canonical composition.

u1=u'\N{greek small letter alpha with oxia}'
report(u1)
8049 GREEK SMALL LETTER ALPHA WITH OXIA

This character is a "singleton decomposition". It decomposes into GREEK
SMALL LETTER ALPHA WITH TONOS, which further decomposes into GREEK SMALL
LETTER ALPHA and a COMBINING ACUTE ACCENT.

It is by definition not normalized, so when you normalize it to form C
it will turn into GREEK SMALL LETTER ALPHA WITH TONOS; there is no way
to get "back" to the original character in a normalized string. For some
more info see:
http://www.unicode.org/unicode/repor...ion_List_Table

u2=ud.normalize('NFD', u1)
report(u2)
945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT
u3=ud.normalize('NFC', u2)
report(u3)

940 GREEK SMALL LETTER ALPHA WITH TONOS

You should get this same result directly for ud.normalize('NFC', u1).
Converting directly to NFC should always give the same result as
converting to NFD and then NFC. Either will give you back the string you
started with if and only if it's already normalized to form C.

-- brion vibber (brion @ pobox.com)

Jul 18 '05 #3

Christos TZOTZIOY Georgiou

On Mon, 08 Nov 2004 17:40:47 -0800, rumours say that Brion Vibber
<br***@pobox.com> might have written:

I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world. :)
....
http://www.unicode.org/unicode/repor...ion_List_Table

Thanks for the pointer, very informative, explaining why the observed
behaviour is well inside the definition of Unicode. Thanks go to Martin
also for taking a look at this.
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek

Jul 18 '05 #4

Similar topics

Unicode 4.0 updates to unicodedata?

by: David Opstad | last post by:

Hi, all! I'm relatively new to Python, but have definitely fallen in love with it. It reminds me of Mesa (old Xerox development language) and LISP a bit. Anyway, on to the question. Now that...

Python

True inconsistency in Python

by: Scott Chapman | last post by:

There seems to be an inconsistency here: Python 2.3.2 (#1, Oct 3 2003, 19:04:58) on linux2 >>> 1 == True True >>> 3 == True False >>> if 1: print "true" ....

Python

unicodedata name for \u000a

by: Ken Beesley | last post by:

Newbie question: on unicodedata.name If I do import unicodedata unicodedata.name(u"a") or unicodedata.name(u"\u0061")

Python

[perl-python] unicode study with unicodedata module

by: Xah Lee | last post by:

python has this nice unicodedata module that deals with unicode nicely. #-*- coding: utf-8 -*- # python from unicodedata import * # each unicode char has a unique name. # one can use the...

Python

unicodedata.name

by: Szabolcs Nagy | last post by:

the unicodedata manual sais: " name( unichr) Returns the name assigned to the Unicode character unichr as a string. If no name is defined, default is returned, or, if not given, ValueError is...

Python

The normalize-space

by: kollatjorva | last post by:

Hi all I'm trying to get a value from an xml node 'Publisher' use the value as a name of an .css class. This works fine until I get a value from the Publisher node with white space in it. I've...

.NET Framework

unicodedata implementation

by: James Abley | last post by:

Hi, I'm looking into implementing this module for Jython, and I'm trying to understand the contracts promised by the various methods. Please bear in mind that means I'm probably targeting...

Python

Vector, matrix, normalize, rotate. What package?

by: =?iso-8859-1?B?TWF0dGlhcyBCcuRuZHN0cvZt?= | last post by:

Hello! I'm trying to find what package I should use if I want to: 1. Create 3d vectors. 2. Normalize those vectors. 3. Create a 3x3 rotation matrix from a unit 3-d vector and an angle in...

Python

unicodedata implementation - categories

by: James Abley | last post by:

Hi, I'm trying to understand how CPython implements unicodedata, with a view to providing an implementation for Jython. This is a background, low priority thing for me, since I last posted to...

Python

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp