Unicode regular expressions -- buggy?

Christopher Subich

I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions:

sys.version '2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]'re.match('\w',unicodedata.normalize('NFD',u'\xf 1'),re.UNICODE).group(0) u'n're.match('\w',unicodedata.normalize('NFC',u'\xf 1'),re.UNICODE).group(0)

u'\xf1'

In the above example, u'\xf1' is n-with-tilde (ñ). NFC happens to be a
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde
as a combining mark.

Is this a limitation-by-design, or a bug? If the latter, is it already
known/to-be-fixed?

Aug 11 '05 #1

Subscribe Post Reply

1767

Fredrik Lundh

Christopher Subich wrote:

I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions: Is this a limitation-by-design, or a bug?

limitation by design. if you want correct results, make sure to use
early normalization everywhere.

cf. http://www.w3.org/TR/charmod-norm/

</F>

Aug 11 '05 #2

by: Avnish | last post by:

Hi, I am looking for some form of validation for all the alphanumeric characters in the entire unicode range e.g. the validation should also accept japanese characters but should restrict...

Javascript

xHTML/XML to Unicode (and back)

by: Robin Haswell | last post by:

Hey guys I'm currently screenscraping some Swedish site, and i need a method to convert XML entities (& etc, plus d etc) to Unicode characters. I'm sure one of python's myriad of XML processors...

Python

Unicode strings and ascii regular expressions

by: Fuzzyman | last post by:

Hello all, Can someone confirm that compiled regular expressions from ascii strings will always (and safely) yield unicode values when matched against unicode strings ? I've tested it and it...

Python

builtin regular expressions?

by: Antoine De Groote | last post by:

Hello, Can anybody tell me the reason(s) why regular expressions are not built into Python like it is the case with Ruby and I believe Perl? Like for example in the following Ruby code line =...

Python

Need a Regular expression to remove a char for Unicode text

by: à°¶à±à°°à±€à°¨à°¿à°µà°¾à°¸ | last post by:

Hai friends, Can any one tell me how can i remove a character from a unocode text. à°•à°²à±â€Œ&à°¹à°¾à°° is a Telugu word in Unicode. Here i want to remove '&' but not replace with a zero width...

Python

Regular Expressions Issue...

by: chunk1978 | last post by:

hi everyone... i'm preparing to complete a validated form through client-side javascript with regular expressions... and yes the form will also be validated server-side as well... anyway, my regex...

Javascript

UNICODE mode for regular expressions - time to change the default?

by: John Nagle | last post by:

Regular expressions are compiled in ASCII mode unless Unicode mode is specified to "rc.compile". The difference is that regular expressions in ASCII mode don't recognize things like Unicode...

Python

Unicode Regular Expressions

by: bryan rasmussen | last post by:

Hi, I'm writing a program that requires specifically Unicode regular expressions http://unicode.org/reports/tr18/ to be loaded in from an external file and then interpreted against the data. if...

Python

Regular expressions and Unicode

by: Jeffrey Barish | last post by:

I have a regular expression that I use to extract the surname: surname = r'(?u).+ (\w+)' However, when I apply it to this Unicode string, I get only the first 3 letters of the surname: name...

Python

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Unicode regular expressions -- buggy?

Similar topics