473,386 Members | 1,962 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Unicode regular expressions -- buggy?

I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions:
sys.version '2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]'re.match('\w',unicodedata.normalize('NFD',u'\xf 1'),re.UNICODE).group(0) u'n're.match('\w',unicodedata.normalize('NFC',u'\xf 1'),re.UNICODE).group(0)

u'\xf1'

In the above example, u'\xf1' is n-with-tilde (). NFC happens to be a
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde
as a combining mark.

Is this a limitation-by-design, or a bug? If the latter, is it already
known/to-be-fixed?
Aug 11 '05 #1
1 1767
Christopher Subich wrote:
I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions: Is this a limitation-by-design, or a bug?


limitation by design. if you want correct results, make sure to use
early normalization everywhere.

cf. http://www.w3.org/TR/charmod-norm/

</F>

Aug 11 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Avnish | last post by:
Hi, I am looking for some form of validation for all the alphanumeric characters in the entire unicode range e.g. the validation should also accept japanese characters but should restrict...
3
by: Robin Haswell | last post by:
Hey guys I'm currently screenscraping some Swedish site, and i need a method to convert XML entities (&amp; etc, plus d etc) to Unicode characters. I'm sure one of python's myriad of XML processors...
2
by: Fuzzyman | last post by:
Hello all, Can someone confirm that compiled regular expressions from ascii strings will always (and safely) yield unicode values when matched against unicode strings ? I've tested it and it...
34
by: Antoine De Groote | last post by:
Hello, Can anybody tell me the reason(s) why regular expressions are not built into Python like it is the case with Ruby and I believe Perl? Like for example in the following Ruby code line =...
4
by: శ్రీనివాస | last post by:
Hai friends, Can any one tell me how can i remove a character from a unocode text. కల్‌&హార is a Telugu word in Unicode. Here i want to remove '&' but not replace with a zero width...
20
chunk1978
by: chunk1978 | last post by:
hi everyone... i'm preparing to complete a validated form through client-side javascript with regular expressions... and yes the form will also be validated server-side as well... anyway, my regex...
2
by: John Nagle | last post by:
Regular expressions are compiled in ASCII mode unless Unicode mode is specified to "rc.compile". The difference is that regular expressions in ASCII mode don't recognize things like Unicode...
2
by: bryan rasmussen | last post by:
Hi, I'm writing a program that requires specifically Unicode regular expressions http://unicode.org/reports/tr18/ to be loaded in from an external file and then interpreted against the data. if...
1
by: Jeffrey Barish | last post by:
I have a regular expression that I use to extract the surname: surname = r'(?u).+ (\w+)' However, when I apply it to this Unicode string, I get only the first 3 letters of the surname: name...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.