regexps with unicode-aware characterclasses?

Stefan Rank

Hi all,

in a python re pattern, how do I match all unicode uppercase characters
(in a unicode string/in a utf-8 string)?

I know that there is string.uppercase/.lowercase which are
'locale-aware', but I don't think there is a "all locales" locale.

I know that there is a re.U switch that makes \w match all unicode word
characters, but there are no subclasses of that ([[:upper:]] or
preferably \u).
Or is there a module/extension to get that?

There is the module unicodedata, but it has no unicodedata.uppercase
that would correspond to string.uppercase.

<wishful thinking>

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

or::

re.compile('(?u)[[:upper:]]')

or::

re.compile('(?u)\u')

for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?

</wishful thinking>

Aug 30 '05 #1

Subscribe Post Reply

1156

Martin v. LÃ¶wis

Stefan Rank wrote:

<wishful thinking>

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))
This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)
for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?

For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin

Sep 14 '05 #2

Similar topics

Can somebody help out with RegExps?

by: R. Tarazi | last post by:

Hello together, I'm having extreme difficulties using RegExps for a specific problem and would really appreciate any help and hope somebody will read through my "long" posting... 1. <?php...

PHP

Expanding regexps

by: Klaus Alexander Seistrup | last post by:

Hi, Is there a way to "expand" simple regexps? Something along the lines of: #v+ >>> rx = '(a|b)c?(d|f)' >>> expand_regexp(rx)

Python

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Recursive regexps?

by: Magnus Lie Hetland | last post by:

Hi! I've been looking at ways of dealing with nested structures in regexps (becuase I figured that would be faster than the Python parsing code I've currently got) and came across a few...

Python

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...

Python

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

Convertion of Unicode to ASCII NIGHTMARE

by: ChaosKCW | last post by:

Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...

Python

Finding Upper-case characters in regexps, unicode friendly.

by: possibilitybox | last post by:

I'm trying to make a unicode friendly regexp to grab sentences reasonably reliably for as many unicode languages as possible, focusing on european languages first, hence it'd be useful to be able...

Python

regexps: dollar sign, lookaheads/behinds and speedquestions

by: Yorian | last post by:

I just started to try regexps in php and I didn't have too many problems, however I found a few when trying to build a templte engine. The first one is found is the dollar sign. In my template I...

PHP

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General