By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,923 Members | 1,443 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,923 IT Pros & Developers. It's quick & easy.

regexps with unicode-aware characterclasses?

P: n/a
Hi all,

in a python re pattern, how do I match all unicode uppercase characters
(in a unicode string/in a utf-8 string)?

I know that there is string.uppercase/.lowercase which are
'locale-aware', but I don't think there is a "all locales" locale.

I know that there is a re.U switch that makes \w match all unicode word
characters, but there are no subclasses of that ([[:upper:]] or
preferably \u).
Or is there a module/extension to get that?

There is the module unicodedata, but it has no unicodedata.uppercase
that would correspond to string.uppercase.

<wishful thinking>

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

or::

re.compile('(?u)[[:upper:]]')

or::

re.compile('(?u)\u')

for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?

</wishful thinking>

Aug 30 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Stefan Rank wrote:
<wishful thinking>

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))
This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)
for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?


For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin
Sep 14 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.