By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,454 Members | 3,133 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,454 IT Pros & Developers. It's quick & easy.

[regex] case-splitting strings in unicode

P: n/a
I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?

Thanks

John
Oct 9 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On Oct 09, John Perks and Sarah Mount wrote:
I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to
denote "uppercase"?


Not sure what your output should look like but something like this could
work:
import re
re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest')

'the First Test the Second Test'

This can be adapted for multiline, etc, but maybe '[A-Z]' is
sufficiently general. The regex module does have an understanding of
unicode (but I don't, sorry); you could add (?u) make it unicode aware.
For programming language identifiers I wouldn't think that unicode
should be an issue. Sorry I'm no help with unicode specifics.

Some useful links:

http://www.python.org/doc/2.4.2/lib/module-re.html
http://www.amk.ca/python/howto/regex/regex.html

--
Micah Elliott
<md*@micah.elliott.name>
Oct 9 '05 #2

P: n/a
John Perks and Sarah Mount wrote:
I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?


In this form, it is currently not implemented, although it should be
(written as [[:upper:]], I believe); contributions are welcome (make
sure you read the Unicode consortium's guidelines on regular expressions
before attempting to implement it).

Until then, the "best" way is to use a regular character class,
precomputed or computed at runtime.

uni_upper = [unichr(i) for i in range(sys.maxunicode) if
unichr(i).isupper()]
uni_re = u"["+u"".join(uni_upper)+u"]"

On my machine, this takes approximately one second to compute,
which may or may not be too much as a startup cost. To speed
this up, you could dump the resulting uni_re into a Python
source file.

Regards,
Martin
Oct 9 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.