John Perks and Sarah Mount wrote:
I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?
In this form, it is currently not implemented, although it should be
(written as [[:upper:]], I believe); contributions are welcome (make
sure you read the Unicode consortium's guidelines on regular expressions
before attempting to implement it).
Until then, the "best" way is to use a regular character class,
precomputed or computed at runtime.
uni_upper = [unichr(i) for i in range(sys.maxunicode) if
unichr(i).isupper()]
uni_re = u"["+u"".join(uni_upper)+u"]"
On my machine, this takes approximately one second to compute,
which may or may not be too much as a startup cost. To speed
this up, you could dump the resulting uni_re into a Python
source file.
Regards,
Martin