By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,728 Members | 2,414 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,728 IT Pros & Developers. It's quick & easy.

Finding Upper-case characters in regexps, unicode friendly.

P: n/a
I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as possible, focusing
on european languages first, hence it'd be useful to be able to refer
to any uppercase unicode character instead of just the typical [A-Z],
which doesn't include, for example . Is there a way to do this, or
do I have to stick with using the isupper method of the string class?

May 24 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
> I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as
possible, focusing on european languages first, hence it'd be
useful to be able to refer to any uppercase unicode character
instead of just the typical [A-Z], which doesn't include, for
example . Is there a way to do this, or do I have to stick
with using the isupper method of the string class?


Well, assuming you pass in the UNICODE or LOCALE specifier, the
following portion of a regexp *should* find what you're describing:
###############################################
import re
tests = [("1", False),
("a", True),
("Hello", True),
("2bad", False),
("bad1", False),
("a c", False)
]
r = re.compile(r'^(?:(?=\w)[^\d_])*$')
for test, expected_result in tests:
if r.match(test):
passed = expected_result
else:
passed = not expected_result
print "[%s] expected [%s] passed [%s]" % (
test, expected_result, passed)
###############################################

That looks for a "word" character ("\w") but doesn't swallow it
("(?=...)"), and then asserts that the character is not ("^") a
digit ("\d") or an underscore. It looks for any number of "these
things" ("(?:...)*"), which you can tweak to your own taste.

For Unicode-ification, just pass the re.UNICODE parameter to
compile().

Hope this makes sense and helps,

-tkc


May 24 '06 #2

P: n/a
Sorry...I somehow missed the key *uppercase* bit of that, and
somehow got it in my head that you just wanted unicode letters,
not numbers. Please pardon the brain-blink. I can't find
anything in Python's regexp docs that do what you want. Vim's
regexp engine has a "uppercase characters" and "lowercase
characters" atoms, but it seems there's no counterpart to them in
Python. Thus, you may have to take a combined attack of
regexps+isupper().

Using isupper() has some peculiar side-effects in that it only
checks uppercase-able characters, so
"1A".isupper()

True

which may or may not be what you wanted. The previously
shot-from-the-hip regexp stuff will help you filter out any
non-alphabetic unicode characters, which can then be passed in
turn to isupper()

-tkc


May 24 '06 #3

P: n/a
On 25/05/2006 5:43 AM, po************@gmail.com wrote:
I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as possible, focusing
on european languages first, hence it'd be useful to be able to refer
to any uppercase unicode character instead of just the typical [A-Z],
which doesn't include, for example . Is there a way to do this, or
do I have to stick with using the isupper method of the string class?


You have set yourself a rather daunting task.

:-)
je suis ici a vous dire grandpere que maintenant nous ecrivons sans
accents sans majuscules sans ponctuation sans tout vive le sms vive la
revolution les professeurs a la lanterne ah m**** pas des lanternes
(-:

I would have thought that a full-on NLP parser might be required, even
for more-or-less-conventionally-expressed utterances. How will you
handle "It's not elementary, Dr. Watson."?

However if you persist: there appears to be no way of specifying "an
uppercase character" in Python's re module. You are stuck with isupper().

Light entertainment for the speed-freaks:
ucucase = set(unichr(i) for i in range(65536) if unichr(i).isupper())
len(ucucase)

704

Is foo in ucucase faster than foo.isupper()?

Cheers,
John

May 24 '06 #4

P: n/a
po************@gmail.com wrote:
I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as possible, focusing
on european languages first, hence it'd be useful to be able to refer
to any uppercase unicode character instead of just the typical [A-Z],
which doesn't include, for example . Is there a way to do this, or
do I have to stick with using the isupper method of the string class?


See http://tinyurl.com/7jqgt

Kent
May 25 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.