Finding Upper-case characters in regexps, unicode friendly.

possibilitybox

I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as possible, focusing
on european languages first, hence it'd be useful to be able to refer
to any uppercase unicode character instead of just the typical [A-Z],
which doesn't include, for example É. Is there a way to do this, or
do I have to stick with using the isupper method of the string class?

May 24 '06 #1

Subscribe Post Reply

3075

Tim Chase

> I'm trying to make a unicode friendly regexp to grab sentences

reasonably reliably for as many unicode languages as
possible, focusing on european languages first, hence it'd be
useful to be able to refer to any uppercase unicode character
instead of just the typical [A-Z], which doesn't include, for
example É. Is there a way to do this, or do I have to stick
with using the isupper method of the string class?

Well, assuming you pass in the UNICODE or LOCALE specifier, the
following portion of a regexp *should* find what you're describing:
###############################################
import re
tests = [("1", False),
("a", True),
("Hello", True),
("2bad", False),
("bad1", False),
("a c", False)
]
r = re.compile(r'^(?:(?=\w)[^\d_])*$')
for test, expected_result in tests:
if r.match(test):
passed = expected_result
else:
passed = not expected_result
print "[%s] expected [%s] passed [%s]" % (
test, expected_result, passed)
###############################################

That looks for a "word" character ("\w") but doesn't swallow it
("(?=...)"), and then asserts that the character is not ("^") a
digit ("\d") or an underscore. It looks for any number of "these
things" ("(?:...)*"), which you can tweak to your own taste.

For Unicode-ification, just pass the re.UNICODE parameter to
compile().

Hope this makes sense and helps,

-tkc

May 24 '06 #2

Tim Chase

Sorry...I somehow missed the key *uppercase* bit of that, and
somehow got it in my head that you just wanted unicode letters,
not numbers. Please pardon the brain-blink. I can't find
anything in Python's regexp docs that do what you want. Vim's
regexp engine has a "uppercase characters" and "lowercase
characters" atoms, but it seems there's no counterpart to them in
Python. Thus, you may have to take a combined attack of
regexps+isupper().

Using isupper() has some peculiar side-effects in that it only
checks uppercase-able characters, so

"1A".isupper()

True

which may or may not be what you wanted. The previously
shot-from-the-hip regexp stuff will help you filter out any
non-alphabetic unicode characters, which can then be passed in
turn to isupper()

-tkc

May 24 '06 #3

John Machin

On 25/05/2006 5:43 AM, po************@gmail.com wrote:

I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as possible, focusing
on european languages first, hence it'd be useful to be able to refer
to any uppercase unicode character instead of just the typical [A-Z],
which doesn't include, for example É. Is there a way to do this, or
do I have to stick with using the isupper method of the string class?

You have set yourself a rather daunting task.

:-)
je suis ici a vous dire grandpere que maintenant nous ecrivons sans
accents sans majuscules sans ponctuation sans tout vive le sms vive la
revolution les professeurs a la lanterne ah m**** pas des lanternes
(-:

I would have thought that a full-on NLP parser might be required, even
for more-or-less-conventionally-expressed utterances. How will you
handle "It's not elementary, Dr. Watson."?

However if you persist: there appears to be no way of specifying "an
uppercase character" in Python's re module. You are stuck with isupper().

Light entertainment for the speed-freaks:

ucucase = set(unichr(i) for i in range(65536) if unichr(i).isupper())
len(ucucase)

704

Is foo in ucucase faster than foo.isupper()?

Cheers,
John

May 24 '06 #4

Kent Johnson

po************@gmail.com wrote:

I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as possible, focusing
on european languages first, hence it'd be useful to be able to refer
to any uppercase unicode character instead of just the typical [A-Z],
which doesn't include, for example É. Is there a way to do this, or
do I have to stick with using the isupper method of the string class?

See http://tinyurl.com/7jqgt

Kent

May 25 '06 #5

Similar topics

finding files that have extensions

by: hokiegal99 | last post by:

Hi, I have a working Python script that renames files that don't currently have PC based file extensions. For example, if there is a MS Word file that does not have '.doc' on the end of it, the...

Python

Floating Upper Right Image

by: fleemo17 | last post by:

I thought this would be very simple to set up in CSS, but I'm having difficulty making it work in several browsers. I'd simply like to have an image float at the upper right hand corner of my web...

HTML / CSS

Post build event to copy all dll's to upper directory

by: Julia | last post by:

Hi, I need help with Post build event to copy all projects output dll of a solution to upper directory I didniy succeeded to move to upper folder Thanks.

.NET Framework

simple algorithm for finding primes

by: someone else | last post by:

hi all I'm a newbie to this group. my apologies if I break any rules. I've wrote a simple program to find the first 1,000,000 primes, and to find all primes within any range (up to 200 *...

C / C++

Finding the time taken by a function

by: srkkreddy | last post by:

Hi, I have written a large program which makes multiple calls to number of functions (also written by me) of the program. Now, I want to know the collective time taken by all the calls to a...

C / C++

string finding

by: newlang | last post by:

Hello everyone, I am eager to know about string functions, (user defined) . tell me the technique of find a string in another string.

C / C++

finding curve length

by: nmukh1 | last post by:

Hey guys, I'm trying to optimize a program that measures the length of a curve. Suppose I define a function f and I have two bounds and am trying to find the arc length. The familiar calculus...

C / C++

Finding Pythagorean Triples in C++

by: stephanieanne2 | last post by:

The Problem: A right triangle can have sides that are all integers. The set of three integer values for the sides of a right triangle is called a Pythagorean triple. These three sides must satisfy...

C / C++

275

Finding the instance reference of an object

by: Astley Le Jasper | last post by:

Sorry for the numpty question ... How do you find the reference name of an object? So if i have this bob = modulename.objectname() how do i find that the name is 'bob'

Python

help with loop of upper & lower letters

by: Thekid | last post by:

I had made a post about making a loop using letters instead of numbers and dshimer gave me this solution: for i in range(65,70): for j in range(65,70): for k in range(65,70): ...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing