471,354 Members | 1,293 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,354 software developers and data experts.

Unicode: matching a

Hi list,

(Please Cc: me when replying, as I'm not subscribed to this list.)

I'm working with Unicode strings to handle accented characters but I'm
experiencing a few problem.

The first one is with regular expression. If I want to match a word
composed of characters only. One can easily use '[a-zA-Z]+' when
working in ascii, but unfortunately there is no equivalent when working
with unicode strings: the latter doesn't match accented characters. The
only mean the re package provides is '\w' along with the re.UNICODE
flag, but unfortunately it also matches digits and underscore. It
appears there is no suitable solution for this currently. Am I right?

Secondly, I need to translate accented characters to their unaccented
form. I've written this function (sorry if the code isn't as efficient
as possible, I'm not a long-time Python programmer, feel free to correct
me, I' be glad to learn anything):

% def unaccent(s):
% """
% """
%
% if not isinstance(s, types.UnicodeType):
% return s
% singleletter_re = re.compile(r'(?:^|\s)([A-Z])(?:$|\s)')
% result = ''
% for l in s:
% desc = unicodedata.name(l)
% m = singleletter_re.search(desc)
% if m is None:
% result += str(l)
% continue
% result += m.group(1).lower()
% return result
%

But I don't feel confortable with it. It strongly depend on the UCD
file format and names that don't contain a single letter cannot
obvisouly all be converted to ascii. How would you implement this
function?

Thank you for your help.
Regards,
--
Jeremie Le Hen
< jlehen at clesys dot fr >
Nov 15 '07 #1
0 885

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Mickel Grönroos | last post: by
2 posts views Thread by Dale Gerdemann | last post: by
1 post views Thread by Stefan Rank | last post: by
1 post views Thread by Jason Stitt | last post: by
8 posts views Thread by Howard Kaikow | last post: by
8 posts views Thread by rkellerjr | last post: by
2 posts views Thread by Jeremie Le Hen | last post: by
1 post views Thread by Peter Robinson | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.