By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,089 Members | 2,418 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,089 IT Pros & Developers. It's quick & easy.

Unicode: matching a word and unaccenting characters

P: n/a
(Mail resent with the proper subject.

Hi list,

(Please Cc: me when replying, as I'm not subscribed to this list.)

I'm working with Unicode strings to handle accented characters but I'm
experiencing a few problem.

The first one is with regular expression. If I want to match a word
composed of characters only. One can easily use '[a-zA-Z]+' when
working in ascii, but unfortunately there is no equivalent when working
with unicode strings: the latter doesn't match accented characters. The
only mean the re package provides is '\w' along with the re.UNICODE
flag, but unfortunately it also matches digits and underscore. It
appears there is no suitable solution for this currently. Am I right?

Secondly, I need to translate accented characters to their unaccented
form. I've written this function (sorry if the code isn't as efficient
as possible, I'm not a long-time Python programmer, feel free to correct
me, I' be glad to learn anything):

% def unaccent(s):
% """
% """
%
% if not isinstance(s, types.UnicodeType):
% return s
% singleletter_re = re.compile(r'(?:^|\s)([A-Z])(?:$|\s)')
% result = ''
% for l in s:
% desc = unicodedata.name(l)
% m = singleletter_re.search(desc)
% if m is None:
% result += str(l)
% continue
% result += m.group(1).lower()
% return result
%

But I don't feel confortable with it. It strongly depend on the UCD
file format and names that don't contain a single letter cannot
obvisouly all be converted to ascii. How would you implement this
function?

Thank you for your help.
Regards,
--
Jeremie Le Hen
< jlehen at clesys dot fr >

----- End forwarded message -----

--
Jeremie Le Hen
< jlehen at clesys dot fr >
Nov 15 '07 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On Nov 15, 1:21 am, Jeremie Le Hen <jere...@le-hen.orgwrote:
(Mail resent with the proper subject.

Hi list,

(Please Cc: me when replying, as I'm not subscribed to this list.)
Don't know your mail, hope you will come back to look at the list
archive...
I'm working with Unicode strings to handle accented characters but I'm
experiencing a few problem.
[skipped first question]
Secondly, I need to translate accented characters to their unaccented
form. I've written this function (sorry if the code isn't as efficient
as possible, I'm not a long-time Python programmer, feel free to correct
me, I' be glad to learn anything):

% def unaccent(s):
% """
% """
%
% if not isinstance(s, types.UnicodeType):
% return s
% singleletter_re = re.compile(r'(?:^|\s)([A-Z])(?:$|\s)')
% result = ''
% for l in s:
% desc = unicodedata.name(l)
% m = singleletter_re.search(desc)
% if m is None:
% result += str(l)
% continue
% result += m.group(1).lower()
% return result
%

But I don't feel confortable with it. It strongly depend on the UCD
file format and names that don't contain a single letter cannot
obvisouly all be converted to ascii. How would you implement this
function?
my 2 cents:

<unaccent.py>
# -*- coding: utf-8 -*-
import unicodedata

def unaccent(s):
u"""
>>unaccent(u"Ça crée déjà l'évènement")
"Ca cree deja l'evenement"
"""

s = unicodedata.normalize('NFD', unicode(s.encode("utf-8"),
encoding="utf-8"))
return "".join(b for b in s.encode("utf-8") if ord(b) < 128)

def _test():
import doctest
doctest.testmod()

if __name__ == "__main__":
import sys
sys.exit(_test())
</unaccent.py>
Thank you for your help.
you are welcome.

(left to the reader:
- why does it work?
- why does doctest work?)

renaud
Regards,
--
Jeremie Le Hen
< jlehen at clesys dot fr >

----- End forwarded message -----

--
Jeremie Le Hen
< jlehen at clesys dot fr >
Nov 15 '07 #2

P: n/a
On Nov 15, 12:21 am, Jeremie Le Hen <jere...@le-hen.orgwrote:
(Mail resent with the proper subject.

Hi list,

(Please Cc: me when replying, as I'm not subscribed to this list.)

I'm working with Unicode strings to handle accented characters but I'm
experiencing a few problem.

The first one is with regular expression. If I want to match a word
composed of characters only. One can easily use '[a-zA-Z]+' when
working in ascii, but unfortunately there is no equivalent when working
with unicode strings: the latter doesn't match accented characters. The
only mean the re package provides is '\w' along with the re.UNICODE
flag, but unfortunately it also matches digits and underscore. It
appears there is no suitable solution for this currently. Am I right?
[snip]
You can match a single character with '\w' and then ensure that it
isn't a digit or underscore with a negative lookbehind '(?<![\d_])',
so to match only words consisting of characters (in the sense you
mean), use '\w(?<![\d_]))+'.
Nov 15 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.