473,406 Members | 2,698 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Replace accented chars with unaccented ones

Hi

I would like to replace accentuel chars (like "é", "è" or "Ã*") with non
accetued ones ("é" -> "e", "è" -> "e", "Ã*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...

Can you help me please ?
Thanks.
Jul 18 '05 #1
14 15958
Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

Bye.
Jul 18 '05 #2
You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents(u):
for a, b in replacements:
u = u.replace(a, b)
return u
remove_accents(u'\xe9') u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
replacement_map = string.maketrans('\xe9...', 'e...')
def remove_accents(s):
return s.translate(replacement_map)
remove_accents('\xe9')

'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
# -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.

Jeff

Jul 18 '05 #3
Jeff Epler wrote:
You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents(u):
for a, b in replacements:
u = u.replace(a, b)
return u

remove_accents(u'\xe9')
u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
replacement_map = string.maketrans('\xe9...', 'e...')
def remove_accents(s):
return s.translate(replacement_map)

remove_accents('\xe9')


'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
# -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.


Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacement_pairs)

def multi_replace(inp, mapping=mapping):
return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement_pairs)) time as given.

- Josiah
Jul 18 '05 #4
Nicolas Bouillon <bo***@bouil.org.invalid> wrote in message news:<EW*******************@nntpserver.swip.net>.. .
Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

Bye.


The 'utils1' package includes a file called charmap which is a
function to map to ascii....... Originally comes from a 'python
snippet' on sourceforge I believe....

http://www.voidspace.org.uk/atlantib...thonutils.html

Regards,
Fuzzy
Jul 18 '05 #5
Jeff Epler <je****@unpythonic.net> writes:
You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents(u):
for a, b in replacements:
u = u.replace(a, b)
return u


There must be some more high powered way of doing this... something
like:

def remove_accent1(c):
return unicodedata.normalize('NFD', c)[0]
def remove_accents(s):
return u''.join(map(remove_accent1, s))

?

Cheers,
mwh

--
We've had a lot of problems going from glibc 2.0 to glibc 2.1.
People claim binary compatibility. Except for functions they
don't like. -- Peter Van Eynde, comp.lang.lisp
Jul 18 '05 #6
On Mon, Mar 15, 2004 at 06:19:00PM -0800, Josiah Carlson wrote:
Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacement_pairs)

def multi_replace(inp, mapping=mapping):
return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement_pairs)) time as given.


Thanks for posting this. My other code was pretty hopeless, but for
some reason .get(i, i) didn't come to mind as a solution.

Jeff

Jul 18 '05 #7
On Tue, Mar 16, 2004 at 08:26:08AM +0100, Nicolas Bouillon wrote:
Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.


When there are non-unicode string literals in a file, they are simply
byte sequences. Take this program, for instance:

# -*- coding: utf-8 -*-
s = "é"
print len(s), repr(s)

$ python bytestr.py
2 '\xc3\xa9'

Jeff

Jul 18 '05 #8
Nicolas Bouillon <bo***@bouil.org.invalid> wrote in message news:<Ta*******************@nntpserver.swip.net>.. .
Hi

I would like to replace accentuel chars (like "é", "è" or "ÃÂ*") with non
accetued ones ("é" -> "e", "è" -> "e", "ÃÂ*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...


The following is the code that I use. This looks like what you are asking for.

In case this gets corrupted you can also find it here:
http://sourceforge.net/snippet/detai...ppet&id=101229
This has some improvements to readability and speed, but it is basically
the same:
http://aspn.activestate.com/ASPN/Coo.../Recipe/251871

Yours,
Noah

#!/usr/bin/env python
"""
UNICODE Hammer -- The Stupid American

I needed something that would take a UNICODE string and
smack it into ASCII. This function doesn't just strip out the characters.
It tries to convert Latin-1 characters into ASCII equivalents where possible.

We get customer mailing address data from Europe, but most of our systems
cannot handle the Latin-1 characters. All I needed was to prepare addresses
for a few different shipping systems that we use.
None of these systems support anything but ASCII.
After getting headaches trying to deal with this problem using Python's
built-in UNICODE support I gave up and decided to write something that
would solve the problem the American way -- with brute force.
I convert all european accented letters to their unaccented equivalents.
I realize this isn't perfect, but for my purposes the packages get delivered.

Noah Spurrier no**@noah.org
License free and public domain
"""

def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaninful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}

r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r

# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)
Jul 18 '05 #9
Josiah Carlson wrote:
Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacement_pairs)

def multi_replace(inp, mapping=mapping):
return u''.join([mapping.get(i, i) for i in inp])


Using the .translate() method on unicode strings should be
even more performant:

# prepare mapping table to match .translate interface
table = {}
for k,v in replacement_pairs: table[ord(k)]=v

def multi_replace(inp):
return inp.translate(table)

Regards,
Martin

Jul 18 '05 #10
> r += xlate[ord(i)]
r += i


Perhaps I'm going to have to create a signature and drop information
about this in every post to c.l.py, but repeated string additions are
slow as hell for any reasonably large lengthed string. It is much
faster to place characters into a list and ''.join() them.
def test_s(l): .... t = time.time()
.... for i in xrange(100):
.... a = ''
.... for j in xrange(l):
.... a += '0'
.... return time.time()-t
.... def test_l(l): .... t = time.time()
.... for i in xrange(100):
.... a = ''.join(['0' for j in xrange(l)])
.... return time.time()-t
.... i = 128
while i < 4097:

.... print test_s(i), test_l(i)
.... i *= 2
....
0.0150001049042 0.0309998989105
0.0469999313354 0.047000169754
0.140999794006 0.109000205994
0.343999862671 0.203000068665
0.905999898911 0.40700006485
2.56200003624 0.828000068665

At 256 characters long, it looks about even. Anything longer and
''.join(lst) is significantly faster.

When we do something like the below, the overhead of creating short
lists is significant, but it is still faster when l is greater than
roughly 2048:
a = []
for i in xrange(l):
a += ['0']
- Josiah
Jul 18 '05 #11
> Using the .translate() method on unicode strings should be
even more performant:

# prepare mapping table to match .translate interface
table = {}
for k,v in replacement_pairs: table[ord(k)]=v

def multi_replace(inp):
return inp.translate(table)


Even better *smile*.

- Josiah
Jul 18 '05 #12
Josiah Carlson <jc******@nospam.uci.edu> wrote in message news:<c3**********@news.service.uci.edu>...
r += xlate[ord(i)]
r += i


Perhaps I'm going to have to create a signature and drop information
about this in every post to c.l.py, but repeated string additions are
slow as hell for any reasonably large lengthed string. It is much
faster to place characters into a list and ''.join() them.


True. Is this better?

... body of latin1_to_ascii() ...
r = []
for i in unicrap:
if xlate.has_key(ord(i)):
r.append (xlate[ord(i)])
elif ord(i) >= 0x80:
pass
else:
r.append (i)
return ''.join(r)
Yours,
Noah
Jul 18 '05 #13
Noah wrote:
Josiah Carlson <jc******@nospam.uci.edu> wrote in message news:<c3**********@news.service.uci.edu>...
r += xlate[ord(i)]
r += i


Perhaps I'm going to have to create a signature and drop information
about this in every post to c.l.py, but repeated string additions are
slow as hell for any reasonably large lengthed string. It is much
faster to place characters into a list and ''.join() them.

True. Is this better?

... body of latin1_to_ascii() ...
r = []
for i in unicrap:
if xlate.has_key(ord(i)):
r.append (xlate[ord(i)])
elif ord(i) >= 0x80:
pass
else:
r.append (i)
return ''.join(r)


I'd use:
''.join([xlate.get(ord(i), i) for i in unicrap \
if ord(i) in xlate or ord(i) < 0x80]

Using r.append(), in general, while being faster than string addition,
is significantly slower than using list comprehensions.

- Josiah
Jul 18 '05 #14
Nicolas Bouillon <bo***@bouil.org.invalid> wrote:
Hi

I would like to replace accentuel chars (like "é", "è" or "ÃÂÂ*") with non
accetued ones ("é" -> "e", "è" -> "e", "ÃÂÂ*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...

Can you help me please ?
Thanks.


You could try experimenting with the 'unicodedata' module:
import unicodedata
[unicodedata.name(x) for x in u'123 abc @#$ \u00ff'] ['DIGIT ONE', 'DIGIT TWO', 'DIGIT THREE', 'SPACE', 'LATIN SMALL LETTER
A', 'LATIN SMALL LETTER B', 'LATIN SMALL LETTER C', 'SPACE',
'COMMERCIAL AT', 'NUMBER SIGN', 'DOLLAR SIGN', 'SPACE', 'LATIN SMALL
LETTER Y WITH DIAERESIS'] unicodedata.lookup('latin capital letter a with grave')

u'\xc0'

You could strip the ' WITH...' part when applicable and convert names
back to string. You would only need to process characters with ord >=
160.

HTH,

AdSR
Jul 18 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Laurent | last post by:
Hello, I'm french and I have a small sorting problem with python (and zope's zcatalog): In a python shell python, try : 'é' > 'z' The answer is true. Then if you try
0
by: Jeff Levinson [mcsd] | last post by:
I don't know of a component like that, but it is really easy to do yourself. First, the String.Replace function is unicode based so it's quite easy to use extended and standard characters...
4
by: Robert Mark Bram | last post by:
Hi All, I have the following to replace newline chars with <br> in a string: ..replace(/\n/g,"<br>") How can I change this so that it replaces only if there is not already a "<br>newline"...
2
by: G. Brannon Smith | last post by:
I have a personal database of my books, several of which are French with accented characters in their titles. However I am getting inconsistent display of the accent characters depending on the app...
7
by: silverburgh.meryl | last post by:
Hi, If I have a string like this: char buff; buff ='h'; buff ='e'; buff ='l'; buff ='l'; buff ='o';
2
by: gsuns82 | last post by:
Hi all, I have to replace accented characters from a input string with normal plain text.I have coded as follows. String input = "ÄÀÁÂÃ"; input=...
7
by: Grok | last post by:
I need an elegant way to remove any characters in a string if they are not in an allowed char list. The part cleaning files of the non-allowed characters will run as a service, so no forms here. ...
13
by: Hongyu | last post by:
Hi, I have a datetime char string returned from ctime_r, and it is in the format like ""Wed Jun 30 21:49:08 1993\n\0", which has 26 chars including the last terminate char '\0', and i would...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.