Replace accented chars with unaccented ones

Nicolas Bouillon

Hi

I would like to replace accentuel chars (like "Ã©", "Ã¨" or "Ã*") with non
accetued ones ("Ã©" -> "e", "Ã¨" -> "e", "Ã*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...

Can you help me please ?
Thanks.

Jul 18 '05 #1

Subscribe Post Reply

15958

Nicolas Bouillon

Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

Bye.

Jul 18 '05 #2

Jeff Epler

You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents(u):
for a, b in replacements:
u = u.replace(a, b)
return u

remove_accents(u'\xe9') u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
replacement_map = string.maketrans('\xe9...', 'e...')
def remove_accents(s):
return s.translate(replacement_map)
remove_accents('\xe9')

'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
# -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.

Jeff

Jul 18 '05 #3

Josiah Carlson

Jeff Epler wrote:

You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents(u):
for a, b in replacements:
u = u.replace(a, b)
return u

remove_accents(u'\xe9')
u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
replacement_map = string.maketrans('\xe9...', 'e...')
def remove_accents(s):
return s.translate(replacement_map)

remove_accents('\xe9')

'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
# -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.

Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacement_pairs)

def multi_replace(inp, mapping=mapping):
return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement_pairs)) time as given.

- Josiah

Jul 18 '05 #4

Fuzzyman

Nicolas Bouillon <bo***@bouil.org.invalid> wrote in message news:<EW*******************@nntpserver.swip.net>.. .

Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

Bye.

The 'utils1' package includes a file called charmap which is a
function to map to ascii....... Originally comes from a 'python
snippet' on sourceforge I believe....

http://www.voidspace.org.uk/atlantib...thonutils.html

Regards,
Fuzzy

Jul 18 '05 #5

Michael Hudson

Jeff Epler <je****@unpythonic.net> writes:

You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents(u):
for a, b in replacements:
u = u.replace(a, b)
return u

There must be some more high powered way of doing this... something
like:

def remove_accent1(c):
return unicodedata.normalize('NFD', c)[0]
def remove_accents(s):
return u''.join(map(remove_accent1, s))

?

Cheers,
mwh

--
We've had a lot of problems going from glibc 2.0 to glibc 2.1.
People claim binary compatibility. Except for functions they
don't like. -- Peter Van Eynde, comp.lang.lisp

Jul 18 '05 #6

Jeff Epler

On Mon, Mar 15, 2004 at 06:19:00PM -0800, Josiah Carlson wrote:

Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacement_pairs)

def multi_replace(inp, mapping=mapping):
return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement_pairs)) time as given.

Thanks for posting this. My other code was pretty hopeless, but for
some reason .get(i, i) didn't come to mind as a solution.

Jeff

Jul 18 '05 #7

Jeff Epler

On Tue, Mar 16, 2004 at 08:26:08AM +0100, Nicolas Bouillon wrote:

Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

When there are non-unicode string literals in a file, they are simply
byte sequences. Take this program, for instance:

# -*- coding: utf-8 -*-
s = "é"
print len(s), repr(s)

$ python bytestr.py
2 '\xc3\xa9'

Jeff

Jul 18 '05 #8

Noah

Nicolas Bouillon <bo***@bouil.org.invalid> wrote in message news:<Ta*******************@nntpserver.swip.net>.. .

Hi

I would like to replace accentuel chars (like "ÃƒÂ©", "ÃƒÂ¨" or "ÃƒÂ*") with non
accetued ones ("ÃƒÂ©" -> "e", "ÃƒÂ¨" -> "e", "ÃƒÂ*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...

The following is the code that I use. This looks like what you are asking for.

In case this gets corrupted you can also find it here:
http://sourceforge.net/snippet/detai...ppet&id=101229
This has some improvements to readability and speed, but it is basically
the same:
http://aspn.activestate.com/ASPN/Coo.../Recipe/251871

Yours,
Noah

#!/usr/bin/env python
"""
UNICODE Hammer -- The Stupid American

I needed something that would take a UNICODE string and
smack it into ASCII. This function doesn't just strip out the characters.
It tries to convert Latin-1 characters into ASCII equivalents where possible.

We get customer mailing address data from Europe, but most of our systems
cannot handle the Latin-1 characters. All I needed was to prepare addresses
for a few different shipping systems that we use.
None of these systems support anything but ASCII.
After getting headaches trying to deal with this problem using Python's
built-in UNICODE support I gave up and decided to write something that
would solve the problem the American way -- with brute force.
I convert all european accented letters to their unaccented equivalents.
I realize this isn't perfect, but for my purposes the packages get delivered.

Noah Spurrier no**@noah.org
License free and public domain
"""

def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaninful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}

r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r

# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)

Jul 18 '05 #9

Martin v. Löwis

Josiah Carlson wrote:

Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacement_pairs)

def multi_replace(inp, mapping=mapping):
return u''.join([mapping.get(i, i) for i in inp])

Using the .translate() method on unicode strings should be
even more performant:

# prepare mapping table to match .translate interface
table = {}
for k,v in replacement_pairs: table[ord(k)]=v

def multi_replace(inp):
return inp.translate(table)

Regards,
Martin

Jul 18 '05 #10

Josiah Carlson

> r += xlate[ord(i)]

r += i

Perhaps I'm going to have to create a signature and drop information
about this in every post to c.l.py, but repeated string additions are
slow as hell for any reasonably large lengthed string. It is much
faster to place characters into a list and ''.join() them.

def test_s(l): .... t = time.time()
.... for i in xrange(100):
.... a = ''
.... for j in xrange(l):
.... a += '0'
.... return time.time()-t
.... def test_l(l): .... t = time.time()
.... for i in xrange(100):
.... a = ''.join(['0' for j in xrange(l)])
.... return time.time()-t
.... i = 128
while i < 4097:

.... print test_s(i), test_l(i)
.... i *= 2
....
0.0150001049042 0.0309998989105
0.0469999313354 0.047000169754
0.140999794006 0.109000205994
0.343999862671 0.203000068665
0.905999898911 0.40700006485
2.56200003624 0.828000068665

At 256 characters long, it looks about even. Anything longer and
''.join(lst) is significantly faster.

When we do something like the below, the overhead of creating short
lists is significant, but it is still faster when l is greater than
roughly 2048:
a = []
for i in xrange(l):
a += ['0']
- Josiah

Jul 18 '05 #11

Josiah Carlson

> Using the .translate() method on unicode strings should be

even more performant:

# prepare mapping table to match .translate interface
table = {}
for k,v in replacement_pairs: table[ord(k)]=v

def multi_replace(inp):
return inp.translate(table)

Even better *smile*.

- Josiah

Jul 18 '05 #12

Noah

Josiah Carlson <jc******@nospam.uci.edu> wrote in message news:<c3**********@news.service.uci.edu>...

r += xlate[ord(i)]
r += i

Perhaps I'm going to have to create a signature and drop information
about this in every post to c.l.py, but repeated string additions are
slow as hell for any reasonably large lengthed string. It is much
faster to place characters into a list and ''.join() them.

True. Is this better?

... body of latin1_to_ascii() ...
r = []
for i in unicrap:
if xlate.has_key(ord(i)):
r.append (xlate[ord(i)])
elif ord(i) >= 0x80:
pass
else:
r.append (i)
return ''.join(r)
Yours,
Noah

Jul 18 '05 #13

Josiah Carlson

Noah wrote:

Josiah Carlson <jc******@nospam.uci.edu> wrote in message news:<c3**********@news.service.uci.edu>...
r += xlate[ord(i)]
r += i

Perhaps I'm going to have to create a signature and drop information
about this in every post to c.l.py, but repeated string additions are
slow as hell for any reasonably large lengthed string. It is much
faster to place characters into a list and ''.join() them.

True. Is this better?

... body of latin1_to_ascii() ...
r = []
for i in unicrap:
if xlate.has_key(ord(i)):
r.append (xlate[ord(i)])
elif ord(i) >= 0x80:
pass
else:
r.append (i)
return ''.join(r)

I'd use:
''.join([xlate.get(ord(i), i) for i in unicrap \
if ord(i) in xlate or ord(i) < 0x80]

Using r.append(), in general, while being faster than string addition,
is significantly slower than using list comprehensions.

- Josiah

Jul 18 '05 #14

AdSR

Nicolas Bouillon <bo***@bouil.org.invalid> wrote:

Hi

I would like to replace accentuel chars (like "ÃƒÂƒÃ‚Â©", "ÃƒÂƒÃ‚Â¨" or "ÃƒÂƒÃ‚Â*") with non
accetued ones ("ÃƒÂƒÃ‚Â©" -> "e", "ÃƒÂƒÃ‚Â¨" -> "e", "ÃƒÂƒÃ‚Â*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...

Can you help me please ?
Thanks.

You could try experimenting with the 'unicodedata' module:

import unicodedata
[unicodedata.name(x) for x in u'123 abc @#$ \u00ff'] ['DIGIT ONE', 'DIGIT TWO', 'DIGIT THREE', 'SPACE', 'LATIN SMALL LETTER
A', 'LATIN SMALL LETTER B', 'LATIN SMALL LETTER C', 'SPACE',
'COMMERCIAL AT', 'NUMBER SIGN', 'DOLLAR SIGN', 'SPACE', 'LATIN SMALL
LETTER Y WITH DIAERESIS'] unicodedata.lookup('latin capital letter a with grave')

u'\xc0'

You could strip the ' WITH...' part when applicable and convert names
back to string. You would only need to process characters with ord >=
160.

HTH,

AdSR

Jul 18 '05 #15

Similar topics

sort accented string

by: Laurent | last post by:

Hello, I'm french and I have a small sorting problem with python (and zope's zcatalog): In a python shell python, try : 'é' > 'z' The answer is true. Then if you try

Python

Component to replace accented characters in a string

by: Jeff Levinson [mcsd] | last post by:

I don't know of a component like that, but it is really easy to do yourself. First, the String.Replace function is unicode based so it's quite easy to use extended and standard characters...

.NET Framework

replace *except* where...

by: Robert Mark Bram | last post by:

Hi All, I have the following to replace newline chars with <br> in a string: ..replace(/\n/g,"<br>") How can I change this so that it replaces only if there is not already a "<br>newline"...

Javascript

Accented chars in several apps

by: G. Brannon Smith | last post by:

I have a personal database of my books, several of which are French with accented characters in their titles. However I am getting inconsistent display of the accent characters depending on the app...

PostgreSQL Database

how can I replace a substring in a string

by: silverburgh.meryl | last post by:

Hi, If I have a string like this: char buff; buff ='h'; buff ='e'; buff ='l'; buff ='l'; buff ='o';

C / C++

Replacing Accented characters from a String

by: gsuns82 | last post by:

Hi all, I have to replace accented characters from a input string with normal plain text.I have coded as follows. String input = "ÄÀÁÂÃ"; input=...

Java

Replace any chars not in allowed list

by: Grok | last post by:

I need an elegant way to remove any characters in a string if they are not in an allowed char list. The part cleaning files of the non-allowed characters will run as a service, so no forms here. ...

Visual Basic .NET

How to truncate char string fromt beginning and replace chars instring by other chars in C or C++?

by: Hongyu | last post by:

Hi, I have a datetime char string returned from ctime_r, and it is in the format like ""Wed Jun 30 21:49:08 1993\n\0", which has 26 chars including the last terminate char '\0', and i would...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA