Replace accented chars with unaccented ones

Nicolas Bouillon

Hi

I would like to replace accentuel chars (like "Ã©", "Ã¨" or "Ã*") with non
accetued ones ("Ã©" -> "e", "Ã¨" -> "e", "Ã*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...

Can you help me please ?
Thanks.

Jul 18 '05 #1

Subscribe Reply

16127

Nicolas Bouillon

Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

Bye.

Jul 18 '05 #2

Jeff Epler

You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents( u):
for a, b in replacements:
u = u.replace(a, b)
return u

remove_accents( u'\xe9') u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
replacement_map = string.maketran s('\xe9...', 'e...')
def remove_accents( s):
return s.translate(rep lacement_map)
remove_accents( '\xe9')

'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
# -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.

Jeff

Jul 18 '05 #3

Josiah Carlson

Jeff Epler wrote:

You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents( u):
for a, b in replacements:
u = u.replace(a, b)
return u

remove_acce nts(u'\xe9')
u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
replacement_map = string.maketran s('\xe9...', 'e...')
def remove_accents( s):
return s.translate(rep lacement_map)

remove_acce nts('\xe9')

'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
# -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.

Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacemen t_pairs)

def multi_replace(i np, mapping=mapping ):
return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement _pairs)) time as given.

- Josiah

Jul 18 '05 #4

Fuzzyman

Nicolas Bouillon <bo***@bouil.or g.invalid> wrote in message news:<EW******* ************@nn tpserver.swip.n et>...

Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

Bye.

The 'utils1' package includes a file called charmap which is a
function to map to ascii....... Originally comes from a 'python
snippet' on sourceforge I believe....

http://www.voidspace.org.uk/atlantib...thonutils.html

Regards,
Fuzzy

Jul 18 '05 #5

Michael Hudson

Jeff Epler <je****@unpytho nic.net> writes:

You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents( u):
for a, b in replacements:
u = u.replace(a, b)
return u

There must be some more high powered way of doing this... something
like:

def remove_accent1( c):
return unicodedata.nor malize('NFD', c)[0]
def remove_accents( s):
return u''.join(map(re move_accent1, s))

?

Cheers,
mwh

--
We've had a lot of problems going from glibc 2.0 to glibc 2.1.
People claim binary compatibility. Except for functions they
don't like. -- Peter Van Eynde, comp.lang.lisp

Jul 18 '05 #6

Jeff Epler

On Mon, Mar 15, 2004 at 06:19:00PM -0800, Josiah Carlson wrote:

Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacemen t_pairs)

def multi_replace(i np, mapping=mapping ):
return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement _pairs)) time as given.

Thanks for posting this. My other code was pretty hopeless, but for
some reason .get(i, i) didn't come to mind as a solution.

Jeff

Jul 18 '05 #7

Jeff Epler

On Tue, Mar 16, 2004 at 08:26:08AM +0100, Nicolas Bouillon wrote:

Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

When there are non-unicode string literals in a file, they are simply
byte sequences. Take this program, for instance:

# -*- coding: utf-8 -*-
s = "é"
print len(s), repr(s)

$ python bytestr.py
2 '\xc3\xa9'

Jeff

Jul 18 '05 #8

Noah

Nicolas Bouillon <bo***@bouil.or g.invalid> wrote in message news:<Ta******* ************@nn tpserver.swip.n et>...

Hi

I would like to replace accentuel chars (like "ÃƒÂ©", "ÃƒÂ¨" or "ÃƒÂ*") with non
accetued ones ("ÃƒÂ©" -> "e", "ÃƒÂ¨" -> "e", "ÃƒÂ*" -> "a").

I have tried string.replace method, but it seems dislike non ascii chars...

The following is the code that I use. This looks like what you are asking for.

In case this gets corrupted you can also find it here:
http://sourceforge.net/snippet/detai...ppet&id=101229
This has some improvements to readability and speed, but it is basically
the same:
http://aspn.activestate.com/ASPN/Coo.../Recipe/251871

Yours,
Noah

#!/usr/bin/env python
"""
UNICODE Hammer -- The Stupid American

I needed something that would take a UNICODE string and
smack it into ASCII. This function doesn't just strip out the characters.
It tries to convert Latin-1 characters into ASCII equivalents where possible.

We get customer mailing address data from Europe, but most of our systems
cannot handle the Latin-1 characters. All I needed was to prepare addresses
for a few different shipping systems that we use.
None of these systems support anything but ASCII.
After getting headaches trying to deal with this problem using Python's
built-in UNICODE support I gave up and decided to write something that
would solve the problem the American way -- with brute force.
I convert all european accented letters to their unaccented equivalents.
I realize this isn't perfect, but for my purposes the packages get delivered.

Noah Spurrier no**@noah.org
License free and public domain
"""

def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaninful. Anything not converted is deleted.
"""
xlate={0xc0:'A' , 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency }',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section} ', 0xa8:'{umlaut}' ,
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees} ',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragrap h}', 0xb7:'*', 0xb8:'{cedilla} ',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}

r = ''
for i in unicrap:
if xlate.has_key(o rd(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r

# This gives an example of how to use latin1_to_ascii ().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','lat in-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c), 'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii (s)

Jul 18 '05 #9

Martin v. Löwis

Josiah Carlson wrote:

Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacemen t_pairs)

def multi_replace(i np, mapping=mapping ):
return u''.join([mapping.get(i, i) for i in inp])

Using the .translate() method on unicode strings should be
even more performant:

# prepare mapping table to match .translate interface
table = {}
for k,v in replacement_pai rs: table[ord(k)]=v

def multi_replace(i np):
return inp.translate(t able)

Regards,
Martin

Jul 18 '05 #10

Similar topics

4882

sort accented string

by: Laurent | last post by:

Hello, I'm french and I have a small sorting problem with python (and zope's zcatalog): In a python shell python, try : 'é' > 'z' The answer is true. Then if you try

Python

2204

Component to replace accented characters in a string

by: Jeff Levinson [mcsd] | last post by:

I don't know of a component like that, but it is really easy to do yourself. First, the String.Replace function is unicode based so it's quite easy to use extended and standard characters interchangeably. What I would recommend is to create an object that holds the basic character and extended character in unicode (use this as a map for conversion) and then create a shared method on the object that ran the replace command and ...

.NET Framework

3829

replace *except* where...

by: Robert Mark Bram | last post by:

Hi All, I have the following to replace newline chars with <br> in a string: ..replace(/\n/g,"<br>") How can I change this so that it replaces only if there is not already a "<br>newline" or "newline<p>" combo? Thanks for any advice!

Javascript

1640

Accented chars in several apps

by: G. Brannon Smith | last post by:

I have a personal database of my books, several of which are French with accented characters in their titles. However I am getting inconsistent display of the accent characters depending on the app I am using to access the DB. When the accents show up OK in psql and phpPgAdmin, they look like garbage in pgaccess and pgadmin3. If I correct them in pgaccess and/or pgadmin3, they look like garbage in psql and phpPgAdmin.

PostgreSQL Database

7961

how can I replace a substring in a string

by: silverburgh.meryl | last post by:

Hi, If I have a string like this: char buff; buff ='h'; buff ='e'; buff ='l'; buff ='l'; buff ='o';

C / C++

6432

Replacing Accented characters from a String

by: gsuns82 | last post by:

Hi all, I have to replace accented characters from a input string with normal plain text.I have coded as follows. String input = "ÄÀÁÂÃ"; input= input.replaceAll("", "A"); like wise v can do for all. output was: ************ AAAAA

Java

3794

Replace any chars not in allowed list

by: Grok | last post by:

I need an elegant way to remove any characters in a string if they are not in an allowed char list. The part cleaning files of the non-allowed characters will run as a service, so no forms here. The list also needs to be editable by the end-user so I'll be providing a form on which they can edit the allowed character list. The end-user is non-technical so asking them to type a regular expression is out.

Visual Basic .NET

3755

How to truncate char string fromt beginning and replace chars instring by other chars in C or C++?

by: Hongyu | last post by:

Hi, I have a datetime char string returned from ctime_r, and it is in the format like ""Wed Jun 30 21:49:08 1993\n\0", which has 26 chars including the last terminate char '\0', and i would like to remove the weekday information that is "Wed" here, and I also would like to replace the spaces char by "_" and also remove the "\n" char. I didn't know how to truncate the string from beginning or replace some chars in a string with another...

C / C++

9528

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10456

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10230

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10174

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10012

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9052

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7548

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6788

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4118

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp