Nicolas Bouillon <bo***@bouil.or g.invalid> wrote in message news:<Ta******* ************@nn tpserver.swip.n et>...
Hi
I would like to replace accentuel chars (like "é", "è" or "ÃÂ*") with non
accetued ones ("é" -> "e", "è" -> "e", "ÃÂ*" -> "a").
I have tried string.replace method, but it seems dislike non ascii chars...
The following is the code that I use. This looks like what you are asking for.
In case this gets corrupted you can also find it here:
http://sourceforge.net/snippet/detai...ppet&id=101229
This has some improvements to readability and speed, but it is basically
the same:
http://aspn.activestate.com/ASPN/Coo.../Recipe/251871
Yours,
Noah
#!/usr/bin/env python
"""
UNICODE Hammer -- The Stupid American
I needed something that would take a UNICODE string and
smack it into ASCII. This function doesn't just strip out the characters.
It tries to convert Latin-1 characters into ASCII equivalents where possible.
We get customer mailing address data from Europe, but most of our systems
cannot handle the Latin-1 characters. All I needed was to prepare addresses
for a few different shipping systems that we use.
None of these systems support anything but ASCII.
After getting headaches trying to deal with this problem using Python's
built-in UNICODE support I gave up and decided to write something that
would solve the problem the American way -- with brute force.
I convert all european accented letters to their unaccented equivalents.
I realize this isn't perfect, but for my purposes the packages get delivered.
Noah Spurrier
no**@noah.org
License free and public domain
"""
def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaninful. Anything not converted is deleted.
"""
xlate={0xc0:'A' , 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency }',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section} ', 0xa8:'{umlaut}' ,
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees} ',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragrap h}', 0xb7:'*', 0xb8:'{cedilla} ',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(o rd(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r
# This gives an example of how to use latin1_to_ascii ().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','lat in-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c), 'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii (s)