473,327 Members | 2,118 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

encoding problems (é and è)

hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')
doesn't work it put me " and , instead of remplacing é by E
if someone have an idea it could be great

regards
Bussiere
ps : i've added the whole script under :


__________________________________________________ ________________________


#!/usr/bin/python
# -*- coding: utf-8 -*-
import fileinput, glob, string, sys, os, re

fichA=raw_input("Entrez le nom du fichier d'entree : ")
print ("\n")
fichC=raw_input("Entrez le nom du fichier de sortie : ")
print ("\n")
normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue->
AV) (O/N) ou A pour tout normaliser \n")
normalisation1 = normalisation1.upper()

if normalisation1 != "A":
print ("\n")
normalisation2 = raw_input("Normaliser les civilités (ex :
Docteur-> DR) (O/N) \n")
normalisation2 = normalisation2.upper()
print ("\n")
normalisation3 = raw_input("Normaliser les Adresses 2 (ex :
Place-> PL) (O/N) \n")
normalisation3 = normalisation3.upper()
normalisation4 = raw_input("Normaliser les caracteres / et - (ex :
/ -> ) (O/N) \n" )
normalisation4 = normalisation4.upper()

if normalisation1 == "A":
normalisation1 = "O"
normalisation2 = "O"
normalisation3 = "O"
normalisation4 = "O"
fiA=open(fichA,"r")
fiC=open(fichC,"w")
compteur = 0

while 1:

ligneA=fiA.readline()

if ligneA == "":

break

if ligneA != "":
str = ligneA
str = str.replace('a', 'A')
str = str.replace('b', 'B')
str = str.replace('c', 'C')
str = str.replace('d', 'D')
str = str.replace('e', 'E')
str = str.replace('f', 'F')
str = str.replace('g', 'G')
str = str.replace('h', 'H')
str = str.replace('i', 'I')
str = str.replace('j', 'J')
str = str.replace('k', 'K')
str = str.replace('l', 'L')
str = str.replace('m', 'M')
str = str.replace('n', 'N')
str = str.replace('o', 'O')
str = str.replace('p', 'P')
str = str.replace('q', 'Q')
str = str.replace('r', 'R')
str = str.replace('s', 'S')
str = str.replace('t', 'T')
str = str.replace('u', 'U')
str = str.replace('v', 'V')
str = str.replace('w', 'W')
str = str.replace('x', 'X')
str = str.replace('y', 'Y')
str = str.replace('z', 'Z')

str = str.replace('ç', 'C')
str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')
str = str.replace('Ê', 'E')
str = str.replace('ë', 'E')
str = str.replace('Ë', 'E')
str = str.replace('ä', 'A')
str = str.replace('Ä', 'A')
str = str.replace('à', 'A')
str = str.replace('À', 'A')
str = str.replace('Á', 'A')
str = str.replace('Â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('Ã', 'A')
str = str.replace('â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('ï', 'I')
str = str.replace('Ï', 'I')
str = str.replace('î', 'I')
str = str.replace('Î', 'I')
str = str.replace('ô', 'O')
str = str.replace('Ô', 'O')
str = str.replace('ö', 'O')
str = str.replace('Ö', 'O')
str = str.replace('Ú','U')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')

if normalisation1 == "O":
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('FAUBOURG', 'FBG')
str = str.replace('GENERAL', 'GAL')
str = str.replace('COMMANDANT', 'CMDT')
str = str.replace('MARECHAL', 'MAL')
str = str.replace('PRESIDENT', 'PRDT')
str = str.replace('SAINT', 'ST')
str = str.replace('SAINTE', 'STE')
str = str.replace('LOTISSEMENT', 'LOT')
str = str.replace('RESIDENCE', 'RES')
str = str.replace('IMMEUBLE', 'IMM')
str = str.replace('IMEUBLE', 'IMM')
str = str.replace('BATIMENT', 'BAT')

if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
str = str.replace('MADAME', 'MME')
str = str.replace('MADEMOISELLE', 'MLLE')
str = str.replace('DOCTEUR', 'DR')
str = str.replace('PROFESSEUR', 'PR')
str = str.replace('MONSEIGNEUR', 'MGR')
str = str.replace('M ME','MME')
if normalisation3 == "O":
str = str.replace('PLACE', 'PL')
str = str.replace('IMPASSE', 'IMP')
str = str.replace('ESPLANADE', 'ESP')
str = str.replace('ROND POINT', 'RPT')
str = str.replace('ROUTE', 'RTE')
str = str.replace('PASSAGE', 'PAS')
str = str.replace('SQUARE', 'SQ')
str = str.replace('ALLEE', 'ALL')
str = str.replace('ESCALIER', 'ESC')
str = str.replace('ETAGE', 'ETG')
str = str.replace('PORTE', 'PTE')
str = str.replace('APPARTEMENT', 'APT')
str = str.replace('APARTEMENT', 'APT')
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('ZONE D ACTIVITE', 'ZA')
str = str.replace('ZONE D ACTIVITEE', 'ZA')
str = str.replace('ZONE D AMENAGEMENT CONCERTE', 'ZAC')
str = str.replace('ZONE D AMENAGEMENT CONCERTEE', 'ZAC')
str = str.replace('ZONE INDUSTRELLE', 'ZI')
str = str.replace('CENTRE COMMERCIAL', 'CCAL')
str = str.replace('CENTRE', 'CTRE')
str = str.replace('C.CIAL','CCAL')
str = str.replace('CTRE CIAL','CCAL')
str = str.replace('CTRE CCAL','CCAL')
str = str.replace('GALERIE','GAL')
str = str.replace('MARTYR', 'M')
str = str.replace('ANCIENS', 'AC')
str = str.replace('ANCIEN', 'AC')
str = str.replace('REVEREND PERE','R P')

if normalisation4 == "O":
str = str.replace(';\"', ' ')
str = str.replace('\"', ' ')
str = str.replace('\'', ' ')
str = str.replace('-', ' ')
str = str.replace(',', ' ')
str = str.replace('\\', ' ')
str = str.replace('\/', ' ')
str = str.replace('&', ' ')
str = str.replace('%', ' ')
str = str.replace('*', ' ')
str = str.replace(' ', ' ')
str = str.replace('.', ' ')
str = str.replace('_', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace('?', ' ')
str = str.replace('%', ' ')
str = str.replace('|', ' ')




str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
fiC.write(str)
compteur += 1
print compteur, "\n"
print "FINIT"
fiA.close()
fiC.close()
Mar 23 '06 #1
13 3280
bussiere bussiere wrote:
hi i'am making a program for formatting string,
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('Ç', 'C')
...
doesn't work it put me " and , instead of remplacing é by E


Are your sure your script and your input file *is* actually encoded with
utf-8? If it does not work as expected, it is probably latin-1, just
like your posting. Try changing the coding to latin-1. Does it work now?

-- Christoph
Mar 23 '06 #2
Seems to work fine for me.
x="éÇ"
x=x.replace('é','E') 'E\xc7' x=x.replace('Ç','C')
x 'E\xc7' x=x.replace('Ç','C')
x
'EC'

You should also be able to use .upper() method to
uppercase everything in the string in a single statement:

tstr=ligneA.upper()

Note: you should never use 'str' as a variable as
it will mask the built-in str function.

-Larry Bates

bussiere bussiere wrote: hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')
doesn't work it put me " and , instead of remplacing é by E
if someone have an idea it could be great

regards
Bussiere
ps : i've added the whole script under :


__________________________________________________ ________________________


#!/usr/bin/python
# -*- coding: utf-8 -*-
import fileinput, glob, string, sys, os, re

fichA=raw_input("Entrez le nom du fichier d'entree : ")
print ("\n")
fichC=raw_input("Entrez le nom du fichier de sortie : ")
print ("\n")
normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue->
AV) (O/N) ou A pour tout normaliser \n")
normalisation1 = normalisation1.upper()

if normalisation1 != "A":
print ("\n")
normalisation2 = raw_input("Normaliser les civilités (ex :
Docteur-> DR) (O/N) \n")
normalisation2 = normalisation2.upper()
print ("\n")
normalisation3 = raw_input("Normaliser les Adresses 2 (ex :
Place-> PL) (O/N) \n")
normalisation3 = normalisation3.upper()
normalisation4 = raw_input("Normaliser les caracteres / et - (ex :
/ -> ) (O/N) \n" )
normalisation4 = normalisation4.upper()

if normalisation1 == "A":
normalisation1 = "O"
normalisation2 = "O"
normalisation3 = "O"
normalisation4 = "O"
fiA=open(fichA,"r")
fiC=open(fichC,"w")
compteur = 0

while 1:

ligneA=fiA.readline()

if ligneA == "":

break

if ligneA != "":
str = ligneA
str = str.replace('a', 'A')
str = str.replace('b', 'B')
str = str.replace('c', 'C')
str = str.replace('d', 'D')
str = str.replace('e', 'E')
str = str.replace('f', 'F')
str = str.replace('g', 'G')
str = str.replace('h', 'H')
str = str.replace('i', 'I')
str = str.replace('j', 'J')
str = str.replace('k', 'K')
str = str.replace('l', 'L')
str = str.replace('m', 'M')
str = str.replace('n', 'N')
str = str.replace('o', 'O')
str = str.replace('p', 'P')
str = str.replace('q', 'Q')
str = str.replace('r', 'R')
str = str.replace('s', 'S')
str = str.replace('t', 'T')
str = str.replace('u', 'U')
str = str.replace('v', 'V')
str = str.replace('w', 'W')
str = str.replace('x', 'X')
str = str.replace('y', 'Y')
str = str.replace('z', 'Z')

str = str.replace('ç', 'C')
str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')
str = str.replace('Ê', 'E')
str = str.replace('ë', 'E')
str = str.replace('Ë', 'E')
str = str.replace('ä', 'A')
str = str.replace('Ä', 'A')
str = str.replace('à', 'A')
str = str.replace('À', 'A')
str = str.replace('Á', 'A')
str = str.replace('Â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('Ã', 'A')
str = str.replace('â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('ï', 'I')
str = str.replace('Ï', 'I')
str = str.replace('î', 'I')
str = str.replace('Î', 'I')
str = str.replace('ô', 'O')
str = str.replace('Ô', 'O')
str = str.replace('ö', 'O')
str = str.replace('Ö', 'O')
str = str.replace('Ú','U')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')

if normalisation1 == "O":
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('FAUBOURG', 'FBG')
str = str.replace('GENERAL', 'GAL')
str = str.replace('COMMANDANT', 'CMDT')
str = str.replace('MARECHAL', 'MAL')
str = str.replace('PRESIDENT', 'PRDT')
str = str.replace('SAINT', 'ST')
str = str.replace('SAINTE', 'STE')
str = str.replace('LOTISSEMENT', 'LOT')
str = str.replace('RESIDENCE', 'RES')
str = str.replace('IMMEUBLE', 'IMM')
str = str.replace('IMEUBLE', 'IMM')
str = str.replace('BATIMENT', 'BAT')

if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
str = str.replace('MADAME', 'MME')
str = str.replace('MADEMOISELLE', 'MLLE')
str = str.replace('DOCTEUR', 'DR')
str = str.replace('PROFESSEUR', 'PR')
str = str.replace('MONSEIGNEUR', 'MGR')
str = str.replace('M ME','MME')
if normalisation3 == "O":
str = str.replace('PLACE', 'PL')
str = str.replace('IMPASSE', 'IMP')
str = str.replace('ESPLANADE', 'ESP')
str = str.replace('ROND POINT', 'RPT')
str = str.replace('ROUTE', 'RTE')
str = str.replace('PASSAGE', 'PAS')
str = str.replace('SQUARE', 'SQ')
str = str.replace('ALLEE', 'ALL')
str = str.replace('ESCALIER', 'ESC')
str = str.replace('ETAGE', 'ETG')
str = str.replace('PORTE', 'PTE')
str = str.replace('APPARTEMENT', 'APT')
str = str.replace('APARTEMENT', 'APT')
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('ZONE D ACTIVITE', 'ZA')
str = str.replace('ZONE D ACTIVITEE', 'ZA')
str = str.replace('ZONE D AMENAGEMENT CONCERTE', 'ZAC')
str = str.replace('ZONE D AMENAGEMENT CONCERTEE', 'ZAC')
str = str.replace('ZONE INDUSTRELLE', 'ZI')
str = str.replace('CENTRE COMMERCIAL', 'CCAL')
str = str.replace('CENTRE', 'CTRE')
str = str.replace('C.CIAL','CCAL')
str = str.replace('CTRE CIAL','CCAL')
str = str.replace('CTRE CCAL','CCAL')
str = str.replace('GALERIE','GAL')
str = str.replace('MARTYR', 'M')
str = str.replace('ANCIENS', 'AC')
str = str.replace('ANCIEN', 'AC')
str = str.replace('REVEREND PERE','R P')

if normalisation4 == "O":
str = str.replace(';\"', ' ')
str = str.replace('\"', ' ')
str = str.replace('\'', ' ')
str = str.replace('-', ' ')
str = str.replace(',', ' ')
str = str.replace('\\', ' ')
str = str.replace('\/', ' ')
str = str.replace('&', ' ')
str = str.replace('%', ' ')
str = str.replace('*', ' ')
str = str.replace(' ', ' ')
str = str.replace('.', ' ')
str = str.replace('_', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace('?', ' ')
str = str.replace('%', ' ')
str = str.replace('|', ' ')




str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
fiC.write(str)
compteur += 1
print compteur, "\n"
print "FINIT"
fiA.close()
fiC.close()

Mar 23 '06 #3
On 23/03/2006 10:07 PM, bussiere bussiere wrote:
hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')
doesn't work it put me " and , instead of remplacing é by E
if someone have an idea it could be great
Hi, I've added some comments below ... I hope they help.
Cheers,
John

regards
Bussiere
ps : i've added the whole script under :
__________________________________________________ ________________________ [snip]
if ligneA != "":
str = ligneA
str = str.replace('a', 'A') [snip] str = str.replace('z', 'Z')

str = str.replace('ç', 'C')
str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E') [snip] str = str.replace('Ú','U')
You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
The standard Python idiom for normalising whitespace is
strg = ' '.join(strg.split())
strg = ' ALLO BUSSIERE\tCA VA? '
strg.split() ['ALLO', 'BUSSIERE', 'CA', 'VA?'] ' '.join(strg.split()) 'ALLO BUSSIERE CA VA?'
[snip] if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
You need to be very careful with this approach. You are changing EVERY
occurrence of "MR" in the string, not just where it is a whole "word"
meaning "Monsieur".
Copnstructed example of what can go wrong: strg = 'MR IMRE NAGY, 123 PRIMROSE STREET, SHAMROCK VALLEY'
strg.replace('MR', 'M') 'M IME NAGY, 123 PRIMOSE STREET, SHAMOCK VALLEY'


A real, non-constructed history lesson: A certain database indicated
duplicate records by having the annotation "DUP" in the surname field
e.g. "SMITH DUP". Fortunately it was detected in testing that the
so-called clean-up was causing DUPLESSIS to become PLESSIS and DUPRAT to
become RAT!

Two points here: (1) Split up your strings into "words" or "tokens".
Using strg.split() is a start but you may need something more
sophisticated e.g. "-" as an additional token separator. (2) Instead of
writing out all those lines of code, consider putting those
substitutions in a dictionary:

title_substitution = {
'MONSIEUR': 'M',
'MR': 'M',
'MADAME': 'MME',
# etc
}
Next level of improvement is to read that stuff from a file.
[snip]
if normalisation4 == "O":
str = str.replace(';\"', ' ')
str = str.replace('\"', ' ')
str = str.replace('\'', ' ')
str = str.replace('-', ' ')
str = str.replace(',', ' ')
str = str.replace('\\', ' ')
str = str.replace('\/', ' ')
str = str.replace('&', ' ')

[snip]
Again, consider the string translate() method.
Also, consider that some of those characters may have some meaning that
you perhaps shouldn't blow away e.g. compare 'SMITH & WESSON' with
'SMITH ET WESSON' :-)
Mar 23 '06 #4
John Machin wrote:
You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.


Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8.

Peter

Mar 23 '06 #5
On 24/03/2006 8:36 AM, Peter Otten wrote:
John Machin wrote:
You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.


Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8.


I'm sorry, I forgot that there were people who are unaware that
variable-length gizmos like UTF-8 and various legacy CJK encodings are
for storage & transmission, and are better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted.

:-)
Unicode? I'm just a benighted Anglo from the a**-end of the globe; who
am I to be preaching Unicode to a European?
(-:
Mar 23 '06 #6
Peter Otten wrote:
You can replace ALL of this upshifting and accent removal in one blow
by using the string translate() method with a suitable table.


Only if you convert to unicode first or if your data maintains 1 byte
== 1 character, in particular it is not UTF-8.


There's a nice little codec from Skip Montaro for removing accents from
latin-1 encoded strings. It also has an error handler so you can convert
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py
import latscii
import htmlentitydefs
print u'\u00c9'.encode('ascii','replacelatscii') E


So Bussiere could replace a large chunk of his code with:

ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(

Mar 24 '06 #7
Duncan Booth wrote:
There's a nice little codec from Skip Montaro for removing accents from
latin-1 encoded strings. It also has an error handler so you can convert
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py
import latscii
import htmlentitydefs
print u'\u00c9'.encode('ascii','replacelatscii') E
So Bussiere could replace a large chunk of his code with:

ligneA = ligneA.decode(INPUTENCODING).encode('ascii',
'replacelatscii') ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(


You made me look into it -- and I found that reusing the decoding map as the
encoding map lets you write
u"Élève ééé".encode("latscii") 'Eleve eee'

without relying on the faulty error handler. I tried to fix the handler,
too:
u"Élève ééé".encode("ascii", "replacelatscii") 'Eleve eee' g = u"\N{GREEK CAPITAL LETTER GAMMA}"
(u"möglich ähnlich üblich ááá" + g*3).encode("ascii", "replacelatscii")

'moglich ahnlich ublich aaa???'

No real testing was performed.

Peter

--- latscii_old.py 2006-03-24 11:45:22.580588520 +0100
+++ latscii.py 2006-03-24 11:48:13.191651696 +0100
@@ -141,7 +141,7 @@

### Encoding Map

-encoding_map = codecs.make_identity_dict(range(256))
+encoding_map = decoding_map
### From Martin Blais
@@ -166,9 +166,9 @@
## ustr.encode('ascii', 'replacelatscii')
##
def latscii_error( uerr ):
- key = ord(uerr.object[uerr.start:uerr.end])
+ key = ord(uerr.object[uerr.start])
try:
- return unichr(decoding_map[key]), uerr.end
+ return unichr(decoding_map[key]), uerr.start + 1
except KeyError:
handler = codecs.lookup_error('replace')
return handler(uerr)
Mar 24 '06 #8
On 24/03/2006 8:11 PM, Duncan Booth wrote:
Peter Otten wrote:

You can replace ALL of this upshifting and accent removal in one blow
by using the string translate() method with a suitable table.
Only if you convert to unicode first or if your data maintains 1 byte
== 1 character, in particular it is not UTF-8.

There's a nice little codec from Skip Montaro for removing accents from


For the benefit of those who may read only this far, it is NOT nice.
latin-1 encoded strings. It also has an error handler so you can convert
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py

import latscii
import htmlentitydefs
print u'\u00c9'.encode('ascii','replacelatscii')

E
So Bussiere could replace a large chunk of his code with:


Could, but definitely shouldn't.

ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(

Some of the transformations are a little unfortunate :-(
0x00d0: ord('D'), # Ð
0x00f0: ord('o'), # ð
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
The Icelandic thorn letters become P & p (based on physical appearance),
when they should become Th and th.
The German letter Eszett (00DF) becomes B (appearance) when it should be ss.
Creating alphabetics out of punctuation is scarcely something that
bussiere should be interested in:
0x00a2: ord('c'), # ¢
0x00a4: ord('o'), # ¤
0x00a5: ord('Y'), # ¥
0x00a7: ord('S'), # §
0x00a9: ord('c'), # ©
0x00ae: ord('R'), # ®
0x00b6: ord('P'), # ¶
Mar 24 '06 #9
John Machin wrote:
0x00d0: ord('D'), # Ð
0x00f0: ord('o'), # ð
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
I see information flow from Iceland is a bit better than from Armenia :-)
Some of the transformations are a little unfortunate :-(


The OP, as you pointed out in your first post in this thread, has more
pressing problems with his normalization approach.

Lastly, even if all went well, turning a list of French addresses into an
ascii-uppercase graveyard would be a sad thing to do...

Peter
Mar 24 '06 #10
Duncan Booth wrote:
[...]
Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(


Replace the error handler with this (untested) and it should work with
consecutive accented characters:

def latscii_error( uerr ):
v = []
for c in uerr.object[uerr.start:uerr.end]
key = ord(c)
try:
v.append(unichr(decoding_map[key]))
except KeyError:
v.append(u"?")
return (u"".join(v), uerr.end)
codecs.register_error('replacelatscii', latscii_error)

Bye,
Walter Dörwald
Mar 24 '06 #11
On 24/03/2006 11:44 PM, Peter Otten wrote:
John Machin wrote:

0x00d0: ord('D'), # Ð
0x00f0: ord('o'), # ð
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!

I see information flow from Iceland is a bit better than from Armenia :-)


No information flow needed. Capital letter BLAH -> D and small letter
BLAH -> o should trigger one's palpable nonsense detector for *any* BLAH.

Some of the transformations are a little unfortunate :-(

The OP, as you pointed out in your first post in this thread, has more
pressing problems with his normalization approach.

Lastly, even if all went well, turning a list of French addresses into an
ascii-uppercase graveyard would be a sad thing to do...


Oh indeed. Not only sad, but incredibly stupid. I fervently hope and
trust that such a normalisation is intended only for fuzzy matching
purposes. I can't imagine that anyone would contemplate writing the
output to storage for any reason other than logging or for regression
testing. Update it back to the database? Do you know anyone who would do
that??

Mar 24 '06 #12
John Machin wrote:
Some of the transformations are a little unfortunate :-(


here's a slightly silly way to map a unicode string to its "unaccented"
version:

###

import unicodedata, sys

CHAR_REPLACEMENT = {
0xc6: u"AE", # LATIN CAPITAL LETTER AE
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
0xde: u"Th", # LATIN CAPITAL LETTER THORN
0xdf: u"ss", # LATIN SMALL LETTER SHARP S
0xe6: u"ae", # LATIN SMALL LETTER AE
0xf0: u"d", # LATIN SMALL LETTER ETH
0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
0xfe: u"th", # LATIN SMALL LETTER THORN
}

class unaccented_map(dict):

def mapchar(self, key):
ch = self.get(key)
if ch is not None:
return ch
ch = unichr(key)
try:
ch = unichr(int(unicodedata.decomposition(ch).split()[0], 16))
except (IndexError, ValueError):
ch = CHAR_REPLACEMENT.get(key, ch)
# uncomment the following line if you want to remove remaining
# non-ascii characters
# if ch >= u"\x80": return None
self[key] = ch
return ch

if sys.version >= "2.5":
__missing__ = mapchar
else:
__getitem__ = mapchar

assert isinstance(mystring, unicode)

print mystring.translate(unaccented_map())

###

if the source string is not unicode, you can use something like

s = mystring.decode("iso-8859-1")
s = s.translate(unaccented_map())
s = s.encode("ascii", "ignore")

(this works well for characters in the latin-1 range, at least. no
guarantees for other character ranges)

</F>

Mar 24 '06 #13
Jean-Paul Calderone wrote:
On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sj******@lexicon.net> wrote:
On 24/03/2006 8:36 AM, Peter Otten wrote:
John Machin wrote:

You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.

Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8.


I'm sorry, I forgot that there were people who are unaware that
variable-length gizmos like UTF-8 and various legacy CJK encodings are
for storage & transmission, and are better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted.


Unfortunately, unicode only appears to solve this problem in a sane manner.


What problem do you mean? Loose matching is solved by unicode in a sane
manner, it is described in the unicode collation algorithm.

Serge.

Mar 25 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Irmen de Jong | last post by:
Hi I'm trying to create e-mail content using the email.MIMEText module. It basically works, until I tried to send mail in non-ascii format. What I did, to test both iso-8859-15 and UTF-8...
6
by: Gandalf | last post by:
Hi All! I have a program that looks like this: # -*- coding: iso-8859-2 -*- s1 = 'néz' s2 = raw_input('Please type in "néz":') print repr(s1) print repr(s2)
7
by: Mark | last post by:
Hi... I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around...
2
by: Vincent Poinot | last post by:
I'd like to implement some sort of search function on my site, so I took Google sample code and tried it, i.e. basically: <form method="GET" action="http://www.google.com/search"> <input...
9
by: Joe Blow | last post by:
Strange problem, Web pages encoded in utf-8 are appearing on customers' browsers as iso-western european. This means that characters like the British £ symbol get messed up. No amount of...
4
by: Curt Fluegel | last post by:
I seem to be having a problem base64 encoding characters above 127. I can encode a sentence like "The big bad dog" without problems, but if I try to encode something like 0xFF I get different...
19
by: Thomas W | last post by:
I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word "fødselsdag". I stored the string as "fødselsdag"...
23
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...
8
by: Erwin Moller | last post by:
Hi group, I could use a bit of guidance on the following matter. I am starting a new project now and must make some decisions regarding encoding. Environment: PHP4.3, Postgres7.4.3 I must...
15
by: Bexm | last post by:
Hello I have searched through this forum and it seems some people are having similar problems to me but none of the fixes are fixing mine..! :( I have a table in my database that has two xml...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.