By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,336 Members | 1,086 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,336 IT Pros & Developers. It's quick & easy.

encoding problems ( and )

P: n/a
hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('', 'C')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
doesn't work it put me " and , instead of remplacing by E
if someone have an idea it could be great

regards
Bussiere
ps : i've added the whole script under :


__________________________________________________ ________________________


#!/usr/bin/python
# -*- coding: utf-8 -*-
import fileinput, glob, string, sys, os, re

fichA=raw_input("Entrez le nom du fichier d'entree : ")
print ("\n")
fichC=raw_input("Entrez le nom du fichier de sortie : ")
print ("\n")
normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue->
AV) (O/N) ou A pour tout normaliser \n")
normalisation1 = normalisation1.upper()

if normalisation1 != "A":
print ("\n")
normalisation2 = raw_input("Normaliser les civilits (ex :
Docteur-> DR) (O/N) \n")
normalisation2 = normalisation2.upper()
print ("\n")
normalisation3 = raw_input("Normaliser les Adresses 2 (ex :
Place-> PL) (O/N) \n")
normalisation3 = normalisation3.upper()
normalisation4 = raw_input("Normaliser les caracteres / et - (ex :
/ -> ) (O/N) \n" )
normalisation4 = normalisation4.upper()

if normalisation1 == "A":
normalisation1 = "O"
normalisation2 = "O"
normalisation3 = "O"
normalisation4 = "O"
fiA=open(fichA,"r")
fiC=open(fichC,"w")
compteur = 0

while 1:

ligneA=fiA.readline()

if ligneA == "":

break

if ligneA != "":
str = ligneA
str = str.replace('a', 'A')
str = str.replace('b', 'B')
str = str.replace('c', 'C')
str = str.replace('d', 'D')
str = str.replace('e', 'E')
str = str.replace('f', 'F')
str = str.replace('g', 'G')
str = str.replace('h', 'H')
str = str.replace('i', 'I')
str = str.replace('j', 'J')
str = str.replace('k', 'K')
str = str.replace('l', 'L')
str = str.replace('m', 'M')
str = str.replace('n', 'N')
str = str.replace('o', 'O')
str = str.replace('p', 'P')
str = str.replace('q', 'Q')
str = str.replace('r', 'R')
str = str.replace('s', 'S')
str = str.replace('t', 'T')
str = str.replace('u', 'U')
str = str.replace('v', 'V')
str = str.replace('w', 'W')
str = str.replace('x', 'X')
str = str.replace('y', 'Y')
str = str.replace('z', 'Z')

str = str.replace('', 'C')
str = str.replace('', 'C')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'I')
str = str.replace('', 'I')
str = str.replace('', 'I')
str = str.replace('', 'I')
str = str.replace('', 'O')
str = str.replace('', 'O')
str = str.replace('', 'O')
str = str.replace('', 'O')
str = str.replace('','U')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')

if normalisation1 == "O":
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('FAUBOURG', 'FBG')
str = str.replace('GENERAL', 'GAL')
str = str.replace('COMMANDANT', 'CMDT')
str = str.replace('MARECHAL', 'MAL')
str = str.replace('PRESIDENT', 'PRDT')
str = str.replace('SAINT', 'ST')
str = str.replace('SAINTE', 'STE')
str = str.replace('LOTISSEMENT', 'LOT')
str = str.replace('RESIDENCE', 'RES')
str = str.replace('IMMEUBLE', 'IMM')
str = str.replace('IMEUBLE', 'IMM')
str = str.replace('BATIMENT', 'BAT')

if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
str = str.replace('MADAME', 'MME')
str = str.replace('MADEMOISELLE', 'MLLE')
str = str.replace('DOCTEUR', 'DR')
str = str.replace('PROFESSEUR', 'PR')
str = str.replace('MONSEIGNEUR', 'MGR')
str = str.replace('M ME','MME')
if normalisation3 == "O":
str = str.replace('PLACE', 'PL')
str = str.replace('IMPASSE', 'IMP')
str = str.replace('ESPLANADE', 'ESP')
str = str.replace('ROND POINT', 'RPT')
str = str.replace('ROUTE', 'RTE')
str = str.replace('PASSAGE', 'PAS')
str = str.replace('SQUARE', 'SQ')
str = str.replace('ALLEE', 'ALL')
str = str.replace('ESCALIER', 'ESC')
str = str.replace('ETAGE', 'ETG')
str = str.replace('PORTE', 'PTE')
str = str.replace('APPARTEMENT', 'APT')
str = str.replace('APARTEMENT', 'APT')
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('ZONE D ACTIVITE', 'ZA')
str = str.replace('ZONE D ACTIVITEE', 'ZA')
str = str.replace('ZONE D AMENAGEMENT CONCERTE', 'ZAC')
str = str.replace('ZONE D AMENAGEMENT CONCERTEE', 'ZAC')
str = str.replace('ZONE INDUSTRELLE', 'ZI')
str = str.replace('CENTRE COMMERCIAL', 'CCAL')
str = str.replace('CENTRE', 'CTRE')
str = str.replace('C.CIAL','CCAL')
str = str.replace('CTRE CIAL','CCAL')
str = str.replace('CTRE CCAL','CCAL')
str = str.replace('GALERIE','GAL')
str = str.replace('MARTYR', 'M')
str = str.replace('ANCIENS', 'AC')
str = str.replace('ANCIEN', 'AC')
str = str.replace('REVEREND PERE','R P')

if normalisation4 == "O":
str = str.replace(';\"', ' ')
str = str.replace('\"', ' ')
str = str.replace('\'', ' ')
str = str.replace('-', ' ')
str = str.replace(',', ' ')
str = str.replace('\\', ' ')
str = str.replace('\/', ' ')
str = str.replace('&', ' ')
str = str.replace('%', ' ')
str = str.replace('*', ' ')
str = str.replace(' ', ' ')
str = str.replace('.', ' ')
str = str.replace('_', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace('?', ' ')
str = str.replace('%', ' ')
str = str.replace('|', ' ')




str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
fiC.write(str)
compteur += 1
print compteur, "\n"
print "FINIT"
fiA.close()
fiC.close()
Mar 23 '06 #1
Share this Question
Share on Google+
13 Replies

P: n/a
bussiere bussiere wrote:
hi i'am making a program for formatting string,
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('', 'C')
...
doesn't work it put me " and , instead of remplacing by E


Are your sure your script and your input file *is* actually encoded with
utf-8? If it does not work as expected, it is probably latin-1, just
like your posting. Try changing the coding to latin-1. Does it work now?

-- Christoph
Mar 23 '06 #2

P: n/a
Seems to work fine for me.
x=""
x=x.replace('','E') 'E\xc7' x=x.replace('','C')
x 'E\xc7' x=x.replace('','C')
x
'EC'

You should also be able to use .upper() method to
uppercase everything in the string in a single statement:

tstr=ligneA.upper()

Note: you should never use 'str' as a variable as
it will mask the built-in str function.

-Larry Bates

bussiere bussiere wrote: hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('', 'C')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
doesn't work it put me " and , instead of remplacing by E
if someone have an idea it could be great

regards
Bussiere
ps : i've added the whole script under :


__________________________________________________ ________________________


#!/usr/bin/python
# -*- coding: utf-8 -*-
import fileinput, glob, string, sys, os, re

fichA=raw_input("Entrez le nom du fichier d'entree : ")
print ("\n")
fichC=raw_input("Entrez le nom du fichier de sortie : ")
print ("\n")
normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue->
AV) (O/N) ou A pour tout normaliser \n")
normalisation1 = normalisation1.upper()

if normalisation1 != "A":
print ("\n")
normalisation2 = raw_input("Normaliser les civilits (ex :
Docteur-> DR) (O/N) \n")
normalisation2 = normalisation2.upper()
print ("\n")
normalisation3 = raw_input("Normaliser les Adresses 2 (ex :
Place-> PL) (O/N) \n")
normalisation3 = normalisation3.upper()
normalisation4 = raw_input("Normaliser les caracteres / et - (ex :
/ -> ) (O/N) \n" )
normalisation4 = normalisation4.upper()

if normalisation1 == "A":
normalisation1 = "O"
normalisation2 = "O"
normalisation3 = "O"
normalisation4 = "O"
fiA=open(fichA,"r")
fiC=open(fichC,"w")
compteur = 0

while 1:

ligneA=fiA.readline()

if ligneA == "":

break

if ligneA != "":
str = ligneA
str = str.replace('a', 'A')
str = str.replace('b', 'B')
str = str.replace('c', 'C')
str = str.replace('d', 'D')
str = str.replace('e', 'E')
str = str.replace('f', 'F')
str = str.replace('g', 'G')
str = str.replace('h', 'H')
str = str.replace('i', 'I')
str = str.replace('j', 'J')
str = str.replace('k', 'K')
str = str.replace('l', 'L')
str = str.replace('m', 'M')
str = str.replace('n', 'N')
str = str.replace('o', 'O')
str = str.replace('p', 'P')
str = str.replace('q', 'Q')
str = str.replace('r', 'R')
str = str.replace('s', 'S')
str = str.replace('t', 'T')
str = str.replace('u', 'U')
str = str.replace('v', 'V')
str = str.replace('w', 'W')
str = str.replace('x', 'X')
str = str.replace('y', 'Y')
str = str.replace('z', 'Z')

str = str.replace('', 'C')
str = str.replace('', 'C')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'A')
str = str.replace('', 'I')
str = str.replace('', 'I')
str = str.replace('', 'I')
str = str.replace('', 'I')
str = str.replace('', 'O')
str = str.replace('', 'O')
str = str.replace('', 'O')
str = str.replace('', 'O')
str = str.replace('','U')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')

if normalisation1 == "O":
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('FAUBOURG', 'FBG')
str = str.replace('GENERAL', 'GAL')
str = str.replace('COMMANDANT', 'CMDT')
str = str.replace('MARECHAL', 'MAL')
str = str.replace('PRESIDENT', 'PRDT')
str = str.replace('SAINT', 'ST')
str = str.replace('SAINTE', 'STE')
str = str.replace('LOTISSEMENT', 'LOT')
str = str.replace('RESIDENCE', 'RES')
str = str.replace('IMMEUBLE', 'IMM')
str = str.replace('IMEUBLE', 'IMM')
str = str.replace('BATIMENT', 'BAT')

if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
str = str.replace('MADAME', 'MME')
str = str.replace('MADEMOISELLE', 'MLLE')
str = str.replace('DOCTEUR', 'DR')
str = str.replace('PROFESSEUR', 'PR')
str = str.replace('MONSEIGNEUR', 'MGR')
str = str.replace('M ME','MME')
if normalisation3 == "O":
str = str.replace('PLACE', 'PL')
str = str.replace('IMPASSE', 'IMP')
str = str.replace('ESPLANADE', 'ESP')
str = str.replace('ROND POINT', 'RPT')
str = str.replace('ROUTE', 'RTE')
str = str.replace('PASSAGE', 'PAS')
str = str.replace('SQUARE', 'SQ')
str = str.replace('ALLEE', 'ALL')
str = str.replace('ESCALIER', 'ESC')
str = str.replace('ETAGE', 'ETG')
str = str.replace('PORTE', 'PTE')
str = str.replace('APPARTEMENT', 'APT')
str = str.replace('APARTEMENT', 'APT')
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('ZONE D ACTIVITE', 'ZA')
str = str.replace('ZONE D ACTIVITEE', 'ZA')
str = str.replace('ZONE D AMENAGEMENT CONCERTE', 'ZAC')
str = str.replace('ZONE D AMENAGEMENT CONCERTEE', 'ZAC')
str = str.replace('ZONE INDUSTRELLE', 'ZI')
str = str.replace('CENTRE COMMERCIAL', 'CCAL')
str = str.replace('CENTRE', 'CTRE')
str = str.replace('C.CIAL','CCAL')
str = str.replace('CTRE CIAL','CCAL')
str = str.replace('CTRE CCAL','CCAL')
str = str.replace('GALERIE','GAL')
str = str.replace('MARTYR', 'M')
str = str.replace('ANCIENS', 'AC')
str = str.replace('ANCIEN', 'AC')
str = str.replace('REVEREND PERE','R P')

if normalisation4 == "O":
str = str.replace(';\"', ' ')
str = str.replace('\"', ' ')
str = str.replace('\'', ' ')
str = str.replace('-', ' ')
str = str.replace(',', ' ')
str = str.replace('\\', ' ')
str = str.replace('\/', ' ')
str = str.replace('&', ' ')
str = str.replace('%', ' ')
str = str.replace('*', ' ')
str = str.replace(' ', ' ')
str = str.replace('.', ' ')
str = str.replace('_', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace('?', ' ')
str = str.replace('%', ' ')
str = str.replace('|', ' ')




str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
fiC.write(str)
compteur += 1
print compteur, "\n"
print "FINIT"
fiA.close()
fiC.close()

Mar 23 '06 #3

P: n/a
On 23/03/2006 10:07 PM, bussiere bussiere wrote:
hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

str = str.replace('', 'C')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E')
doesn't work it put me " and , instead of remplacing by E
if someone have an idea it could be great
Hi, I've added some comments below ... I hope they help.
Cheers,
John

regards
Bussiere
ps : i've added the whole script under :
__________________________________________________ ________________________ [snip]
if ligneA != "":
str = ligneA
str = str.replace('a', 'A') [snip] str = str.replace('z', 'Z')

str = str.replace('', 'C')
str = str.replace('', 'C')
str = str.replace('', 'E')
str = str.replace('', 'E')
str = str.replace('', 'E') [snip] str = str.replace('','U')
You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
str = str.replace(' ', ' ')
The standard Python idiom for normalising whitespace is
strg = ' '.join(strg.split())
strg = ' ALLO BUSSIERE\tCA VA? '
strg.split() ['ALLO', 'BUSSIERE', 'CA', 'VA?'] ' '.join(strg.split()) 'ALLO BUSSIERE CA VA?'
[snip] if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
You need to be very careful with this approach. You are changing EVERY
occurrence of "MR" in the string, not just where it is a whole "word"
meaning "Monsieur".
Copnstructed example of what can go wrong: strg = 'MR IMRE NAGY, 123 PRIMROSE STREET, SHAMROCK VALLEY'
strg.replace('MR', 'M') 'M IME NAGY, 123 PRIMOSE STREET, SHAMOCK VALLEY'


A real, non-constructed history lesson: A certain database indicated
duplicate records by having the annotation "DUP" in the surname field
e.g. "SMITH DUP". Fortunately it was detected in testing that the
so-called clean-up was causing DUPLESSIS to become PLESSIS and DUPRAT to
become RAT!

Two points here: (1) Split up your strings into "words" or "tokens".
Using strg.split() is a start but you may need something more
sophisticated e.g. "-" as an additional token separator. (2) Instead of
writing out all those lines of code, consider putting those
substitutions in a dictionary:

title_substitution = {
'MONSIEUR': 'M',
'MR': 'M',
'MADAME': 'MME',
# etc
}
Next level of improvement is to read that stuff from a file.
[snip]
if normalisation4 == "O":
str = str.replace(';\"', ' ')
str = str.replace('\"', ' ')
str = str.replace('\'', ' ')
str = str.replace('-', ' ')
str = str.replace(',', ' ')
str = str.replace('\\', ' ')
str = str.replace('\/', ' ')
str = str.replace('&', ' ')

[snip]
Again, consider the string translate() method.
Also, consider that some of those characters may have some meaning that
you perhaps shouldn't blow away e.g. compare 'SMITH & WESSON' with
'SMITH ET WESSON' :-)
Mar 23 '06 #4

P: n/a
John Machin wrote:
You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.


Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8.

Peter

Mar 23 '06 #5

P: n/a
On 24/03/2006 8:36 AM, Peter Otten wrote:
John Machin wrote:
You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.


Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8.


I'm sorry, I forgot that there were people who are unaware that
variable-length gizmos like UTF-8 and various legacy CJK encodings are
for storage & transmission, and are better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted.

:-)
Unicode? I'm just a benighted Anglo from the a**-end of the globe; who
am I to be preaching Unicode to a European?
(-:
Mar 23 '06 #6

P: n/a
Peter Otten wrote:
You can replace ALL of this upshifting and accent removal in one blow
by using the string translate() method with a suitable table.


Only if you convert to unicode first or if your data maintains 1 byte
== 1 character, in particular it is not UTF-8.


There's a nice little codec from Skip Montaro for removing accents from
latin-1 encoded strings. It also has an error handler so you can convert
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py
import latscii
import htmlentitydefs
print u'\u00c9'.encode('ascii','replacelatscii') E


So Bussiere could replace a large chunk of his code with:

ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(

Mar 24 '06 #7

P: n/a
Duncan Booth wrote:
There's a nice little codec from Skip Montaro for removing accents from
latin-1 encoded strings. It also has an error handler so you can convert
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py
import latscii
import htmlentitydefs
print u'\u00c9'.encode('ascii','replacelatscii') E
So Bussiere could replace a large chunk of his code with:

ligneA = ligneA.decode(INPUTENCODING).encode('ascii',
'replacelatscii') ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(


You made me look into it -- and I found that reusing the decoding map as the
encoding map lets you write
u"lve ".encode("latscii") 'Eleve eee'

without relying on the faulty error handler. I tried to fix the handler,
too:
u"lve ".encode("ascii", "replacelatscii") 'Eleve eee' g = u"\N{GREEK CAPITAL LETTER GAMMA}"
(u"mglich hnlich blich " + g*3).encode("ascii", "replacelatscii")

'moglich ahnlich ublich aaa???'

No real testing was performed.

Peter

--- latscii_old.py 2006-03-24 11:45:22.580588520 +0100
+++ latscii.py 2006-03-24 11:48:13.191651696 +0100
@@ -141,7 +141,7 @@

### Encoding Map

-encoding_map = codecs.make_identity_dict(range(256))
+encoding_map = decoding_map
### From Martin Blais
@@ -166,9 +166,9 @@
## ustr.encode('ascii', 'replacelatscii')
##
def latscii_error( uerr ):
- key = ord(uerr.object[uerr.start:uerr.end])
+ key = ord(uerr.object[uerr.start])
try:
- return unichr(decoding_map[key]), uerr.end
+ return unichr(decoding_map[key]), uerr.start + 1
except KeyError:
handler = codecs.lookup_error('replace')
return handler(uerr)
Mar 24 '06 #8

P: n/a
On 24/03/2006 8:11 PM, Duncan Booth wrote:
Peter Otten wrote:

You can replace ALL of this upshifting and accent removal in one blow
by using the string translate() method with a suitable table.
Only if you convert to unicode first or if your data maintains 1 byte
== 1 character, in particular it is not UTF-8.

There's a nice little codec from Skip Montaro for removing accents from


For the benefit of those who may read only this far, it is NOT nice.
latin-1 encoded strings. It also has an error handler so you can convert
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py

import latscii
import htmlentitydefs
print u'\u00c9'.encode('ascii','replacelatscii')

E
So Bussiere could replace a large chunk of his code with:


Could, but definitely shouldn't.

ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(

Some of the transformations are a little unfortunate :-(
0x00d0: ord('D'), #
0x00f0: ord('o'), #
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
The Icelandic thorn letters become P & p (based on physical appearance),
when they should become Th and th.
The German letter Eszett (00DF) becomes B (appearance) when it should be ss.
Creating alphabetics out of punctuation is scarcely something that
bussiere should be interested in:
0x00a2: ord('c'), #
0x00a4: ord('o'), #
0x00a5: ord('Y'), #
0x00a7: ord('S'), #
0x00a9: ord('c'), #
0x00ae: ord('R'), #
0x00b6: ord('P'), #
Mar 24 '06 #9

P: n/a
John Machin wrote:
0x00d0: ord('D'), #
0x00f0: ord('o'), #
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
I see information flow from Iceland is a bit better than from Armenia :-)
Some of the transformations are a little unfortunate :-(


The OP, as you pointed out in your first post in this thread, has more
pressing problems with his normalization approach.

Lastly, even if all went well, turning a list of French addresses into an
ascii-uppercase graveyard would be a sad thing to do...

Peter
Mar 24 '06 #10

P: n/a
Duncan Booth wrote:
[...]
Unfortunately, just as I finished writing this I discovered that the
latscii module isn't as robust as I thought, it blows up on consecutive
accented characters.

:(


Replace the error handler with this (untested) and it should work with
consecutive accented characters:

def latscii_error( uerr ):
v = []
for c in uerr.object[uerr.start:uerr.end]
key = ord(c)
try:
v.append(unichr(decoding_map[key]))
except KeyError:
v.append(u"?")
return (u"".join(v), uerr.end)
codecs.register_error('replacelatscii', latscii_error)

Bye,
Walter Drwald
Mar 24 '06 #11

P: n/a
On 24/03/2006 11:44 PM, Peter Otten wrote:
John Machin wrote:

0x00d0: ord('D'), #
0x00f0: ord('o'), #
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!

I see information flow from Iceland is a bit better than from Armenia :-)


No information flow needed. Capital letter BLAH -> D and small letter
BLAH -> o should trigger one's palpable nonsense detector for *any* BLAH.

Some of the transformations are a little unfortunate :-(

The OP, as you pointed out in your first post in this thread, has more
pressing problems with his normalization approach.

Lastly, even if all went well, turning a list of French addresses into an
ascii-uppercase graveyard would be a sad thing to do...


Oh indeed. Not only sad, but incredibly stupid. I fervently hope and
trust that such a normalisation is intended only for fuzzy matching
purposes. I can't imagine that anyone would contemplate writing the
output to storage for any reason other than logging or for regression
testing. Update it back to the database? Do you know anyone who would do
that??

Mar 24 '06 #12

P: n/a
John Machin wrote:
Some of the transformations are a little unfortunate :-(


here's a slightly silly way to map a unicode string to its "unaccented"
version:

###

import unicodedata, sys

CHAR_REPLACEMENT = {
0xc6: u"AE", # LATIN CAPITAL LETTER AE
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
0xde: u"Th", # LATIN CAPITAL LETTER THORN
0xdf: u"ss", # LATIN SMALL LETTER SHARP S
0xe6: u"ae", # LATIN SMALL LETTER AE
0xf0: u"d", # LATIN SMALL LETTER ETH
0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
0xfe: u"th", # LATIN SMALL LETTER THORN
}

class unaccented_map(dict):

def mapchar(self, key):
ch = self.get(key)
if ch is not None:
return ch
ch = unichr(key)
try:
ch = unichr(int(unicodedata.decomposition(ch).split()[0], 16))
except (IndexError, ValueError):
ch = CHAR_REPLACEMENT.get(key, ch)
# uncomment the following line if you want to remove remaining
# non-ascii characters
# if ch >= u"\x80": return None
self[key] = ch
return ch

if sys.version >= "2.5":
__missing__ = mapchar
else:
__getitem__ = mapchar

assert isinstance(mystring, unicode)

print mystring.translate(unaccented_map())

###

if the source string is not unicode, you can use something like

s = mystring.decode("iso-8859-1")
s = s.translate(unaccented_map())
s = s.encode("ascii", "ignore")

(this works well for characters in the latin-1 range, at least. no
guarantees for other character ranges)

</F>

Mar 24 '06 #13

P: n/a
Jean-Paul Calderone wrote:
On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sj******@lexicon.net> wrote:
On 24/03/2006 8:36 AM, Peter Otten wrote:
John Machin wrote:

You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.

Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8.


I'm sorry, I forgot that there were people who are unaware that
variable-length gizmos like UTF-8 and various legacy CJK encodings are
for storage & transmission, and are better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted.


Unfortunately, unicode only appears to solve this problem in a sane manner.


What problem do you mean? Loose matching is solved by unicode in a sane
manner, it is described in the unicode collation algorithm.

Serge.

Mar 25 '06 #14

This discussion thread is closed

Replies have been disabled for this discussion.