By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,931 Members | 1,976 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,931 IT Pros & Developers. It's quick & easy.

utf - string translation

P: n/a
hg
Hi,

I'm bringing over a thread that's going on on f.c.l.python.

The point was to get rid of french accents from words.

We noticed that len('') != len('a') and I found the hack below to fix
the "problem" ... yet I do not understand - especially since '' is
included in the extended ASCII table, and thus can be stored in one byte.

Any clue ?

hg

# -*- coding: utf-8 -*-
import string

def convert(mot):
print len(mot)
print mot[0]
print '%x' % ord(mot[1])
table =
string.maketrans('','\x00a\x00a\x00a \x00e\x00e\x00e\x00e\x00i\x00i\x00o\x00o\x00u\x00u \x00u')

return mot.translate(table).replace('\x00','')
c = 'b a '
print convert(c)
Nov 22 '06 #1
Share this Question
Share on Google+
22 Replies


P: n/a
hg wrote:
We noticed that len('') != len('a')
sounds odd.
>>len('') == len('a')
True

are you perhaps using an UTF-8 editor?

to keep your sanity, no matter what editor you're using, I recommend
adding a coding directive to the source file, and using *only* Unicode
string literals for non-ASCII text.

or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Nov 22 '06 #2

P: n/a
hg
Fredrik Lundh wrote:
hg wrote:
>We noticed that len('') != len('a')

sounds odd.
>>>len('') == len('a')
True

are you perhaps using an UTF-8 editor?

to keep your sanity, no matter what editor you're using, I recommend
adding a coding directive to the source file, and using *only* Unicode
string literals for non-ASCII text.

or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>
Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('')

returns 1 then 2

and string.maketrans(str1, str2) requires that len(str1) == len(str2)

hg

Nov 22 '06 #3

P: n/a
hg
hg wrote:
Fredrik Lundh wrote:
>hg wrote:
>>We noticed that len('') != len('a')
sounds odd.
>>>>len('') == len('a')
True

are you perhaps using an UTF-8 editor?

to keep your sanity, no matter what editor you're using, I recommend
adding a coding directive to the source file, and using *only* Unicode
string literals for non-ASCII text.

or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('')

returns 1 then 2

and string.maketrans(str1, str2) requires that len(str1) == len(str2)

hg


PS: I'm running this under Idle
Nov 22 '06 #4

P: n/a
hg <hg@nospam.comwrote:
>or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('')

returns 1 then 2
And if you do what was suggested and write:

# -*- coding: utf-8 -*-
import string
print len(u'a')
print len(u'')

then you get:

1
1
Nov 22 '06 #5

P: n/a
hg
Duncan Booth wrote:
hg <hg@nospam.comwrote:
>>or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>
Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('')

returns 1 then 2

And if you do what was suggested and write:

# -*- coding: utf-8 -*-
import string
print len(u'a')
print len(u'')

then you get:

1
1
OK,

How would you handle the string.maketrans then ?

hg

Nov 22 '06 #6

P: n/a
hg wrote:
How would you handle the string.maketrans then ?
maketrans works on bytes, not characters. what makes you think that you
can use maketrans if you haven't gotten the slightest idea what's in the
string?

if you want to get rid of accents in a Unicode string, you can do the
approaches described here

http://www.peterbe.com/plog/unicode-to-ascii

or here

http://effbot.org/zone/unicode-convert.htm

which both works on any Unicode string.

</F>

Nov 22 '06 #7

P: n/a
hg
Fredrik Lundh wrote:
hg wrote:
>How would you handle the string.maketrans then ?

maketrans works on bytes, not characters. what makes you think that you
can use maketrans if you haven't gotten the slightest idea what's in the
string?

if you want to get rid of accents in a Unicode string, you can do the
approaches described here

http://www.peterbe.com/plog/unicode-to-ascii

or here

http://effbot.org/zone/unicode-convert.htm

which both works on any Unicode string.

</F>
Thanks
Nov 22 '06 #8

P: n/a
hg wrote:
Duncan Booth wrote:
hg <hg@nospam.comwrote:
>or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('')

returns 1 then 2
And if you do what was suggested and write:

# -*- coding: utf-8 -*-
import string
print len(u'a')
print len(u'')

then you get:

1
1
Some general comments:

1. There has been at least one thread on the subject of ripping accents
off Latin1 characters in the last 3 or 4 months. Try Google.

2. About your earlier problem, when len(thing1) != len(thing2):
In that and similar situations, it can be *very* useful to use this
technique:
print repr(thing1), type(thing1)
print repr(thing2), type(thing2)
Go back now and try it out!
OK,

How would you handle the string.maketrans then ?
I suggest that you first read the documentation on the str and unicode
"translate" methods.
You can obtain this quickly at the interactive prompt by doing
help(''.translate)
and
help(u''.translate)
respectively.

Next steps:

Is your *real* data (not the examples you were hard-coding earlier)
encoded (latin1, utf8) in str objects or is it in unicode objects?
After reading previous posts my head is spinning & I'm not going to
guess; you determine it yourself.

[pseudocode -- blend of Pythonic & Knuthian styles]
if latin1: (A) you can use string.maketrans and str.translate
immediately.

elif unicode: (B) either (1) encode to latin1; goto (A) or (2) use
unicode.translate with do-it-yourself mapping

elif utf8: decode to unicode; goto (B)

else: ???

HTH,
John

Nov 22 '06 #9

P: n/a
Dan
Thank you for your answers.

In fact, I'm getting start with Python.

I was looking for transform a text through elementary cryptographic
processes (Vigenre).
The initial text is in a file, and my system is under UTF-8 by default
(Ubuntu)

Nov 22 '06 #10

P: n/a
Dan wrote:
Thank you for your answers.

In fact, I'm getting start with Python.
That was a good decision. Welcome!
>
I was looking for transform a text through elementary cryptographic
processes (Vigenre).
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

The initial text is in a file, and my system is under UTF-8 by default
(Ubuntu)
Your system being "under UTF-8" does give you some clue, I suppose. Do
find the time to locate some data with accents and do print(repr(data))
as I suggested, to *verify* what you've got.

Don't guess. Different underlying representations can look the same
when rendered on your screen. Don't rely on what sysadmins tell you.
Peculiar things can happen, e.g.

me: How is your data encoded?
them: XYZese [a language]
me: I'll try again; Are you using encoding A or encoding B?
them: We've heard A mentioned; what's an encoding anyway?
[snip long explanation plus investigation of what locales [plural] had
been used when configuring their workstations and servers]
them: OK, so there's more than one way of representing XYZese on a
computer. That might explain why the government regulatory authority
for our industry is very sad [to put it mildly] about not being able to
read our monthly filings!!!

Cheers,
John

Nov 22 '06 #11

P: n/a
In article <11**********************@k70g2000cwa.googlegroups .com>,
John Machin <sj******@lexicon.netwrote:
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.
The difference there, though, is a punctuation character, not an accent.

--
David Wild using RISC OS on broadband
Nov 22 '06 #12

P: n/a

David H Wild wrote:
In article <11**********************@k70g2000cwa.googlegroups .com>,
John Machin <sj******@lexicon.netwrote:
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

The difference there, though, is a punctuation character, not an accent.
I did say "differences in punctuation or accents". Yes, the only
example I could recall OTTOMH was a difference in punctuation --
according to legend, a fly-spot IIRC :-)

Nov 22 '06 #13

P: n/a
David H Wild wrote:
In article <11**********************@k70g2000cwa.googlegroups .com>,
John Machin <sj******@lexicon.netwrote:
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

The difference there, though, is a punctuation character, not an accent.
It's not too hard to imagine an accentual difference, eg:

Le soldat protge avec le fusil --the soldier protects with the gun
Le soldat protg avec le fusil --the soldier who is protected by
the gun (perhaps a cannon)

Contrived example, I realize, but there are scads of such instances.
(Caveat: my french is also very rusty).

-Mike

Nov 22 '06 #14

P: n/a
Klaas wrote:
It's not too hard to imagine an accentual difference, eg:
especially in languages where certain combinations really are distinct
letters, not just letters with accents or silly marks.

I have a Swedish children's book somewhere, in which some characters are
harassed by a big ugly monster who carries a sign around his neck that
says "Monster".

the protagonist ends up adding two dots to that sign, turning it into
"Mnster" (meaning "model", in the "model citizen" sense), and all ends
well.

just imagine that story in reverse.

</F>

Nov 23 '06 #15

P: n/a
On Wed, 22 Nov 2006 22:59:01 +0100, John Machin <sj******@lexicon.net>
wrote:
[snip]
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.
It may not be to store or even use the actual text. I stumbled on a
problem like this some time ago: I had some code building an index for a
document and wanted the entries starting with "e", "", "" or "" to be
in the same section...
--
python -c "print ''.join([chr(154 - ord(c)) for c in
'U(17zX(%,5.zmz5(17l8(%,5.Z*(93-965$l7+-'])"
Nov 23 '06 #16

P: n/a
Dan
On 22 nov, 22:59, "John Machin" <sjmac...@lexicon.netwrote:
processes (Vigenre)
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.
of course.
My purpose is not doing something realistic on a cryptographic view.
It's for learning rudiments of programming.
In fact, coding characters is a kind of cryptography I mean, sometimes,
when friends can't read an email because of the characters used...

I wanted to strip off accents because I use the frequences of the
charactacters. If I only have 26 char, it's more easy to analyse (the
text can be shorter for example)

Nov 26 '06 #17

P: n/a
Dan wrote:
On 22 nov, 22:59, "John Machin" <sjmac...@lexicon.netwrote:

>>processes (Vigenre)
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

of course.
My purpose is not doing something realistic on a cryptographic view.
It's for learning rudiments of programming.
In fact, coding characters is a kind of cryptography I mean, sometimes,
when friends can't read an email because of the characters used...

I wanted to strip off accents because I use the frequences of the
charactacters. If I only have 26 char, it's more easy to analyse (the
text can be shorter for example)

Try this:

from_characters =
'\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\ xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd 9\xda\xdb\xdc\xdd\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xec\ xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xf a\xfb\xfc\xfd\xff\xe7\xe8\xe9\xea\xeb'
to_characters =
'AAAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaaiiiionoooooou uuuyyceeee'
translation_table = string.maketrans (from_characters, to_characters)
translated_string = string.translate (original_string, translation_table)
Frederic
Nov 29 '06 #18

P: n/a

Frederic Rentsch wrote:
Try this:

from_characters =
'\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\ xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd 9\xda\xdb\xdc\xdd\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xec\ xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xf a\xfb\xfc\xfd\xff\xe7\xe8\xe9\xea\xeb'
to_characters =
'AAAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaaiiiionoooooou uuuyyceeee'
translation_table = string.maketrans (from_characters, to_characters)
translated_string = string.translate (original_string, translation_table)
A few observations on the above:

1. This assumes that "original_string" is a str object, and the text is
encoded in latin1 or similar (e.g. cp1252).

2. Presentation of the map could be improved greatly, along the lines
of:

import pprint
import unicodedata
fromc = \
[snip]
toc = 'AAAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaaiiiionoooooou uuuyyceeee'
assert len(fromc) == len(toc)
tups = list(zip(unicode(fromc, 'latin1'), toc))
tups.sort()
tupsu = [(x[1], x[0], unicodedata.name(x[0], '** no name **')) for x in
tups]
pprint.pprint(tupsu)

which produces:

[('A', u'\xc0', 'LATIN CAPITAL LETTER A WITH GRAVE'),
('A', u'\xc1', 'LATIN CAPITAL LETTER A WITH ACUTE'),
[snip]
('D', u'\xd0', 'LATIN CAPITAL LETTER ETH'),
[snip]
('Y', u'\xdd', 'LATIN CAPITAL LETTER Y WITH ACUTE'),
('a', u'\xe0', 'LATIN SMALL LETTER A WITH GRAVE'),
[snip]
('o', u'\xf0', 'LATIN SMALL LETTER ETH'),
[snip]
('y', u'\xfd', 'LATIN SMALL LETTER Y WITH ACUTE'),
('y', u'\xff', 'LATIN SMALL LETTER Y WITH DIAERESIS')]

This makes it a lot easier to see what is going on, and check for
weirdness, like the inconsistent treatment of \xd0 and \xf0.

3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn), \xfe (small
thorn) and \xdf (sharp s aka Eszett).

Cheers,
John

Nov 29 '06 #19

P: n/a
John Machin wrote:
3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn), \xfe (small
thorn) and \xdf (sharp s aka Eszett).
I did post links to code that does this to this thread, several days ago...

</F>

Nov 29 '06 #20

P: n/a

Fredrik Lundh wrote:
John Machin wrote:
3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn), \xfe (small
thorn) and \xdf (sharp s aka Eszett).

I did post links to code that does this to this thread, several days ago...
Ah yes, I missed that -- and your posting doesn't advertise that the
code fixed the "one character should be mapped to two" cases :-)

This code
(http://effbot.python-hosting.com/fil...xt/unaccent.py)
looks generally very good, but I'm left wondering why "AE" and "OE" in
the table, not "Ae and "Oe":
[snip]
0xc6: u"AE", # LATIN CAPITAL LETTER AE <<<=== ??
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE <<<=== ??
0xde: u"Th", # LATIN CAPITAL LETTER THORN
[snip]

Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukasziewicz".translate(unaccented_map() )
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE

It looks like generating extra entries like that could be done, with
the aid of unicodedata.name():

LATIN CAPITAL LETTER X WITH blahblah -"X"
LATIN SMALL LETTER X WITH blahblah -"X".lower()

This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.

Cheers,
John

Nov 29 '06 #21

P: n/a
John Machin wrote:
Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukasziewicz".translate(unaccented_map() )
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE

It looks like generating extra entries like that could be done, with
the aid of unicodedata.name():

LATIN CAPITAL LETTER X WITH blahblah -"X"
LATIN SMALL LETTER X WITH blahblah -"X".lower()

This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.
see the comments over at

http://effbot.org/zone/unicode-convert.htm

for an extended table, eyeballed by a regional expert (and since he
makes the same point about OE vs Oe as you do, I'll probably have to
change the code ;-)

</F>

Nov 29 '06 #22

P: n/a

Fredrik Lundh wrote:
John Machin wrote:
Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukasziewicz".translate(unaccented_map() )
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE

It looks like generating extra entries like that could be done, with
the aid of unicodedata.name():

LATIN CAPITAL LETTER X WITH blahblah -"X"
LATIN SMALL LETTER X WITH blahblah -"X".lower()

This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.

see the comments over at

http://effbot.org/zone/unicode-convert.htm
Don't rush me, I was getting to that next :-)
>
for an extended table, eyeballed by a regional expert (and since he
makes the same point about OE vs Oe as you do, I'll probably have to
change the code ;-)
Slightly extended. My point is that there is a large number of LATIN
(CAPITAL|SMALL) LETTER X WITH twiddly-bits that don't have a
decomposition; the table entries could be generated automatically

As well as regional experts, Google can be handy: googling for Thord,
Thordh, Thordsson and Thordhsson and noting the number of hits for each
tends to indicate that you and I are right about the treatment of
"eth"; Marcin's "dh" might better indicate how it's pronounced, but "d"
is AFAICT the standard transcription.

Cheers,
John

Nov 29 '06 #23

This discussion thread is closed

Replies have been disabled for this discussion.