Fredrik Lundh wrote:
John Machin wrote:
3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn), \xfe (small
thorn) and \xdf (sharp s aka Eszett).
I did post links to code that does this to this thread, several days ago...
Ah yes, I missed that -- and your posting doesn't advertise that the
code fixed the "one character should be mapped to two" cases :-)
This code
(
http://effbot.python-hosting.com/fil...xt/unaccent.py)
looks generally very good, but I'm left wondering why "AE" and "OE" in
the table, not "Ae and "Oe":
[snip]
0xc6: u"AE", # LATIN CAPITAL LETTER AE <<<=== ??
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE <<<=== ??
0xde: u"Th", # LATIN CAPITAL LETTER THORN
[snip]
Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukaszie wicz".translate (unaccented_map ())
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE
It looks like generating extra entries like that could be done, with
the aid of unicodedata.nam e():
LATIN CAPITAL LETTER X WITH blahblah -"X"
LATIN SMALL LETTER X WITH blahblah -"X".lower()
This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.
Cheers,
John