Kenneth McDonald <kenneth.m.mcdonald@sbcglobal.net> wrote:[color=blue]
> I am going to demonstrate my complete lack of understanding as to
> going back and forth between
> character encodings, so I hope someone out there can shed some light
> on this. I have always
> depended on the kindness of strangers... :-)
>
> I'm playing around with some very simplistic french to english
> translation. As some text to
> work with, I copied the following from a french news site:
>
> Dans les années 1960, plus d'une voiture sur deux vendues aux
> Etats-Unis était fabriquée par GM.
> Pendant que les ventes s'effondrent, les pertes se creusent :
> sur les neuf premiers mois de l'année 2005,
> elles s'élèvent Ã* 3,8 milliards de dollars (3,18 milliards
> d'euros), et le dernier trimestre s'annonce difficile.
> Quant Ã* la dette, elle est hors normes : 285 milliards de
> dollars, soit une fois et demie le chiffre d'affaires.
> GM est désormais considéré par les agences de notation
> financière comme un investissement spéculatif.
> Un comble pour un leader mondial !
>
> Of course, it has lots of accented, non-ascii characters. However, it
> posted just fine into both
> this email program (hopefully it displays equally well at the other
> end),[/color]
It has correct charset header indicating ISO-8859-1 encoding, so yes, it
displayed correctly.
[color=blue]
> and into my Python
> editing program (jEdit).
>
> To start with, I'm not at all cognizant of how either the editor or
> the mail program could even
> know what encodings to use to display this text properly...[/color]
You did not tell us what OS are you using, but in case of Unix, it all
goes up and down with locale - you can transparently pass around text
data as long as the characters are in the repertoire of your locale - of
course, as long as the applications are locale-aware - many older ones
are not. (It is best to use UTF-8 encoding, so that all the more or less
obscure characters can be represented)
If you have Windows, it depends on programs working with old 8-bit ANSI
API, or new unicode API. If the programs use unicode API, you can
without problems pass data around, if they use 8-bit API, you are
restricted to the characters from your system codepage.
[color=blue]
>
> Next, having got the text into the Python file, I presumably have to
> encode it as a Unicode
> string, but trying something like text = u"""désormais considéré"""
> complains to the effect
> that :
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\x8e'
> in position 13: ordinal not in range(128)
>
> This occurs even with the first line in the file of
>
> # -*- coding: latin-1 -*-
>
> which I'd hoped would include what I think of as the latin characters
> including all those ones with
> graves, agues, circonflexes, umlauts, cedilles, and so forth.[/color]
latin-1 is not enough for proper French (lack of Å“). It is not even
enough for English, it lacks proper typographic quotes and so on.
[color=blue]
> Apparently it does not :-)[/color]
Well, it would be enough for your example, "désormais considéré"
does indeed fit into latin-1. But python complains about character \x8e,
which indeed does not belong to latin-1. Without knowing your OS and
your locale (or ANSI codepage), we cannot tell how it got there.
[color=blue]
>
> So I really have two questions:
>
> 1) How the heck did jEdit understand the text with all the accents
> I pasted into it? More
> specifically, how did it know the proper encoding to use?[/color]
jEdit is written in Java, right? Java has a good internal unicode
support, so if your OS allowed it, pasting from WWW browser worked since
the browser had to new the encoding (in order to display it properly).
[color=blue]
>
> 2) How do I get Python to understand this text? Is there some sort
> of coding that will
> work in almost every circumstance?[/color]
utf-8, obviously. Unless you have a strong reason not to do so, use
utf-8 exclusively - you never know what strange character can appear
(even in plain English), and you working and tested application will
start crashing when it gets to the real worls.
So, use # -*- coding: utf-8 -*-, but MAKE SURE jEdit is configured to
save the file in utf-8 encoding (not knowing jEdit, I cannot tell you
how to achieve this, but jEdit's www page claims that jEdit does support
utf-8).
Then there is a little problem with python stdout trying to convert
unicode strings into system default encoding and failing if it cannot be
done, but let's leave this for the moment :-)
--
-----------------------------------------------------------
| Radovan GarabÃ*k
http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!