473,659 Members | 2,872 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Questions about working with character encodings

I am going to demonstrate my complete lack of understanding as to
going back and forth between
character encodings, so I hope someone out there can shed some light
on this. I have always
depended on the kindness of strangers... :-)

I'm playing around with some very simplistic french to english
translation. As some text to
work with, I copied the following from a french news site:

Dans les années 1960, plus d'une voiture sur deux vendues aux
Etats-Unis était fabriquée par GM.
Pendant que les ventes s'effondrent, les pertes se creusent :
sur les neuf premiers mois de l'année 2005,
elles s'élèvent à 3,8 milliards de dollars (3,18 milliards
d'euros), et le dernier trimestre s'annonce difficile.
Quant à la dette, elle est hors normes : 285 milliards de
dollars, soit une fois et demie le chiffre d'affaires.
GM est désormais considéré par les agences de notation
financière comme un investissement spéculatif.
Un comble pour un leader mondial !

Of course, it has lots of accented, non-ascii characters. However, it
posted just fine into both
this email program (hopefully it displays equally well at the other
end), and into my Python
editing program (jEdit).

To start with, I'm not at all cognizant of how either the editor or
the mail program could even
know what encodings to use to display this text properly...

Next, having got the text into the Python file, I presumably have to
encode it as a Unicode
string, but trying something like text = u"""désormai s considéré"""
complains to the effect
that :

UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\x8e'
in position 13: ordinal not in range(128)

This occurs even with the first line in the file of

# -*- coding: latin-1 -*-

which I'd hoped would include what I think of as the latin characters
including all those ones with
graves, agues, circonflexes, umlauts, cedilles, and so forth.
Apparently it does not :-)

So I really have two questions:

1) How the heck did jEdit understand the text with all the accents
I pasted into it? More
specifically, how did it know the proper encoding to use?

2) How do I get Python to understand this text? Is there some sort
of coding that will
work in almost every circumstance?

Many thanks for your patience with someone completely new to this
aspect of text handling,

Ken
Dec 15 '05 #1
1 2451
Kenneth McDonald <ke************ ****@sbcglobal. net> wrote:
I am going to demonstrate my complete lack of understanding as to
going back and forth between
character encodings, so I hope someone out there can shed some light
on this. I have always
depended on the kindness of strangers... :-)

I'm playing around with some very simplistic french to english
translation. As some text to
work with, I copied the following from a french news site:

Dans les années 1960, plus d'une voiture sur deux vendues aux
Etats-Unis était fabriquée par GM.
Pendant que les ventes s'effondrent, les pertes se creusent :
sur les neuf premiers mois de l'année 2005,
elles s'élèvent Ã* 3,8 milliards de dollars (3,18 milliards
d'euros), et le dernier trimestre s'annonce difficile.
Quant Ã* la dette, elle est hors normes : 285 milliards de
dollars, soit une fois et demie le chiffre d'affaires.
GM est désormais considéré par les agences de notation
financière comme un investissement spéculatif.
Un comble pour un leader mondial !

Of course, it has lots of accented, non-ascii characters. However, it
posted just fine into both
this email program (hopefully it displays equally well at the other
end),
It has correct charset header indicating ISO-8859-1 encoding, so yes, it
displayed correctly.
and into my Python
editing program (jEdit).

To start with, I'm not at all cognizant of how either the editor or
the mail program could even
know what encodings to use to display this text properly...
You did not tell us what OS are you using, but in case of Unix, it all
goes up and down with locale - you can transparently pass around text
data as long as the characters are in the repertoire of your locale - of
course, as long as the applications are locale-aware - many older ones
are not. (It is best to use UTF-8 encoding, so that all the more or less
obscure characters can be represented)

If you have Windows, it depends on programs working with old 8-bit ANSI
API, or new unicode API. If the programs use unicode API, you can
without problems pass data around, if they use 8-bit API, you are
restricted to the characters from your system codepage.

Next, having got the text into the Python file, I presumably have to
encode it as a Unicode
string, but trying something like text = u"""désorma is considéré"""
complains to the effect
that :

UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\x8e'
in position 13: ordinal not in range(128)

This occurs even with the first line in the file of

# -*- coding: latin-1 -*-

which I'd hoped would include what I think of as the latin characters
including all those ones with
graves, agues, circonflexes, umlauts, cedilles, and so forth.
latin-1 is not enough for proper French (lack of Å“). It is not even
enough for English, it lacks proper typographic quotes and so on.
Apparently it does not :-)
Well, it would be enough for your example, "désormais considéré"
does indeed fit into latin-1. But python complains about character \x8e,
which indeed does not belong to latin-1. Without knowing your OS and
your locale (or ANSI codepage), we cannot tell how it got there.

So I really have two questions:

1) How the heck did jEdit understand the text with all the accents
I pasted into it? More
specifically, how did it know the proper encoding to use?
jEdit is written in Java, right? Java has a good internal unicode
support, so if your OS allowed it, pasting from WWW browser worked since
the browser had to new the encoding (in order to display it properly).

2) How do I get Python to understand this text? Is there some sort
of coding that will
work in almost every circumstance?


utf-8, obviously. Unless you have a strong reason not to do so, use
utf-8 exclusively - you never know what strange character can appear
(even in plain English), and you working and tested application will
start crashing when it gets to the real worls.

So, use # -*- coding: utf-8 -*-, but MAKE SURE jEdit is configured to
save the file in utf-8 encoding (not knowing jEdit, I cannot tell you
how to achieve this, but jEdit's www page claims that jEdit does support
utf-8).

Then there is a little problem with python stdout trying to convert
unicode strings into system default encoding and failing if it cannot be
done, but let's leave this for the moment :-)

--
-----------------------------------------------------------
| Radovan GarabÃ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls .savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Dec 15 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
1224
by: Fuzzyman | last post by:
I have a couple of questions about the UTF encodings. The codecs module has constants definded for the UTF32 encoding, yet this encoding isn't supported as a standard encoding. Why isn't it supported ? It possibly has something to do with my next question. I know that unicode has (recently?) been expanded to include new character sets. This means that the latest unicode standard can't be fully supported with 2 bytes per character. As...
18
14740
by: SwordAngel | last post by:
Hello, I'm looking for a program that converts characters of different encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand escape sequences. Anybody knows where I can find one? thx.
4
3016
by: HeroOfSpielburg | last post by:
Hello, I am trying to using the Shift_JIS character set in my web pages, and have specified it as such in the <head>. <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> This used to work just fine, but recently I migrated all of my web pages to a new server. Now I find that when I view the web pages they
37
10143
by: chandy | last post by:
Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be accidentally entered are properly represented when the document is served. My programming language allows me to get the ascii value for any individual character so what I am doing when a change is saved is to look at each character in the content and...
38
3468
by: Luke Matuszewski | last post by:
Welcome I have read the in the faq from jibbering about the generic DynWrite, but i also realized that is uses only innerHTML feature of HTML objects. (1) Is there a DOM function which is very similar to innerHTML property eg. (my guess) setInnerNodeAsText or sth... ? I want to write function which will be dynamically updateing some of my select boxes. My question is: (2.1) Can i use innerHTML property of SELECT (or even incorporate
13
27975
by: Michal | last post by:
Hello, is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode). Thank you for any answer Regards Michal
37
3367
by: Zhiv Kurilka | last post by:
Hi, I have a text file with following content: "((^)|(.* +))§§§§§§§§" if I read it with: k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); k.readtotheend()
9
1524
by: Alex Shao | last post by:
Dear all, I have a questions about whether it is possible to use UNICODE by Standard C++. If not, Are there any libraries can be used to achieve it. Any help will be appreciated.
17
10652
by: =?Utf-8?B?R2Vvcmdl?= | last post by:
Hello everyone, Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when talking with another team -- codepage -- at the same time. I am more confused when I saw sometimes we need codepage parameter for wide character conversion, and sometimes we do not need for conversion. Here are two examples,
0
8427
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8851
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8746
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8627
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6179
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4175
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2750
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1975
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1737
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.