473,739 Members | 8,690 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Trouble saving unicode text to file

I'm working on a program that is supposed to save
different information to text files.

Because the program is in swedish i have to use
unicode text for letters.

When I run the following testscript I get an error message.

# -*- coding: cp1252 -*-

titel = ""
titel = unicode(titel)

print "Titel type", type(titel)

fil = open("testfil.t xt", "w")
fil.write(titel )
fil.close()
Traceback (most recent call last):
File "D:\Documen ts and
Settings\Daniel \Desktop\Progra mmering\aaotest \aaotest2\aaote st2.pyw",
line 5, in ?
titel = unicode(titel)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)
I need to have the titel variable in unicode format because when I
write
in a entry box in Tkinkter it makes the value to a unicode
format
automaticly.

Are there anyone who knows an easy way to save this unicode format text
to a file?

Jul 19 '05 #1
19 5679

Svennglenn> Traceback (most recent call last):
Svennglenn> File "D:\Documen ts and
Svennglenn> Settings\Daniel \Desktop\Progra mmering\aaotest \aaotest2\aaote st2.pyw",
Svennglenn> line 5, in ?
Svennglenn> titel = unicode(titel)
Svennglenn> UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
Svennglenn> ordinal not in range(128)

Try:

import codecs

titel = ""
titel = unicode(titel, "iso-8859-1")
fil = codecs.open("te stfil.txt", "w", "iso-8859-1")
fil.write(titel )
fil.close()

Skip
Jul 19 '05 #2
On 7 May 2005 14:22:56 -0700, "Svennglenn " <Da**********@y ahoo.se>
wrote:
I'm working on a program that is supposed to save
different information to text files.

Because the program is in swedish i have to use
unicode text for letters.
"program is in Swedish": to the extent that this means "names of
variables are in Swedish", this is quite irrelevant. The variable
names could be in some other language, like Slovak, Slovenian, Swahili
or Strine. Your problem(s) (PLURAL) arise from the fact that your text
data is in Swedish, the representation of which uses a few non-ASCII
characters. Problem 1 is the representation of Swedish in text
constants in your program; this is causing the exception you show
below but curiously didn't ask for help with.

When I run the following testscript I get an error message.

# -*- coding: cp1252 -*-

titel = ""
titel = unicode(titel)
You should use titel = u""
Works, and saves wear & tear on your typing fingers.

print "Titel type", type(titel)

fil = open("testfil.t xt", "w")
fil.write(tite l)
fil.close()
Traceback (most recent call last):
File "D:\Documen ts and
Settings\Danie l\Desktop\Progr ammering\aaotes t\aaotest2\aaot est2.pyw",
line 5, in ?
titel = unicode(titel)
UnicodeDecodeE rror: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)
I need to have the titel variable in unicode format because when I
write
in a entry box in Tkinkter it makes the value to a unicode
format
automaticly.
The general rule in working with Unicode can be expressed something
like "work in Unicode all the time i.e. decode legacy text as early as
possible; encode into legacy text (if absolutely required) as late as
possible (corollary: if forced to communicate with another
Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
cp666)"

Applying this to Problem 1 is, as you've seen, trivial: To the extent
that you have text constants at all in your program, they should be in
Unicode.

Now after all that, Problem 2: how to save Unicode text to a file?

Which raises a question: who or what is going to read your file? If a
Unicode-aware application, and never a human, you might like to
consider encoding the text as utf-16. If Unicode-aware app plus
(occasional human developer or not CJK and you want to save space),
try utf-8. For general use on Windows boxes in the Latin1 subset of
the universe, you'll no doubt want to encode as cp1252.

Are there anyone who knows an easy way to save this unicode format text
to a file?


Read the docs of the codecs module -- skipping over how to register
codecs, just concentrate on using them.

Try this:

# -*- coding: cp1252 -*-
import codecs
titel = u""
print "Titel type", type(titel)
f1 = codecs.open('ti tel.u16', 'wb', 'utf_16')
f2 = codecs.open('ti tel.u8', 'w', 'utf_8')
f3 = codecs.open('ti tel.txt', 'w', 'cp1252')
# much later, maybe in a different function
# maybe even in a different module
f1.write(titel)
f2.write(titel)
f3.write(titel)
# much later
f1.close()
f2.close()
f3.close()

Note: doing it this way follows the "encode as late as possible" rule
and documents the encoding for the whole file, in one place. Other
approaches which might use the .encode() method of Unicode strings and
then write the 8-bit-string results at different times and in
different functions/modules are somewhat less clean and more prone to
mistakes.

HTH,
John
Jul 19 '05 #3
On Sat, 7 May 2005 17:25:28 -0500, Skip Montanaro <sk**@pobox.com >
wrote:

Svennglenn> Traceback (most recent call last):
Svennglenn> File "D:\Documen ts and
Svennglenn> Settings\Daniel \Desktop\Progra mmering\aaotest \aaotest2\aaote st2.pyw",
Svennglenn> line 5, in ?
Svennglenn> titel = unicode(titel)
Svennglenn> UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
Svennglenn> ordinal not in range(128)

Try:

import codecs

titel = ""
titel = unicode(titel, "iso-8859-1")
fil = codecs.open("te stfil.txt", "w", "iso-8859-1")
fil.write(titel )
fil.close()


I tried that, with this result:

C:\junk>python skip.py
sys:1: DeprecationWarn ing: Non-ASCII character '\xe5' in file skip.py
on line 3, but no encoding declared; see http://www.python.org
/peps/pep-0263.html for details

1. An explicit PEP 263 declaration (which the OP already had!) should
be used, rather than relying on the default, which doesn't work in
general if you substituted say Polish or Russian for Swedish.

2. My bet is that 'cp1252' is more likely to be appropriate for the OP
than 'iso-8859-1'. The encodings are quite different in range(0x80,
0xA0). They coincidentally give the same result for the OP's limited
sample. However if for example the OP needs to use the euro character
which is 0x80 in cp1252, it wouldn't show up as a problem in the
limited scripts we've been playing with so far, but 0x80 in the script
is sure not going to look like a euro in Tkinter if it's being decoded
via iso-8859-1. Your rationale for using iso-8859-1 when the OP had
already mentioned cp1252 was ... what?

Jul 19 '05 #4
Hi All--

John Machin wrote:


The general rule in working with Unicode can be expressed something
like "work in Unicode all the time i.e. decode legacy text as early as
possible; encode into legacy text (if absolutely required) as late as
possible (corollary: if forced to communicate with another
Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
cp666)"


+1 QOTW

And true, too.

<i-especially-like-the-cp666-part>-ly y'rs,
Ivan
----------------------------------------------
Ivan Van Laningham
God N Locomotive Works
http://www.andi-holmes.com/
http://www.foretec.com/python/worksh...oceedings.html
Army Signal Corps: Cu Chi, Class of '70
Author: Teach Yourself Python in 24 Hours
Jul 19 '05 #5
Svennglenn wrote:
# -*- coding: cp1252 -*-

titel = ""
titel = unicode(titel)
Instead of this, just write

# -*- coding: cp1252 -*-

titel = u""
fil = open("testfil.t xt", "w")
fil.write(titel )
fil.close()


Instead of this, write

import codecs
fil = codecs.open("te stfil.txt", "w", "cp1252")
fil.write(titel )
fil.close()

Instead of cp1252, consider using ISO-8859-1.

Regards,
Martin
Jul 19 '05 #6
On Sun, 08 May 2005 11:23:49 +0200, "Martin v. Lwis"
<ma****@v.loewi s.de> wrote:
Svennglenn wrote:
# -*- coding: cp1252 -*-

titel = ""
titel = unicode(titel)


Instead of this, just write

# -*- coding: cp1252 -*-

titel = u""
fil = open("testfil.t xt", "w")
fil.write(titel )
fil.close()


Instead of this, write

import codecs
fil = codecs.open("te stfil.txt", "w", "cp1252")
fil.write(tite l)
fil.close()

Instead of cp1252, consider using ISO-8859-1.


Martin, I can't guess the reason for this last suggestion; why should
a Windows system use iso-8859-1 instead of cp1252?

Regards,
John
Jul 19 '05 #7
John Machin wrote:
Martin, I can't guess the reason for this last suggestion; why should
a Windows system use iso-8859-1 instead of cp1252?


Windows users often think that windows-1252 is the same thing as
iso-8859-1, and then exchange data in windows-1252, but declare them
as iso-8859-1 (in particular, this is common for HTML files).
iso-8859-1 is more portable than windows-1252, so it should be
preferred when the data need to be exchanged across systems.

Regards,
Martin
Jul 19 '05 #8
On Sun, 08 May 2005 19:49:42 +0200, "Martin v. Lwis"
<ma****@v.loewi s.de> wrote:
John Machin wrote:
Martin, I can't guess the reason for this last suggestion; why should
a Windows system use iso-8859-1 instead of cp1252?


Windows users often think that windows-1252 is the same thing as
iso-8859-1, and then exchange data in windows-1252, but declare them
as iso-8859-1 (in particular, this is common for HTML files).
iso-8859-1 is more portable than windows-1252, so it should be
preferred when the data need to be exchanged across systems.


Martin, it seems I'm still a long way short of enlightenment; please
bear with me:

Terminology disambiguation: what I call "users" wouldn't know what
'cp1252' and 'iso-8859-1' were. They're not expected to know. They
just type in whatever characters they can see on their keyboard or
find in the charmap utility. It's what I'd call 'admins' and
'developers' who should know better, but often don't.

1. When exchanging data across systems, should not utf-8 be
preferred???

2. If the Windows *users* have been using characters that are in
cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
will cause an exception.
euro_win = chr(128)
euro_uc = euro_win.decode ('cp1252')
euro_uc u'\u20ac' unicodedata.nam e(euro_uc) 'EURO SIGN' euro_iso = euro_uc.encode( 'iso-8859-1') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'latin-1' codec can't encode character u'\u20ac'
in position 0: ordinal not in range(256)


I find it a bit hard to imagine that the euro sign wouldn't get a fair
bit of usage in Swedish data processing even if it's not their own
currency.

3. How portable is a character set that doesn't include the euro sign?

Regards,
John
Jul 19 '05 #9
Le Mon, 09 May 2005 08:39:40 +1000, John Machin a crit :
On Sun, 08 May 2005 19:49:42 +0200, "Martin v. Lwis"
<ma****@v.loew is.de> wrote:
John Machin wrote:
Martin, I can't guess the reason for this last suggestion; why should
a Windows system use iso-8859-1 instead of cp1252?
Windows users often think that windows-1252 is the same thing as
iso-8859-1, and then exchange data in windows-1252, but declare them
as iso-8859-1 (in particular, this is common for HTML files).
iso-8859-1 is more portable than windows-1252, so it should be
preferred when the data need to be exchanged across systems.


1. When exchanging data across systems, should not utf-8 be
preferred???

2. If the Windows *users* have been using characters that are in
cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
will cause an exception.
euro_win = chr(128)
euro_uc = euro_win.decode ('cp1252')
euro_uc u'\u20ac' unicodedata.nam e(euro_uc) 'EURO SIGN' euro_iso = euro_uc.encode( 'iso-8859-1') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'latin-1' codec can't encode character u'\u20ac'
in position 0: ordinal not in range(256)
I find it a bit hard to imagine that the euro sign wouldn't get a fair
bit of usage in Swedish data processing even if it's not their own
currency.

For western Europe countries, another codec exists which includes the
'EURO SIGN'. It is spelled 'iso8859_15' (with an alias 'iso-8859-15'
according to the 4.9.2 Standard Encodings page of the python library
reference).
euro_iso = euro_uc.encode( 'iso8859_15')
euro_iso

'\xa4'
3. How portable is a character set that doesn't include the euro sign? I think it is due to historical constraints : isoLatin1 existed before
that the EURO SIGN appeared.
Regards,
John

Jul 19 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
1912
by: Guilherme Salgado | last post by:
Hi there, I have a python source file encoded in unicode(utf-8) with some iso8859-1 strings. I've encoded this file as utf-8 in the hope that python will understand these strings as unicode (<type 'unicode'>) strings whithout the need to use unicode() or u"" on these strings. But this didn't happen. Am I expecting something that really shoudn't happen or we have a bug?
3
7772
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
7
4999
by: Philipp Lenssen | last post by:
How do I load and save a UTF-8 document in XML in ASP/VBS? Well, the loading* is not the problem actually -- the file is in UTF-8, and understood correctly -- but once saved, the UTF-8 is replaced by what seems to be iso-8859-1 (which Flash doesn't understand, but that's another problem). Any help greatly appreciated. * Something like this...
3
16259
by: andy_ro | last post by:
Hi group, I have an web application where the user can upload a pdf file. This file is stored in a table, in a column of type ntext. The user can later request the content of this column, and the application should open a new page and load the pdf content. My problem: If I read the uploaded pdf file into a string variable and then I response.write it right away to another page, it shows fine (I can see a pdf
1
6916
by: David Dvali | last post by:
Hello. I have a problem with sending Unicode text in mail message. So what I do: First of all I have some template file like this: ================================= <html> <head><title>Test Message</title></head> <body> <p>Hello {0}</p>
10
8094
by: Nikolay Petrov | last post by:
How can I convert DOS cyrillic text to Unicode
18
34143
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found Encoding.Convert, but that needs byte arrays. Thanks, /Ger
6
7032
by: Jeff | last post by:
Hi - I'm setting up a streamreader in a VB.NET app to read a text file and display its contents in a multiline textbox. If I set it up with System.Text.Encoding.Unicode, it reads a unicode file just fine. If I set it up as ASCII, it reads a non-unicode text file. But I don't know the file format in advance. How can my app determine whether to use Unicode encoding before I read the
1
2016
by: HOWARD MYERS | last post by:
I am developing an Access application which creates text based XML instance files and saves these on the hard drives. How do I programatically save them as Unicode text files and not as the default text files ? Thanks in advance for any assistance. H.Myers
0
8969
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
9483
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9341
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9269
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6756
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6056
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4826
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2748
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2195
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.