Encoding and norwegian (non ASCII) characters.

joakim.hove

Hello,

I am having great problems writing norwegian characters æøå to file
from a python application. My (simplified) scenario is as follows:

1. I have a web form where the user can enter his name.

2. I use the cgi module module to get to the input from the user:
....
name = form["name"].value

3. The name is stored in a file

fileH = open(namefile , "a")
fileH.write("na me:%s \n" % name)
fileH.close()

Now, this works very well indeed as long the users have 'ascii' names,
however when someone enters a name with one of the norwegian characters
æøå - it breaks at the write() statement.

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x8f in position
.....

Now - I understand that the ascii codec can't be used to decode the
particular characters, however my attempts of specifying an alternative
encoding have all failed.

I have tried variants along the line:

fileH = codecs.open(nam efile , "a" , "latin-1") / fileH =
open(namefile , "a")
fileH.write(nam e) / fileH.write(nam e.encode("latin-1"))

It seems *whatever* I do the Python interpreter fails to see my pledge
for an alternative encoding, and fails with the dreaded
UnicodeDecodeEr ror.

Any tips on this would be *highly* appreciated.
Joakim

Oct 7 '06 #1

Subscribe Reply

6717

Peter Otten

jo*********@gma il.com wrote:

Hello,

I am having great problems writing norwegian characters Ã¦Ã¸Ã¥ to file
from a python application. My (simplified) scenario is as follows:

1. I have a web form where the user can enter his name.

2. I use the cgi module module to get to the input from the user:
....
name = form["name"].value

3. The name is stored in a file

fileH = open(namefile , "a")
fileH.write("na me:%s \n" % name)
fileH.close()

Now, this works very well indeed as long the users have 'ascii' names,
however when someone enters a name with one of the norwegian characters
Ã¦Ã¸Ã¥ - it breaks at the write() statement.

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x8f in position
....

Now - I understand that the ascii codec can't be used to decode the
particular characters, however my attempts of specifying an alternative
encoding have all failed.

I have tried variants along the line:

fileH = codecs.open(nam efile , "a" , "latin-1") / fileH =
open(namefile , "a")
fileH.write(nam e) / fileH.write(nam e.encode("latin-1"))

It seems *whatever* I do the Python interpreter fails to see my pledge
for an alternative encoding, and fails with the dreaded
UnicodeDecodeEr ror.

Any tips on this would be *highly* appreciated.

The approach with codecs.open() should succeed

>>out = codecs.open("tm p.txt", "a", "latin1")
out.write(u"Ã ¦Ã¸Ã¥")
out.write("ab c")
out.write("Ã¦ Ã¸Ã¥")

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.4/codecs.py", line 501, in write
return self.writer.wri te(data)
File "/usr/local/lib/python2.4/codecs.py", line 178, in write
data, consumed = self.encode(obj ect, self.errors)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

provided that you write only unicode strings with characters in the range
unichr(0)...uni chr(255) and normal strs in the range chr(0)...chr(12 7).

You have to decode non-ascii strs before feeding them to write() with the
appropriate encoding (that only you know)

>>out.write(uni code("\xe6\xf8\ xe5", "latin1"))

If there are unicode code points beyond unichr(255) you have to change the
encoding in codecs.open(), typically to UTF-8.

# raises UnicodeEncodeEr ror
codecs.open("tm p.txt", "a", "latin1").write (u"\u1234")

# works
codecs.open("tm p.txt", "a", "utf8").write(u "\u1234")

Peter

Oct 7 '06 #2

Paul Boddie

jo*********@gma il.com wrote:

>
I am having great problems writing norwegian characters æøå to file
from a python application. My (simplified) scenario is as follows:

1. I have a web form where the user can enter his name.

2. I use the cgi module module to get to the input from the user:
....
name = form["name"].value

The cgi module should produce plain strings, not Unicode objects, which
makes some of the later behaviour quite "interestin g".

3. The name is stored in a file

fileH = open(namefile , "a")
fileH.write("na me:%s \n" % name)
fileH.close()

Now, this works very well indeed as long the users have 'ascii' names,
however when someone enters a name with one of the norwegian characters
æøå - it breaks at the write() statement.

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x8f in position

This is odd, since writing plain strings to files shouldn't involve any
Unicode conversions. If you received a plain string from the cgi
module, the text you write to the file should still be a plain string.
This is like obtaining a sequence of bytes and just passing them
around. Perhaps your Python configuration is different in some
non-standard way, although I wouldn't want to point the finger at
anything in particular (although sys.getdefaulte ncoding might suggest
something).

Now - I understand that the ascii codec can't be used to decode the
particular characters, however my attempts of specifying an alternative
encoding have all failed.

I have tried variants along the line:

fileH = codecs.open(nam efile , "a" , "latin-1") / fileH = open(namefile , "a")
fileH.write(nam e) / fileH.write(nam e.encode("latin-1"))

It seems *whatever* I do the Python interpreter fails to see my pledge
for an alternative encoding, and fails with the dreaded
UnicodeDecodeEr ror.

To use a file opened through codecs.open, you really should present
Unicode objects to the write method. Otherwise, I imagine that the
method will try and automatically convert to Unicode the plain string
that the name object supposedly is, and this conversion will assume
that the string only contains ASCII characters (as is Python's default
behaviour) and thus cause the error you are seeing. Only after getting
the text as a Unicode object will the method then try to encode the
text in the specified encoding in order to write it to the file.

In other words, you'll see this behaviour:

name (plain string) -Unicode object -encoded text (written to
file)

Or rather, in the failure case:

name (plain string) -error! (couldn't produce the Unicode object)

As Peter Otten suggests, you could first make the Unicode object
yourself, stating explicitly that the name object contains "latin-1"
characters. In other words:

name (plain string) -Unicode object

Then, the write method has an easier time:

Unicode object -encoded text (written to file)

All this seems unnecessary for your application, I suppose, since you
know (or believe) that the form values only contain "latin-1"
characters. However, as is the standard advice on such matters, you may
wish to embrace Unicode more eagerly, converting plain strings to
Unicode as soon as possible and only converting them to text in various
encodings when writing them out.

In some Web framework APIs, the values of form fields are immediately
available as Unicode without any (or much) additional work. WebStack
returns Unicode objects for form fields, as does the Java Servlet API,
but I'm not particularly aware of many other Python frameworks which
enforce or promote such semantics.

Paul

Oct 8 '06 #3

Similar topics

2817

Anyway to designating the encoding of the "source" for compile?

by: janeaustine50 | last post by:

Python's InteractiveInterpreter uses the built-in compile function. According to the ref. manual, it doesn't seem to concern about the encoding of the source string. When I hand in an unicode object, it is encoded in utf-8 automatically. It can be a problem when I'm building an interactive environment using "compile", with a different encoding from utf-8. IDLE itself has the same problem. ( '<a string with non-ascii-encoding>' is...

Python

4645

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the

HTML / CSS

10173

Simple high-ascii character encoding

by: chandy | last post by:

Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be accidentally entered are properly represented when the document is served. My programming language allows me to get the ascii value for any individual character so what I am doing when a change is saved is to look at each character in the content and...

HTML / CSS

6142

How do I create a one line text file with control codes? e.g.: 144 = 0x90 and 147 = 0x93?

by: Dan V. | last post by:

How do I create a one line text file with these control codes? e.g.: 144 = 0x90 and 147 = 0x93? I am trying to create a one line text file with these characters all one one row with no spaces. 1. 144 = 0x90 2. 147 = 0x93 3. STX = (^B = 2 = 0x2) 4. NUL = (^@ = 0 = 0x0)

C# / C Sharp

23712

query string encoding/decoding

by: Mark | last post by:

I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the "improvements", but maybe someone here can point me in the right direction... First, it looks like asp.net will automatically read and recognize query strings encoded in utf8 and 16-bit unicode, only the latter is some mutant, non-standard...

ASP.NET

3340

Unicode/ascii encoding nightmare

by: Thomas W | last post by:

I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word "fødselsdag". I stored the string as "fødselsdag" but somewhere in my code it got translated into the mess above and I cannot get the original string back. It cannot be printed in the console or written a plain text-file. I've tried to convert it using

Python

17605

Non-ascii email subject and header encoding

by: Ciuin | last post by:

Hi all, I need to mail() emails with user input that does contain non-ascii (umlauts, accents) and non-latin (cyrillic) characters in the "Subject:" and "From:" headers. I understand that they are typically encoded in UTF8 like this: =?UTF-8?B?w5Z0emkg0J/RgNC40LLQtdGC?= but I cannot find a PHP function to encode the input string in this

PHP

2682

Encoding/characterset/font family confusion

by: Erwin Moller | last post by:

Hi group, I could use a bit of guidance on the following matter. I am starting a new project now and must make some decisions regarding encoding. Environment: PHP4.3, Postgres7.4.3 I must be able to receive forminformation and store that in a database and later produce it on screen on the client (just plain HTML).

PHP

4149

encoding.ascii

by: =?Utf-8?B?RGFu?= | last post by:

I have the following code section that I thought would strip out all the non-ascii characters from a string after decoding it. Unfortunately the non-ascii characters are still in the string. What am I doing wrong? Dim plainText As String plainText = "tâ•e" Dim plainTextBytes() As Byte Dim enc As Encoding = Encoding.ASCII plainTextBytes = enc.GetBytes(plainText)

Visual Basic .NET

9646

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10096

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9956

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8982

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7504

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6742

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4055

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3658

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2887

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General