Python UTF-8 and codecs

Mike Currie

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Thanks
Mike

Jun 27 '06 #1

Subscribe Post Reply

13460

Dennis Benzinger

Mike Currie wrote:

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work
[...]

You want to write to a file but you used the 'rU' mode. This should be
'wU'. Don't know if this is the only reason it doesn't work. Could you
show more of your code?
Bye,
Dennis

Jun 27 '06 #2

Serge Orlov

On 6/27/06, Mike Currie <de*@null.com> wrote:

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Use unicode strings for non-ascii characters. The following program "works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?

Jun 27 '06 #3

Mike Currie

I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab characters
inside quoted fields. The idea is to convert all the new line and
characters to 0x85 and 0x88 respectivly, then process the files. Finally
right before importing them into a database convert them back to new line
and tab's thus preserving the field values.

Will python not handle the control characters correctly?
"Serge Orlov" <se*********@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...

On 6/27/06, Mike Currie <de*@null.com> wrote:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
errors='strict')
and that doesn't work and I've also try wrapping the file in an
utf8_writer
using codecs.lookup('utf8')

Any clues?

Use unicode strings for non-ascii characters. The following program
"works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?

Jun 27 '06 #4

Mike Currie

Okay,

Here is a sample of what I'm doing:
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.

filterMap = {}
for i in range(0,255): .... filterMap[chr(i)] = chr(i)
.... filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)
line = '''this has .... tabs and line
.... breaks''' filteredLine = ''.join([ filterMap[a] for a in line])
import codecs
f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisêhasêàtabsêandêlineàbreaks f.write(filteredLine) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)

"Mike Currie" <de*@null.com> wrote in message
news:5Hgog.627$Gv.173@fed1read09...I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab
characters inside quoted fields. The idea is to convert all the new line
and characters to 0x85 and 0x88 respectivly, then process the files.
Finally right before importing them into a database convert them back to
new line and tab's thus preserving the field values.

Will python not handle the control characters correctly?
"Serge Orlov" <se*********@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...
On 6/27/06, Mike Currie <de*@null.com> wrote:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08
in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
errors='strict')
and that doesn't work and I've also try wrapping the file in an
utf8_writer
using codecs.lookup('utf8')

Any clues?

Use unicode strings for non-ascii characters. The following program
"works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?

Jun 27 '06 #5

Serge Orlov

On 6/27/06, Mike Currie <de*@null.com> wrote:

Okay,

Here is a sample of what I'm doing:
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
filterMap = {}
for i in range(0,255): ... filterMap[chr(i)] = chr(i)
... filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)

This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)

Jun 27 '06 #6

Mike Currie

Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?
"Serge Orlov" <se*********@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...

On 6/27/06, Mike Currie <de*@null.com> wrote:
Okay,

Here is a sample of what I'm doing:
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> filterMap = {}
>>> for i in range(0,255):

... filterMap[chr(i)] = chr(i)
...
>>> filterMap[chr(9)] = chr(136)
>>> filterMap[chr(10)] = chr(133)
>>> filterMap[chr(136)] = chr(9)
>>> filterMap[chr(133)] = chr(10)

This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)

Jun 27 '06 #7

Serge Orlov

On 6/27/06, Mike Currie <de*@null.com> wrote:

Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?

Yes, the program succesfully wrote text file. Without magic abilities
to read the screen of your computer I guess you now get exception in
print statement. It is because you use legacy windows console (I use
unicode-capable console of lightning compiler
<http://www.python.org/pypi/Lightning%20Compiler> to run snippets of
code). You can either change console or comment out print statement or
change your program to print unicode representation: print
repr(filteredLine)

Jun 27 '06 #8

by: Nuff Said | last post by:

When I type the following code in the interactive python shell, I get 'UTF-8'; but if I put the code into a Python script and run the script - in the same terminal on my Linux box in which I...

Python

problem with python zsi

by: Rafal Zawadzki | last post by:

Hi. I tried earlier to write python zsi mail list, but nobody answered. I am using ZSI 1.7/2.0rc1 with TTPro Soap SDK. The wsdl file can be found here: http://demo.seapine.com/ttsoapcgi.wsdl ...

Python

122

"index" method only for mutable sequences??

by: C.L. | last post by:

I was looking for a function or method that would return the index to the first matching element in a list. Coming from a C++ STL background, I thought it might be called "find". My first stop was...

Python

Python's handling of unicode surrogates

by: Adam Olsen | last post by:

As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...

Python

UTF-8 Support of Curses in Python 2.5

by: shrek2099 | last post by:

Hi All, Recently I ran into a problem with UTF-8 surrport when using curses library in python 2.5 in Fedora 7. I found out that the program using curses cannot print out unicode characters...

Python

Python beginner, unicode encode/decode Q

by: anonymous | last post by:

1 Objective to write little programs to help me learn German. See code after numbered comments. //Thanks in advance for any direction or suggestions. tk 2 Want keyboard answer input, for...

Python

Python HTML parser chokes on UTF-8 input

by: Johannes Bauer | last post by:

Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as...

Python

inserting Unicode character in dictionary - Python

by: gita ziabari | last post by:

Hello All, The following code does not work for unicode characters: keyword = dict() kw = 'ÇÅÎÓËÉÈ' keyword.setdefault(key, ).append (kw) It works fine for inserting ASCII character. Any...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Python UTF-8 and codecs

Similar topics