473,403 Members | 2,270 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,403 software developers and data experts.

Python UTF-8 and codecs

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Thanks
Mike
Jun 27 '06 #1
7 13460
Mike Currie wrote:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work
[...]


You want to write to a file but you used the 'rU' mode. This should be
'wU'. Don't know if this is the only reason it doesn't work. Could you
show more of your code?
Bye,
Dennis
Jun 27 '06 #2
On 6/27/06, Mike Currie <de*@null.com> wrote:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?


Use unicode strings for non-ascii characters. The following program "works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?
Jun 27 '06 #3
I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab characters
inside quoted fields. The idea is to convert all the new line and
characters to 0x85 and 0x88 respectivly, then process the files. Finally
right before importing them into a database convert them back to new line
and tab's thus preserving the field values.

Will python not handle the control characters correctly?
"Serge Orlov" <se*********@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...
On 6/27/06, Mike Currie <de*@null.com> wrote:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
errors='strict')
and that doesn't work and I've also try wrapping the file in an
utf8_writer
using codecs.lookup('utf8')

Any clues?


Use unicode strings for non-ascii characters. The following program
"works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?

Jun 27 '06 #4
Okay,

Here is a sample of what I'm doing:
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
filterMap = {}
for i in range(0,255): .... filterMap[chr(i)] = chr(i)
.... filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)
line = '''this has .... tabs and line
.... breaks''' filteredLine = ''.join([ filterMap[a] for a in line])
import codecs
f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisêhasêàtabsêandêlineàbreaks f.write(filteredLine) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)

"Mike Currie" <de*@null.com> wrote in message
news:5Hgog.627$Gv.173@fed1read09...I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab
characters inside quoted fields. The idea is to convert all the new line
and characters to 0x85 and 0x88 respectivly, then process the files.
Finally right before importing them into a database convert them back to
new line and tab's thus preserving the field values.

Will python not handle the control characters correctly?
"Serge Orlov" <se*********@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...
On 6/27/06, Mike Currie <de*@null.com> wrote:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08
in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
errors='strict')
and that doesn't work and I've also try wrapping the file in an
utf8_writer
using codecs.lookup('utf8')

Any clues?


Use unicode strings for non-ascii characters. The following program
"works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?


Jun 27 '06 #5
On 6/27/06, Mike Currie <de*@null.com> wrote:
Okay,

Here is a sample of what I'm doing:
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
filterMap = {}
for i in range(0,255): ... filterMap[chr(i)] = chr(i)
... filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)


This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)
Jun 27 '06 #6
Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?
"Serge Orlov" <se*********@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...
On 6/27/06, Mike Currie <de*@null.com> wrote:
Okay,

Here is a sample of what I'm doing:
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> filterMap = {}
>>> for i in range(0,255):

... filterMap[chr(i)] = chr(i)
...
>>> filterMap[chr(9)] = chr(136)
>>> filterMap[chr(10)] = chr(133)
>>> filterMap[chr(136)] = chr(9)
>>> filterMap[chr(133)] = chr(10)


This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)

Jun 27 '06 #7
On 6/27/06, Mike Currie <de*@null.com> wrote:
Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?


Yes, the program succesfully wrote text file. Without magic abilities
to read the screen of your computer I guess you now get exception in
print statement. It is because you use legacy windows console (I use
unicode-capable console of lightning compiler
<http://www.python.org/pypi/Lightning%20Compiler> to run snippets of
code). You can either change console or comment out print statement or
change your program to print unicode representation: print
repr(filteredLine)
Jun 27 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Nuff Said | last post by:
When I type the following code in the interactive python shell, I get 'UTF-8'; but if I put the code into a Python script and run the script - in the same terminal on my Linux box in which I...
0
by: Rafal Zawadzki | last post by:
Hi. I tried earlier to write python zsi mail list, but nobody answered. I am using ZSI 1.7/2.0rc1 with TTPro Soap SDK. The wsdl file can be found here: http://demo.seapine.com/ttsoapcgi.wsdl ...
122
by: C.L. | last post by:
I was looking for a function or method that would return the index to the first matching element in a list. Coming from a C++ STL background, I thought it might be called "find". My first stop was...
17
by: Adam Olsen | last post by:
As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...
1
by: shrek2099 | last post by:
Hi All, Recently I ran into a problem with UTF-8 surrport when using curses library in python 2.5 in Fedora 7. I found out that the program using curses cannot print out unicode characters...
1
by: anonymous | last post by:
1 Objective to write little programs to help me learn German. See code after numbered comments. //Thanks in advance for any direction or suggestions. tk 2 Want keyboard answer input, for...
5
by: Johannes Bauer | last post by:
Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as...
6
by: gita ziabari | last post by:
Hello All, The following code does not work for unicode characters: keyword = dict() kw = 'ÇÅÎÓËÉÈ' keyword.setdefault(key, ).append (kw) It works fine for inserting ASCII character. Any...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.