473,651 Members | 3,063 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Printing Filenames with non-Ascii-Characters

Hi,

I am very new to Python and have run into the following problem. If I do
something like

dir = os.listdir(some path)
for d in dir:
print d

The program fails for filenames that contain non-ascii characters.

'ascii' codec can't encode characters in position 33-34:

I have noticed that this seems to be a very common problem. I have read a lot
of postings regarding it but not really found a solution. Is there a simple
one?

What I specifically do not understand is why Python wants to interpret the
string as ASCII at all. Where is this setting hidden?

I am running Python 2.3.4 on Windows XP and I want to run the program on
Debian sarge later.

Ciao, MM
--
Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
http://www.marian-aldenhoevel.de
"There is a procedure to follow in these cases, and if followed it can
pretty well guarantee a generous measure of success, success here
defined as survival with major extremities remaining attached."
Jul 18 '05 #1
11 4350
On Tue, 01 Feb 2005 20:28:11 +0100, Marian Aldenhövel
<ma****@mba-software.de> wrote:
Hi,

I am very new to Python and have run into the following problem. If I do
something like

dir = os.listdir(some path)
for d in dir:
print d

The program fails for filenames that contain non-ascii characters.

'ascii' codec can't encode characters in position 33-34:

I have noticed that this seems to be a very common problem. I have read
a lot
of postings regarding it but not really found a solution. Is there a
simple
one?
English windows command prompt uses cp437 charset. To print it, use

print d.encode('cp437 ')

The issue is a terminal only understand certain character set. If you have
unicode string, like d in your case, you have to encode it before it can
be printed. (We really need native unicode terminal!!!) If you don't
encode, Python will do it for you. The default encoding is ASCII. Any
string that contains non-ASCII character will give you trouble. In my
opinion Python is too conversative to use the 'strict' encoding which
gives users unaware of unicode a lot of woes.

So how did you get a unicoded d to start with? If 'somepath' is unicode,
os.listdir returns a list of unicode. So why is somepath unicode? Either
you have entered a unicode literal or it comes from some other sources.
One possible source is XML parser, which returns string in unicode.

Windows NT support unicode filename. I'm not sure about Linux. The result
maybe slightly differ.


What I specifically do not understand is why Python wants to interpret
the
string as ASCII at all. Where is this setting hidden?

I am running Python 2.3.4 on Windows XP and I want to run the program on
Debian sarge later.

Ciao, MM


Jul 18 '05 #2
Marian Aldenhövel wrote:
Hi,

I am very new to Python and have run into the following problem. If I do something like

dir = os.listdir(some path)
for d in dir:
print d

The program fails for filenames that contain non-ascii characters.

'ascii' codec can't encode characters in position 33-34:

I have noticed that this seems to be a very common problem. I have read a lot of postings regarding it but not really found a solution. Is there a simple one?
No :) You're trying to deal with legacy terminals, you can't reliably
print unicode characters across various terminals. It's not really
Python's fault.

What I specifically do not understand is why Python wants to interpret the string as ASCII at all. Where is this setting hidden?
http://www.python.org/moin/PrintFails Let me know if it's not clear. It
would be great if other people fixed/improved this page.
I am running Python 2.3.4 on Windows XP and I want to run the program on Debian sarge later.


You need cross platform terminal that supports unicode output.
Sergey.

Jul 18 '05 #3
Marian Aldenhövel wrote:
Hi,

I am very new to Python and have run into the following problem. If I do
something like

dir = os.listdir(some path)
for d in dir:
print d

The program fails for filenames that contain non-ascii characters.

'ascii' codec can't encode characters in position 33-34:
If you read this carefully, you'll notice that Python has tried and
failed to *encode* a decoded ( = unicode) string using the 'ascii'
codec. IOW, d seems to be bound to a unicode string. Which is unexpected
unless maybe the argument passed to os.listdir (somepath) is a Unicode
string, too. (If given a Unicode string as argument, os.listdir will
return the list as a list of unicode names).

If you're printing to the console, modern Pythons will try to guess the
console's encoding (e.g. cp850). I would expect a UnicodeEncodeEr ror if
the print fails because the characters do not map to the console's
encoding, not the error you're seeing.

How *are* you running the program. In the console (cmd.exe)? Or from
some IDE?

I have noticed that this seems to be a very common problem. I have read
a lot
of postings regarding it but not really found a solution. Is there a simple
one?

What I specifically do not understand is why Python wants to interpret the
string as ASCII at all. Where is this setting hidden?
Don't be tempted to ever change sys.defaultenco ding in site.py, this is
site specific, meaning that if you ever distribute them, programs
relying on this setting may fail on other people's Python installations.

--
Vincent Wehren

I am running Python 2.3.4 on Windows XP and I want to run the program on
Debian sarge later.

Ciao, MM

Jul 18 '05 #4
Hi,

Thank you very much, you have collectively cleared up some of the confusion.
English windows command prompt uses cp437 charset.
To be exact my Windows is german but I am not outputting to the command prompt
window. I am using eclipse with the pydev plugin as development platform and
the output is redirected to the console view in the IDE. I am not sure how
this affects the problem and have since tried a vanilla console too. The
problem stays the same, though.

I wonder what surprises are waiting for me when I first move this to my
linux-box :-). I believe it uses UTF-8 throughout.
print d.encode('cp437 ')
So I would have to specify the encoding on every call to print? I am sure to
forget and I don't like the program dying, in my case garbled output would be
much more acceptable.

Is there some global way of forcing an encoding instead of the default
'ascii'? I have found references to setencoding() but this seems to have gone
away.
The issue is a terminal only understand certain character set.
I have experimented a bit now and I can make it work using encode(). The
eclipse console uses a different encoding than my windows command prompt, by
the way. I am sure this can be configured somewhere but I do not really care
at the moment.
If you have unicode string, like d in your case, you have to encode it before
it can be printed.
I got that now.

So encode() is a method of a unicode string, right?. I come from a background
of statically typed languages so I am a bit queasy when I am not allowed to
explicitly specify type.

How can I, maybe by print()-ing something, find out what type d actually is
of? Just to make sure and get a better feeling for the system?

Should d at any time not be a unicode string but some other flavour of string,
will encode() still work? Or do I need to write a function myPrint() that
distinguishes them by type and calls encode() only for unicode strings?
So how did you get a unicoded d to start with?
I have asked myself this question before after reading the docs for
os.listdir(). But I have no way of finding out what type d really is (see
question above :-)). So I was dead-reckoning.

Can I force a string to be of a certain type? Like

nonunicode=unic ode.encode("spe cialencoding")

How would I do it the other way round? From encoded representation to full
unicode?
If 'somepath' is unicode, os.listdir returns a list of unicode.
So why is somepath unicode? One possible source is XML parser, which returns string in unicode.
I get a root-directory from XML and I walk the filesystem from there. That
explains it.
Windows NT support unicode filename. I'm not sure about Linux. The
result maybe slightly differ.


I think I will worry about that later. I can create files using german umlauts
on the linux box. I am sure I will find a way to move those names into my
Python program.

I will not move data between the systems so there will not be much of
a problem.

Ciao, MM
--
Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
http://www.marian-aldenhoevel.de
"There is a procedure to follow in these cases, and if followed it can
pretty well guarantee a generous measure of success, success here
defined as survival with major extremities remaining attached."
Jul 18 '05 #5
Hi,
Don't be tempted to ever change sys.defaultenco ding in site.py, this is
site specific, meaning that if you ever distribute them, programs
relying on this setting may fail on other people's Python installations.
But wouldn't that be correct in my case?
If you're printing to the console, modern Pythons will try to guess the
console's encoding (e.g. cp850).


But it seems to have quessed wrong. I don't blame it, I would not know of
any way to reliably figure out this setting.

My console can print the filenames in question fine, I can verify that by
simple listing the directory, so it can display more than plain ascii.
The error message seems to indicate that ascii is used as target.

So if I were to fix this in sity.py to configure whatever encoding is
actually used on my system, I could print() my filenames without explicitly
calling encode()?

If the program then fails on other people's installations that would mean
one of two things:

1) They have not configured their encoding correctly.
2) The data to be printed cannot be encoded. This is unlikely as it comes
from a local filename.

So wouldn't fixing site.py be the right thing to do? To enable Python to print
everything that can actually be printed and not barf at things it could print
but cannot because it defaults to plain ascii?

Ciao, MM
--
Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
http://www.marian-aldenhoevel.de
"There is a procedure to follow in these cases, and if followed it can
pretty well guarantee a generous measure of success, success here
defined as survival with major extremities remaining attached."
Jul 18 '05 #6
Marian Aldenhövel wrote:
> If you're printing to the console, modern Pythons will try to guess the
> console's encoding (e.g. cp850).


But it seems to have quessed wrong. I don't blame it, I would not know of
any way to reliably figure out this setting.


Have you set the coding cookie in your file?

Try adding this as the first or second line.

# -*- coding: cp850 -*-

Python will then know how your file is encoded

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
Jul 18 '05 #7
Hi,
Have you set the coding cookie in your file?
Yes. I set it to Utf-8 as that's what I use for all my development.
Try adding this as the first or second line.

# -*- coding: cp850 -*-

Python will then know how your file is encoded


That is relevant to the encoding of source-files, right? How does it affect
printing to standard out?

If it would I would expect UTF-8 data on my console. That would be fine, it
can encode everything and as I have written in another posting in my case
garbled data is better than termination of my program.

But it uses 'ascii', at least if I can believe the error message it gave.

Ciao, MM
--
Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
http://www.marian-aldenhoevel.de
"There is a procedure to follow in these cases, and if followed it can
pretty well guarantee a generous measure of success, success here
defined as survival with major extremities remaining attached."
Jul 18 '05 #8
Marian Aldenhövel wrote:

But wouldn't that be correct in my case?


This is what I get inside Eclipse using pydev when I run:

<code>
import os
dirname = "c:/test"
print dirname
for fname in os.listdir(dirn ame):
print fname
if os.path.isfile( fname):
print fname
</code>:

c:/test
straßenschild.p ng
test.py
Übersetzung.rtf
This is what I get passing a unicode argument to os.listdir:

<code>
import os
dirname = u"c:/test"
print dirname # will print fine, all ascii subset compatible
for fname in os.listdir(dirn ame):
print fname
if os.path.isfile( fname):
print fname
</code>

c:/test
Traceback (most recent call last):
File "C:\Programme\e clipse\workspac e\myFirstProjec t\pythonFile.py ",
line 5, in ?
print fname
UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xdf' in
position 4: ordinal not in range(128)

which is probably what you are getting, right?

You are trying to write *Unicode* objects containing characters outside
of the 0-128 to a multi byte-oriented output without telling Python the
appropriate encoding to use. Inside eclipse, Python will always use
ascii and never guess.

import os
dirname = u"c:/test"
print dirname
for fname in os.listdir(dirn ame):
print type(fname)

c:/test
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>

so finally:
<code>
import os
dirname = u"c:/test"
print dirname
for fname in os.listdir(dirn ame):
print fname.encode("m bcs")
</code>

gives:

c:/test
straßenschild.p ng
test.py
Übersetzung.rtf

Instead of "mbcs", which should be available on all Windows systems, you
could have used "cp1252" when working on a German locale; inside Eclipse
even "utf-16-le" would work, underscoring that the way the 'output
device' handles encodings is decisive. I know this all seems awkward at
first, but Python's drive towards uncompromising explicitness pays off
big time when you're dealing with multilingual data.

--
Vincent Wehren


Jul 18 '05 #9
> > print d.encode('cp437 ')

So I would have to specify the encoding on every call to print? I am
sure to
forget and I don't like the program dying, in my case garbled output
would be
much more acceptable.


Marian I'm with you. You never known you have put enough encode in all the
right places and there is no static type checking to help you. So that
short answer is to set a different default in sitecustomize.p y. I'm trying
to writeup something about unicode in Python, once I understand what's
going on inside...
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
12468
by: gerb | last post by:
Hello, I realise this is not a pure Javascript question, and that VBscript is probably involved at some point, though not as much as I fear. If you opened this item looking for a pute Javascript question, I'm sorry. For an IE6 intranet application I'm running into some problems regarding printing screens from the browser. The specs dictate these requirements:
8
9611
by: Tinus | last post by:
Hello all, Because you have been so helpfull the last couple of times, I thought after testing and wasting more than 20 pages (and google-ling for 3 days :-( ). I would ask you again for your help. The problem is this: If I print a rectangle which begins at (0,0) and the margins are also set to 0 (l:0, t:0, r:0, b:0) then it prints fine (ok, not quite because 0,0 is inside the none printable area but I corrected for that by checking...
2
8666
by: rbutch | last post by:
guys, i need a little help with this. this is working (well sort of) i get the info, but it's not moving to a new line as it iterates thru the array and all of the fields are like ONE HUGE LONG string 'declare an array Dim filenames() As String filenames = Directory.GetFiles("C:\Re_Class") Dim i, o As Integer Dim info As String
0
1388
by: n33470 | last post by:
We have a web site that is being converted from the 1.1 format into 2.0. I've noticed that after the web project has been converted, the first time that the solution is opened in VS, all of the aspx and ascx filenames are changed to lowercase. The filenames are not being changed during the conversion process. After conversion completes, the filenames remain unchanged. However, the first time that the solution is opened in VS2005, the...
12
6357
by: Alex Clark | last post by:
Greetings, (.NET 2.0, WinXP Pro/Server 2003, IE6 with latest service packs). I've decided to take advantage of the layout characteristics of HTML documents to simplify my printing tasks, but of course it's thrown up a whole host of new issues... I'm generating a multi page printable document in HTML from my app, and displaying it in a WebBrowser control. I've looked into using some CSS
1
2165
by: osmethod | last post by:
Hello, I hope I can be helped again.... Problem: I have a report defined with 4 image controls in the detail section of the report. I have code written which assigns the image path and name of the image to each control. Image control1 = c:\image1.jpg
8
5885
by: Neo Geshel | last post by:
Greetings. BACKGROUND: My sites are pure XHTML 1.1 with CSS 2.1 for markup. My pages are delivered as application/xhtml+xml for all non-MS web clients, and as text/xml for all MS web clients (Internet Explorer). My flash content was originally brought in via the “flash satay†method, but I have since used some server-side magic do deliver one <objecttag
3
4224
by: Jlcarroll | last post by:
Hi, I am building a web page.and have a simple javascript menu... I call the javascript menu within a div block that my print sytlesheet has set as a display: none;, well all the content in that block doesn't show up, EXCEPT the javascript is still getting run... <div class="menu"> <!--*****************These lines load in the
4
2118
by: anthony | last post by:
Images stored in our database (stored as filenames but held elsewhere) will form part of a long report. It is critical that when the report prints the images appear at the highest possible quality (colour laser with 100gsm Xerox Colotech+ paper). All tips, including recommemndations for third party add-ins, will be very welcome Many thanks Anthony
18
11290
by: Brett | last post by:
I have an ASP.NET page that displays work orders in a GridView. In that GridView is a checkbox column. When the user clicks a "Print" button, I create a report, using the .NET Framework printing classes, for each of the checked rows in the GridView. This works fine in the Visual Studio 2005 development environment on localhost. But, when I move the page to the web server, I get the error "Settings to access printer...
0
8361
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8278
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8807
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8701
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8466
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8584
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
5615
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
2701
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1912
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.