473,396 Members | 2,018 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

sax barfs on unicode filenames

Hi. Presumably this is a easy question, but anyone who understands the sax
docs thinks completely differently than I do :-)

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external _ges,1)

# Hopefully the content handler can figure out the encoding from the <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)

Here 'theFile' is an open file. Usually this works just fine, but when the
filename contains u'\u8116' I get the following exception:

Traceback (most recent call last):

File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2159, in
parse_leo_file

parser.parse(theFile)

File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse

xmlreader.IncrementalParser.parse(self, source)

File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse

self.prepareParser(source)

File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser

self._parser.SetBase(source.getSystemId())

UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
position 44: ordinal not in range(128)

Presumably the documentation at:

http://docs.python.org/lib/module-xm...xmlreader.html

would be sufficient for a sax-head, but I have absolutely no idea of how to
create an InputSource that can handle non-ascii filenames.

Any help would be appreciated. Thanks!

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #1
9 1607
Edward K. Ream wrote:
Hi. Presumably this is a easy question, but anyone who understands the
sax docs thinks completely differently than I do :-)

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external _ges,1)

# Hopefully the content handler can figure out the encoding from the
# <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)

Here 'theFile' is an open file. Usually this works just fine, but when
Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

You have to encode the unicode string according to your filesystem
beforehand.

Diez

Oct 4 '06 #2
Diez B. Roggisch wrote:
Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.
it is ?
>>f = open(u"\u8116", "w")
f.write("hello")
f.close()
>>f = open(u"\u8116", "r")
f.read()
'hello'

</F>

Oct 4 '06 #3
Filenames are expected to be bytestrings.

The exception happens in a method to which no fileName is passed as an
argument.

parse_leo_file:
'C:\\prog\\tigris-cvs\\leo\\test\\unittest\\chinese?folder\\chinese? test.leo'
(trace of converted fileName)

Unexpected exception parsing
C:\prog\tigris-cvs\leo\test\unittest\chinese?folder\chinese?test. leo
Traceback (most recent call last):

File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2162, in
parse_leo_file
parser.parse(theFile)

File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)

File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse
self.prepareParser(source)

File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser
self._parser.SetBase(source.getSystemId())

UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
position 44: ordinal not in range(128)

To repeat, theFile is an open file. I believe the actual filename is passed
nowhere as an argument to sax in my code. Just to make sure, I converted
the filename to ascii in my code, and got (no surprise) exactly the same
crash. I suppose a workaround would be to pass a 'file-like-object to sax
instead of an open file, so that theFile.getSystemId won't crash. But this
looks like a bug to me.

BTW:

Python 2.5.0, Tk 8.4.12, Pmw 1.2
Windows 5, 1, 2600, 2, Service Pack 2

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #4

Diez B. Roggisch wrote:
Edward K. Ream wrote:
Hi. Presumably this is a easy question, but anyone who understands the
sax docs thinks completely differently than I do :-)

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external _ges,1)

# Hopefully the content handler can figure out the encoding from the
# <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)

Here 'theFile' is an open file. Usually this works just fine, but when

Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

You have to encode the unicode string according to your filesystem
beforehand.
Not if your filesystem supports Unicode names, as Windows does.
Edward's point is that something is (whether by accident or "design")
trying to coerce it to str, and failing.

Oct 4 '06 #5
Happily, the workaround is easy. Replace theFile with:

# Use cStringIo to avoid a crash in sax when inputFileName has unicode
characters.
s = theFile.read()
theFile = cStringIO.StringIO(s)

My first attempt at a workaround was to use:

s = theFile.read()
parser.parseString(s)

but the expat parser does not support parseString...

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #6
Fredrik Lundh schrieb:
Diez B. Roggisch wrote:
>Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

it is ?
Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.

Regards,
Martin

P.S. and just to anticipate nit-picking: yes, you can pass a Unicode
string to Expat, too, as long as the Unicode string only contains
ASCII characters. And yes, it doesn't have to be ASCII, if you change
the system default encoding.
Oct 4 '06 #7
Edward K. Ream schrieb:
Happily, the workaround is easy. Replace theFile with:

# Use cStringIo to avoid a crash in sax when inputFileName has unicode
characters.
s = theFile.read()
theFile = cStringIO.StringIO(s)

My first attempt at a workaround was to use:

s = theFile.read()
parser.parseString(s)

but the expat parser does not support parseString...
Right - you would have to use xml.sax.parseString (which is a global
function, not a method).

Of course, parseString just does what you did: create a cStringIO
object and operate on that.

Regards,
Martin
Oct 4 '06 #8
Martin v. Löwis wrote:
Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.
sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
doesn't seem to have any problems dealing with unicode filenames...)

</F>

Oct 4 '06 #9
Fredrik Lundh schrieb:
Martin v. Löwis wrote:
>Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.

sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
doesn't seem to have any problems dealing with unicode filenames...)
That's because ET never invokes XML_SetBase. Without testing, this
suggests that there might be problem in ET with relative URIs
in parsed external entities. XML_SetBase expects a char* for the
base URI.

Regards,
Martin
Oct 4 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Edward K. Ream | last post by:
Am I reading pep 277 correctly? On Windows NT/XP, should filenames always be converted to Unicode using the mbcs encoding? For example, myFile = unicode(__file__, "mbcs", "strict") This...
19
by: Gerson Kurz | last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...
3
by: fanbanlo | last post by:
C:\MP3\001.txt -> 0.txt C:\MP3\01. ??? - ????(???).mp3 -> 1.mp3 Traceback (most recent call last): File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 310, in...
3
by: Kevin Ollivier | last post by:
Hi all, On Windows, it's very common to have a string of long directories in the pathname for files, like "C:\Documents and Settings\My Long User Name\My Documents\My Long Subdirectory...
7
by: Sune | last post by:
Hi! For example: 1) I want to open a file in a Chinese locale and print it. 2) The program takes the file name as a command line argument.
7
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the...
13
by: gabor | last post by:
hi, from the documentation (http://docs.python.org/lib/os-file-dir.html) for os.listdir: "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode...
1
by: durumdara | last post by:
Hi! As I experienced in the year 2006, the Python's zip module is not unicode-safe. With the hungarian filenames I got wrong result. I need to convert iso-8859-2 to cp852 chset to get good...
24
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.