By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,275 Members | 1,745 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,275 IT Pros & Developers. It's quick & easy.

sax barfs on unicode filenames

P: n/a
Hi. Presumably this is a easy question, but anyone who understands the sax
docs thinks completely differently than I do :-)

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external _ges,1)

# Hopefully the content handler can figure out the encoding from the <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)

Here 'theFile' is an open file. Usually this works just fine, but when the
filename contains u'\u8116' I get the following exception:

Traceback (most recent call last):

File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2159, in
parse_leo_file

parser.parse(theFile)

File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse

xmlreader.IncrementalParser.parse(self, source)

File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse

self.prepareParser(source)

File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser

self._parser.SetBase(source.getSystemId())

UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
position 44: ordinal not in range(128)

Presumably the documentation at:

http://docs.python.org/lib/module-xm...xmlreader.html

would be sufficient for a sax-head, but I have absolutely no idea of how to
create an InputSource that can handle non-ascii filenames.

Any help would be appreciated. Thanks!

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #1
Share this Question
Share on Google+
9 Replies


P: n/a
Edward K. Ream wrote:
Hi. Presumably this is a easy question, but anyone who understands the
sax docs thinks completely differently than I do :-)

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external _ges,1)

# Hopefully the content handler can figure out the encoding from the
# <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)

Here 'theFile' is an open file. Usually this works just fine, but when
Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

You have to encode the unicode string according to your filesystem
beforehand.

Diez

Oct 4 '06 #2

P: n/a
Diez B. Roggisch wrote:
Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.
it is ?
>>f = open(u"\u8116", "w")
f.write("hello")
f.close()
>>f = open(u"\u8116", "r")
f.read()
'hello'

</F>

Oct 4 '06 #3

P: n/a
Filenames are expected to be bytestrings.

The exception happens in a method to which no fileName is passed as an
argument.

parse_leo_file:
'C:\\prog\\tigris-cvs\\leo\\test\\unittest\\chinese?folder\\chinese? test.leo'
(trace of converted fileName)

Unexpected exception parsing
C:\prog\tigris-cvs\leo\test\unittest\chinese?folder\chinese?test. leo
Traceback (most recent call last):

File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2162, in
parse_leo_file
parser.parse(theFile)

File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)

File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse
self.prepareParser(source)

File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser
self._parser.SetBase(source.getSystemId())

UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
position 44: ordinal not in range(128)

To repeat, theFile is an open file. I believe the actual filename is passed
nowhere as an argument to sax in my code. Just to make sure, I converted
the filename to ascii in my code, and got (no surprise) exactly the same
crash. I suppose a workaround would be to pass a 'file-like-object to sax
instead of an open file, so that theFile.getSystemId won't crash. But this
looks like a bug to me.

BTW:

Python 2.5.0, Tk 8.4.12, Pmw 1.2
Windows 5, 1, 2600, 2, Service Pack 2

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #4

P: n/a

Diez B. Roggisch wrote:
Edward K. Ream wrote:
Hi. Presumably this is a easy question, but anyone who understands the
sax docs thinks completely differently than I do :-)

Following the usual cookbook examples, my app parses an open file as
follows::

parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external _ges,1)

# Hopefully the content handler can figure out the encoding from the
# <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)

Here 'theFile' is an open file. Usually this works just fine, but when

Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

You have to encode the unicode string according to your filesystem
beforehand.
Not if your filesystem supports Unicode names, as Windows does.
Edward's point is that something is (whether by accident or "design")
trying to coerce it to str, and failing.

Oct 4 '06 #5

P: n/a
Happily, the workaround is easy. Replace theFile with:

# Use cStringIo to avoid a crash in sax when inputFileName has unicode
characters.
s = theFile.read()
theFile = cStringIO.StringIO(s)

My first attempt at a workaround was to use:

s = theFile.read()
parser.parseString(s)

but the expat parser does not support parseString...

Edward
--------------------------------------------------------------------
Edward K. Ream email: ed*******@charter.net
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Oct 4 '06 #6

P: n/a
Fredrik Lundh schrieb:
Diez B. Roggisch wrote:
>Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

it is ?
Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.

Regards,
Martin

P.S. and just to anticipate nit-picking: yes, you can pass a Unicode
string to Expat, too, as long as the Unicode string only contains
ASCII characters. And yes, it doesn't have to be ASCII, if you change
the system default encoding.
Oct 4 '06 #7

P: n/a
Edward K. Ream schrieb:
Happily, the workaround is easy. Replace theFile with:

# Use cStringIo to avoid a crash in sax when inputFileName has unicode
characters.
s = theFile.read()
theFile = cStringIO.StringIO(s)

My first attempt at a workaround was to use:

s = theFile.read()
parser.parseString(s)

but the expat parser does not support parseString...
Right - you would have to use xml.sax.parseString (which is a global
function, not a method).

Of course, parseString just does what you did: create a cStringIO
object and operate on that.

Regards,
Martin
Oct 4 '06 #8

P: n/a
Martin v. Lwis wrote:
Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.
sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
doesn't seem to have any problems dealing with unicode filenames...)

</F>

Oct 4 '06 #9

P: n/a
Fredrik Lundh schrieb:
Martin v. Lwis wrote:
>Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.

sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
doesn't seem to have any problems dealing with unicode filenames...)
That's because ET never invokes XML_SetBase. Without testing, this
suggests that there might be problem in ET with relative URIs
in parsed external entities. XML_SetBase expects a char* for the
base URI.

Regards,
Martin
Oct 4 '06 #10

This discussion thread is closed

Replies have been disabled for this discussion.