470,815 Members | 1,260 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,815 developers. It's quick & easy.

unicode and xml/xsl


I'm a python (& xml, & unicode!) newbie working on an interface to a
bibliographic reference server (refdb); I'm running into some encoding
problems & am ifnding the plethora of tools a little confusing. Here
is the basic situation:

I connect to the server and receive an xml document whose content is a
bibliographic dataset. The document can be encoded in two ways:
ISO-8859-1 or unicode. My program simply takes the document and
passes it to an xsl stylesleet using libxslt & libxml2. Here's the
relevant code:

# this is how I get the results & generate either a string or a
# unicode string
def getref (self, query = ':ID:>0', cmd = 'getref ',
reftype = default_reftype):
cmd += ' ' + query
self.send(cmd + self.CS_TERM)
results = self.tread()
if self.encoding == 'UNICODE':
print ' decoding unicode string: '
results = results.decode('utf-8', 'replace')
return results
# this is where I generate the html:
def risx_to_html (self, risxSet, xsl = xsl_ss,
css=css_url, use_css = 1):
styledoc = libxml2.parseFile(xsl)
style = libxslt.parseStylesheetDoc(styledoc)
doc = libxml2.parseDoc(risxSet)
result = style.applyStylesheet(doc, None)
# style.saveResultToFilename("results.html", result, 0)
htmlString = style.saveResultToString(result)
return htmlString

The server's default encoding is iso-8859-1, and since I mosly use
english-language references, this usually works fine; but occasionally
the server will pass me an entity like 'μ' (for Greek letter mu).
This generates an error like this:

Entity: line 57: parser error : Entity 'mu' not defined

This is not so bad, because the parsing continues nonetheless. With
unicode it's worse. In this case there are several errors depending
on how I set the system up:

with iso-8859-1 set as default encoding in sitecustomize.py:

File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
doc = libxml2.parseDoc(risxSet)
File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)

with utf-8 set as default encoding:
File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
doc = libxml2.parseDoc(risxSet)
File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
ret = libxml2mod.xmlParseDoc(cur)
TypeError: xmlParseDoc() argument 1 must be string without null bytes or None, not unicode

So I guess I have two questions:

(1) am I using the right python tools for this job? My excellent
python books unfortunately don't cover either unicode or xml in much
depth, so I'm a little uncertain as te whtehr I'm doing the right

(2) Is there a way to make libxml2 parse unicode documents? Do I need
to pass it more information alerting it that it's getting unicode?

Anyway, thanks very much for your help. Much appreciated,

Matt Price ma********@utoronto.ca
History Department, University of Toronto
(416) 978-2094
Jul 18 '05 #1
0 1372

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Michael Weir | last post: by
8 posts views Thread by Bill Eldridge | last post: by
8 posts views Thread by Francis Girard | last post: by
4 posts views Thread by webdev | last post: by
2 posts views Thread by Neil Schemenauer | last post: by
10 posts views Thread by Nikolay Petrov | last post: by
6 posts views Thread by Jeff | last post: by
13 posts views Thread by Tomás | last post: by
24 posts views Thread by ChaosKCW | last post: by
reply views Thread by mihailmihai484 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.