469,578 Members | 1,869 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,578 developers. It's quick & easy.

XML/HTML Encoding problem

A colleague has asked me this and I don't know the answer. Can anyone here
help with this? Thanks in advance.

Here is his email:

I am trying to parse an HTML document using the xml.dom.minidom parser and
then outputting a valid HTML document, all using the ISO-8859-1 charset.
For example:

My input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>

Desired output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>

Note that it doesn't matter if the '<?xml version="1.0"
encoding="ISO-8859-1"?>' header gets stripped. *What does matter is that the
input document has the 'ISO-8859-1' charset and is an ANSI encoded file.

The problem I get is that when I run, for example:

from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>

</body>
</html>

So it encodes the entity reference to € (Euro sign). *I need it to remain as
€ so that the resulting HTML can render properly in a browser. *Is
there a way to make the parser not convert the entity references? *Or is
there a convenient post processing function that will do the conversion?

--
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk

May 22 '06 #1
3 5515
Dale Strickland-Clark enlightened us with:
So it encodes the entity reference to € (Euro sign). *I need it to
remain as € so that the resulting HTML can render properly in
a browser.


If you want proper display, why not use UTF-8?

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
May 22 '06 #2
Dale Strickland-Clark wrote:
from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1"
http-equiv="Content-Type"/> </head>
<body>

</body>
</html>

So it encodes the entity reference to € (Euro sign). *I need it to
remain as € so that the resulting HTML can render properly in a
browser. *Is there a way to make the parser not convert the entity
references? *Or is there a convenient post processing function that
will do the conversion?


First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there isn't
any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):
from xml.dom.minidom import parseString
strHTML = '''<?xml version="1.0" encoding="ISO-8859-1"?> <html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>''' print parseString(strHTML).toxml().encode('ascii', 'xmlcharrefreplace') <?xml version="1.0" ?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>

</body>
</html>


You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.
May 22 '06 #3
Thanks, Duncan. That did the trick.

If you're EuroPythoning, I'll buy you a drink.

Cheers.
Duncan Booth wrote:
First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there
isn't any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):
You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.


--
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk

May 23 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by zorro | last post: by
5 posts views Thread by Armand Karlsen | last post: by
5 posts views Thread by z. f. | last post: by
3 posts views Thread by ChainsawDude | last post: by
82 posts views Thread by Eric Lindsay | last post: by
5 posts views Thread by Johannes Bauer | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.