473,386 Members | 1,775 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

XML/HTML Encoding problem

A colleague has asked me this and I don't know the answer. Can anyone here
help with this? Thanks in advance.

Here is his email:

I am trying to parse an HTML document using the xml.dom.minidom parser and
then outputting a valid HTML document, all using the ISO-8859-1 charset.
For example:

My input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>

Desired output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>

Note that it doesn't matter if the '<?xml version="1.0"
encoding="ISO-8859-1"?>' header gets stripped. Â*What does matter is that the
input document has the 'ISO-8859-1' charset and is an ANSI encoded file.

The problem I get is that when I run, for example:

from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
€
</body>
</html>

So it encodes the entity reference to € (Euro sign). Â*I need it to remain as
€ so that the resulting HTML can render properly in a browser. Â*Is
there a way to make the parser not convert the entity references? Â*Or is
there a convenient post processing function that will do the conversion?

--
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk

May 22 '06 #1
3 5662
Dale Strickland-Clark enlightened us with:
So it encodes the entity reference to € (Euro sign). Â*I need it to
remain as € so that the resulting HTML can render properly in
a browser.


If you want proper display, why not use UTF-8?

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
May 22 '06 #2
Dale Strickland-Clark wrote:
from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1"
http-equiv="Content-Type"/> </head>
<body>
€
</body>
</html>

So it encodes the entity reference to € (Euro sign). Â*I need it to
remain as € so that the resulting HTML can render properly in a
browser. Â*Is there a way to make the parser not convert the entity
references? Â*Or is there a convenient post processing function that
will do the conversion?


First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there isn't
any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):
from xml.dom.minidom import parseString
strHTML = '''<?xml version="1.0" encoding="ISO-8859-1"?> <html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>''' print parseString(strHTML).toxml().encode('ascii', 'xmlcharrefreplace') <?xml version="1.0" ?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>

</body>
</html>


You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.
May 22 '06 #3
Thanks, Duncan. That did the trick.

If you're EuroPythoning, I'll buy you a drink.

Cheers.
Duncan Booth wrote:
First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there
isn't any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):
You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.


--
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk

May 23 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: zorro | last post by:
Hello there, I'm really stumped... I'm fetching a web page with a script and parsing it. There is a problem because the response inserts '8 1ff8' in random places. For example, I get things...
5
by: Armand Karlsen | last post by:
On this page of my website: http://www.zen62775.zen.co.uk/rigs.html the W3C html validator claims that there are some bytes it cannot interpret in line 49 of the html source as UTF-8. If I force...
11
by: Ron | last post by:
Hello, I'm having an aggravating time getting the "html" spewed by Word 2003 to display correctly in a webpage. The situation here is that the people creating the documents only know Word, and...
5
by: z. f. | last post by:
hi, i have a vb.net web application and i make a request using internet explorer to an aspx page. the aspx page size if over 170KB, and the page in internet explorer looks truncated and in the...
3
by: ChainsawDude | last post by:
I am posting from a HTML form (below) to a aspx webform. This works OK but I notice the £ character (pound sign) is dropped! i.e. not picked up by a Request.Form in the webform e.g. input =...
15
by: Nathan | last post by:
I have an aspx page with a data grid, some textboxes, and an update button. This page also has one html input element with type=file (not inside the data grid and runat=server). The update...
82
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...
14
by: Zoro | last post by:
My task is to read html files from disk and save them onto SQL Server database field. I have created an nvarchar(max) field to hold them. The problem is that some characters, particularly html...
5
by: Johannes Bauer | last post by:
Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.