XML/HTML Encoding problem

Dale Strickland-Clark

A colleague has asked me this and I don't know the answer. Can anyone here
help with this? Thanks in advance.

Here is his email:

I am trying to parse an HTML document using the xml.dom.minidom parser and
then outputting a valid HTML document, all using the ISO-8859-1 charset.
For example:

My input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>

Desired output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>

Note that it doesn't matter if the '<?xml version="1.0"
encoding="ISO-8859-1"?>' header gets stripped. Â*What does matter is that the
input document has the 'ISO-8859-1' charset and is an ANSI encoded file.

The problem I get is that when I run, for example:

from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
â‚¬
</body>
</html>

So it encodes the entity reference to â‚¬ (Euro sign). Â*I need it to remain as
€ so that the resulting HTML can render properly in a browser. Â*Is
there a way to make the parser not convert the entity references? Â*Or is
there a convenient post processing function that will do the conversion?

--
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk

May 22 '06 #1

Subscribe Post Reply

5662

Sybren Stuvel

Dale Strickland-Clark enlightened us with:

So it encodes the entity reference to â‚¬ (Euro sign). Â*I need it to
remain as € so that the resulting HTML can render properly in
a browser.

If you want proper display, why not use UTF-8?

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa

May 22 '06 #2

Duncan Booth

Dale Strickland-Clark wrote:

from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1"
http-equiv="Content-Type"/> </head>
<body>
â‚¬
</body>
</html>

So it encodes the entity reference to â‚¬ (Euro sign). Â*I need it to
remain as € so that the resulting HTML can render properly in a
browser. Â*Is there a way to make the parser not convert the entity
references? Â*Or is there a convenient post processing function that
will do the conversion?

First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there isn't
any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):

from xml.dom.minidom import parseString
strHTML = '''<?xml version="1.0" encoding="ISO-8859-1"?> <html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>''' print parseString(strHTML).toxml().encode('ascii', 'xmlcharrefreplace') <?xml version="1.0" ?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
€
</body>
</html>

You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.

May 22 '06 #3

Dale Strickland-Clark

Thanks, Duncan. That did the trick.

If you're EuroPythoning, I'll buy you a drink.

Cheers.
Duncan Booth wrote:

First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there
isn't any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):
You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.

--
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk

May 23 '06 #4

by: zorro | last post by:

Hello there, I'm really stumped... I'm fetching a web page with a script and parsing it. There is a problem because the response inserts '8 1ff8' in random places. For example, I get things...

PHP

Rendering "special characters" and html validation

by: Armand Karlsen | last post by:

On this page of my website: http://www.zen62775.zen.co.uk/rigs.html the W3C html validator claims that there are some bytes it cannot interpret in line 49 of the html source as UTF-8. If I force...

HTML / CSS

html tidy, word 2003 and "smart quotes"

by: Ron | last post by:

Hello, I'm having an aggravating time getting the "html" spewed by Word 2003 to display correctly in a webpage. The situation here is that the people creating the documents only know Word, and...

HTML / CSS

aspx/html file bigger than 170KB truncated

by: z. f. | last post by:

hi, i have a vb.net web application and i make a request using internet explorer to an aspx page. the aspx page size if over 170KB, and the page in internet explorer looks truncated and in the...

ASP.NET

Posting from HTML form to WebForm

by: ChainsawDude | last post by:

I am posting from a HTML form (below) to a aspx webform. This works OK but I notice the Â£ character (pound sign) is dropped! i.e. not picked up by a Request.Form in the webform e.g. input =...

ASP.NET

Page has Expired - using html input control (type=file)

by: Nathan | last post by:

I have an aspx page with a data grid, some textboxes, and an update button. This page also has one html input element with type=file (not inside the data grid and runat=server). The update...

ASP.NET

Understanding simplest HTML page

by: Eric Lindsay | last post by:

I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...

HTML / CSS

How to read html files AS IS. Encoding seems to change the characters.

by: Zoro | last post by:

My task is to read html files from disk and save them onto SQL Server database field. I have created an nvarchar(max) field to hold them. The problem is that some characters, particularly html...

C# / C Sharp

Python HTML parser chokes on UTF-8 input

by: Johannes Bauer | last post by:

Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

XML/HTML Encoding problem

Similar topics