encoding during elementtree serialization

Chris McDonough

ElementTree's XML serialization routine implied by tree._write(file,
node, encoding, namespaces looks like this (elided):

def _write(self, file, node, encoding, namespaces):
# write XML to file
tag = node.tag
if tag is Comment:
file.write("" % _escape_cdata(node.text, encoding))
elif tag is ProcessingInstruction:
file.write("<?%s?>" % _escape_cdata(node.text, encoding))
else:
....
file.write("<" + _encode(tag, encoding))
if items or xmlns_items:
items.sort() # lexical order

Note that "_escape_cdata" (which also performs encoding) and "_encode"
are called for pcdata (and attribute values) only, but not for the tag
literals like "<" and "<?%s?>".

In some profiling I've done, I believe encoding during recursion makes
serialization slightly slower than it could be if we could get away with
not encoding any pcdata or attribute values during recursion.

Instead, we might be able to get away with encoding everything just once
at the end. But I don't know if this is kosher. Is there any reason to
not also encode tag literals and quotation marks that are attribute
containers, just once, at the end of serialization?

Even if that's not acceptable in general because tag literals cannot be
encoded, would it be acceptable for "ascii-compatible" encodings like
utf-8, latin-1, and friends?

Something like:

def _escape_cdata(text, encoding=None, replace=string.replace):
# doesn't do any encoding
text = replace(text, "&", "&")
text = replace(text, "<", "<")
text = replace(text, ">", ">")
return text

class _ElementInterface:

...

def write(self, file, encoding="us-ascii"):
assert self._root is not None
if not hasattr(file, "write"):
file = open(file, "wb")
if not encoding:
encoding = "us-ascii"
elif encoding != "utf-8" and encoding != "us-ascii":
file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
tmp = StringIO()
self._write(tmp, self._root, encoding, {})
file.write(tmp.getvalue().encode(encoding))
def _write(self, file, node, encoding, namespaces):
# write XML to file
tag = node.tag
if tag is Comment:
file.write("" % _escape_cdata(node.text, encoding))
elif tag is ProcessingInstruction:
file.write("<?%s?>" % _escape_cdata(node.text, encoding))
else:
items = node.items()
xmlns_items = [] # new namespaces in this scope
try:
if isinstance(tag, QName) or tag[:1] == "{":
tag, xmlns = fixtag(tag, namespaces)
if xmlns: xmlns_items.append(xmlns)
except TypeError:
_raise_serialization_error(tag)
file.write("<" + tag)
I smell the mention of a Byte Order Mark coming on. ;-)

Feb 8 '06 #1

Subscribe Post Reply

1928

Similar topics

import statement / ElementTree

by: mirandacascade | last post by:

O/S: Windows 2K Vsn of Python: 2.4 Currently: 1) Folder structure: \workarea\ <- ElementTree files reside here \xml\ \dom\

Python

Timeout during XmlSerializer instantiation

by: Rangi Keen | last post by:

I am instantiating an XmlSerializer using the XmlSerializer(Type) constructor. This works most of the time, but sometimes I get a timeout during the process. I'm using the same type in all cases...

.NET Framework

elementtree and gbk encoding

by: Steven Bethard | last post by:

I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with...

Python

the tostring and XML methods in ElementTree

by: mirandacascade | last post by:

O/S: Windows XP Home Vsn of Python: 2.4 Copy/paste of interactive window is immediately below; the text/questions toward the bottom of this post will refer to the content of the copy/paste ...

Python

request for advice - possible ElementTree nexus

by: mirandacascade | last post by:

Situation is this: 1) I have inherited some python code that accepts a string object, the contents of which is an XML document, and produces a data structure that represents some of the content of...

Python

lxml/ElementTree and .tail

by: Chas Emerick | last post by:

I looked around for an ElementTree-specific mailing list, but found none -- my apologies if this is too broad a forum for this question. I've been using the lxml variant of the ElementTree API,...

Python

ElementTree and utf-16 encoding

by: =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?= | last post by:

Hi, ET being ElementTree in the following code, could anyone explain why it fails ? "<?xml version='1.0' encoding='UTF-16'?>\n<\xff\xfer\x00o\x00o\x00t\x00 />" Traceback (most recent call...

Python

Serialization & encoding

by: Sarika Agarwal | last post by:

Hi, What is the primary difference between serialization and encoding in ..NET! *** Sent via Developersdex http://www.developersdex.com ***

C# / C Sharp

XML/encoding/prolog/python hell...

by: fscked | last post by:

I am a beginning pythoner and I am having a terrible time trying to figure out how to do something that (it would seeme to me) should be fairly simple. I have a CSV file of unknown encoding and...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp