By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,235 Members | 1,011 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,235 IT Pros & Developers. It's quick & easy.

Splitting a DOM

P: n/a
Hello,

I would like to handle an XML file structured as following
<ROOT>
<STEP>
....
</STEP>
<STEP>
....
</STEP>
....
</ROOT>

From this file, I want to build an XML file for each STEP block.

Currently I'm doing something like:

from xml.dom.ext.reader import Sax2
from xml.dom.ext import PrettyPrint

reader = Sax2.Reader()
my_dom = reader.fromUri('steps.xml')
steps = my_dom.getElementsByTagName('STEP')

i=0
for step in steps:
tmp = file('step%s.xml' % i,'w')
tmp.write('<?xml version="1.0" encoding="ISO-8859-1" ?>\n')
PrettyPrint(step , tmp , encoding='ISO-8859-1')
tmp.close()
i+=1

But I'm pretty sure that there's a better way to split the DOM ?

Thanks for any suggestion provided.

Brice
Jul 18 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
[Brice Vissi?re]
But I'm pretty sure that there's a better way to split the DOM ?


There's *lots* of ways to solve this one. The "best" solution depends
on which criteria you choose.

The most efficient in time and memory is probably SAX, although the
problem is so simple, a simple textual solution might work well, and
would definitely be faster.

Here's a bit of SAX code adapted from another SAX example I posted
earlier today. Note that this will not work properly if you have
<STEP> elements nested inside one another. In that case, you'd have to
maintain a stack of the output files: push the outfile onto the stack
in "startElement()" and pop it off in "endElement()".

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

split_on_elems = ['STEP']

class splitter(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outfile = None
self.seq_no = self.seq_no_gen()

def seq_no_gen(self, n=0):
while True: yield n ; n = n+1

def startElement(self, elemname, attrs):
if elemname in split_on_elems:
self.outfile = open('step%04d.xml' % self.seq_no.next(), 'wt')
if self.outfile:
attrstr = ""
for a in attrs.keys():
attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
self.outfile.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if self.outfile: self.outfile.write('</%s>' % elemname)
if elemname in split_on_elems:
self.outfile.close() ; self.outfile = None

def characters(self, s):
if self.outfile: self.outfile.write("%s" % (s,))

testdoc = """
<ROOT>
<STEP a="b" c="d">Step 0</STEP>
<STEP>Step 1</STEP>
<STEP>Step 2</STEP>
<STEP>Step 3</STEP>
<STEP>Step 4</STEP>
</ROOT>
"""

if __name__ == "__main__":
parser = xml.sax.make_parser()
PFJ = splitter()
parser.setContentHandler(PFJ)
parser.setFeature(xml.sax.handler.feature_namespac es, 0)
parser.feed(testdoc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

HTH,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #2

P: n/a
br************@costes-gestion.net (Brice Vissi?re) wrote in message news:<fa**************************@posting.google. com>...
Hello,

I would like to handle an XML file structured as following
<ROOT>
<STEP>
...
</STEP>
<STEP>
...
</STEP>
...
</ROOT>

From this file, I want to build an XML file for each STEP block.

Currently I'm doing something like:

from xml.dom.ext.reader import Sax2
from xml.dom.ext import PrettyPrint

reader = Sax2.Reader()
my_dom = reader.fromUri('steps.xml')
steps = my_dom.getElementsByTagName('STEP')

i=0
for step in steps:
tmp = file('step%s.xml' % i,'w')
tmp.write('<?xml version="1.0" encoding="ISO-8859-1" ?>\n')
PrettyPrint(step , tmp , encoding='ISO-8859-1')
tmp.close()
i+=1

But I'm pretty sure that there's a better way to split the DOM ?


Here's an Anobind recipe:

--- % ---

#Boilerplate set-up

import anobind
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri('steps.xml', attemptAbsolute=1)
isrc = isrc_factory.fromUri(file_uri)

#Now bind from the XML given in the input source
binder = anobind.binder()
binding = binder.read_xml(isrc)

#File splitting task
import tempfile

#The direct approach
i = 0
for folder in binding.xbel.folder:
fout = open('step%s.xml', 'w')
folder.unbind(fout)
fout.close()
i += 1

--- % ---

To use XPath replace the line

for folder in binding.xbel.folder:

With

for folder in binding.xpath_query(u'xbel/folder'):

Anobind: http://uche.ogbuji.net/tech/4Suite/anobind/

--Uche
http://uche.ogbuji.net
Jul 18 '05 #3

P: n/a
br************@costes-gestion.net (Brice Vissi?re) wrote in message news:<fa**************************@posting.google. com>...
Hello,

I would like to handle an XML file structured as following
<ROOT>
<STEP>
...
</STEP>
<STEP>
...
</STEP>
...
</ROOT>

From this file, I want to build an XML file for each STEP block.

Currently I'm doing something like:

from xml.dom.ext.reader import Sax2
from xml.dom.ext import PrettyPrint

reader = Sax2.Reader()
my_dom = reader.fromUri('steps.xml')
steps = my_dom.getElementsByTagName('STEP')

i=0
for step in steps:
tmp = file('step%s.xml' % i,'w')
tmp.write('<?xml version="1.0" encoding="ISO-8859-1" ?>\n')
PrettyPrint(step , tmp , encoding='ISO-8859-1')
tmp.close()
i+=1

But I'm pretty sure that there's a better way to split the DOM ?

Here's an Anobind recipe:

--- % ---

#Boilerplate set-up

import anobind
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri('steps.xml', attemptAbsolute=1)
isrc = isrc_factory.fromUri(file_uri)

#Now bind from the XML given in the input source
binder = anobind.binder()
binding = binder.read_xml(isrc)

#File splitting task
import tempfile

#The direct approach
i = 0
for folder in binding.ROOT.STEP:
fout = open('step%s.xml', 'w')
folder.unbind(fout)
fout.close()
i += 1

--- % ---

To use XPath replace the line

for folder in binding.ROOT.STEP:

With

for folder in binding.xpath_query(u'ROOT/STEP'):

Anobind: http://uche.ogbuji.net/tech/4Suite/anobind/

--Uche
http://uche.ogbuji.net
Jul 18 '05 #4

P: n/a
br************@costes-gestion.net (Brice Vissi?re) wrote in message news:<fa**************************@posting.google. com>...
Hello,

I would like to handle an XML file structured as following
<ROOT>
<STEP>
...
</STEP>
<STEP>
...
</STEP>
...
</ROOT>

From this file, I want to build an XML file for each STEP block.

Currently I'm doing something like:

from xml.dom.ext.reader import Sax2
from xml.dom.ext import PrettyPrint

reader = Sax2.Reader()
my_dom = reader.fromUri('steps.xml')
steps = my_dom.getElementsByTagName('STEP')

i=0
for step in steps:
tmp = file('step%s.xml' % i,'w')
tmp.write('<?xml version="1.0" encoding="ISO-8859-1" ?>\n')
PrettyPrint(step , tmp , encoding='ISO-8859-1')
tmp.close()
i+=1

But I'm pretty sure that there's a better way to split the DOM ?


I already gave an Aobind recipe foir this one, but I wanted to also
post a few notes on your chosen approach:

1) "from xml.dom.ext.reader import Sax2" means you're using 4DOM.
4DOM is very slow. If you find this is a problem, use minidom. My
aob ind recipe used cDomlette, which is *very* fast, and even faster
than minidom, certainly, but requires installing 3rd party software.

2) "steps = my_dom.getElementsByTagName('STEP')". This could give
unexpected results in the case that you have nested STEP elements.
You might want to use a list comprehension such as

steps = [ step for step in my_dom.documentElement.childNodes if
step.nodeName == u"STEP" ]

Good luck.

--Uche
http://uche.ogbuji.net
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.