471,853 Members | 823 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,853 software developers and data experts.

aligning ElementTrees to text

I'm trying to align an XML file with the original text file from which
it was created. Unfortunately, the XML version of the file has added and
removed some of the whitespace. For example::
>>plain_text = '''
... Pacific First Financial Corp. said shareholders approved its
... acquisition.
... '''
>>xml_text = ''' <s>Pacific First Financial Corp.
... <EVENT eid="e1" class="REPORTING" said </EVENTshareholders
... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENTits
... <EVENT eid="e8" class="OCCURRENCE" acquis ition </EVENT>.
... </s>
... '''

I want to determine which offsets in the *original* text each element
from the XML text is supposed to cover. So I want something like::
>>xml_tree = etree.fromstring(xml_text)
align(xml_tree, plain_text)
[(<Element 'EVENT' at 01411B00>, 31, 35),
(<Element 'EVENT' at 01411EA8>, 49, 57),
(<Element 'EVENT' at 01411E18>, 62, 73),
(<Element 's' at 01411FC8>, 1, 74)]

where ``align`` has returned a list of all elements in the XML text
along with their start and end indices in the original text::

Note that I want to ignore whitespace as much as possible, so the
elements are aligned only to the non-whitespace text they include.
Below is my current implementation of the ``align`` function. It seems
pretty messy to me -- can anyone offer me some advice on how to clean it
up or write it differently?

def align(tree, text):

def align_helper(elem, elem_start):
# skip whitespace in the text before the element
while text[elem_start:elem_start + 1].isspace():
elem_start += 1

# advance the element end past any element text
elem_end = elem_start
if elem.text is not None:
for char in elem.text:
if not char.isspace():
while text[elem_end:elem_end + 1].isspace():
elem_end += 1
assert text[elem_end] == char
elem_end += 1

# advance the element end past any child elements
for child_elem in elem:
elem_end = align_helper(child_elem, elem_end)

# advance the start for the next element past the tail text
next_start = elem_end
if elem.tail is not None:
for char in elem.tail:
if not char.isspace():
while text[next_start:next_start + 1].isspace():
next_start += 1
assert text[next_start] == char
next_start += 1

# add the element and its start and end to the result list
result.append((elem, elem_start, elem_end))

# return the start of the next element
return next_start

result = []
align_helper(tree, 0)
return result

Jan 17 '07 #1
0 874

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Steven Bethard | last post: by
1 post views Thread by Graham Cross | last post: by
2 posts views Thread by Dave | last post: by
1 post views Thread by Linux Boy via .NET 247 | last post: by
9 posts views Thread by Steven Bethard | last post: by
6 posts views Thread by tomasio | last post: by
4 posts views Thread by Steven Bethard | last post: by
10 posts views Thread by unstoppablekatia | last post: by
reply views Thread by NeoPa | last post: by
reply views Thread by YellowAndGreen | last post: by
reply views Thread by aboka | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.