By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,995 Members | 1,217 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,995 IT Pros & Developers. It's quick & easy.

aligning ElementTrees to text

P: n/a
I'm trying to align an XML file with the original text file from which
it was created. Unfortunately, the XML version of the file has added and
removed some of the whitespace. For example::
>>plain_text = '''
... Pacific First Financial Corp. said shareholders approved its
... acquisition.
... '''
>>xml_text = ''' <s>Pacific First Financial Corp.
... <EVENT eid="e1" class="REPORTING" said </EVENTshareholders
... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENTits
... <EVENT eid="e8" class="OCCURRENCE" acquis ition </EVENT>.
... </s>
... '''

I want to determine which offsets in the *original* text each element
from the XML text is supposed to cover. So I want something like::
>>xml_tree = etree.fromstring(xml_text)
align(xml_tree, plain_text)
[(<Element 'EVENT' at 01411B00>, 31, 35),
(<Element 'EVENT' at 01411EA8>, 49, 57),
(<Element 'EVENT' at 01411E18>, 62, 73),
(<Element 's' at 01411FC8>, 1, 74)]

where ``align`` has returned a list of all elements in the XML text
along with their start and end indices in the original text::
>>plain_text[31:35]
'said'
>>plain_text[49:57]
'approved'
>>plain_text[62:73]
'acquisition'

Note that I want to ignore whitespace as much as possible, so the
elements are aligned only to the non-whitespace text they include.
Below is my current implementation of the ``align`` function. It seems
pretty messy to me -- can anyone offer me some advice on how to clean it
up or write it differently?

def align(tree, text):

def align_helper(elem, elem_start):
# skip whitespace in the text before the element
while text[elem_start:elem_start + 1].isspace():
elem_start += 1

# advance the element end past any element text
elem_end = elem_start
if elem.text is not None:
for char in elem.text:
if not char.isspace():
while text[elem_end:elem_end + 1].isspace():
elem_end += 1
assert text[elem_end] == char
elem_end += 1

# advance the element end past any child elements
for child_elem in elem:
elem_end = align_helper(child_elem, elem_end)

# advance the start for the next element past the tail text
next_start = elem_end
if elem.tail is not None:
for char in elem.tail:
if not char.isspace():
while text[next_start:next_start + 1].isspace():
next_start += 1
assert text[next_start] == char
next_start += 1

# add the element and its start and end to the result list
result.append((elem, elem_start, elem_end))

# return the start of the next element
return next_start

result = []
align_helper(tree, 0)
return result
Thanks,

STeVe
Jan 17 '07 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.