it was created. Unfortunately, the XML version of the file has added and
removed some of the whitespace. For example::
... Pacific First Financial Corp. said shareholders approved its>>plain_text = '''
... acquisition.
... '''
... <EVENT eid="e1" class="REPORTING" said </EVENTshareholders>>xml_text = ''' <s>Pacific First Financial Corp.
... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENTits
... <EVENT eid="e8" class="OCCURRENCE" acquis ition </EVENT>.
... </s>
... '''
I want to determine which offsets in the *original* text each element
from the XML text is supposed to cover. So I want something like::
[(<Element 'EVENT' at 01411B00>, 31, 35),>>xml_tree = etree.fromstring(xml_text)
align(xml_tree, plain_text)
(<Element 'EVENT' at 01411EA8>, 49, 57),
(<Element 'EVENT' at 01411E18>, 62, 73),
(<Element 's' at 01411FC8>, 1, 74)]
where ``align`` has returned a list of all elements in the XML text
along with their start and end indices in the original text::
'said'>>plain_text[31:35]
'approved'>>plain_text[49:57]
'acquisition'>>plain_text[62:73]
Note that I want to ignore whitespace as much as possible, so the
elements are aligned only to the non-whitespace text they include.
Below is my current implementation of the ``align`` function. It seems
pretty messy to me -- can anyone offer me some advice on how to clean it
up or write it differently?
def align(tree, text):
def align_helper(elem, elem_start):
# skip whitespace in the text before the element
while text[elem_start:elem_start + 1].isspace():
elem_start += 1
# advance the element end past any element text
elem_end = elem_start
if elem.text is not None:
for char in elem.text:
if not char.isspace():
while text[elem_end:elem_end + 1].isspace():
elem_end += 1
assert text[elem_end] == char
elem_end += 1
# advance the element end past any child elements
for child_elem in elem:
elem_end = align_helper(child_elem, elem_end)
# advance the start for the next element past the tail text
next_start = elem_end
if elem.tail is not None:
for char in elem.tail:
if not char.isspace():
while text[next_start:next_start + 1].isspace():
next_start += 1
assert text[next_start] == char
next_start += 1
# add the element and its start and end to the result list
result.append((elem, elem_start, elem_end))
# return the start of the next element
return next_start
result = []
align_helper(tree, 0)
return result
Thanks,
STeVe