473,416 Members | 1,599 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

aligning ElementTrees to text

I'm trying to align an XML file with the original text file from which
it was created. Unfortunately, the XML version of the file has added and
removed some of the whitespace. For example::
>>plain_text = '''
... Pacific First Financial Corp. said shareholders approved its
... acquisition.
... '''
>>xml_text = ''' <s>Pacific First Financial Corp.
... <EVENT eid="e1" class="REPORTING" said </EVENTshareholders
... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENTits
... <EVENT eid="e8" class="OCCURRENCE" acquis ition </EVENT>.
... </s>
... '''

I want to determine which offsets in the *original* text each element
from the XML text is supposed to cover. So I want something like::
>>xml_tree = etree.fromstring(xml_text)
align(xml_tree, plain_text)
[(<Element 'EVENT' at 01411B00>, 31, 35),
(<Element 'EVENT' at 01411EA8>, 49, 57),
(<Element 'EVENT' at 01411E18>, 62, 73),
(<Element 's' at 01411FC8>, 1, 74)]

where ``align`` has returned a list of all elements in the XML text
along with their start and end indices in the original text::
>>plain_text[31:35]
'said'
>>plain_text[49:57]
'approved'
>>plain_text[62:73]
'acquisition'

Note that I want to ignore whitespace as much as possible, so the
elements are aligned only to the non-whitespace text they include.
Below is my current implementation of the ``align`` function. It seems
pretty messy to me -- can anyone offer me some advice on how to clean it
up or write it differently?

def align(tree, text):

def align_helper(elem, elem_start):
# skip whitespace in the text before the element
while text[elem_start:elem_start + 1].isspace():
elem_start += 1

# advance the element end past any element text
elem_end = elem_start
if elem.text is not None:
for char in elem.text:
if not char.isspace():
while text[elem_end:elem_end + 1].isspace():
elem_end += 1
assert text[elem_end] == char
elem_end += 1

# advance the element end past any child elements
for child_elem in elem:
elem_end = align_helper(child_elem, elem_end)

# advance the start for the next element past the tail text
next_start = elem_end
if elem.tail is not None:
for char in elem.tail:
if not char.isspace():
while text[next_start:next_start + 1].isspace():
next_start += 1
assert text[next_start] == char
next_start += 1

# add the element and its start and end to the result list
result.append((elem, elem_start, elem_end))

# return the start of the next element
return next_start

result = []
align_helper(tree, 0)
return result
Thanks,

STeVe
Jan 17 '07 #1
0 921

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Steven Bethard | last post by:
I have a string with a bunch of whitespace in it, and a series of chunks of that string whose indices I need to find. However, the chunks have been whitespace-normalized, so that multiple spaces...
1
by: Graham Cross | last post by:
Dear All Is there an HTML way of aligning the images on this page http://www.ageconcernleics.com/review03/chairs03.html so that they appear evenly spaced relative to the text column on the...
2
by: Dave | last post by:
Hello all, I need to know if you can vertically align Items within an ItemTemplate. For example I have two columns, one column has a stack of 6 textboxes, my second column can have 1 to n...
1
by: Linux Boy via .NET 247 | last post by:
(Type your message here) Hi everyone, I would like to ask a question about aligning text within one label. I have an application that everytime the user click on Enter Record button, they will...
9
by: Steven Bethard | last post by:
I've got a list of word substrings (the "tokens") which I need to align to a string of text (the "sentence"). The sentence is basically the concatenation of the token list, with spaces sometimes...
6
by: tomasio | last post by:
Dear NG-readers, I want to align text right to an image. The last line of text should align with the bottom edge of the image. Unfortunately, the results using my poor CSS knowledge mounted in...
4
by: Steven Bethard | last post by:
I have some plain text data and some SGML markup for that text that I need to align. (The SGML doesn't maintain the original whitespace, so I have to do some alignment; I can't just calculate the...
2
by: agbee1 | last post by:
Hello: I've finally made the effort to ween myself from overly using tables and use CSS for my positioning. However, I am having a problem with my navigational menu properly aligning in Firefox,...
10
unstoppablekatia
by: unstoppablekatia | last post by:
I have a website that has images on it, and underneath the images are text. My only option of aligning the images and text separate from each other, is to do <div align="right">, <center>, or <div...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.