473,569 Members | 2,557 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

aligning ElementTrees to text

I'm trying to align an XML file with the original text file from which
it was created. Unfortunately, the XML version of the file has added and
removed some of the whitespace. For example::
>>plain_text = '''
... Pacific First Financial Corp. said shareholders approved its
... acquisition.
... '''
>>xml_text = ''' <s>Pacific First Financial Corp.
... <EVENT eid="e1" class="REPORTIN G" said </EVENTshareholde rs
... <EVENT eid="e2" class="OCCURREN CE" >approved</EVENTits
... <EVENT eid="e8" class="OCCURREN CE" acquis ition </EVENT>.
... </s>
... '''

I want to determine which offsets in the *original* text each element
from the XML text is supposed to cover. So I want something like::
>>xml_tree = etree.fromstrin g(xml_text)
align(xml_tre e, plain_text)
[(<Element 'EVENT' at 01411B00>, 31, 35),
(<Element 'EVENT' at 01411EA8>, 49, 57),
(<Element 'EVENT' at 01411E18>, 62, 73),
(<Element 's' at 01411FC8>, 1, 74)]

where ``align`` has returned a list of all elements in the XML text
along with their start and end indices in the original text::
>>plain_text[31:35]
'said'
>>plain_text[49:57]
'approved'
>>plain_text[62:73]
'acquisition'

Note that I want to ignore whitespace as much as possible, so the
elements are aligned only to the non-whitespace text they include.
Below is my current implementation of the ``align`` function. It seems
pretty messy to me -- can anyone offer me some advice on how to clean it
up or write it differently?

def align(tree, text):

def align_helper(el em, elem_start):
# skip whitespace in the text before the element
while text[elem_start:elem _start + 1].isspace():
elem_start += 1

# advance the element end past any element text
elem_end = elem_start
if elem.text is not None:
for char in elem.text:
if not char.isspace():
while text[elem_end:elem_e nd + 1].isspace():
elem_end += 1
assert text[elem_end] == char
elem_end += 1

# advance the element end past any child elements
for child_elem in elem:
elem_end = align_helper(ch ild_elem, elem_end)

# advance the start for the next element past the tail text
next_start = elem_end
if elem.tail is not None:
for char in elem.tail:
if not char.isspace():
while text[next_start:next _start + 1].isspace():
next_start += 1
assert text[next_start] == char
next_start += 1

# add the element and its start and end to the result list
result.append(( elem, elem_start, elem_end))

# return the start of the next element
return next_start

result = []
align_helper(tr ee, 0)
return result
Thanks,

STeVe
Jan 17 '07 #1
0 928

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
1643
by: Steven Bethard | last post by:
I have a string with a bunch of whitespace in it, and a series of chunks of that string whose indices I need to find. However, the chunks have been whitespace-normalized, so that multiple spaces and newlines have been converted to single spaces as if by ' '.join(chunk.split()). Some example data to clarify my problem: py> text = """\...
1
2426
by: Graham Cross | last post by:
Dear All Is there an HTML way of aligning the images on this page http://www.ageconcernleics.com/review03/chairs03.html so that they appear evenly spaced relative to the text column on the left. In a perfect world I would like the top of the first photo to be aligned with the first line of text. Thanks
2
2743
by: Dave | last post by:
Hello all, I need to know if you can vertically align Items within an ItemTemplate. For example I have two columns, one column has a stack of 6 textboxes, my second column can have 1 to n number of textboxes, but when there is only one textbox I need to be able to anchor it to the top of the column, at present it centers itself in the...
1
1915
by: Linux Boy via .NET 247 | last post by:
(Type your message here) Hi everyone, I would like to ask a question about aligning text within one label. I have an application that everytime the user click on Enter Record button, they will be prompt an input box to enter employee names and sales figures. The output then will be displyed in only 1 label using a For loop. The layout...
9
1406
by: Steven Bethard | last post by:
I've got a list of word substrings (the "tokens") which I need to align to a string of text (the "sentence"). The sentence is basically the concatenation of the token list, with spaces sometimes inserted beetween tokens. I need to determine the start and end offsets of each token in the sentence. For example:: py> tokens = py> text =...
6
2003
by: tomasio | last post by:
Dear NG-readers, I want to align text right to an image. The last line of text should align with the bottom edge of the image. Unfortunately, the results using my poor CSS knowledge mounted in text which starts at the bottom line of the image and continues below. For an example go here: http://tomasio.at/temp/test_valign.html I want to...
4
1299
by: Steven Bethard | last post by:
I have some plain text data and some SGML markup for that text that I need to align. (The SGML doesn't maintain the original whitespace, so I have to do some alignment; I can't just calculate the indices directly.) For example, some of my text looks like: TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in cytoplasmic...
2
2440
by: agbee1 | last post by:
Hello: I've finally made the effort to ween myself from overly using tables and use CSS for my positioning. However, I am having a problem with my navigational menu properly aligning in Firefox, despite the fact that I have gotten a green light from W3. link to problem page: I didn't see a feature to attach my css code so I pasted...
10
2183
unstoppablekatia
by: unstoppablekatia | last post by:
I have a website that has images on it, and underneath the images are text. My only option of aligning the images and text separate from each other, is to do <div align="right">, <center>, or <div align="left">. Neither of them help with what I want to do, and nor does <display:inline> or <float:right>, even adding them to the image links does no...
0
7700
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7614
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7924
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
1
7676
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6284
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5513
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3653
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
1221
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
938
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.