I have some text and a list of Element objects and their offsets, e.g.::
>>text = 'aaa aaa aaabbb bbbaaa' spans = [
... (etree.Element( 'a'), 0, 21),
... (etree.Element( 'b'), 11, 18),
... (etree.Element( 'c'), 18, 18),
... ]
I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
>>tree = get_tree(text, spans) etree.tostrin g(tree)
'<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?
Thanks,
STeVe 9 1993
En Tue, 22 May 2007 03:02:34 -0300, Steven Bethard
<st************ @gmail.comescri bió:
I have some text and a list of Element objects and their offsets, e.g.::
>>text = 'aaa aaa aaabbb bbbaaa'
>>spans = [
... (etree.Element( 'a'), 0, 21),
... (etree.Element( 'b'), 11, 18),
... (etree.Element( 'c'), 18, 18),
... ]
I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
>>tree = get_tree(text, spans)
>>etree.tostrin g(tree)
'<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?
I need *some* sleep, but the idea would be as follows:
- For each span generate two tuples: (start_offset, 1, end_offset,
element) and (end_offset, 0, -start_offset, element). If start==end use
(start_offset, -1, start_offset, element).
- Collect all tuples in a list, and sort them. The tuple is made so when
at a given offset there is an element that ends and another that starts,
the ending element comes first (because it has a 0). For all the elements
that end at a given point, the shortest comes first.
- Initialize an empty list (will be used as a stack of containers), and
create a root Element as your "current container" (CC), the variable
"last_used" will keep the last position used on the text.
- For each tuple in the sorted list:
- if the second item is a 1, an element is starting. Insert it into the
CC element, push the CC onto the stack, and set the new element as the new
CC. The element data is text[last_used:start _offset], and move last_used
to start_offset.
- if the second item is a 0, an element is ending. Discard the CC, pop
an element from the stack as the new CC. The element data is
text[last_used:end_o ffset], move last_used up to end_offset.
- if the second item is a -1, it's an element with no content. Insert it
into the CC element.
You can play with the way the tuples are generated and sorted, to get
'<a>aaa aaa aaa<b>bbb bbb</b><c />aaa</a>' instead.
--
Gabriel Genellina
On May 21, 11:02 pm, Steven Bethard <steven.beth... @gmail.comwrote :
I have some text and a list of Element objects and their offsets, e.g.::
>>text = 'aaa aaa aaabbb bbbaaa'
>>spans = [
... (etree.Element( 'a'), 0, 21),
... (etree.Element( 'b'), 11, 18),
... (etree.Element( 'c'), 18, 18),
... ]
I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
>>tree = get_tree(text, spans)
>>etree.tostrin g(tree)
'<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?
It seems you're looking to construct an Interval Tree: http://en.wikipedia.org/wiki/Interval_tree
--
Hope this helps,
Steven at************* @gmail.com wrote:
On May 21, 11:02 pm, Steven Bethard <steven.beth... @gmail.comwrote :
>I have some text and a list of Element objects and their offsets, e.g.::
> >>text = 'aaa aaa aaabbb bbbaaa' >>spans = [
... (etree.Element( 'a'), 0, 21), ... (etree.Element( 'b'), 11, 18), ... (etree.Element( 'c'), 18, 18), ... ]
I'd like to produce the corresponding ElementTree. So I want to write a get_tree() function that works like::
> >>tree = get_tree(text, spans) >>etree.tostrin g(tree)
'<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
Perhaps I just need some more sleep, but I can't see an obvious way to do this. Any suggestions?
It seems you're looking to construct an Interval Tree:
http://en.wikipedia.org/wiki/Interval_tree
No, I'm looking to construct an ElementTree from intervals. ;-) Could
you elaborate on how an Interval Tree would help me?
STeVe
On May 21, 11:02 pm, Steven Bethard <steven.beth... @gmail.comwrote :
I have some text and a list of Element objects and their offsets, e.g.::
>>text = 'aaa aaa aaabbb bbbaaa'
>>spans = [
... (etree.Element( 'a'), 0, 21),
... (etree.Element( 'b'), 11, 18),
... (etree.Element( 'c'), 18, 18),
... ]
I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
>>tree = get_tree(text, spans)
>>etree.tostrin g(tree)
'<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
Perhaps I just need some more sleep, but I can't see an obvious way to
do this. Any suggestions?
It seems you're looking to construct an Interval Tree: http://en.wikipedia.org/wiki/Interval_tree
--
Hope this helps,
Steven
Gabriel Genellina wrote:
the idea would be as follows:
- For each span generate two tuples: (start_offset, 1, end_offset,
element) and (end_offset, 0, -start_offset, element). If start==end use
(start_offset, -1, start_offset, element).
- Collect all tuples in a list, and sort them. The tuple is made so when
at a given offset there is an element that ends and another that starts,
the ending element comes first (because it has a 0). For all the
elements that end at a given point, the shortest comes first.
- Initialize an empty list (will be used as a stack of containers), and
create a root Element as your "current container" (CC), the variable
"last_used" will keep the last position used on the text.
- For each tuple in the sorted list:
- if the second item is a 1, an element is starting. Insert it into
the CC element, push the CC onto the stack, and set the new element as
the new CC. The element data is text[last_used:start _offset], and move
last_used to start_offset.
- if the second item is a 0, an element is ending. Discard the CC, pop
an element from the stack as the new CC. The element data is
text[last_used:end_o ffset], move last_used up to end_offset.
- if the second item is a -1, it's an element with no content. Insert
it into the CC element.
Thanks a lot! This put me on the right track (though the devil's
definitely in the details). It's working now::
>>tree = xmltools.text_a nd_spans_to_etr ee('aaa aaa aaaccc cccaaa', [
.... (etree.Element( 'a'), 0, 21),
.... (etree.Element( 'b'), 11, 11),
.... (etree.Element( 'c'), 11, 18),
.... ])
>>etree.tostrin g(tree)
'<a>aaa aaa aaa<b /><c>ccc ccc</c>aaa</a>'
>>tree = xmltools.text_a nd_spans_to_etr ee('bbb\naaaccc \ncccaaa', [
.... (etree.Element( 'a'), 0, 17),
.... (etree.Element( 'b'), 0, 4),
.... (etree.Element( 'c'), 7, 14),
.... (etree.Element( 'b'), 14, 14),
.... ])
>>etree.tostrin g(tree)
'<a><b>bbb\n</b>aaa<c>ccc\ncc c</c><b />aaa</a>'
>>tree = xmltools.text_a nd_spans_to_etr ee('abc', [
.... (etree.Element( 'a'), 0, 3),
.... (etree.Element( 'b'), 0, 3),
.... (etree.Element( 'c'), 0, 3),
.... ])
>>etree.tostrin g(tree)
'<a><b><c>abc</c></b></a>'
And for the sake of any poor soul who runs into a similar problem,
here's the code I wrote using Gabriel's hints above::
def text_and_spans_ to_etree(text, spans):
# Create a list of element starts and ends, sorting them in the
# order they would appear in an XML version of the text. So for
# example, with XML text like:
# <aa <bb </ba </a>
# we will see:
# starting <a>
# starting <b>
# ending </b>
# ending </a>
empty = -1
starting = 0
ending = 1
tuples = []
root_elem = None
for i, (elem, start, end) in enumerate(spans ):
# validate spans
if start < 0 or end len(text):
msg = 'spans must be in the range 0-%i'
raise ValueError(msg % len(text))
# save the first element that spans the entire text as the root
elif root_elem is None and start == 0 and end == len(text):
root_elem = elem
# insert a single tuple for empty elements
elif start == end:
tuples.append(( start, empty, -end, i, elem))
# insert starts and ends for all other elements
else:
tuples.append(( start, starting, -end, i, elem))
tuples.append(( start, ending, -end, -i, elem))
tuples.sort()
# make sure we found a root element
if root_elem is None:
raise ValueError('one element must span the entire text')
# iterate through the element starts and ends,
# updating element texts, tails and children
last_offset = 0
last_elem = root_elem
last_type = starting
stack = [root_elem]
for start, offset_type, neg_end, _, elem in tuples:
# start of an element:
# add it as a child and add it to the stack
# next text chunk goes up to the start offset
if offset_type is starting:
stack[-1].append(elem)
stack.append(el em)
offset = start
# end of an element:
# pop if off the stack
# next text chunk goes up to the end offset
elif offset_type is ending:
if elem is not stack[-1]:
print elem, stack[-1]
assert False
assert elem is stack.pop()
offset = -neg_end
# empty element:
# add it as a child
# next text chunk goes up to the start offset
elif offset_type is empty:
stack[-1].append(elem)
offset = start
# should never get here
else:
assert False
# calculate the next text chunk, and then determine what to do
# with it based on what we did the last time around:
# * started an element -set its .text
# * ended an element (or inserted an empty) -set its .tail
last_text = text[last_offset:off set]
if last_type is starting:
last_elem.text = last_text
elif last_type is ending:
last_elem.tail = last_text
elif last_type is empty:
last_elem.tail = last_text
else:
assert False
# save what we did this time for inspection next time
last_offset = offset
last_type = offset_type
last_elem = elem
# add any final text before the close of the root element
last_elem.tail = text[last_offset:]
return root_elem
Thanks again,
STeVe
On 2007-05-22, Steven Bethard <st************ @gmail.comwrote :
Thanks a lot! This put me on the right track (though the
devil's definitely in the details). It's working now::
>tree = xmltools.text_a nd_spans_to_etr ee('aaa aaa aaaccc cccaaa', [
... (etree.Element( 'a'), 0, 21),
... (etree.Element( 'b'), 11, 11),
... (etree.Element( 'c'), 11, 18),
... ])
>etree.tostring (tree)
'<a>aaa aaa aaa<b /><c>ccc ccc</c>aaa</a>'
>tree = xmltools.text_a nd_spans_to_etr ee('bbb\naaaccc \ncccaaa', [
... (etree.Element( 'a'), 0, 17),
... (etree.Element( 'b'), 0, 4),
... (etree.Element( 'c'), 7, 14),
... (etree.Element( 'b'), 14, 14),
... ])
>etree.tostring (tree)
'<a><b>bbb\n</b>aaa<c>ccc\ncc c</c><b />aaa</a>'
>tree = xmltools.text_a nd_spans_to_etr ee('abc', [
... (etree.Element( 'a'), 0, 3),
... (etree.Element( 'b'), 0, 3),
... (etree.Element( 'c'), 0, 3),
... ])
>etree.tostring (tree)
'<a><b><c>abc</c></b></a>'
And for the sake of any poor soul who runs into a similar
problem, here's the code I wrote using Gabriel's hints above::
When I saw you manually keeping a stack, I called Captain
Recursion on my Red-Alert Recursion Phone.
(I'm sorry he left out the Element stuff, which he doesn't know
or use yet. The Captain's get_tree just returns the string)
def get_tree(text, spans):
"""
>>text = 'aaa aaa aaabbb bbbaaa' spans = [
... ('a', 0, 21),
... ('b', 11, 18),
... ('c', 18, 18),
... ]
I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
>>get_tree(text , spans)
'<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
"""
if not spans:
return ''
else:
head, tail = spans[0], spans[1:]
elem, start, end = head
if tail:
_, follow_start, follow_end = tail[0]
else:
follow_start, follow_end = (end, end)
if end start:
return ("<%s>%s%s%s </%s>" %
(elem,
text[start:follow_st art],
get_tree(text, tail),
text[follow_end:end],
elem))
else:
return "<%s />%s" % (elem, get_tree(text, tail))
--
Neil Cerutti
Neil Cerutti wrote:
On 2007-05-22, Steven Bethard <st************ @gmail.comwrote :
>Thanks a lot! This put me on the right track (though the devil's definitely in the details). It's working now::
>>>>tree = xmltools.text_a nd_spans_to_etr ee('aaa aaa aaaccc cccaaa', [
... (etree.Element( 'a'), 0, 21), ... (etree.Element( 'b'), 11, 11), ... (etree.Element( 'c'), 11, 18), ... ])
>>>>etree.tostr ing(tree)
'<a>aaa aaa aaa<b /><c>ccc ccc</c>aaa</a>'
>>>>tree = xmltools.text_a nd_spans_to_etr ee('bbb\naaaccc \ncccaaa', [
... (etree.Element( 'a'), 0, 17), ... (etree.Element( 'b'), 0, 4), ... (etree.Element( 'c'), 7, 14), ... (etree.Element( 'b'), 14, 14), ... ])
>>>>etree.tostr ing(tree)
'<a><b>bbb\n </b>aaa<c>ccc\ncc c</c><b />aaa</a>'
>>>>tree = xmltools.text_a nd_spans_to_etr ee('abc', [
... (etree.Element( 'a'), 0, 3), ... (etree.Element( 'b'), 0, 3), ... (etree.Element( 'c'), 0, 3), ... ])
>>>>etree.tostr ing(tree)
'<a><b><c>ab c</c></b></a>'
And for the sake of any poor soul who runs into a similar problem, here's the code I wrote using Gabriel's hints above::
When I saw you manually keeping a stack, I called Captain
Recursion on my Red-Alert Recursion Phone.
(I'm sorry he left out the Element stuff, which he doesn't know
or use yet. The Captain's get_tree just returns the string)
Heh heh.
I actually thought about writing it recursively, but note that you need
both recursive and non-recursive parts of this algorithm to do the
ElementTree part right:
* the recursive (or stack) part assigns children to parents
* the non-recursive part assigns text or tail to the previous element
(note that's previous in a sequential sense, not a recursive sense)
I'm sure I could implement this recursively, passing around annother
appropriate argument, but it wasn't obvious to me that the code would be
any cleaner.
STeVe
On 2007-05-23, Steven Bethard <st************ @gmail.comwrote :
Neil Cerutti wrote:
>On 2007-05-22, Steven Bethard <st************ @gmail.comwrote :
>>Thanks a lot! This put me on the right track (though the devil's definitely in the details). It's working now::
>tree = xmltools.text_a nd_spans_to_etr ee('aaa aaa aaaccc cccaaa', [ ... (etree.Element( 'a'), 0, 21), ... (etree.Element( 'b'), 11, 11), ... (etree.Element( 'c'), 11, 18), ... ]) >etree.tost ring(tree) '<a>aaa aaa aaa<b /><c>ccc ccc</c>aaa</a>' >tree = xmltools.text_a nd_spans_to_etr ee('bbb\naaaccc \ncccaaa', [ ... (etree.Element( 'a'), 0, 17), ... (etree.Element( 'b'), 0, 4), ... (etree.Element( 'c'), 7, 14), ... (etree.Element( 'b'), 14, 14), ... ]) >etree.tost ring(tree) '<a><b>bbb\ n</b>aaa<c>ccc\ncc c</c><b />aaa</a>' >tree = xmltools.text_a nd_spans_to_etr ee('abc', [ ... (etree.Element( 'a'), 0, 3), ... (etree.Element( 'b'), 0, 3), ... (etree.Element( 'c'), 0, 3), ... ]) >etree.tost ring(tree) '<a><b><c>abc </c></b></a>'
And for the sake of any poor soul who runs into a similar problem, here's the code I wrote using Gabriel's hints above::
When I saw you manually keeping a stack, I called Captain Recursion on my Red-Alert Recursion Phone.
(I'm sorry he left out the Element stuff, which he doesn't know or use yet. The Captain's get_tree just returns the string)
Heh heh.
I actually thought about writing it recursively, but note that
you need both recursive and non-recursive parts of this
algorithm to do the ElementTree part right:
You mean... I left out the hard part? Shucks. I had really hoped
it didn't matter.
* the recursive (or stack) part assigns children to parents
* the non-recursive part assigns text or tail to the previous element
(note that's previous in a sequential sense, not a recursive sense)
I'm sure I could implement this recursively, passing around
annother appropriate argument, but it wasn't obvious to me that
the code would be any cleaner.
Moreover, it looks like you have experience in writing that sort
of code. I'd have never even attempted it without recursion, but
that's merely exposing one of my limitations. ;)
--
Neil Cerutti
On 2007-05-24, Neil Cerutti <ho*****@yahoo. comwrote:
On 2007-05-23, Steven Bethard <st************ @gmail.comwrote :
You mean... I left out the hard part? Shucks. I had really
hoped it didn't matter.
>* the recursive (or stack) part assigns children to parents * the non-recursive part assigns text or tail to the previous element (note that's previous in a sequential sense, not a recursive sense)
I'm sure I could implement this recursively, passing around annother appropriate argument, but it wasn't obvious to me that the code would be any cleaner.
Moreover, it looks like you have experience in writing that
sort of code. I'd have never even attempted it without
recursion, but that's merely exposing one of my limitations. ;)
You'll be happy to know I found a way to salvage my simple
recursive solution, and make it generate an ElementTree!
def get_tree(text, spans):
"""
>>text = 'aaa aaa aaabbb bbbaaa' spans = [
... (etree.Element( 'a'), 0, 21),
... (etree.Element( 'b'), 11, 18),
... (etree.Element( 'c'), 18, 18),
... ]
I'd like to produce the corresponding ElementTree. So I want to write a
get_tree() function that works like::
>>etree.tostrin g(get_tree(text , spans))
'<a>aaa aaa aaa<b>bbb bbb<c /></b>aaa</a>'
"""
def helper(text, spans):
if not spans:
return ''
else:
head, tail = spans[0], spans[1:]
elem, start, end = head
if tail:
_, follow_start, follow_end = tail[0]
else:
follow_start, follow_end = (end, end)
if end start:
return ("<%s>%s%s%s </%s>" %
(elem.tag,
text[start:follow_st art],
helper(text, tail),
text[follow_end:end],
elem.tag))
else:
return "<%s />%s" % (elem.tag, helper(text, tail))
return etree.XML(helpe r(text, spans))
But at least I learned just a *little* about XML and Python
during this arduous process. ;)
--
Neil Cerutti
The concert held in Fellowship Hall was a great success. Special thanks are
due to the minister's daughter, who labored the whole evening at the piano,
which as usual fell upon her. --Church Bulletin Blooper This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Robert Oschler |
last post by:
Has anybody seen a Python module that will take an XML document (not a
colossal one), and convert it to a Python nested class object? I'm
basically looking for something that would allow me to parse an XML document
(not tokenize it like SAX or make it into an XPath accessible DOM object
like others), directly into a nested Python object so I could access
everything through Python class attribute references.
Thanks.
--
Robert
|
by: Stewart Midwinter |
last post by:
I want to parse a file with ElementTree. My file has the following
format:
<!-- file population.xml -->
<?xml version='1.0' encoding='utf-8'?>
<population>
<person><name="joe" sex="male" age="49"></person>
<person><name="hilda" sex="female" age="33"></person>
<person><name="bartholomew" sex="male" age="17">
</person>
</population>
|
by: Greg Wilson |
last post by:
I'm trying to convert from minidom to ElementTree for handling XML,
and am having trouble with entities in DTDs. My Python script looks
like this:
----------------------------------------------------------------------
#!/usr/bin/env python
import sys, os
from elementtree import ElementTree
|
by: Max M |
last post by:
I am writing a "find-free-time" function for a calendar. There are a lot
of time spans with start end times, some overlapping, some not.
To find the free time spans, I first need to convert the events into a
list of non overlapping time spans "meta-spans".
This nice ascii graph should show what I mean.
1) ---
2) ---
|
by: svintuss |
last post by:
Hi. I'm trying to create a widget for Tiger, which should use a
web-dictionary. I want the widget to display translation within its
window, so it has to copy all the data within the tags <span
class="lingvo-article"> </span>.
Is there a method like document.getElementById to get the class
contents?
Or it's only possible to search for opening and ending tags, then copy
the text inside?
| |
by: mirandacascade |
last post by:
Situation is this:
1) I have inherited some python code that accepts a string object, the
contents of which is an XML document, and produces a data structure
that represents some of the content of the XML document
2) The inherited code is somewhat 'brittle' in that some well-formed
XML documents are not correctly processed by the code; the brittleness
is caused by how the parser portion of the code handles whitespace.
3) I would like to...
|
by: saif.shakeel |
last post by:
#!/usr/bin/env python
from elementtree import ElementTree as Element
tree = et.parse("testxml.xml")
for t in tree.getiterator("SERVICEPARAMETER"):
if t.get("Semantics") == "localId":
t.set("Semantics", "dataPackageID")
|
by: Jan Danielsson |
last post by:
Hello all,
I'm using mod_python+ElementTree to build XHTML pages. But I stumbled
across this problem:
--------------------
def foo(req, desc = None):
...
|
by: mndprasad |
last post by:
Hi friends,
Am new to AJAX coding, In my program am going to have two texbox which going to implent AJAX from same table. One box is going to retrieve the value of other and vice versa.
I have implemented successfully for one text box, but it's done for the other field, since am getting script errors.
<script>
var queryField;
var lookupURL;
var divName;
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |