lxml/ElementTree and .tail

Chas Emerick

I looked around for an ElementTree-specific mailing list, but found
none -- my apologies if this is too broad a forum for this question.

I've been using the lxml variant of the ElementTree API, which I
understand works in much the same way (with some significant
additions). In particular, it shares the use of a .tail attribute.
I ran headlong into this aspect of the API while doing some DOM
manipulations, and it's got me pretty confused.

Example:

>>from lxml import etree as ET
frag = ET.XML('<a>headinsidetail</a>')
b = frag.xpath('//b')[0]
b

>>b.text

'inside'

>>b.tail

'tail'

>>frag.remove(b)
ET.tostring(frag)

'<a>head</a>'

As you can see, the .tail text is removed as part of the <belement
-- but it IS NOT part of the <belement. I understand the use of
the .tail attribute given the desire to simplify the API by avoiding
pure text nodes, but it seems entirely inappropriate for the tail
text to disappear into the ether when what is technically a sibling
node is removed.

Performing the same operations with the Java DOM api (crimson, in
this case it turns out) yields what I would expect (here I'm using
JPype to access a v1.4.2 JVM through python -- which makes things
somewhat less painful):

>>from jpype import *
startJVM(getDefaultJVMPath())
builder = javax.xml.parsers.DocumentBuilderFactory.newInstan ce

().newDocumentBuilder()

>>xml = java.io.ByteArrayInputStream(java.lang.String

('<a>headinsidetail</a>').getBytes())

>>doc = builder.parse(xml)
a = doc.documentElement
a.toString()

u'<a>headinsidetail</a>'

>>b = a.getElementsByTagName('b').item(0)
a.removeChild(b)
a.toString()

u'<a>headtail</a>'

(Sorry for the Java comparison, but that's where I first cut my teeth
on XML, and that's where my expectations were formed.)

That's a pretty significant mismatch in functionality. I certainly
understand the motivations of Mr. Lundh to make the ET API as
pythonic as possible, but ET's behaviour in this specific context is
flatly wrong as far as I can see. I would have expected that a
removal operation would have appended 's tail text to the text of
<a(or perhaps to the tail text of 's closest preceding sibling)
-- something that I think I'm going to have to do in order to
continue using lxml / ElementTree.

I ran this issue past a few people I know who've worked with and
written about ElementTree, and their response to this apparent
divergence between the ET DOM API and "standard" DOM APIs was
roughly: "that's just the way it is".

Comments, thoughts?

Chas Emerick
Founder, Snowtide Informatics Systems
Enterprise-class PDF content extraction

ce******@snowtide.com
http://snowtide.com | +1 413.519.6365

Nov 15 '06 #1

Subscribe Post Reply

4560

Stefan Behnel

Hi,

Chas Emerick wrote:

I looked around for an ElementTree-specific mailing list, but found none
-- my apologies if this is too broad a forum for this question.

The lxml mailing list is always happy to receive feedback, but it's fine to
ask here if it's not lxml specific.

I've been using the lxml variant of the ElementTree API.
it shares the use of a .tail attribute. I
ran headlong into this aspect of the API while doing some DOM
manipulations, and it's got me pretty confused.

Example:

>>>from lxml import etree as ET
frag = ET.XML('<a>headinsidetail</a>')
b = frag.xpath('//b')[0]
b

<Element b at 71cbe8>

>>>b.text

'inside'

>>>b.tail

'tail'

>>>frag.remove(b)
ET.tostring(frag)

'<a>head</a>'

As you can see, the .tail text is removed as part of the <belement --
but it IS NOT part of the <belement.

Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

If you want to copy part of of removed element back into the tree, feel free
to do so.

Performing the same operations with the Java DOM api
(Sorry for the Java comparison, but that's where I first cut my teeth on
XML, and that's where my expectations were formed.)

That's a pretty significant mismatch in functionality.

IMHO, DOM has a pretty significant mismatch with Python.

I ran this issue past a few people I know who've worked with and written
about ElementTree, and their response to this apparent divergence
between the ET DOM API and "standard" DOM APIs was roughly: "that's just
the way it is".

It's just a matter of understanding (or getting used to) the API. You might
want to stop thinking in terms of '<' and '>' and rather embrace the API
itself as a way to work with the XML Infoset (rather than the XML DOM).

Stefan

Nov 16 '06 #2

Fredrik Lundh

Stefan Behnel wrote:

If you want to copy part of of removed element back into the tree, feel free
to do so.

and that can of course be done with a short helper function.

when removing elements from trees, I often set the tag for those
elements to some "garbage" value during processing, and then call
something like

http://effbot.org/zone/element-bits-...es.htm#cleanup

to clean things up before serializing the tree.

</F>

Nov 16 '06 #3

Paul Boddie

Stefan Behnel wrote:

>

[Remove an element, remove following nodes]

Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

I guess it depends on what you regard an element to be...

[...]

IMHO, DOM has a pretty significant mismatch with Python.

....in the DOM or otherwise:

http://www.w3.org/TR/2006/REC-xml-20...logical-struct

Paul

Nov 16 '06 #4

Fredrik Lundh

Paul Boddie wrote:

>Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

I guess it depends on what you regard an element to be...

Stefan said "Element", not "element".

"Element" is a class in the "ElementTree" module, which can be used to
*represent* an XML element in an XML infoset, including all the data
*inside* the XML element, and any data *between* that XML element and
the next one (which is always character data, of course).

It's not very difficult, really; especially if you, as Stefan said,
think in infoset terms rather "a sequence of little piggies" terms.

</F>

Nov 16 '06 #5

Paul Boddie

Fredrik Lundh wrote:

>
It's not very difficult, really; especially if you, as Stefan said,
think in infoset terms rather "a sequence of little piggies" terms.

Are piggies part of the infoset too? Does the Piggie class represent a
piggie from the infoset plus a stretch of the road to the market? ;-)

Paul

Nov 16 '06 #6

Chas Emerick

Thanks for the comments and thoughts. I must admit that I have an
overwhelming feeling of having just stepped into the middle of a
complex, heated conversation without having heard the preamble.

(FYI, this reply is only an attempt to help those that come
afterwards -- I'm not looking to advocate much of anything here.)

Fredrik's invocation of the "infoset" term led me to a couple of
quick searches that clarified the state of play. Here he sets the
stage for the .tail behaviour that I originally posted about:

http://effbot.org/zone/element-infoset.htm

And it looks like there have been tussles over other mismatches in
expectations before, specifically around how namespaces are handled:

http://groups.google.com/group/comp....thread/thread/
31b2e9f4a8f7338c
http://nixforums.org/ntopic43901.html

From what I can see, there are more than a few people that have
stumbled with ElementTree's API because of their preexisting
expectations, which others have probably correctly bucketed as
"implementation details". This comes as quite a shock to those who
have stumbled (including myself) who have, lo these many years, come
to view those details as the only standard that matters (perhaps
simply because those details have been so consistent in our experience).

Which, in my view, is just fine -- different strokes for different
folks, and all that. When I originally started poking around the
python xml world, I was somewhat confused as to why 4suite/Domlette
existed, as it seemed pretty clear that ElementTree had crystallized
a lot of mindshare, and has a very attractive API to boot.
Thankfully, I can now see its appeal, and am very glad it's around,
as it seems to have all of those comfortable implementation details
that I've been looking for. :-)

As for the infoset vs. "sequence of piggies" nut: if ElementTree's
infoset approach is technically correct, then wouldn't it also be
correct to use a .head attribute instead of a .tail attribute? Example:

<a>firstmiddlelast</a>

might be represented as:

<Element a: head='', text='last'>
<Element b: head='first', text='middle'>

If I'm wrong, just chalk it up to the fact that this is the first
time I've ever looked at the Infoset spec, and I'm simply confused.
If that IS a technically-valid way to represent the above xml
fragment . . . then I guess I'll make sure to tread more carefully in
the future around tools that work in infoset terms. For me, it turns
out that sequences of piggies really are important, at least in
contexts where XML is merely a means to an end (either because of the
attractiveness of the toolsets or because we must cope with what
we're provided as input) and where consistency with existing tools
(like those that adhere to DOM level 2/3) and expectations are
critical. I think this is what Paul was nodding towards with his
original response to Stefan's response.

Cheers,

- Chas

On Nov 16, 2006, at 5:11 AM, Fredrik Lundh wrote:

Paul Boddie wrote:

>>Yes, it is. Just look at the API. It's an attribute of an
Element, isn't it?
What other API do you know where removing an element from a data
structure
leaves part of the element behind?

I guess it depends on what you regard an element to be...

Stefan said "Element", not "element".

"Element" is a class in the "ElementTree" module, which can be used to
*represent* an XML element in an XML infoset, including all the data
*inside* the XML element, and any data *between* that XML element and
the next one (which is always character data, of course).

It's not very difficult, really; especially if you, as Stefan said,
think in infoset terms rather "a sequence of little piggies" terms.

</F>

Nov 16 '06 #7

Fredrik Lundh

Chas Emerick wrote:

might be represented as:

<Element a: head='', text='last'>
<Element b: head='first', text='middle'>

sure, and you could use a text subtype instead that kept track of the
elements above it, and let the elements be sequences of their siblings
instead of their children, and perhaps stuff everything in a dictionary.
such a construct would also be able to hold the same data, and be very
hard to use in most normal situations.

If I'm wrong, just chalk it up to the fact that this is the first
time I've ever looked at the Infoset spec, and I'm simply confused.

the Infoset spec *is* the essence of XML; if you don't realize that an
XML document is just a serialization of a very simple data model, you're
bound to be fighting with XML all the time.

but ET doesn't implement the Infoset spec as it is, of course: it uses a
*simplified* model, carefully optimized for the large percentage of all
XML formats that simply doesn't use mixed content. if you're doing
document-style processing, you sometimes need to add an extra assignment
or two, but unless you're doing *only* document-style processing, ET's
API gives you a net win. (and even if you're doing only document-style
processing, ET's speed and memory footprint gives you a net win over
most competing technologies).

</F>

Nov 16 '06 #8

Chas Emerick

On Nov 16, 2006, at 7:25 AM, Fredrik Lundh wrote:

>If I'm wrong, just chalk it up to the fact that this is the first
time I've ever looked at the Infoset spec, and I'm simply confused.

the Infoset spec *is* the essence of XML; if you don't realize that an
XML document is just a serialization of a very simple data model,
you're
bound to be fighting with XML all the time.

The principle and the practice diverge significantly in our neck of
the woods. The current project involves consuming and making sense
of extraordinarily (and typically unnecessarily) complex XHTML. Of
course, as you say, those documents are still serializations of a
simple data model, but the types of manipulations we do happen to
butt up very uncomfortably with the way ET does things.

but ET doesn't implement the Infoset spec as it is, of course: it
uses a
*simplified* model, carefully optimized for the large percentage of
all
XML formats that simply doesn't use mixed content. if you're doing
document-style processing, you sometimes need to add an extra
assignment
or two, but unless you're doing *only* document-style processing, ET's
API gives you a net win. (and even if you're doing only document-
style
processing, ET's speed and memory footprint gives you a net win over
most competing technologies).

Yeah, documents are all we do -- XML just happens to be a pleasant
intermediate format, and something we need to consume. The notion of
an nicely-formatted XML is entirely foreign to the work that we do --
in fact, our current focus is (in part) dragging decidedly
unstructured data out of those XHTML documents (among other source
formats) and putting them into a reasonable, useful structure.

I took some time last night to bang out some functions that squeezed
ET's model (via lxml) into doing what we need, and it ended up
requiring a lot more B&D than I like. At that point, I swung over to
4suite, which dropped into place quite nicely.

*shrug* I guess we're just in the minority with regard to our API
requirements -- we happen to live in the corner cases. I'm certainly
glad to have made the detour on a different path for a bit though.

- Chas

Nov 16 '06 #9

Fredrik Lundh

Chas Emerick wrote:

The principle and the practice diverge significantly in our neck of
the woods. The current project involves consuming and making sense
of extraordinarily (and typically unnecessarily) complex XHTML.

wasn't your original complaint that ET didn't do the "right thing" when
you removed elements from a mixed-content tree? (something than can be
trivially handled with a 2-line helper function)

why mutate the tree if all you want is to extract information from it?
doesn't sound very efficient to me...

</F>

Nov 16 '06 #10

Chas Emerick

On Nov 16, 2006, at 8:12 AM, Fredrik Lundh wrote:

Chas Emerick wrote:

>The principle and the practice diverge significantly in our neck of
the woods. The current project involves consuming and making sense
of extraordinarily (and typically unnecessarily) complex XHTML.

wasn't your original complaint that ET didn't do the "right thing"
when
you removed elements from a mixed-content tree? (something than can be
trivially handled with a 2-line helper function)

Yes, that was the initial issue, but the delta between Elements and
DOM-style elements leads to other issues. There's no doubt that the
needed helpers are simple, but all things being equal, not having to
carry them around anywhere we're doing DOM manipulations is a big plus.

why mutate the tree if all you want is to extract information from it?
doesn't sound very efficient to me...

Because we're far from doing anything that is regular or one-off in
nature. We're systematizing the extraction of data from functionally
unstructured content, and it's flatly necessary to normalize the
XHTML into something that can be easily consumed by the processes
we've built that can do that content->data extraction/conversion from
plain text, XML, PDF, and now XHTML.

Remember, corner cases. :-)

- Chas

Nov 16 '06 #11

Stefan Behnel

Fredrik Lundh wrote:

Stefan Behnel wrote:

>If you want to copy part of of removed element back into the tree,
feel free to do so.

and that can of course be done with a short helper function.

Oh, and obviously with a custom Element class in lxml that does this
automatically for you behind the scenes.

http://codespeak.net/lxml/element_classes.html
http://codespeak.net/lxml/element_cl...t-class-lookup

Stefan

Nov 16 '06 #12

Stefan Behnel

Chas Emerick wrote:

the delta between Elements and DOM-style elements leads to other issues.
There's no doubt that the needed helpers are simple, but all things being
equal, not having to carry them around anywhere we're doing DOM
manipulations is a big plus.

Because we're far from doing anything that is regular or one-off in nature.
We're systematizing the extraction of data from functionally unstructured
content, and it's flatly necessary to normalize the XHTML into something
that can be easily consumed by the processes we've built that can do that
content->data extraction/conversion from plain text, XML, PDF, and now
XHTML.

Remember, corner cases. :-)

Hmm, then I really don't get why you didn't just write a customised XHTML API
on top of lxml's custom Element classes feature. Hiding XML language specific
behaviour directly in the Element classes really helps in getting your code
clean, especially in larger code bases.

Stefan

Nov 16 '06 #13

Fredrik Lundh

Paul Boddie wrote:

>It's not very difficult, really; especially if you, as Stefan said,
think in infoset terms rather "a sequence of little piggies" terms.

Are piggies part of the infoset too? Does the Piggie class represent a
piggie from the infoset plus a stretch of the road to the market? ;-)

no, they just appear in serialized XML. if you want concrete piggies, you have
to wrap ET's iterparse function, or perhaps the XMLParser class.

</F>

Nov 16 '06 #14

Uche Ogbuji

Fredrik Lundh wrote:

Chas Emerick wrote:
If I'm wrong, just chalk it up to the fact that this is the first
time I've ever looked at the Infoset spec, and I'm simply confused.

the Infoset spec *is* the essence of XML; if you don't realize that an
XML document is just a serialization of a very simple data model, you're
bound to be fighting with XML all the time.

I certainly have never liked the aspects of the ElementTree API under
present discussion. But that's not as important as the fact that I
think the above statement is misleading. There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization. Infoset is a secondary and optional spec. In fact, I
think it's clear that Infoset is not even the preeminent *data model*
of the XML world. That distinction goes to the XPath data model, which
is quite different from the Infoset.

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

Nov 18 '06 #15

Fredrik Lundh

Uche Ogbuji wrote:

I certainly have never liked the aspects of the ElementTree API under
present discussion. But that's not as important as the fact that I
think the above statement is misleading. There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization.

sure, the computing world is and has always been full of people who want
the simplest thing to look a lot harder than it actually is. after all,
*they* spent lots of time reading all the specifications, they've bought
all the books, and went to all the seminars, so it's simply not fair
when others are cheating.

in reality, *all* interchange formats are easier to understand and use
if you focus on a (complete or intentionally simplified) data model of
the things being interchanged, and treat various artifacts of the
byte-stream used by the wire format as artifacts, historical accidents
based on what specification happened to be written before the other, or
what some guy did or did not do in the seventies, as accidents, and
esoteric arcana disseminated on limited-distribution mailing lists as
about as relevant for your customer as last week's episode of American Idol.

(XML is a bit unusual in this respect, but that's probably just some
variation of the bikeshed effect. it's just text, and everyone with
a keyboard knows what that is, so we don't need to use established
software engineering practices, or think about security *at all*
(Billion laughs? XXE?) or, for that matter, learn from people who's
been doing data interchange in other domains since the dawn of time.
and when they do appear anyway, and mess with our technology in ways
that we haven't authorized, without reading our books or going to our
seminars or subscribing to our mailing lists, we can write them off as
"clueless muppet teenage genius code-jockeys", and keep patting our-
selves on the back, while the rest of the world is busy routing around
us, switching to well-understood XML subsets or other serialization
formats, simpler and more flexible data models, simpler API:s, and
more robust code. and Python ;-)

</F>

Nov 18 '06 #16

Paul McGuire

"Fredrik Lundh" <fr*****@pythonware.comwrote in message
news:ma**************************************@pyth on.org...

>
(XML is a bit unusual in this respect, but that's probably just some
variation of the bikeshed effect. it's just text, and everyone with
a keyboard knows what that is, so we don't need to use established
software engineering practices, or think about security *at all* (Billion
laughs? XXE?) or, for that matter, learn from people who's
been doing data interchange in other domains since the dawn of time. and
when they do appear anyway, and mess with our technology in ways that we
haven't authorized, without reading our books or going to our seminars or
subscribing to our mailing lists, we can write them off as "clueless
muppet teenage genius code-jockeys", and keep patting our- selves on the
back, while the rest of the world is busy routing around us, switching to
well-understood XML subsets or other serialization formats, simpler and
more flexible data models, simpler API:s, and
more robust code. and Python ;-)

maybe time to switch to decaf... :)

Nov 18 '06 #17

Fredrik Lundh

Paul McGuire wrote:

maybe time to switch to decaf... :)

do you disagree with my characterization of the state of the XML universe?

</F>

Nov 18 '06 #18

Paul McGuire

"Fredrik Lundh" <fr*****@pythonware.comwrote in message
news:ma**************************************@pyth on.org...

Paul McGuire wrote:

>maybe time to switch to decaf... :)

do you disagree with my characterization of the state of the XML universe?

</F>

Thankfully, I'm largely on the periphery of that universe (except for being
a sometimes victim). But it is certainly frustrating to see many of the OMG
concepts of the 90's reimplemented in Java services, and then again in
XML/SOAP, with no detectable awareness that these messaging and
serialization problems have been considered before, and much more
thoroughly.

I liked XML when I could read it and hack it out in Notepad. I like
attributes, which puts me on the outs with most XML zealots who forswear the
use of attributes on purely academic grounds (they defeat the future
possible expansion of an attribute's value into more complex substructure).
I dislike namespaces, especially the default xmlns kind, as they make me
take extra steps when retrieving nodes via Xpaths; and everyone seems to
think their application needs namespaces, when there is no threat that these
tags will ever get mixed up with anyone else's.

No, I was mostly amused (which I thought was your intent, given the trailing
smiley) at your breathless, quasi-rant against the XML milieu in general - I
think your one sentence went on for about 15 lines!

-- Paul

Nov 18 '06 #19

Chas Emerick

On Nov 18, 2006, at 5:09 AM, Fredrik Lundh wrote:

Uche Ogbuji wrote:

>I certainly have never liked the aspects of the ElementTree API under
present discussion. But that's not as important as the fact that I
think the above statement is misleading. There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization.

sure, the computing world is and has always been full of people who
want
the simplest thing to look a lot harder than it actually is. after
all,
*they* spent lots of time reading all the specifications, they've
bought
all the books, and went to all the seminars, so it's simply not fair
when others are cheating.

[snip]

and keep patting our-
selves on the back, while the rest of the world is busy routing around
us, switching to well-understood XML subsets or other serialization
formats, simpler and more flexible data models, simpler API:s, and
more robust code. and Python ;-)

That's flatly unrealistic. If you'll remember, I'm not one of "those
people" that are specification-driven -- I hadn't even *heard* of
Infoset until earlier this week! However, I am driven to ensure that
the code I (and we) write works *as others expect* when confronted by
any of the billions of XML documents out there. Simpler is better,
and better is better (thus why I am in python-land), unless that
simplicity makes it difficult to play nicely with others. Shrugging
off the way everyone else does things reminds me of various CSS
fanatics I know of that simply won't use tables or IE CSS
compatibility hacks, even if that's what's needed to get things to work.

I've never been involved in any "XML battles", but to Uche's point, I
would speculate (only on the basis of personal interactions and
anecdotes) that some overwhelming majority of the developers out
there care for nothing but the serialization, simply because that's
how one plays nicely with others. I would count myself in that group
as well, although I do recognize that there is a worthy academic
exercise in exploring the data-model-centric XML worldview.

OT: Uche, 4suite XML is tops! Thank you very much for that.

- Chas

Nov 18 '06 #20

Fredrik Lundh

Chas Emerick wrote:

>and keep patting our-
selves on the back, while the rest of the world is busy routing around
us, switching to well-understood XML subsets or other serialization
formats, simpler and more flexible data models, simpler API:s, and
more robust code. and Python ;-)

That's flatly unrealistic. If you'll remember, I'm not one of "those
people" that are specification-driven -- I hadn't even *heard* of
Infoset until earlier this week!

The rant wasn't directed at you or anyone special, but I don't really
think you got the point of it either. Which is a bit strange, because
it sounded like you *were* working on extracting information from messy
documents, so the "it's about the data, dammit" way of thinking
shouldn't be news to you.

And the routing around is not unrealistic, it's is a *fact*; JSON and
POX are killing the full XML/Schema/SOAP stack for communication, XHTML
is pretty much dead as a wire format, people are apologizing in public
for their use of SOAP, AJAX is quickly turning into AJAJ, few people
care about the more obscure details of the XML 1.0 standard (when did
you last see a conditional section? or even a DTD?), dealing with huge
XML data sets is still extremely hard compared to just uploading the
darn thing to a database and doing the crunching in SQL, and nobody uses
XML 1.1 for anything.

Practicality beats purity, and the Internet routes around damage, every
single time.

overwhelming majority of the developers out there care for nothing
but the serialization, simply because that's how one plays nicely
with others.

The problem is if you only stare at the serialization, your code *won't*
play nicely with others. At the serialization level, it's easy to think
that CDATA sections are different from other text, that character
references are different from ordinary characters, that you should
somehow be able to distinguish between <tag></tagand <tag/>, that
namespace prefixes are more important than the namespace URI, that an
  in an XHTML-style stream is different from a U+00A0 character in
memory, and so on. In my experience, serialization-only thinking (at
the receiving end) is the single most common cause for interoperability
problems when it comes to general XML interchange.

But when you focus on the data model, and treat the serialization as an
implementation detail, to be addressed by a library written by someone
who's actually read the specifications a few more times than you have,
all those problems tend to just go away. Things just work.

And in practice, of course, most software engineers understand this, and
care about this. After all, good software engineering is about
abstractions and decoupling and designing things so you can focus on one
part of the problem at a time. And about making your customer happy,
and having fun while doing that. Not staying up all night to look for
an obscure interoperability problem that you finally discover is caused
by someone using a CDATA section where you expected a character
reference, in 0.1% of all production records, but in none of the files
in your test data set.

(By the way, did ET fail to *read* your XML documents? I thought your
complaint was that it didn't put the things it read in a place where you
expected them to be, and that you didn't have time to learn how to deal
with that because you had more important things to do, at the time?)

</F>

Nov 18 '06 #21

Chas Emerick

On Nov 18, 2006, at 11:29 AM, Fredrik Lundh wrote:

Chas Emerick wrote:

>>and keep patting our-
selves on the back, while the rest of the world is busy routing
around
us, switching to well-understood XML subsets or other serialization
formats, simpler and more flexible data models, simpler API:s, and
more robust code. and Python ;-)

That's flatly unrealistic. If you'll remember, I'm not one of "those
people" that are specification-driven -- I hadn't even *heard* of
Infoset until earlier this week!

The rant wasn't directed at you or anyone special, but I don't really
think you got the point of it either. Which is a bit strange, because
it sounded like you *were* working on extracting information from
messy
documents, so the "it's about the data, dammit" way of thinking
shouldn't be news to you.

No, it's not any kind of news at all, and I'm very sympathetic to
your specific perspective (and have advocated it in other contexts
and circumstances, where appropriate). And yes, we are in fact
ensuring that we get from the HTML/XHTML/text/PDF/etc serialization
we have to consume to a uniform, normalized, and "clean" data model
in as few steps as possible. However, in those few steps, we have to
recognize the functional reality of how each data representation is
used out in the world in order to translate it into a uniform model
for our own purposes. In concrete terms, that means that an end tag
in an XHTML serialization means that that element is closed, done,
finit. Any other representation of that serialization doesn't
correspond properly with the intent of that HTML document's author.

And the routing around is not unrealistic, it's is a *fact*; JSON and
POX are killing the full XML/Schema/SOAP stack for communication,
XHTML
is pretty much dead as a wire format, people are apologizing in public
for their use of SOAP, AJAX is quickly turning into AJAJ, few people
care about the more obscure details of the XML 1.0 standard (when did
you last see a conditional section? or even a DTD?), dealing with huge
XML data sets is still extremely hard compared to just uploading the
darn thing to a database and doing the crunching in SQL, and nobody
uses
XML 1.1 for anything.

Practicality beats purity, and the Internet routes around damage,
every
single time.

I agree 100% -- but I would have thought that that's a point I would
have made. The model that ET uses seems like a "purified"
representation of a mixed-content serialization, exactly because it
is geared to an ideal rather than the practical realities of mixed
content and expectations thereof.

For what it's worth, our current effort is directed towards providing
significant stores/feeds of XML/PDF/HTML/text/etc in something that
can be dropped into a RDBMS. Perhaps that's the source of the
impedance between us: you view Infoset as a functional replacement
for serialization-dependent XML, whereas we are focussed on what
could be broadly described as a translation from one to the other.

>overwhelming majority of the developers out there care for nothing
but the serialization, simply because that's how one plays nicely
with others.

The problem is if you only stare at the serialization, your code
*won't*
play nicely with others. At the serialization level, it's easy to
think
that CDATA sections are different from other text, that character
references are different from ordinary characters, that you should
somehow be able to distinguish between <tag></tagand <tag/>, that
namespace prefixes are more important than the namespace URI, that an
  in an XHTML-style stream is different from a U+00A0
character in
memory, and so on. In my experience, serialization-only thinking (at
the receiving end) is the single most common cause for
interoperability
problems when it comes to general XML interchange.

I agree with all of that. I would again refer to the pervasive view
of what end tags mean -- that's what I was primarily referring to
with the term 'serialization'.

(By the way, did ET fail to *read* your XML documents? I thought your
complaint was that it didn't put the things it read in a place
where you
expected them to be, and that you didn't have time to learn how to
deal
with that because you had more important things to do, at the time?)

No, it doesn't put things in the right places, so I consider that a
failure of the model. I don't see why I should have spent time
learning how to deal with that when another very comprehensive
library is available that does meet expectations. *shrug*

Further, the fact that ET/lxml works the way that it does makes me
think that there may be some other landmines in the underlying model
that we might not have discovered until some days, weeks, etc., had
passed, so there's a much greater comfort level in working with a
library that explicitly supports the model that we expect (and was
assumed when the HTML [now XHTML] documents in question were authored).

- Chas

Nov 18 '06 #22

Fredrik Lundh

Chas Emerick wrote:

Further, the fact that ET/lxml works the way that it does makes me
think that there may be some other landmines in the underlying model
that we might not have discovered until some days, weeks, etc., had
passed

so the real reason you posted your original post was to spread some FUD,
not to get help? that's a bit disappointing.

</F>

Nov 18 '06 #23

Chas Emerick

On Nov 18, 2006, at 1:12 PM, Fredrik Lundh wrote:

Chas Emerick wrote:

>Further, the fact that ET/lxml works the way that it does makes me
think that there may be some other landmines in the underlying model
that we might not have discovered until some days, weeks, etc., had
passed

so the real reason you posted your original post was to spread some
FUD,
not to get help? that's a bit disappointing.

<sarcasm>
Yeah, that's exactly it. In fact, if you look back at the head of
this thread, you'll see how I was looking to disparage ET. I
especially wanted to make sure ET's API doesn't get any traction in
the python community. It's especially important that ET not find
popular success and acclaim -- I'd have quite a bit to gain from it
remaining a niche library.
</sarcasm>

Fredrik, I wasn't attempting to spread anything. I was confused, I
posed some illustrative examples, and asked for people's thoughts.
Your reply gave me the right vocabulary to find more information
(i.e. about Infoset), and I replied with a overview of what I had
learned so as to benefit anyone with similar questions or confusion
in the future. A discussion ensued.

ET (and lxml) is obviously extremely successful, widely used, and for
good reason. It's just not right for us, but you incorrectly
surmised that I was simply lazy by not modifying/extending ET/lxml to
make it suitable for our purposes even when other libraries existed
that better meshed with our requirements. I tried to answer as
straightforwardly as possible, and (regrettably, it turns out)
included the fact that I had worried that our apparent conceptual
differences indicated that we might find other instances where ET/
lxml works differently than we would expect. I think that's very
rational, and doesn't speak poorly of ET in any way (especially given
its obvious success elsewhere).

- Chas

Nov 19 '06 #24

Uche Ogbuji

Fredrik Lundh wrote:

Uche Ogbuji wrote:

I certainly have never liked the aspects of the ElementTree API under
present discussion. But that's not as important as the fact that I
think the above statement is misleading. There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization.

sure, the computing world is and has always been full of people who want
the simplest thing to look a lot harder than it actually is. after all,
*they* spent lots of time reading all the specifications, they've bought
all the books, and went to all the seminars, so it's simply not fair
when others are cheating.

You sound bitter about something. Don't worry, it's really not all
that serious.

in reality, *all* interchange formats are easier to understand and use
if you focus on a (complete or intentionally simplified) data model of
the things being interchanged, and treat various artifacts of the
byte-stream used by the wire format as artifacts, historical accidents
based on what specification happened to be written before the other, or
what some guy did or did not do in the seventies, as accidents, and
esoteric arcana disseminated on limited-distribution mailing lists as
about as relevant for your customer as last week's episode of American Idol.

The fact that the XML Infoset is hardly used outside W3C XML Schema,
and that the XPath data model is far more common, and that focus on the
serialization is even more common than that is a matter of everyday
practicality.

And oh by the way, this thread is all about *your* customer's
complaining. And your response is to give them your philosophical take
on XML. Doesn't that contradict what you're saying above?

Oh never mind. You posted something misleading, and I posted another
point of view. I know you're incapable of any disagreement that
doesn't devolve into a full-scale flame-war. Sometimes I have time for
that sort of thing. This is not one fo those times, so this is
probably where I get off.

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

Nov 19 '06 #25

Uche Ogbuji

Paul McGuire wrote:

Thankfully, I'm largely on the periphery of that universe (except for being
a sometimes victim). But it is certainly frustrating to see many of the OMG
concepts of the 90's reimplemented in Java services, and then again in
XML/SOAP, with no detectable awareness that these messaging and
serialization problems have been considered before, and much more
thoroughly.

You'll be surprised at how many XMLers agree that Web services are a
pretty inept reinvention of CORBA. I was pretty much slain by this
take:

http://wanderingbarque.com/noninters...nds-for-simple

I think Duncan Grisby of OmniORB put it most succintly when he pointed
out that SOAP and friends are more complex, more bloated, and less
interoprable than CORBA ever was. But they use XML so they get the
teacher's pet treatment.

I liked XML when I could read it and hack it out in Notepad.

You still can, and don't let anyone tell you otherwise. I've always
argued that XML doesn't work unless it's Notepad-hackable. I do
usually allow an exception for SVG.

I like
attributes, which puts me on the outs with most XML zealots who forswear the
use of attributes on purely academic grounds (they defeat the future
possible expansion of an attribute's value into more complex substructure).

Really? Do you have any references for this? I haven't seen much
criticism of attributes since the very early days, and almost all XML
technologies make heavy use of attributes. Here's my take:

http://www.ibm.com/developerworks/xm.../x-eleatt.html

As you can see, elements and attributes get equal billing.

I dislike namespaces, especially the default xmlns kind, as they make me
take extra steps when retrieving nodes via Xpaths; and everyone seems to
think their application needs namespaces, when there is no threat that these
tags will ever get mixed up with anyone else's.

Namespaces are possibly the worst thing to have ever happened to XML.
Again, my take:

http://www.ibm.com/developerworks/xm.../x-namcar.html

And yes, default namespaces are about 50% of the problem with
namespace. QNames in content (which are of course an abuse of
namespaces) are almost all of the other 50%. I call them "hidden
namespaces":

http://copia.ogbuji.net/blog/2006-08-14/Some_thoug

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

Nov 19 '06 #26

Diez B. Roggisch

You'll be surprised at how many XMLers agree that Web services are a

pretty inept reinvention of CORBA. I was pretty much slain by this
take:

http://wanderingbarque.com/noninters...nds-for-simple

Thanks for that! Sums up nicely my experiences, and gave me a good chuckle!

While I liked the idea of AXIS reflecting my java code in the first
place (as long as interoperability only meant "I can test my own code"),
it sucked soooo hard when trying to make it work with anything else
(including python of course).

And I don't know why I've complained about this style of inverse
interface generation on so many other occasions (e.g. COM interfaces in
VStudio, JBuilder GUI design and so on), but could never quite put the
finger on what disturbed me on SOAP.

Probably because looking at a WSDL it immediately made me shrink away
from that mess and hope that there must be _some_ merciful deity that
will produce that crap for me, so that I never asked myself the right
questions....

Diez

Nov 19 '06 #27

Fredrik Lundh

Uche Ogbuji wrote:

The fact that the XML Infoset is hardly used outside W3C XML Schema,
and that the XPath data model is far more common, and that focus on
the serialization is even more common than that is a matter of
everyday practicality.

everyday interoperability problems, that is. yesterday, someone
reported a bug in Python's xml.dom because he couldn't get it to
serialize the string " " as " ". earlier today, someone
asked how to work around an XML parser that didn't understand
namespace prefixes.

And oh by the way, this thread is all about *your* customer's
complaining.

from what I can tell, it was *your* customer posting FUD about a
different library, not my customer asking for help with a specific
problem. this is free software; people who use a piece of software
count a *lot* more than people who don't want to use it.

This is not one fo those times, so this is probably where I get off.

I'll be looking forward to your next O'Reilly article.

</F>

Nov 19 '06 #28

Chas Emerick

On Nov 19, 2006, at 9:55 AM, Fredrik Lundh wrote:

>And oh by the way, this thread is all about *your* customer's
complaining.

from what I can tell, it was *your* customer posting FUD about a
different library, not my customer asking for help with a specific
problem. this is free software; people who use a piece of software
count a *lot* more than people who don't want to use it.

Holy hell Fredrik -- I hadn't even *downloaded* 4suite before I
posted my original question. I've tried to be nice, tried to be
complimentary, and tried to be diplomatic, so it would be nice if
*everyone* would stop casting aspersions or otherwise speculating
about my intentions. Flame amongst yourselves, but leave me out of it.

- Chas

Nov 19 '06 #29

Fredrik Lundh

Uche Ogbuji wrote:

The fact that the XML Infoset is hardly used outside W3C XML Schema,
and that the XPath data model is far more common,

and for the bystanders, it should be noted that the Infoset is pretty
much the same thing as the XPath data model; it's mostly just that the
specifications use different names for the same concept. if you cut
through the vocabulary, it's all about a tree of elements, plus text and
attributes and a few more (but usually less interesting) things. it's a
bit like arguing that

class Person(object):
__slots__ = ["name"]
def __init__(self, name):
self.name = name

and

class Employee:
def __init__(self, first_name, last_name):
self.full_name = first_name + " " + last_name

and

employee_name = "..."

are entirely different things, and not just three more or less con-
venient ways to store exactly the same information.

</F>

Nov 19 '06 #30

Damjan

sure, the computing world is and has always been full of people who want

the simplest thing to look a lot harder than it actually is. after all,
*they* spent lots of time reading all the specifications, they've bought
all the books, and went to all the seminars,

and have been sold all the expensive proprietary tools

so it's simply not fair when others are cheating.

--
damjan

Nov 19 '06 #31

lxml/ElementTree and .tail

Similar topics