Why treat text nodes as nodes?

Xamle Eng

One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.

But there is another way.

The XPath/XQuery data model does not allow two consecutive text nodes.
As far as I can tell, most XML processing software automatically merges
consecutive text nodes. This means that the number of text segments
directly under an element is bound by the number of sub-elements plus 1
(PIs and comments may be treated as "pseudo-elements" for this
purpose). As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.

No more text nodes.

The only API I know that uses this trick is the ElementTree API for
Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
Each Element object has a text and tail property for the text
immediately inside the element and text following it within its parent
element. Elements always have a tag, attributes and and zero or more
children - which are always other elements. No mixed types. The text
and tail attributes are always strings. This model should be very
convenient for statically-typed languages like Java or C++. I find it
ironic that this idea is probably used only in Python- a dynamically
typed language that is much more comfortable with mixed data types.

This form of API is very suitable for data-oriented XML applications
that don't use mixed elements: for leaf elements just use the .text
attribute and ignore everything else. Container elements use the
element's children which are always other elements. The text attribute
of an element can be ignore if it has children. No need to explicitly
skip it. Tails are always ignored, unless used to indent the output,
which can be done easily without disturbing the rest of the data.

For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.

The only real downside seems to be that this API is non-standard. But
the advantages can easily compensate for that.

Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?

XE

Jul 20 '05 #1

Subscribe Post Reply

1512

Richard Tobin

In article <11**********************@g47g2000cwa.googlegroups .com>,
Xamle Eng <xa*******@gmail.com> wrote:

For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.

It seems very unnatural to me. If you have

See <a href="...">my page</a> for more details

why on earth would you want to associate the test " for more details"
with the <a> element preceding it? The usual way of handling it -
some text, followed by an <a> element, followed by some more text - is
exactly right.

There are some applications where whitespace can be usefully be
associated with the preceding element, but a general-purpose API
should not assume even that.

-- Richard

Jul 20 '05 #2

Xamle Eng

Richard Tobin wrote:

In article <11**********************@g47g2000cwa.googlegroups .com>,
Xamle Eng <xa*******@gmail.com> wrote:
For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.
It seems very unnatural to me. If you have

See <a href="...">my page</a> for more details

why on earth would you want to associate the test " for more details"
with the <a> element preceding it?

As I said, this model is probably more natural for data-oriented XML,
but I think it's perfectly usable for document-oriented XML, too. It
preserves the structural information and makes it accessible to your
code in a form where everything has exactly one type, known in advance
at compile time. The tail association is totally arbitrary but it works
very well in practice. Try it. Write some code. Don't always trust your
initial gut reaction. I find that code using the ElementTree API if far
shorter and easier to read than with DOM or DOM-like APIs.
There are some applications where whitespace can be usefully be
associated with the preceding element, but a general-purpose API
should not assume even that.

It doesn't assume that. And it it isn't "usefully" associated - it's
just a place to put it that is consistent, easy to access when you need
it and easier to ignore when you don't.

XE

Jul 20 '05 #3

Richard Tobin

In article <11**********************@g47g2000cwa.googlegroups .com>,
Xamle Eng <xa*******@gmail.com> wrote:

Try it. Write some code.
I don't think so. I have perfectly good interfaces already, I'm not going
to switch to an obviously silly interface because someone says "try it".
It doesn't assume that. And it it isn't "usefully" associated - it's
just a place to put it that is consistent, easy to access when you need
it and easier to ignore when you don't.

How is it "easy to access" when I have to keep hold of the previous item
to access it? And I have to do something different for the first text node
then all the others.

-- Richard

Jul 20 '05 #4

Soren Kuula

Xamle Eng wrote:

One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.
With seven node types (element, attribute, text, NS node, comment, PI
and document/root), it won't be that much of a cleanup to remove one?
But there is another way.

The XPath/XQuery data model does not allow two consecutive text nodes.
As far as I can tell, most XML processing software automatically merges
consecutive text nodes. This means that the number of text segments
directly under an element is bound by the number of sub-elements plus 1
(PIs and comments may be treated as "pseudo-elements" for this
purpose). As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.
....then the first text segment is sort of semantically different from
the rest? It will be found on the parent -- the rest on its children?
This model should be very
convenient for statically-typed languages like Java or C++. I find it
ironic that this idea is probably used only in Python- a dynamically
typed language that is much more comfortable with mixed data types.
Yes the general Node type can make things look clumsy sometimes.
Polymorphism is for solving that ..., or generics:

Iterator<Element> children()
Iterator<Text> textNodes()
....etc are no problem to implement effeciently
For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.
It could get confusing that the first text element under a parent gets
different from the rest -- you have to look it up on the parent.
The only real downside seems to be that this API is non-standard. But
the advantages can easily compensate for that.
Instead of mixed representation types in mixed contents, don't you just
get a pile of .tail references that you have to check for nullity as you
iterate over element contents? Not all that much better, I think :) (and
harder to describe).
Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?

No, don't know. But the idea of replacing some parent to child
relationships in trees by sibling to sibling relationships is not at all
new :)

Soren

Jul 20 '05 #5

Andy Dingley

On 13 May 2005 11:33:10 -0700, "Xamle Eng" <xa*******@gmail.com> wrote:

As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.

I'll hold him down, someone else can break his fingers.

That's the most fuckwittedly stupid idea I've read on the whole of
usenet in the last week.

The web is a great thing. Even "internet time" is quite fun, when it's
all rolling along nicely. But can we _please_ do without the clueless
muppet teenage genius code-jockeys who don't have the first bloody clue
about what's a good design and what's blecherous. Back in the day you'd
have written maybe 100k+ lines of something before you even got near
writing anything as fun as DOM-walking code. You might not be an expert
yet, but you gained some sense of smell for stinking bad designs.

Now any bloody idiot thinks they can re-invent important back-end
components, IE can't work out how to render a simple rectangular box and
my credit card gets pwned by Ukrainians because some muppet thought that
raw PHP made for a k00l file include mechanism.
--
Cats have nine lives, which is why they rarely post to Usenet.

Jul 20 '05 #6

Peter Flynn

Xamle Eng wrote:

One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.
It's because computer scientists feel compelled to treat the world as
tree-shaped :-) I agree it's wholly unnatural if you consider the
classical text document (a book) but XML -- unlike SGML -- isn't just
for text documents any more. This has had the unfortunate effect that
many otherwise level-headed people find it fashionable now to pretend
that XML isn't used for text documents at all any more, so they need
not be taken into consideration. You will even find programmers being
shocked to discover XML can be used for text documents :-)
But there is another way.

The XPath/XQuery data model does not allow two consecutive text nodes.
Worse, the wholly extraordinary decision in XSLT to elide white-space
nodes between adjacent element nodes *in mixed content* as part of the
"strip-space" feature is very strongly to be deprecated, as it breaks
the model of almost any heavily-marked text document.

[...] No more text nodes.

The only API I know that uses this trick is the ElementTree API for
Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
Each Element object has a text and tail property for the text
immediately inside the element and text following it within its parent
element. Elements always have a tag, attributes and and zero or more
children - which are always other elements. No mixed types.
This has been tried many times and found wanting. The most notorious
was perhaps the EuroMath DTD, which was possibly the only project to
implement it successfully!

[...] Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?

I think there are many other things I'd rather see first. YMMV.

///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"

Jul 20 '05 #7

Fredrik Lundh

> clueless muppet teenage genius code-jockeys

lovely ;-)

mind if I quote you on the elementtree page?

</F>

Jul 20 '05 #8

Fredrik Lundh

> How is it "easy to access" when I have to keep hold of the previous item

to access it? And I have to do something different for the first text node
then all the others.

if you don't understand how it works, how can you be so sure that it's
"obviously silly".

</F>

Jul 20 '05 #9

Similar topics

transformation in text-mode

by: Jürgen Holly | last post by:

Hi! I have the following xml-node: <docu> Sample: bold and text in italic </docu> I need to create a text-file, so I set the output-mode to text.

.NET Framework

xhtml <body> w/ text() nodes-- why?

by: Gordon - Adelphia | last post by:

I have a question regarding xhtml. Why, why, why does the ELEMENT <body> allow “unblocked” text. HTML does not (though, most browsers will render). Xhtml (transitional) however allows text nodes...

.NET Framework

Firefox inserts text nodes in TR elements?

by: RobG | last post by:

Why does Firefox insert #text nodes as children of TR elements? As a work-around for older Safari versions not properly supporting a table row's cells collection, I used the row's childNodes...

Javascript

Get all text nodes?

by: Jeremy | last post by:

I have a script that runs a regular expression replace on all text in the document. Currently, I recurse the entire document looking for text nodes and run the replacement on the text nodes when I...

Javascript

Rich Text Editor - problem with generated HTML

by: pbreah | last post by:

I'm doing a Rich Text Editor (WYSIWYG) in javascript for a game for kids. I'm doing a special case in with every keystroke from A-Z creates a background and foreground color for that letter, witch...

Javascript

Extracting text (cross platform)

by: Debbie | last post by:

Is there a standard way to extract text from a web page, without using innertext/innerhtml? It's an academic exercise, and we've been advised that we can't use Internet Explorer DOM extensions...

Javascript

Showing XHTML text in Word 2007 with the formatting via webservices

by: Darsin | last post by:

What i am doing is to pull the data from a CMS and import it to Word 2007 Beta and i also have to export the data from Word 2007 Beta back to that CMS. We have with us two Web Services of the CMS....

.NET Framework

inserting line breaks in text quoted in javascript

by: alice | last post by:

I'm doing some text swapping with javascript, got it working fine, but I would like the line to have line breaks and being a beginner, I don't even know if this is possible. So I have a line like...

Javascript

Treeview and find text in node

by: JR | last post by:

Hi, I need a routine/finction that finds a text, starting at the selected node, where a given text is in. say i have some nodes with 1 'sonday' and another with 'son' and I look for 'so' it...

Visual Basic .NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice