By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,877 Members | 1,102 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,877 IT Pros & Developers. It's quick & easy.

whitespace in element content

P: n/a
Hello,

it is often convenient to insert whitespace into an XML document in order to
format it nicely. For example, take this snippet of a notional DocBook XML
document:

<para>
This is a longer paragraph.
With <wordasword>longer</wordasword> I mean that it contains more than
one sentence.
</para>

I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

But how do XML processors and applications deal with this issue?

In section 2.10 of "Extensible Markup Language (XML) 1.0 (Third Edition)",
one can read:

In editing XML documents, it is often convenient to use "white
space" (spaces, tabs, and blank lines) to set apart the markup for
greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document.

But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

I'd be thankful for any clarification.

Best whishes,
Wolfgang
Jul 20 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
In article <2u*************@uni-berlin.de>,
Wolfgang Jeltsch <je*****@tu-cottbus.de> wrote:
But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?
As far as the XML spec is concerned, deciding which whitespace is
significant or not is a job for the application, which really means
"everything except the parser". A conformant parser must give all the
whitespace to the application, which can then decide what to do with
it.

Of course, there may be other standard programs or libraries layered
on top of the XML parser which you might not consider to be the
application. XSLT for example allows you to specify that some
whitespace is to be stripped from its input. From the point of view
of the parser, XSLT is the application, but you may regard it as just
a library that you're using.
I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.


This is a fairly common form of whitespace normalization and often
goes under the name of "tokenization". For example, XML itself treats
tokenized attributes like this. Among other things, you could use an
XML Schema processor to do this normalization.

-- Richard
Jul 20 '05 #2

P: n/a

"Wolfgang Jeltsch" <je*****@tu-cottbus.de> wrote in message
news:2u*************@uni-berlin.de...
Hello,

it is often convenient to insert whitespace into an XML document in order
to
format it nicely. For example, take this snippet of a notional DocBook
XML
document:

<para>
This is a longer paragraph.
With <wordasword>longer</wordasword> I mean that it contains more
than
one sentence.
</para>

I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the
whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

But how do XML processors and applications deal with this issue?

In section 2.10 of "Extensible Markup Language (XML) 1.0 (Third Edition)",
one can read:

In editing XML documents, it is often convenient to use "white
space" (spaces, tabs, and blank lines) to set apart the markup for
greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document.

But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

I'd be thankful for any clarification.

my parser uses what could be called the "newline whitespace assertion",
namely:
any initial whitespace is ignored;
any whitespace following a newline is eaten and replaced with a single space
(unless it is the end of the text).
<foo>Hello World
Again</foo>

is parsed as:
<foo>Hello World Again</foo>

Jul 20 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.