By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,963 Members | 1,033 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,963 IT Pros & Developers. It's quick & easy.

XML CDATA special characters

P: n/a
Hi.

I'm trying to develop a program that uses XML files store data. I'm using
Windows XP, Apache 1.3.29 and PHP 4.3.4.

Right now the XML file is read using the xml_parser_create(),
xml_set_element_handler() etc. functions. I have difficulties with special
characters in the data.

I found information on "<![CDATA[ special chars here ]]>", UTF-8, XML DOM,
htmlentities(), and more, but I'm confused with all these terms and their
meaning.

I think I should use CDATA sections anyhow, right? Or is this UTF-8 a way to
use special characters without bothering the XML parser?

Long ago I used a DOM in Perl and liked it, is it hard to use the PHP XML
DOM and is it (part of a) solution to my problem?

Right now (with the xml_parser_ functions) my program outputs something like
<img alt="Data from XML file, sometimes with "quotes".">
to the browser, which isn't right because of the early end-quote. Where and
how should I avoid this? This is where htmlentities fits in, right? And I
once read something about PHP settings dealing with HTML characters.

It's not that I'm lazy, but there's a lot of information on a lot of
interrelated subjects. Who can help me out here please?

Regards,

- John van Terheijden, the Netherlands
Jul 17 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
for a start, if you are "creating" XML content, then you need to use the
DOM API and not the SAX API. As far as I am aware, the SAX API is just
for "reading" XML data and not writing to it. Someone please correct me
if I am wrong.

The DOM API will conveniently do all special character escaping for you
so you dont have to worry about using functions *like* htmlentities().
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
-- &gt;

< -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended). If you try and use HTML
entities, then you will likely create invalid XML documents because HTML
has entities that are "undefined" in the default XML set.

When you use an XML parser (be it SAX, or DOM) to get the data back from
the XML storage files, everything (including entities) will be converted
back (un-escaped). So you really do not need to use CDATA sections.
CDATA sections do have their usages but their absolute neccecity is
limited to a very few cases.

SPECIAL NOTE ON XSL STYLESHEETS:
If you are using XSL templates to extract HTML markup contained
(escaped) in the XML storage files, use the disable-output-escaping
attribute of the value-of directive to disable output escaping. This is
useful if you have done something like this...
$element->set_content($htmlSource);
and you wish the output tree to contain unescaped HTML.

As for character encoding (UTF8 etc), it depends on what sort of data
you are putting in there. Odds are you needn't concern yourself with
this unless you know that your source data is UTF-16 or something. Just
try using the DOM XML functions and see how you go.

Jul 17 '05 #2

P: n/a
Terence wrote:
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
> -- &gt; < -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended).


The other one was introduced by XML1.0, and doesn't exist in any HTML
version. It's U+0027 APOSTROPHE ("'"), with an entity reference of
&apos;, a decimal character reference of ', and a hexadecimal
character reference of &#x27; (XML1.0 sec. 4.6).
If you try and use HTML entities, then you will likely create invalid
XML documents because HTML has entities that are "undefined" in the
default XML set.


OK.

On the other hand, htmlspecialchars converts, at most, just those
five characters to their respective entity references (or decimal
character reference in the case of the ASCII apostrophe, since there
is no entity reference defined for it in any HTML version). The
ENT_QUOTES mode converts both single- and double-quotes; the default
mode, ENT_COMPAT, only converts double-quotes.

http://www.php.net/manual/en/functio...ecialchars.php

--
Jock
Jul 17 '05 #3

P: n/a
I didn't mention SAX, is that the standard PHP parser I'm using now? I
thought it was Expat. Thanks for making this even more confusing ;)

Ok, I'll just dive into DOM now and see where this will all end up. I'll
probably come across all the terms again, in time. B.t.w. I don't understand
much of your XSL note, probably because I know very little about XSL. I'm
using XML to store data while avoiding databases.

Thanks!

"Terence" <tk******@fastmail.fm> schreef in bericht
news:3fb9969e$1@herald...
for a start, if you are "creating" XML content, then you need to use the
DOM API and not the SAX API. As far as I am aware, the SAX API is just
for "reading" XML data and not writing to it. Someone please correct me
if I am wrong.

The DOM API will conveniently do all special character escaping for you
so you dont have to worry about using functions *like* htmlentities().
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
> -- &gt;

< -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended). If you try and use HTML
entities, then you will likely create invalid XML documents because HTML
has entities that are "undefined" in the default XML set.

When you use an XML parser (be it SAX, or DOM) to get the data back from
the XML storage files, everything (including entities) will be converted
back (un-escaped). So you really do not need to use CDATA sections.
CDATA sections do have their usages but their absolute neccecity is
limited to a very few cases.

SPECIAL NOTE ON XSL STYLESHEETS:
If you are using XSL templates to extract HTML markup contained
(escaped) in the XML storage files, use the disable-output-escaping
attribute of the value-of directive to disable output escaping. This is
useful if you have done something like this...
$element->set_content($htmlSource);
and you wish the output tree to contain unescaped HTML.

As for character encoding (UTF8 etc), it depends on what sort of data
you are putting in there. Odds are you needn't concern yourself with
this unless you know that your source data is UTF-16 or something. Just
try using the DOM XML functions and see how you go.

Jul 17 '05 #4

P: n/a
John van Terheijden wrote:
I didn't mention SAX, is that the standard PHP parser I'm using now? I
thought it was Expat. Thanks for making this even more confusing ;)

Yeah, it's a bit like that. I didn't want to include too much
explanations else I'd be in danger of writing a huge article. Trust me,
restraint is a good thing for me. When you're on the newbie end of a
technology, then it's best just to pretend you never read/heard the
stuff that confused you (initially of course).

Simple Api for Xml (SAX) is indeed what PHP's inadequately named the
"XML extension". And yes, it is based on the Expat (product name)
implementation of SAX. SAX is a standard, Expat is a product that
implements that standard.

DOM is a standard, PHP uses the libxml product which implements that
standard. PHP5 is slated to use libxml2 which is very exciting indeed :)

If you don't know anything about XSLT, then ignore the tip I gave to
XSLT users who might take my advice on the [no need to use] CDATA issue.
XSLT is a whole new kettle of fish, don't go there until you have a firm
grasp on XML.

I recomend familiarising yourself with the XML "infoset". You will find
the "infoset" standard on the w3c website. Do not panic, it is a
relatively short document that can be skimmed quite readily. Don't get
depressed if it all doesn't stick the first time. At least *familiarise*
yourself with the *concept* of the infoset. There should be an
introduction/primer type article there.

Ok, I'll just dive into DOM now and see where this will all end up. I'll
probably come across all the terms again, in time. B.t.w. I don't understand
much of your XSL note, probably because I know very little about XSL. I'm
using XML to store data while avoiding databases.

Thanks!

"Terence" <tk******@fastmail.fm> schreef in bericht
news:3fb9969e$1@herald...
for a start, if you are "creating" XML content, then you need to use the
DOM API and not the SAX API. As far as I am aware, the SAX API is just
for "reading" XML data and not writing to it. Someone please correct me
if I am wrong.

The DOM API will conveniently do all special character escaping for you
so you dont have to worry about using functions *like* htmlentities().
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
> -- &gt;

< -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended). If you try and use HTML
entities, then you will likely create invalid XML documents because HTML
has entities that are "undefined" in the default XML set.

When you use an XML parser (be it SAX, or DOM) to get the data back from
the XML storage files, everything (including entities) will be converted
back (un-escaped). So you really do not need to use CDATA sections.
CDATA sections do have their usages but their absolute neccecity is
limited to a very few cases.

SPECIAL NOTE ON XSL STYLESHEETS:
If you are using XSL templates to extract HTML markup contained
(escaped) in the XML storage files, use the disable-output-escaping
attribute of the value-of directive to disable output escaping. This is
useful if you have done something like this...
$element->set_content($htmlSource);
and you wish the output tree to contain unescaped HTML.

As for character encoding (UTF8 etc), it depends on what sort of data
you are putting in there. Odds are you needn't concern yourself with
this unless you know that your source data is UTF-16 or something. Just
try using the DOM XML functions and see how you go.



Jul 17 '05 #5

P: n/a
Thanks for the reply.

"Terence" <tk******@fastmail.fm> schreef in bericht
news:3fbaada8$1@herald...
John van Terheijden wrote:
I didn't mention SAX, is that the standard PHP parser I'm using now? I
thought it was Expat. Thanks for making this even more confusing ;)

Yeah, it's a bit like that. I didn't want to include too much
explanations else I'd be in danger of writing a huge article. Trust me,
restraint is a good thing for me. When you're on the newbie end of a
technology, then it's best just to pretend you never read/heard the
stuff that confused you (initially of course).


I agree. It's always hard to choose between learning by reading or by
practice. Most of the times, "the other one" would have been faster.
Simple Api for Xml (SAX) is indeed what PHP's inadequately named the
"XML extension". And yes, it is based on the Expat (product name)
implementation of SAX. SAX is a standard, Expat is a product that
implements that standard.

DOM is a standard, PHP uses the libxml product which implements that
standard. PHP5 is slated to use libxml2 which is very exciting indeed :)
Thanks for clearing that up! Btw, I think I like how DOM works better than
how SAX works. However, I believe that's very much depending on the type of
XML data involved.
If you don't know anything about XSLT, then ignore the tip I gave to
XSLT users who might take my advice on the [no need to use] CDATA issue.
XSLT is a whole new kettle of fish, don't go there until you have a firm
grasp on XML.
ok :)
I recomend familiarising yourself with the XML "infoset". You will find
the "infoset" standard on the w3c website. Do not panic, it is a
relatively short document that can be skimmed quite readily. Don't get
depressed if it all doesn't stick the first time. At least *familiarise*
yourself with the *concept* of the infoset. There should be an
introduction/primer type article there.


I had a quick look and will read it.

Thanks!
Jul 17 '05 #6

P: n/a
"John van Terheijden" <john-foobar-nl> wrote in message news:<3f***********************@news.versatel.net> ...
Hi.

I'm trying to develop a program that uses XML files store data. I'm using
Windows XP, Apache 1.3.29 and PHP 4.3.4.


I couldn't understand why people are messing with XML when PHP with a
simple database (like MySQL, Postgre SQL or SQLite) can do the job
better.

XML can be effectively used to share the data between two domains.
But, there are people who plow XML in their own domains; also seen
number of people who dump their data into XML from the DB and messing
with XML.

There are also some people who still fight against PHP's cool
short-tag on behalf of messy XML.

---
"One who mix sports and patriotism is a barbarian"
Email: rrjanbiah-at-Y!com
Jul 17 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.