473,320 Members | 1,881 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

XML CDATA special characters

Hi.

I'm trying to develop a program that uses XML files store data. I'm using
Windows XP, Apache 1.3.29 and PHP 4.3.4.

Right now the XML file is read using the xml_parser_create(),
xml_set_element_handler() etc. functions. I have difficulties with special
characters in the data.

I found information on "<![CDATA[ special chars here ]]>", UTF-8, XML DOM,
htmlentities(), and more, but I'm confused with all these terms and their
meaning.

I think I should use CDATA sections anyhow, right? Or is this UTF-8 a way to
use special characters without bothering the XML parser?

Long ago I used a DOM in Perl and liked it, is it hard to use the PHP XML
DOM and is it (part of a) solution to my problem?

Right now (with the xml_parser_ functions) my program outputs something like
<img alt="Data from XML file, sometimes with "quotes".">
to the browser, which isn't right because of the early end-quote. Where and
how should I avoid this? This is where htmlentities fits in, right? And I
once read something about PHP settings dealing with HTML characters.

It's not that I'm lazy, but there's a lot of information on a lot of
interrelated subjects. Who can help me out here please?

Regards,

- John van Terheijden, the Netherlands
Jul 17 '05 #1
6 17687
for a start, if you are "creating" XML content, then you need to use the
DOM API and not the SAX API. As far as I am aware, the SAX API is just
for "reading" XML data and not writing to it. Someone please correct me
if I am wrong.

The DOM API will conveniently do all special character escaping for you
so you dont have to worry about using functions *like* htmlentities().
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
-- &gt;

< -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended). If you try and use HTML
entities, then you will likely create invalid XML documents because HTML
has entities that are "undefined" in the default XML set.

When you use an XML parser (be it SAX, or DOM) to get the data back from
the XML storage files, everything (including entities) will be converted
back (un-escaped). So you really do not need to use CDATA sections.
CDATA sections do have their usages but their absolute neccecity is
limited to a very few cases.

SPECIAL NOTE ON XSL STYLESHEETS:
If you are using XSL templates to extract HTML markup contained
(escaped) in the XML storage files, use the disable-output-escaping
attribute of the value-of directive to disable output escaping. This is
useful if you have done something like this...
$element->set_content($htmlSource);
and you wish the output tree to contain unescaped HTML.

As for character encoding (UTF8 etc), it depends on what sort of data
you are putting in there. Odds are you needn't concern yourself with
this unless you know that your source data is UTF-16 or something. Just
try using the DOM XML functions and see how you go.

Jul 17 '05 #2
Terence wrote:
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
> -- &gt; < -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended).


The other one was introduced by XML1.0, and doesn't exist in any HTML
version. It's U+0027 APOSTROPHE ("'"), with an entity reference of
&apos;, a decimal character reference of ', and a hexadecimal
character reference of &#x27; (XML1.0 sec. 4.6).
If you try and use HTML entities, then you will likely create invalid
XML documents because HTML has entities that are "undefined" in the
default XML set.


OK.

On the other hand, htmlspecialchars converts, at most, just those
five characters to their respective entity references (or decimal
character reference in the case of the ASCII apostrophe, since there
is no entity reference defined for it in any HTML version). The
ENT_QUOTES mode converts both single- and double-quotes; the default
mode, ENT_COMPAT, only converts double-quotes.

http://www.php.net/manual/en/functio...ecialchars.php

--
Jock
Jul 17 '05 #3
I didn't mention SAX, is that the standard PHP parser I'm using now? I
thought it was Expat. Thanks for making this even more confusing ;)

Ok, I'll just dive into DOM now and see where this will all end up. I'll
probably come across all the terms again, in time. B.t.w. I don't understand
much of your XSL note, probably because I know very little about XSL. I'm
using XML to store data while avoiding databases.

Thanks!

"Terence" <tk******@fastmail.fm> schreef in bericht
news:3fb9969e$1@herald...
for a start, if you are "creating" XML content, then you need to use the
DOM API and not the SAX API. As far as I am aware, the SAX API is just
for "reading" XML data and not writing to it. Someone please correct me
if I am wrong.

The DOM API will conveniently do all special character escaping for you
so you dont have to worry about using functions *like* htmlentities().
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
> -- &gt;

< -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended). If you try and use HTML
entities, then you will likely create invalid XML documents because HTML
has entities that are "undefined" in the default XML set.

When you use an XML parser (be it SAX, or DOM) to get the data back from
the XML storage files, everything (including entities) will be converted
back (un-escaped). So you really do not need to use CDATA sections.
CDATA sections do have their usages but their absolute neccecity is
limited to a very few cases.

SPECIAL NOTE ON XSL STYLESHEETS:
If you are using XSL templates to extract HTML markup contained
(escaped) in the XML storage files, use the disable-output-escaping
attribute of the value-of directive to disable output escaping. This is
useful if you have done something like this...
$element->set_content($htmlSource);
and you wish the output tree to contain unescaped HTML.

As for character encoding (UTF8 etc), it depends on what sort of data
you are putting in there. Odds are you needn't concern yourself with
this unless you know that your source data is UTF-16 or something. Just
try using the DOM XML functions and see how you go.

Jul 17 '05 #4
John van Terheijden wrote:
I didn't mention SAX, is that the standard PHP parser I'm using now? I
thought it was Expat. Thanks for making this even more confusing ;)

Yeah, it's a bit like that. I didn't want to include too much
explanations else I'd be in danger of writing a huge article. Trust me,
restraint is a good thing for me. When you're on the newbie end of a
technology, then it's best just to pretend you never read/heard the
stuff that confused you (initially of course).

Simple Api for Xml (SAX) is indeed what PHP's inadequately named the
"XML extension". And yes, it is based on the Expat (product name)
implementation of SAX. SAX is a standard, Expat is a product that
implements that standard.

DOM is a standard, PHP uses the libxml product which implements that
standard. PHP5 is slated to use libxml2 which is very exciting indeed :)

If you don't know anything about XSLT, then ignore the tip I gave to
XSLT users who might take my advice on the [no need to use] CDATA issue.
XSLT is a whole new kettle of fish, don't go there until you have a firm
grasp on XML.

I recomend familiarising yourself with the XML "infoset". You will find
the "infoset" standard on the w3c website. Do not panic, it is a
relatively short document that can be skimmed quite readily. Don't get
depressed if it all doesn't stick the first time. At least *familiarise*
yourself with the *concept* of the infoset. There should be an
introduction/primer type article there.

Ok, I'll just dive into DOM now and see where this will all end up. I'll
probably come across all the terms again, in time. B.t.w. I don't understand
much of your XSL note, probably because I know very little about XSL. I'm
using XML to store data while avoiding databases.

Thanks!

"Terence" <tk******@fastmail.fm> schreef in bericht
news:3fb9969e$1@herald...
for a start, if you are "creating" XML content, then you need to use the
DOM API and not the SAX API. As far as I am aware, the SAX API is just
for "reading" XML data and not writing to it. Someone please correct me
if I am wrong.

The DOM API will conveniently do all special character escaping for you
so you dont have to worry about using functions *like* htmlentities().
On that point, basic XML only has 5 pre-defined default entities. And
off the top of my head, I think they are:
> -- &gt;

< -- &lt;
" -- &quot;
& -- &amp;
[insert fifth one here]

The other one escapes me (no pun intended). If you try and use HTML
entities, then you will likely create invalid XML documents because HTML
has entities that are "undefined" in the default XML set.

When you use an XML parser (be it SAX, or DOM) to get the data back from
the XML storage files, everything (including entities) will be converted
back (un-escaped). So you really do not need to use CDATA sections.
CDATA sections do have their usages but their absolute neccecity is
limited to a very few cases.

SPECIAL NOTE ON XSL STYLESHEETS:
If you are using XSL templates to extract HTML markup contained
(escaped) in the XML storage files, use the disable-output-escaping
attribute of the value-of directive to disable output escaping. This is
useful if you have done something like this...
$element->set_content($htmlSource);
and you wish the output tree to contain unescaped HTML.

As for character encoding (UTF8 etc), it depends on what sort of data
you are putting in there. Odds are you needn't concern yourself with
this unless you know that your source data is UTF-16 or something. Just
try using the DOM XML functions and see how you go.



Jul 17 '05 #5
Thanks for the reply.

"Terence" <tk******@fastmail.fm> schreef in bericht
news:3fbaada8$1@herald...
John van Terheijden wrote:
I didn't mention SAX, is that the standard PHP parser I'm using now? I
thought it was Expat. Thanks for making this even more confusing ;)

Yeah, it's a bit like that. I didn't want to include too much
explanations else I'd be in danger of writing a huge article. Trust me,
restraint is a good thing for me. When you're on the newbie end of a
technology, then it's best just to pretend you never read/heard the
stuff that confused you (initially of course).


I agree. It's always hard to choose between learning by reading or by
practice. Most of the times, "the other one" would have been faster.
Simple Api for Xml (SAX) is indeed what PHP's inadequately named the
"XML extension". And yes, it is based on the Expat (product name)
implementation of SAX. SAX is a standard, Expat is a product that
implements that standard.

DOM is a standard, PHP uses the libxml product which implements that
standard. PHP5 is slated to use libxml2 which is very exciting indeed :)
Thanks for clearing that up! Btw, I think I like how DOM works better than
how SAX works. However, I believe that's very much depending on the type of
XML data involved.
If you don't know anything about XSLT, then ignore the tip I gave to
XSLT users who might take my advice on the [no need to use] CDATA issue.
XSLT is a whole new kettle of fish, don't go there until you have a firm
grasp on XML.
ok :)
I recomend familiarising yourself with the XML "infoset". You will find
the "infoset" standard on the w3c website. Do not panic, it is a
relatively short document that can be skimmed quite readily. Don't get
depressed if it all doesn't stick the first time. At least *familiarise*
yourself with the *concept* of the infoset. There should be an
introduction/primer type article there.


I had a quick look and will read it.

Thanks!
Jul 17 '05 #6
"John van Terheijden" <john-foobar-nl> wrote in message news:<3f***********************@news.versatel.net> ...
Hi.

I'm trying to develop a program that uses XML files store data. I'm using
Windows XP, Apache 1.3.29 and PHP 4.3.4.


I couldn't understand why people are messing with XML when PHP with a
simple database (like MySQL, Postgre SQL or SQLite) can do the job
better.

XML can be effectively used to share the data between two domains.
But, there are people who plow XML in their own domains; also seen
number of people who dump their data into XML from the DB and messing
with XML.

There are also some people who still fight against PHP's cool
short-tag on behalf of messy XML.

---
"One who mix sports and patriotism is a barbarian"
Email: rrjanbiah-at-Y!com
Jul 17 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: alainpoint | last post by:
I am experimenting with ElementTree and i came accross some (apparently) weird behaviour. I would expect a piece of XML to be read, parsed and written back without corruption (except for the...
10
by: Jon Noring | last post by:
Out of curiosity, may a CDATA section appear within an attribute value with datatype CDATA? And if so, how about other attribute value datatypes which accept the XML markup characters? To me,...
4
by: troppfigo | last post by:
I have this example of xml <?xml version="1.0"?> <xml> <!]> </xml> I want to extract the contained data from <body> tag using an xslt transformation. I want to obtain this
10
by: Simon Brooke | last post by:
Here's my problem: <xsl:template match="/category"> .... <script type="text/javascript"> &lt;!]&gt; </script> .... </xsl:template>
3
by: Dilip | last post by:
I have been out of the XML world for a while and have sort of forgotten the exact difference between: <Symbol><!]></Symbol> and just: <Symbol>IBM</Symbol> Can anyone tell me why one is...
12
by: Peter Michaux | last post by:
Hi, I am experimenting with some of the Ruby on Rails JavaScript generators and see something I haven't before. Maybe it is worthwhile? In the page below the script is enclosed in //<!]> ...
1
by: Dariusz Tomoń | last post by:
Hi, I have got xml document with CDATA sections containing special characters like links to images. All Iwant is to display the content in my div section. I tried like this: protected...
3
by: raga | last post by:
Hi Could you please let me know When i specify an attribute of an XML Tag as CDATA in DTD , can i use & straight away within the value of that attribute (instead of using & AMP ;) . If we...
9
by: shapper | last post by:
Hello, Why do some pages I have seen have //<![CDATA[ in the beginning of a script tag before the script itself? Do I need this? Thanks, Miguel
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.