By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,626 Members | 2,202 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,626 IT Pros & Developers. It's quick & easy.

problem parsing XML files with < and > in cdata-section

P: n/a
Hi,

I am using the following code (see below) from php.net
(http://www.php.net/manual/en/ref.xml.php, example 1) to parse an XML
file (encoded in UTF-8). I changed the code slightly so that the cdata
sections will be echoed an not the element names as in the original
example.

In the cdata sections of my XML file I have terms like this:

Cap<Finanzinstrument>

The parser echoes them as following (echo $data . "<br>";):

Cap
<
Finanzinstrument


Can anyone explain this to me? Why does the parser split the
cdata-section with &lt; and &gt, in it? Is there any way to avoid
this?

Thanks very much in advance,

greetings, wenke

--------------------------------------------

<?php
$file = "ck_bsp.xml";
$depth = array();

function startElement($parser, $name, $attrs)
{
global $depth;
for ($i = 0; $i < $depth[$parser]; $i++) {
echo " ";
}
//echo "$name\n";
$depth[$parser]++;
}

function endElement($parser, $name)
{
global $depth;
$depth[$parser]--;
}

function characterData($parser, $data)
{
echo $data . "<br>";
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);
?>
--------------------------------------------
Jul 17 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
we*********@gmx.de (wenke) wrote in
news:a6**************************@posting.google.c om:
In the cdata sections of my XML file I have terms like this:

Cap&lt;Finanzinstrument&gt;

The parser echoes them as following (echo $data . "<br>";):

Cap
<
Finanzinstrument


Can anyone explain this to me? Why does the parser split the
cdata-section with &lt; and &gt, in it? Is there any way to avoid
this?


"Stream-oriented" XML parsers (like expat, which is what PHP uses) are
almost never guaranteed to return maximum-length pieces of character data,
because doing so requires some rather complicated internal buffering that
slows them down. In particular, they usually stop a chunk at the
beginning of an entity reference. You simply have to be prepared for
consecutive calls to your character data handler.
Jul 17 '05 #2

P: n/a
Eric Bohlman <eb******@earthlink.net> wrote in message news:<Xn*******************************@130.133.1. 4>...
we*********@gmx.de (wenke) wrote in
news:a6**************************@posting.google.c om:
In the cdata sections of my XML file I have terms like this:

Cap&lt;Finanzinstrument&gt;

The parser echoes them as following (echo $data . "<br>";):

Cap
<
Finanzinstrument


Can anyone explain this to me? Why does the parser split the
cdata-section with &lt; and &gt, in it? Is there any way to avoid
this?


"Stream-oriented" XML parsers (like expat, which is what PHP uses) are
almost never guaranteed to return maximum-length pieces of character data,
because doing so requires some rather complicated internal buffering that
slows them down. In particular, they usually stop a chunk at the
beginning of an entity reference. You simply have to be prepared for
consecutive calls to your character data handler.


Could you please render this more precisely? How do I know if the
output the parser is giving me still belongs to the prior or a new
cdata section (especially if the structure of the data might vary) ??
Thanks!
Jul 17 '05 #3

P: n/a
we*********@gmx.de (wenke) wrote in
news:a6**************************@posting.google.c om:
Eric Bohlman <eb******@earthlink.net> wrote in message
news:<Xn*******************************@130.133.1. 4>...
"Stream-oriented" XML parsers (like expat, which is what PHP uses)
are almost never guaranteed to return maximum-length pieces of
character data, because doing so requires some rather complicated
internal buffering that slows them down. In particular, they usually
stop a chunk at the beginning of an entity reference. You simply
have to be prepared for consecutive calls to your character data
handler.


Could you please render this more precisely? How do I know if the
output the parser is giving me still belongs to the prior or a new
cdata section (especially if the structure of the data might vary) ??


If there were no intervening start-element or end-element events, then two
character-data events are referring to consecutive parts of the same text.
The usual trick is to clear out a text buffer at the end of the code for
each start-element or end-element event (the code would have made use of
anything that was previously in the buffer), and simply append the text to
it in character-data events.
Jul 17 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.