Connecting Tech Pros Worldwide Forums | Help | Site Map

problem parsing XML files with < and > in cdata-section

wenke
Guest
 
Posts: n/a
#1: Jul 17 '05
Hi,

I am using the following code (see below) from php.net
(http://www.php.net/manual/en/ref.xml.php, example 1) to parse an XML
file (encoded in UTF-8). I changed the code slightly so that the cdata
sections will be echoed an not the element names as in the original
example.

In the cdata sections of my XML file I have terms like this:

Cap<Finanzinstrument>

The parser echoes them as following (echo $data . "<br>";):

Cap
<
Finanzinstrument[color=blue]
>[/color]

Can anyone explain this to me? Why does the parser split the
cdata-section with &lt; and &gt, in it? Is there any way to avoid
this?

Thanks very much in advance,

greetings, wenke

--------------------------------------------

<?php
$file = "ck_bsp.xml";
$depth = array();

function startElement($parser, $name, $attrs)
{
global $depth;
for ($i = 0; $i < $depth[$parser]; $i++) {
echo " ";
}
//echo "$name\n";
$depth[$parser]++;
}

function endElement($parser, $name)
{
global $depth;
$depth[$parser]--;
}

function characterData($parser, $data)
{
echo $data . "<br>";
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);
?>
--------------------------------------------
Eric Bohlman
Guest
 
Posts: n/a
#2: Jul 17 '05

re: problem parsing XML files with &lt; and &gt; in cdata-section


wenkeroeper@gmx.de (wenke) wrote in
news:a642a16e.0403030637.284b954e@posting.google.c om:
[color=blue]
> In the cdata sections of my XML file I have terms like this:
>
> Cap&lt;Finanzinstrument&gt;
>
> The parser echoes them as following (echo $data . "<br>";):
>
> Cap
> <
> Finanzinstrument[color=green]
>>[/color]
>
> Can anyone explain this to me? Why does the parser split the
> cdata-section with &lt; and &gt, in it? Is there any way to avoid
> this?[/color]

"Stream-oriented" XML parsers (like expat, which is what PHP uses) are
almost never guaranteed to return maximum-length pieces of character data,
because doing so requires some rather complicated internal buffering that
slows them down. In particular, they usually stop a chunk at the
beginning of an entity reference. You simply have to be prepared for
consecutive calls to your character data handler.
wenke
Guest
 
Posts: n/a
#3: Jul 17 '05

re: problem parsing XML files with &lt; and &gt; in cdata-section


Eric Bohlman <ebohlman@earthlink.net> wrote in message news:<Xns94A1CB4338120ebohlmanomsdevcom@130.133.1. 4>...[color=blue]
> wenkeroeper@gmx.de (wenke) wrote in
> news:a642a16e.0403030637.284b954e@posting.google.c om:
>[color=green]
> > In the cdata sections of my XML file I have terms like this:
> >
> > Cap&lt;Finanzinstrument&gt;
> >
> > The parser echoes them as following (echo $data . "<br>";):
> >
> > Cap
> > <
> > Finanzinstrument[color=darkred]
> >>[/color]
> >
> > Can anyone explain this to me? Why does the parser split the
> > cdata-section with &lt; and &gt, in it? Is there any way to avoid
> > this?[/color]
>
> "Stream-oriented" XML parsers (like expat, which is what PHP uses) are
> almost never guaranteed to return maximum-length pieces of character data,
> because doing so requires some rather complicated internal buffering that
> slows them down. In particular, they usually stop a chunk at the
> beginning of an entity reference. You simply have to be prepared for
> consecutive calls to your character data handler.[/color]

Could you please render this more precisely? How do I know if the
output the parser is giving me still belongs to the prior or a new
cdata section (especially if the structure of the data might vary) ??
Thanks!
Eric Bohlman
Guest
 
Posts: n/a
#4: Jul 17 '05

re: problem parsing XML files with &lt; and &gt; in cdata-section


wenkeroeper@gmx.de (wenke) wrote in
news:a642a16e.0403080153.6e023138@posting.google.c om:
[color=blue]
> Eric Bohlman <ebohlman@earthlink.net> wrote in message
> news:<Xns94A1CB4338120ebohlmanomsdevcom@130.133.1. 4>...[color=green]
>> "Stream-oriented" XML parsers (like expat, which is what PHP uses)
>> are almost never guaranteed to return maximum-length pieces of
>> character data, because doing so requires some rather complicated
>> internal buffering that slows them down. In particular, they usually
>> stop a chunk at the beginning of an entity reference. You simply
>> have to be prepared for consecutive calls to your character data
>> handler.[/color]
>
> Could you please render this more precisely? How do I know if the
> output the parser is giving me still belongs to the prior or a new
> cdata section (especially if the structure of the data might vary) ??[/color]

If there were no intervening start-element or end-element events, then two
character-data events are referring to consecutive parts of the same text.
The usual trick is to clear out a text buffer at the end of the code for
each start-element or end-element event (the code would have made use of
anything that was previously in the buffer), and simply append the text to
it in character-data events.
Closed Thread


Similar PHP bytes