On 20 Feb 2005 06:32:22 -0800,
fr**********@europe.com (Francesco Moi)
wrote:
I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------
That's not a well-formed XML document.
I assume that <message> is from your own schema, and that you want to
embed some HTML fragment within it. At this point I usually start
wondering if I can use RSS instead, and save myself a lot of effort.
Your failure here is that the HTML fragment isn't a well-fomed XML
fragment.. You have several choices:
- Use XHTML instead of HTML. This _might_ work, but you still need to
only include balanced and well-formed fragments. If it's generated
within your own system it might be workable, but it's not a general
solution to reading other people's content (which will always break
sometime).
- Write a parser that can handle tag soup. This is what you need to do
when reading other people's RSS feeds, because they're so often
mis-formed.
- Use HTML, but mangle into well-formed XML (i.e. <br> becomes
<br />) This is ugly, worse than using XHTML and has nothing to
commend it.
- Embed the HTML into the XML, either by encoding it, or by using a
CDATA section.
Read the infamous RSS versions note
http://diveintomark.org/archives/200...compatible-rss
It gives some useful background on these issues.
As well as tag / element formation issues, watch out for HTML entity
references that aren't in core XML (like é) and for embedded
CDATA sections too.
--
Smert' spamionam