473,508 Members | 2,335 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to parse a XML doc with HTML tags within the texts

Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.
Jul 20 '05 #1
8 3734


Francesco Moi wrote:

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
That is not XML as it is not well-formed, there needs to be a closing
</br> tag.
When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi


That is odd, if you really parse with an XML parser then you shouldn't
get to a DOM at all, parsing should throw an error.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #2
On 20 Feb 2005 06:32:22 -0800, fr**********@europe.com (Francesco Moi)
wrote:
I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------


That's not a well-formed XML document.

I assume that <message> is from your own schema, and that you want to
embed some HTML fragment within it. At this point I usually start
wondering if I can use RSS instead, and save myself a lot of effort.

Your failure here is that the HTML fragment isn't a well-fomed XML
fragment.. You have several choices:

- Use XHTML instead of HTML. This _might_ work, but you still need to
only include balanced and well-formed fragments. If it's generated
within your own system it might be workable, but it's not a general
solution to reading other people's content (which will always break
sometime).

- Write a parser that can handle tag soup. This is what you need to do
when reading other people's RSS feeds, because they're so often
mis-formed.

- Use HTML, but mangle into well-formed XML (i.e. <br> becomes
<br />) This is ugly, worse than using XHTML and has nothing to
commend it.

- Embed the HTML into the XML, either by encoding it, or by using a
CDATA section.
Read the infamous RSS versions note
http://diveintomark.org/archives/200...compatible-rss
It gives some useful background on these issues.

As well as tag / element formation issues, watch out for HTML entity
references that aren't in core XML (like &eacute;) and for embedded
CDATA sections too.

--
Smert' spamionam
Jul 20 '05 #3
Francesco Moi wrote:
Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.


We have an application that ouputs this kind of rubbish (rubbish being
!xhtml ;-)).
I had to take out all the unbalanced tags before being able to parse the
results.
Much easier, if you can enforce xhtml, IMHO.
Jul 20 '05 #4
Sorry, it's a <br/> instead of <br>.
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br/>My name is Jerry</message>
</item>
</doc>
----------------------

Jul 20 '05 #5
fr**********@europe.com wrote:
Sorry, it's a <br/> instead of <br>.
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br/>My name is Jerry</message>
</item>
</doc>
----------------------


sed 's,<br/>,,g'

--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because I can type.

Jul 20 '05 #6
On 20 Feb 2005 14:00:03 -0800, fr**********@europe.com wrote:
Sorry, it's a <br/> instead of <br>.


It's not a parsing problem either, it's a DOM problem.

"Hi" is the first child of <message>, that's what you asked for,
that's what you got.

item(0) & getFirstChild are effectively duplicates here. So instead
of getting the content of the first <message>, you're getting the
first member (one text node) of this content.

To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.
You can usually use a .text property, or else you'll have to iterate /
collect all the text nodes yourself and concatenate them.

--
Die Gotterspammerung - Junkmail of the Gods
Jul 20 '05 #7
Andy Dingley wrote:
To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.


In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.

--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Jul 20 '05 #8


Johannes Koch wrote:

In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.


The XML parser in Java 1.5 (alias Java 5) has support for that, and I
think it is based on Xerces Java from Apache.
Mozilla has no DOM Level 3 Core support in general but has textContent
support.
Not sure whether the Xerces C++ that the OP uses with Perl is also up to
DOM Level 3 Core.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3168
by: Michelle | last post by:
I am trying to write a general function XML file parser, but it keeps choking when it finds embedded HTML within the text, for example: <item> <title>New PHP eBooks in PDF</title> <description>...
8
2705
by: laredotornado | last post by:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF attributes of anchor tags and SRC attributes of IMG tags. Does anyone know of any libraries/freeware to help parse through...
5
3754
by: Donald Firesmith | last post by:
Are html tags allowed within meta tags? Specifically, if I have html tags within a <definition> tag within XML, can I use the definition as the content within the <meta content="description> tag? ...
71
6382
by: tomy_baseo | last post by:
I'm new to HTML and want to learn the basics by learning to code by hand (with the assistance of an HTML editor to eliminate repetitive tasks). Can anyone recommend a good, basic HTML editor that's...
23
2549
by: Charles Law | last post by:
Does anyone have a regex pattern to parse HTML from a stream? I have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN...
82
6218
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...
1
2251
by: Patrick | last post by:
I need to parse and HTML document of the following format. I am interested to obtain all the HTML from and including the first <div class="data"> up to and including Data updated dd/mm/yyyy...
13
4163
by: DH | last post by:
Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...
7
5088
by: mark | last post by:
Hi All, Apologies for the newbie question but I've searched and tried all sorts for a few days and I'm pulling my hair out ; Please feel free to teach me to suck eggs because it's all new to me...
0
7324
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7382
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7495
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5627
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5052
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4707
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3193
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3181
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
418
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.