How to parse a XML doc with HTML tags within the texts

Francesco Moi

Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.

Jul 20 '05 #1

Subscribe Reply

3734

Martin Honnen

Francesco Moi wrote:

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
That is not XML as it is not well-formed, there needs to be a closing
 tag.
When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

That is odd, if you really parse with an XML parser then you shouldn't
get to a DOM at all, parsing should throw an error.

--

Martin Honnen
http://JavaScript.FAQTs.com/

Jul 20 '05 #2

Andy Dingley

On 20 Feb 2005 06:32:22 -0800, fr**********@europe.com (Francesco Moi)
wrote:

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
--------

That's not a well-formed XML document.

I assume that <message> is from your own schema, and that you want to
embed some HTML fragment within it. At this point I usually start
wondering if I can use RSS instead, and save myself a lot of effort.

Your failure here is that the HTML fragment isn't a well-fomed XML
fragment.. You have several choices:

- Use XHTML instead of HTML. This _might_ work, but you still need to
only include balanced and well-formed fragments. If it's generated
within your own system it might be workable, but it's not a general
solution to reading other people's content (which will always break
sometime).

- Write a parser that can handle tag soup. This is what you need to do
when reading other people's RSS feeds, because they're so often
mis-formed.

- Use HTML, but mangle into well-formed XML (i.e. becomes
 ) This is ugly, worse than using XHTML and has nothing to
commend it.

- Embed the HTML into the XML, either by encoding it, or by using a
CDATA section.
Read the infamous RSS versions note
http://diveintomark.org/archives/200...compatible-rss
It gives some useful background on these issues.

As well as tag / element formation issues, watch out for HTML entity
references that aren't in core XML (like é) and for embedded
CDATA sections too.

--
Smert' spamionam

Jul 20 '05 #3

Malte

Francesco Moi wrote:

Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.

We have an application that ouputs this kind of rubbish (rubbish being
!xhtml ;-)).
I had to take out all the unbalanced tags before being able to parse the
results.
Much easier, if you can enforce xhtml, IMHO.

Jul 20 '05 #4

francescomoi

Sorry, it's a instead of .
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
----------------------

Jul 20 '05 #5

William Park

fr**********@europe.com wrote:

Sorry, it's a instead of .
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
----------------------

sed 's, ,,g'

--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because I can type.

Jul 20 '05 #6

Andy Dingley

On 20 Feb 2005 14:00:03 -0800, fr**********@europe.com wrote:

Sorry, it's a instead of .

It's not a parsing problem either, it's a DOM problem.

"Hi" is the first child of <message>, that's what you asked for,
that's what you got.

item(0) & getFirstChild are effectively duplicates here. So instead
of getting the content of the first <message>, you're getting the
first member (one text node) of this content.

To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.
You can usually use a .text property, or else you'll have to iterate /
collect all the text nodes yourself and concatenate them.

--
Die Gotterspammerung - Junkmail of the Gods

Jul 20 '05 #7

Johannes Koch

Andy Dingley wrote:

To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.

In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.

--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)

Jul 20 '05 #8

Martin Honnen

Johannes Koch wrote:

In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.

The XML parser in Java 1.5 (alias Java 5) has support for that, and I
think it is based on Xerces Java from Apache.
Mozilla has no DOM Level 3 Core support in general but has textContent
support.
Not sure whether the Xerces C++ that the OP uses with Perl is also up to
DOM Level 3 Core.
--

Martin Honnen
http://JavaScript.FAQTs.com/

Jul 20 '05 #9

Similar topics

3168

Parse embedded html tags within XML file?

by: Michelle | last post by:

I am trying to write a general function XML file parser, but it keeps choking when it finds embedded HTML within the text, for example: <item> <title>New PHP eBooks in PDF</title> <description>...

PHP

2705

using PHP to parse through HTML

by: laredotornado | last post by:

Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF attributes of anchor tags and SRC attributes of IMG tags. Does anyone know of any libraries/freeware to help parse through...

PHP

3754

html tags within meta tags allowed?

by: Donald Firesmith | last post by:

Are html tags allowed within meta tags? Specifically, if I have html tags within a <definition> tag within XML, can I use the definition as the content within the <meta content="description> tag? ...

.NET Framework

6382

HTML Editor

by: tomy_baseo | last post by:

I'm new to HTML and want to learn the basics by learning to code by hand (with the assistance of an HTML editor to eliminate repetitive tasks). Can anyone recommend a good, basic HTML editor that's...

HTML / CSS

2549

Regular Expression to Parse HTML

by: Charles Law | last post by:

Does anyone have a regex pattern to parse HTML from a stream? I have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN...

.NET Framework

6218

Understanding simplest HTML page

by: Eric Lindsay | last post by:

I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...

HTML / CSS

2251

Regular Expressions to parse HTML

by: Patrick | last post by:

I need to parse and HTML document of the following format. I am interested to obtain all the HTML from and including the first <div class="data"> up to and including Data updated dd/mm/yyyy...

.NET Framework

4163

Taking data from a text file to parse html page

by: DH | last post by:

Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...

Python

5088

How to find <tag> to </tag> HTML strings and 'save' them?

by: mark | last post by:

Hi All, Apologies for the newbie question but I've searched and tried all sorts for a few days and I'm pulling my hair out ; Please feel free to teach me to suck eggs because it's all new to me...

Python

7324

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7382

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7495

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5627

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

5052

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4707

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3193

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3181

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

418

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General