How Do I parse this XML document, most efficiently?

Russell Mangel

What would be the best way to parse this XML document?
I want to avoid using XMLDocument.
I don't know if I should use XMLTextReader, or Xpath classes.

There is only one element <MessageStore> element in the document, "always"
at the end of the document.
There will be thousands of <Messages> elements, "always" before
<MessageStore> element.

1st Step
=====
I need to get the <PR_STORE_ENTRY_ID> data "first".
There will always only be one MessageStore element, with one child
containing data.

2nd Step
======
I then need all the data from <Messages> elements.

If possible an example in C#/VB/C++.NET, would be greatly appreciated.

// Begin XML document

<?xml version="1.0" standalone="yes" ?>
<MessagesToArchive>
<Messages>
<PR_ENTRYID>0000000001</PR_ENTRYID>
<MessageType>32</MessageType>
</Messages>
<Messages>
<PR_ENTRYID>0000000002</PR_ENTRYID>
<MessageType>512</MessageType>
</Messages>
<Messages>
<PR_ENTRYID>0000000003</PR_ENTRYID>
<MessageType>64</MessageType>
</Messages>
<MessageStore>
<PR_STORE_ENTRY_ID>FFFFFFFFF</PR_ENTRYID>
</MessageStore>
</MessagesToArchive>

// End XML document

Thanks

Russell Mangel
Las Vegas, NV

Nov 12 '05 #1

Subscribe Reply

2627

Kevin Yu [MSFT]

Hi Russell,

First of all, I would like to confirm my understanding of your issue. From
your description, I understand that you need to parse an Xml document
without using the XmlDocument class. If there is any misunderstanding,
please feel free to let me know.

I think we can use XmlTextReader to achieve this. An XmlTextReader object
provides fast and read-only access to an xml source. We can use
XmlTextReader.Read method to go through each node of the document and check
the node type to decide what to do on it.

For more information about XmlTextReader, please check the following link
with an example.

http://msdn.microsoft.com/library/de...us/cpref/html/
frlrfsystemxmlxmltextreaderclassreadtopic.asp

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #2

Russell Mangel

Correct, I do not want to use XMLDocument Class, there can easily be
50,000-80,000 Messages elements in the document.

If you look at the xml document, I need to get an element that is at the end
of the file first.
Then I can get the rest of the previous elements. It is a shame the last
element is at the end of the document, but then I can not change this.

I am concerned about how do to this, as I want to avoid parsing the document
twice.
How do I get the last element first, then loop through the <Messages>
elements.

If we use the XMLTextReader, we would have to parse the document twice, this
is not efficient.
Would the Xpath, classes be a better choice?
"Kevin Yu [MSFT]" <v-****@online.microsoft.com> wrote in message
news:ru**************@cpmsftngxa10.phx.gbl...

Hi Russell,

First of all, I would like to confirm my understanding of your issue. From
your description, I understand that you need to parse an Xml document
without using the XmlDocument class. If there is any misunderstanding,
please feel free to let me know.

I think we can use XmlTextReader to achieve this. An XmlTextReader object
provides fast and read-only access to an xml source. We can use
XmlTextReader.Read method to go through each node of the document and check the node type to decide what to do on it.

For more information about XmlTextReader, please check the following link
with an example.

http://msdn.microsoft.com/library/de...us/cpref/html/ frlrfsystemxmlxmltextreaderclassreadtopic.asp

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #3

Daniel Cazzulino [MVP XML]

I don't think you have a choice other than parsing twice. But you can do it
pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}

Next you do the same but for messages. As this is streaming parsing, it's
not as bad as you may think... although it's really unfortunate that the
first element you need is the last ...

--
Daniel Cazzulino [MVP XML]
Clarius Consulting SA
http://weblogs.asp.net/cazzu
http://aspnet2.com
"Russell Mangel" <ru*****@tymer.net> wrote in message
news:#d*************@TK2MSFTNGP11.phx.gbl...

Correct, I do not want to use XMLDocument Class, there can easily be
50,000-80,000 Messages elements in the document.

If you look at the xml document, I need to get an element that is at the end of the file first.
Then I can get the rest of the previous elements. It is a shame the last
element is at the end of the document, but then I can not change this.

I am concerned about how do to this, as I want to avoid parsing the document twice.
How do I get the last element first, then loop through the <Messages>
elements.

If we use the XMLTextReader, we would have to parse the document twice, this is not efficient.
Would the Xpath, classes be a better choice?
"Kevin Yu [MSFT]" <v-****@online.microsoft.com> wrote in message
news:ru**************@cpmsftngxa10.phx.gbl...
Hi Russell,

First of all, I would like to confirm my understanding of your issue. From your description, I understand that you need to parse an Xml document
without using the XmlDocument class. If there is any misunderstanding,
please feel free to let me know.

I think we can use XmlTextReader to achieve this. An XmlTextReader object provides fast and read-only access to an xml source. We can use
XmlTextReader.Read method to go through each node of the document and

check
the node type to decide what to do on it.

For more information about XmlTextReader, please check the following link with an example.

http://msdn.microsoft.com/library/de...us/cpref/html/

frlrfsystemxmlxmltextreaderclassreadtopic.asp

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.655 / Virus Database: 420 - Release Date: 10/04/2004

Nov 12 '05 #4

Derek Harmon

"Russell Mangel" <ru*****@tymer.net> wrote in message news:OG**************@TK2MSFTNGP09.phx.gbl...

What would be the best way to parse this XML document?
I want to avoid using XMLDocument.
I don't know if I should use XMLTextReader, or Xpath classes.
Use XmlTextReader.
There is only one element <MessageStore> element in the document, "always"
at the end of the document.
There will be thousands of <Messages> elements, "always" before
<MessageStore> element.
There are a couple more questions to contemplate in this design.

Are there any attributes? (or, if there are attributes, do you care about parsing
them?) The sample you've posted appears element-oriented.

What's the longest the text content of a child text node can be? All of the text
nodes in the sample appear of small order, <16 characters or so.

: : I need to get the <PR_STORE_ENTRY_ID> data "first". : : I then need all the data from <Messages> elements. : : <?xml version="1.0" standalone="yes" ?>
<MessagesToArchive> :  <Messages>
<PR_ENTRYID>0000000003</PR_ENTRYID>
<MessageType>64</MessageType>
</Messages>
<MessageStore> : <PR_STORE_ENTRY_ID>FFFFFFFFF</PR_STORE_ENTRY_ID> </MessageStore>
</MessagesToArchive>

The next question is, does it matter to you, if the XML document you're
processing "looks like" the following

<MessagesToArchive>
<MessageStore>
<PR_STORE_ENTRY_ID>FFFFFFFFF</PR_STORE_ENTRY_ID>
</MessageStore>
<Messages>
<MessageType>64</MessageType>
<PR_ENTRYID>0000000003</PR_ENTRYID>
</Messages>

</MessagesToArchive>

Where is the XML document coming from? Is it coming from the file system?
Is it coming from a random-access stream (i.e., the stream supports seeking)?

A fourth question, although it's just an implementation detail, is whether there
are CDATA sections or entity references used by the document?

Given:
* The document's content is element-only and text.
* The document may be processed in reverse document-order.
* The source of the document is random-access.

Then the solution is to write a custom subclass of Stream or StreamReader
that wraps the existing Stream from which you're reading the XML document,
and reads it in reverse, replaces "</" sequences with "<" as they're encountered,
replaces "<" sequences with "</" as they are encountered, and reverses the
text of child nodes.

The presence of empty elements, attributes, entity references, and CDATA
sections complicate this implementation slightly. If the length of text nodes is
larger than the block reading size you use (4096 is usually a good size), then
the implementation is complicated slightly further because it may take more
than 2 reads to read a child node.

By intercepting the incoming XML at the stream level, you can make the XML
look like whatever you want for the XmlTextReader. In your situation, it's
definately most efficient to make the document appear upside-down.

In the [idealized] case of the input source being a file in the file system, its
easy to seek to the end of the file and then start reading blocks from end
of the file, working your way up; then reading these buffers from the end
up, etc.

If you want to process the <Message> elements from the top-down in
document order as they are now; then I'd just read the last cluster of the
file in directly and extract the <MessageStore> element using text processing
techniques. Directly accessing part of the file in this manner, when it's a
random-access Stream, is going to be much faster than making two passes.

The key point, when dealing with megabyte-plus XML documents efficiently,
is to never forget that they're coming to you via a Stream. The input source
may give you options.
Derek Harmon

Nov 12 '05 #5

Russell Mangel

There are no attributes in the document.

The Elements that have the name "ENTRY_ID" are 250-400 characters, (String).
MessageType is an Int32,

The next question is, does it matter to you, if the XML document you're processing "looks like" the following

No it does not matter, however. I can not re-arrange the document, I didn't
write the code which creates these XML documents.
Where is the XML document coming from?
The document is a file on the file system.
A fourth question, although it's just an implementation detail, is whether

there are CDATA sections or entity references used by the document?

There are none, the sample of the XML I originally posted is "exactly" what
I am working with, there are just thousands of <Messages> elements.
Every <Messages> Element always contains 2 child elements, and all of them
contain data, never null.

These xml documents are perfectly consistant, in every way.
So reading the file in backwards should be reasonably simple, and would
solve my problem of getting the <MessageStore> data first.
The documents are quite large, some of them are 25-60 MB on disk. So you can
see why I am trying to avoid parsing twice.

Thanks for your excellant thoughts, especially the reminder about using
Streams.

Russell Mangel
Las Vegas, NV

Nov 12 '05 #6

Russell Mangel

I noticed something in your example, which I will have to look up in
documentation as I have not used this before:
I don't know what purpose the following line of code serves.

tr.NameTable.Add("MessageStore");

Thanks for you sample code, and thoughts.
Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message
news:Oc**************@TK2MSFTNGP09.phx.gbl...

I don't think you have a choice other than parsing twice. But you can do it pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}

Nov 12 '05 #7

Daniel Cazzulino [MVP XML]

The purpose of that line is to get a string that can be used to perform fast
reference comparison against the current LocalName, instead of string-value
comparison.
Read the documentation for XmlNameTable for more.
You should read Oleg's post too:
http://www.tkachenko.com/blog/archives/000181.html

--
Daniel Cazzulino [MVP XML]
Clarius Consulting SA
http://weblogs.asp.net/cazzu
http://aspnet2.com
"Russell Mangel" <ru*****@tymer.net> wrote in message
news:#8*************@tk2msftngp13.phx.gbl...

I noticed something in your example, which I will have to look up in
documentation as I have not used this before:
I don't know what purpose the following line of code serves.

tr.NameTable.Add("MessageStore");

Thanks for you sample code, and thoughts.
Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message
news:Oc**************@TK2MSFTNGP09.phx.gbl...
I don't think you have a choice other than parsing twice. But you can do

it
pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.665 / Virus Database: 428 - Release Date: 21/04/2004

Nov 12 '05 #8

Kevin Yu [MSFT]

Hi Russel,

I'd like to know if this issue has been resolved yet. Is there anything
that I can help. I'm still monitoring on it. If you have any questions,
please feel free to post them in the community.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #9

Russell Mangel

Thanks for following through.
Yes I have this working good now.

Russell Mangel
Las Vegas, NV

"Kevin Yu [MSFT]" <v-****@online.microsoft.com> wrote in message
news:Jt**************@cpmsftngxa10.phx.gbl...

Hi Russel,

I'd like to know if this issue has been resolved yet. Is there anything
that I can help. I'm still monitoring on it. If you have any questions,
please feel free to post them in the community.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #10

Russell Mangel

I am glad you pointed this out, I was un-ware of this fact.
I will modify my code to add use this technique instead of comparing
Strings.

Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message
news:Oc**************@tk2msftngp13.phx.gbl...

The purpose of that line is to get a string that can be used to perform fast reference comparison against the current LocalName, instead of string-value comparison.
Read the documentation for XmlNameTable for more.
You should read Oleg's post too:
http://www.tkachenko.com/blog/archives/000181.html

--
Daniel Cazzulino [MVP XML]
Clarius Consulting SA
http://weblogs.asp.net/cazzu
http://aspnet2.com
"Russell Mangel" <ru*****@tymer.net> wrote in message
news:#8*************@tk2msftngp13.phx.gbl...
I noticed something in your example, which I will have to look up in
documentation as I have not used this before:
I don't know what purpose the following line of code serves.

tr.NameTable.Add("MessageStore");

Thanks for you sample code, and thoughts.
Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message news:Oc**************@TK2MSFTNGP09.phx.gbl...
I don't think you have a choice other than parsing twice. But you can
do it
pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.665 / Virus Database: 428 - Release Date: 21/04/2004

Nov 12 '05 #11

Similar topics

15372

DocumentBuilder.parse returns DeferredDocumentImpl???

by: edwinek | last post by:

Hi, According to the API specification, javax.xml.parsers.DocumentBuilder.parse should return an org.w3c.dom.Document. However, when I use the following code: DocumentBuilderFactory factory =...

Java

2470

a parse error when using the xml4c.

by: H.L Bai | last post by:

hi, everybody i meet a parse error when i used the xml4c. any proposal is helpful. The error is following .../XMLRegionHandler.h:59 parse error before '*' .../XMLRegionHandler.h:60 parse...

.NET Framework

6281

Using xerces to parse a string of xml does not seem to work

by: Watsh | last post by:

Hi All, I have been trying to parse an XML string using the StringReader and InputSource interface but the document returned to me is always null. Please find the code below which i have been...

.NET Framework

10665

how to parse an entire multi-dimensional array ?

by: iop | last post by:

Hello there, I'd like to "parse" an entire multi-dimension array like this : APP APP without knowing "framework" or "config" or anything passed as variables... 'cause it's simple to call...

Javascript

7402

Unknown Parse Mode! warning from w3c validator with custom doctype

by: Spartanicus | last post by:

The document at http://homepage.ntlworld.com/spartanicus/custom_dtd.htm uses a custom DTD, the w3c validator validates it but with this warning: "Unknown Parse Mode! The MIME Media Type...

HTML / CSS

3105

Best method to parse a very large XML Doc

by: Brian | last post by:

Hi all, I have an xml document that can contain 248 nodes with each node containing different fields. There is the possibility that there will be more nodes. Each node can have up to 30...

C# / C Sharp

64579

How to parse a file in C++

by: AdrianH | last post by:

Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C++ programming. FYI Although I have called...

C / C++

64004

How to Parse a File in C

by: AdrianH | last post by:

Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this...

C / C++

2337

How to parse CSV file

by: Peter Afonin | last post by:

Hello, I need to parse a string that returns the domain DNS records and to put this data into a DataTable. I don't have much experience in parsing strings, so I'm not aware of the efficient way...

C# / C Sharp

7095

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7294

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7361

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

5602

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4693

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3183

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

1523

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

749

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

403

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General