By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,106 Members | 1,095 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,106 IT Pros & Developers. It's quick & easy.

How Do I parse this XML document, most efficiently?

P: n/a
What would be the best way to parse this XML document?
I want to avoid using XMLDocument.
I don't know if I should use XMLTextReader, or Xpath classes.

There is only one element <MessageStore> element in the document, "always"
at the end of the document.
There will be thousands of <Messages> elements, "always" before
<MessageStore> element.

1st Step
=====
I need to get the <PR_STORE_ENTRY_ID> data "first".
There will always only be one MessageStore element, with one child
containing data.

2nd Step
======
I then need all the data from <Messages> elements.

If possible an example in C#/VB/C++.NET, would be greatly appreciated.

// Begin XML document

<?xml version="1.0" standalone="yes" ?>
<MessagesToArchive>
<Messages>
<PR_ENTRYID>0000000001</PR_ENTRYID>
<MessageType>32</MessageType>
</Messages>
<Messages>
<PR_ENTRYID>0000000002</PR_ENTRYID>
<MessageType>512</MessageType>
</Messages>
<Messages>
<PR_ENTRYID>0000000003</PR_ENTRYID>
<MessageType>64</MessageType>
</Messages>
<MessageStore>
<PR_STORE_ENTRY_ID>FFFFFFFFF</PR_ENTRYID>
</MessageStore>
</MessagesToArchive>

// End XML document

Thanks

Russell Mangel
Las Vegas, NV
Nov 12 '05 #1
Share this Question
Share on Google+
10 Replies


P: n/a
Hi Russell,

First of all, I would like to confirm my understanding of your issue. From
your description, I understand that you need to parse an Xml document
without using the XmlDocument class. If there is any misunderstanding,
please feel free to let me know.

I think we can use XmlTextReader to achieve this. An XmlTextReader object
provides fast and read-only access to an xml source. We can use
XmlTextReader.Read method to go through each node of the document and check
the node type to decide what to do on it.

For more information about XmlTextReader, please check the following link
with an example.

http://msdn.microsoft.com/library/de...us/cpref/html/
frlrfsystemxmlxmltextreaderclassreadtopic.asp

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #2

P: n/a
Correct, I do not want to use XMLDocument Class, there can easily be
50,000-80,000 Messages elements in the document.

If you look at the xml document, I need to get an element that is at the end
of the file first.
Then I can get the rest of the previous elements. It is a shame the last
element is at the end of the document, but then I can not change this.

I am concerned about how do to this, as I want to avoid parsing the document
twice.
How do I get the last element first, then loop through the <Messages>
elements.

If we use the XMLTextReader, we would have to parse the document twice, this
is not efficient.
Would the Xpath, classes be a better choice?
"Kevin Yu [MSFT]" <v-****@online.microsoft.com> wrote in message
news:ru**************@cpmsftngxa10.phx.gbl...
Hi Russell,

First of all, I would like to confirm my understanding of your issue. From
your description, I understand that you need to parse an Xml document
without using the XmlDocument class. If there is any misunderstanding,
please feel free to let me know.

I think we can use XmlTextReader to achieve this. An XmlTextReader object
provides fast and read-only access to an xml source. We can use
XmlTextReader.Read method to go through each node of the document and check the node type to decide what to do on it.

For more information about XmlTextReader, please check the following link
with an example.

http://msdn.microsoft.com/library/de...us/cpref/html/ frlrfsystemxmlxmltextreaderclassreadtopic.asp

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #3

P: n/a
I don't think you have a choice other than parsing twice. But you can do it
pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}

Next you do the same but for messages. As this is streaming parsing, it's
not as bad as you may think... although it's really unfortunate that the
first element you need is the last ...

--
Daniel Cazzulino [MVP XML]
Clarius Consulting SA
http://weblogs.asp.net/cazzu
http://aspnet2.com
"Russell Mangel" <ru*****@tymer.net> wrote in message
news:#d*************@TK2MSFTNGP11.phx.gbl...
Correct, I do not want to use XMLDocument Class, there can easily be
50,000-80,000 Messages elements in the document.

If you look at the xml document, I need to get an element that is at the end of the file first.
Then I can get the rest of the previous elements. It is a shame the last
element is at the end of the document, but then I can not change this.

I am concerned about how do to this, as I want to avoid parsing the document twice.
How do I get the last element first, then loop through the <Messages>
elements.

If we use the XMLTextReader, we would have to parse the document twice, this is not efficient.
Would the Xpath, classes be a better choice?
"Kevin Yu [MSFT]" <v-****@online.microsoft.com> wrote in message
news:ru**************@cpmsftngxa10.phx.gbl...
Hi Russell,

First of all, I would like to confirm my understanding of your issue. From your description, I understand that you need to parse an Xml document
without using the XmlDocument class. If there is any misunderstanding,
please feel free to let me know.

I think we can use XmlTextReader to achieve this. An XmlTextReader object provides fast and read-only access to an xml source. We can use
XmlTextReader.Read method to go through each node of the document and

check
the node type to decide what to do on it.

For more information about XmlTextReader, please check the following link with an example.

http://msdn.microsoft.com/library/de...us/cpref/html/
frlrfsystemxmlxmltextreaderclassreadtopic.asp

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.655 / Virus Database: 420 - Release Date: 10/04/2004
Nov 12 '05 #4

P: n/a
"Russell Mangel" <ru*****@tymer.net> wrote in message news:OG**************@TK2MSFTNGP09.phx.gbl...
What would be the best way to parse this XML document?
I want to avoid using XMLDocument.
I don't know if I should use XMLTextReader, or Xpath classes.
Use XmlTextReader.
There is only one element <MessageStore> element in the document, "always"
at the end of the document.
There will be thousands of <Messages> elements, "always" before
<MessageStore> element.
There are a couple more questions to contemplate in this design.

Are there any attributes? (or, if there are attributes, do you care about parsing
them?) The sample you've posted appears element-oriented.

What's the longest the text content of a child text node can be? All of the text
nodes in the sample appear of small order, <16 characters or so.

: : I need to get the <PR_STORE_ENTRY_ID> data "first". : : I then need all the data from <Messages> elements. : : <?xml version="1.0" standalone="yes" ?>
<MessagesToArchive> : <!-- 80,000 more messages like the one below. --> <Messages>
<PR_ENTRYID>0000000003</PR_ENTRYID>
<MessageType>64</MessageType>
</Messages>
<MessageStore> : <PR_STORE_ENTRY_ID>FFFFFFFFF</PR_STORE_ENTRY_ID> </MessageStore>
</MessagesToArchive>


The next question is, does it matter to you, if the XML document you're
processing "looks like" the following

<MessagesToArchive>
<MessageStore>
<PR_STORE_ENTRY_ID>FFFFFFFFF</PR_STORE_ENTRY_ID>
</MessageStore>
<Messages>
<MessageType>64</MessageType>
<PR_ENTRYID>0000000003</PR_ENTRYID>
</Messages>
<!-- 80,000 more messages like the one above. -->
</MessagesToArchive>

Where is the XML document coming from? Is it coming from the file system?
Is it coming from a random-access stream (i.e., the stream supports seeking)?

A fourth question, although it's just an implementation detail, is whether there
are CDATA sections or entity references used by the document?

Given:
* The document's content is element-only and text.
* The document may be processed in reverse document-order.
* The source of the document is random-access.

Then the solution is to write a custom subclass of Stream or StreamReader
that wraps the existing Stream from which you're reading the XML document,
and reads it in reverse, replaces "</" sequences with "<" as they're encountered,
replaces "<" sequences with "</" as they are encountered, and reverses the
text of child nodes.

The presence of empty elements, attributes, entity references, and CDATA
sections complicate this implementation slightly. If the length of text nodes is
larger than the block reading size you use (4096 is usually a good size), then
the implementation is complicated slightly further because it may take more
than 2 reads to read a child node.

By intercepting the incoming XML at the stream level, you can make the XML
look like whatever you want for the XmlTextReader. In your situation, it's
definately most efficient to make the document appear upside-down.

In the [idealized] case of the input source being a file in the file system, its
easy to seek to the end of the file and then start reading blocks from end
of the file, working your way up; then reading these buffers from the end
up, etc.

If you want to process the <Message> elements from the top-down in
document order as they are now; then I'd just read the last cluster of the
file in directly and extract the <MessageStore> element using text processing
techniques. Directly accessing part of the file in this manner, when it's a
random-access Stream, is going to be much faster than making two passes.

The key point, when dealing with megabyte-plus XML documents efficiently,
is to never forget that they're coming to you via a Stream. The input source
may give you options.
Derek Harmon
Nov 12 '05 #5

P: n/a
There are no attributes in the document.

The Elements that have the name "ENTRY_ID" are 250-400 characters, (String).
MessageType is an Int32,
The next question is, does it matter to you, if the XML document you're processing "looks like" the following

No it does not matter, however. I can not re-arrange the document, I didn't
write the code which creates these XML documents.
Where is the XML document coming from?
The document is a file on the file system.
A fourth question, although it's just an implementation detail, is whether

there are CDATA sections or entity references used by the document?

There are none, the sample of the XML I originally posted is "exactly" what
I am working with, there are just thousands of <Messages> elements.
Every <Messages> Element always contains 2 child elements, and all of them
contain data, never null.

These xml documents are perfectly consistant, in every way.
So reading the file in backwards should be reasonably simple, and would
solve my problem of getting the <MessageStore> data first.
The documents are quite large, some of them are 25-60 MB on disk. So you can
see why I am trying to avoid parsing twice.

Thanks for your excellant thoughts, especially the reminder about using
Streams.

Russell Mangel
Las Vegas, NV
Nov 12 '05 #6

P: n/a
I noticed something in your example, which I will have to look up in
documentation as I have not used this before:
I don't know what purpose the following line of code serves.

tr.NameTable.Add("MessageStore");

Thanks for you sample code, and thoughts.
Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message
news:Oc**************@TK2MSFTNGP09.phx.gbl...
I don't think you have a choice other than parsing twice. But you can do it pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}

Nov 12 '05 #7

P: n/a
The purpose of that line is to get a string that can be used to perform fast
reference comparison against the current LocalName, instead of string-value
comparison.
Read the documentation for XmlNameTable for more.
You should read Oleg's post too:
http://www.tkachenko.com/blog/archives/000181.html

--
Daniel Cazzulino [MVP XML]
Clarius Consulting SA
http://weblogs.asp.net/cazzu
http://aspnet2.com
"Russell Mangel" <ru*****@tymer.net> wrote in message
news:#8*************@tk2msftngp13.phx.gbl...
I noticed something in your example, which I will have to look up in
documentation as I have not used this before:
I don't know what purpose the following line of code serves.

tr.NameTable.Add("MessageStore");

Thanks for you sample code, and thoughts.
Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message
news:Oc**************@TK2MSFTNGP09.phx.gbl...
I don't think you have a choice other than parsing twice. But you can do

it
pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.665 / Virus Database: 428 - Release Date: 21/04/2004
Nov 12 '05 #8

P: n/a
Hi Russel,

I'd like to know if this issue has been resolved yet. Is there anything
that I can help. I'm still monitoring on it. If you have any questions,
please feel free to post them in the community.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #9

P: n/a
Thanks for following through.
Yes I have this working good now.

Russell Mangel
Las Vegas, NV

"Kevin Yu [MSFT]" <v-****@online.microsoft.com> wrote in message
news:Jt**************@cpmsftngxa10.phx.gbl...
Hi Russel,

I'd like to know if this issue has been resolved yet. Is there anything
that I can help. I'm still monitoring on it. If you have any questions,
please feel free to post them in the community.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 12 '05 #10

P: n/a
I am glad you pointed this out, I was un-ware of this fact.
I will modify my code to add use this technique instead of comparing
Strings.

Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message
news:Oc**************@tk2msftngp13.phx.gbl...
The purpose of that line is to get a string that can be used to perform fast reference comparison against the current LocalName, instead of string-value comparison.
Read the documentation for XmlNameTable for more.
You should read Oleg's post too:
http://www.tkachenko.com/blog/archives/000181.html

--
Daniel Cazzulino [MVP XML]
Clarius Consulting SA
http://weblogs.asp.net/cazzu
http://aspnet2.com
"Russell Mangel" <ru*****@tymer.net> wrote in message
news:#8*************@tk2msftngp13.phx.gbl...
I noticed something in your example, which I will have to look up in
documentation as I have not used this before:
I don't know what purpose the following line of code serves.

tr.NameTable.Add("MessageStore");

Thanks for you sample code, and thoughts.
Russell Mangel
Las Vegas, NV

"Daniel Cazzulino [MVP XML]" <kz***@NOaspnet2SPAMPLZ.com> wrote in message news:Oc**************@TK2MSFTNGP09.phx.gbl...
I don't think you have a choice other than parsing twice. But you can
do it
pretty fast. For example:

XmlTextReader tr = new XmlTextReader( stream );
string store = tr.NameTable.Add("MessageStore");
// Move to root element
tr.MoveToContent();
// Move to first child and subsequent siblings
while (tr.Read())
{
if (tr.LocalName != store)
tr.Skip(); // Avoid elements you don't care about.
else
// You're at the MessageStore. Process and exit while
break;
}


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.665 / Virus Database: 428 - Release Date: 21/04/2004

Nov 12 '05 #11

This discussion thread is closed

Replies have been disabled for this discussion.