"Russell Mangel" <ru*****@tymer.net> wrote in message news:OG**************@TK2MSFTNGP09.phx.gbl...
What would be the best way to parse this XML document?
I want to avoid using XMLDocument.
I don't know if I should use XMLTextReader, or Xpath classes.
Use XmlTextReader.
There is only one element <MessageStore> element in the document, "always"
at the end of the document.
There will be thousands of <Messages> elements, "always" before
<MessageStore> element.
There are a couple more questions to contemplate in this design.
Are there any attributes? (or, if there are attributes, do you care about parsing
them?) The sample you've posted appears element-oriented.
What's the longest the text content of a child text node can be? All of the text
nodes in the sample appear of small order, <16 characters or so.
: : I need to get the <PR_STORE_ENTRY_ID> data "first".
: : I then need all the data from <Messages> elements.
: : <?xml version="1.0" standalone="yes" ?>
<MessagesToArchive>
: <!-- 80,000 more messages like the one below. --> <Messages>
<PR_ENTRYID>0000000003</PR_ENTRYID>
<MessageType>64</MessageType>
</Messages>
<MessageStore>
: <PR_STORE_ENTRY_ID>FFFFFFFFF</PR_STORE_ENTRY_ID> </MessageStore>
</MessagesToArchive>
The next question is, does it matter to you, if the XML document you're
processing "looks like" the following
<MessagesToArchive>
<MessageStore>
<PR_STORE_ENTRY_ID>FFFFFFFFF</PR_STORE_ENTRY_ID>
</MessageStore>
<Messages>
<MessageType>64</MessageType>
<PR_ENTRYID>0000000003</PR_ENTRYID>
</Messages>
<!-- 80,000 more messages like the one above. -->
</MessagesToArchive>
Where is the XML document coming from? Is it coming from the file system?
Is it coming from a random-access stream (i.e., the stream supports seeking)?
A fourth question, although it's just an implementation detail, is whether there
are CDATA sections or entity references used by the document?
Given:
* The document's content is element-only and text.
* The document may be processed in reverse document-order.
* The source of the document is random-access.
Then the solution is to write a custom subclass of Stream or StreamReader
that wraps the existing Stream from which you're reading the XML document,
and reads it in reverse, replaces "</" sequences with "<" as they're encountered,
replaces "<" sequences with "</" as they are encountered, and reverses the
text of child nodes.
The presence of empty elements, attributes, entity references, and CDATA
sections complicate this implementation slightly. If the length of text nodes is
larger than the block reading size you use (4096 is usually a good size), then
the implementation is complicated slightly further because it may take more
than 2 reads to read a child node.
By intercepting the incoming XML at the stream level, you can make the XML
look like whatever you want for the XmlTextReader. In your situation, it's
definately most efficient to make the document appear upside-down.
In the [idealized] case of the input source being a file in the file system, its
easy to seek to the end of the file and then start reading blocks from end
of the file, working your way up; then reading these buffers from the end
up, etc.
If you want to process the <Message> elements from the top-down in
document order as they are now; then I'd just read the last cluster of the
file in directly and extract the <MessageStore> element using text processing
techniques. Directly accessing part of the file in this manner, when it's a
random-access Stream, is going to be much faster than making two passes.
The key point, when dealing with megabyte-plus XML documents efficiently,
is to never forget that they're coming to you via a Stream. The input source
may give you options.
Derek Harmon