By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,482 Members | 3,157 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,482 IT Pros & Developers. It's quick & easy.

Querying Very Large XML

P: n/a
I am working on a project that will have about 500,000 records in an XML
document. This document will need to be queried with XPath, and records
will need to be updated. I was thinking about splitting up the XML into
several XML documents (perhaps 50,000 per document) to be more efficient but
this will make things a lot more complex because the searching needs to go
accross all 500,000 records. Can anyone point me to some best practices /
performance techniques for handling large XML documents? Obviously, the
XmlDocument object is probably not a good choice...

Nov 12 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will
keep all data from the document plus all the DOM-related information in
memory, i.e. you will need sufficient memory in your server. On top of
that, you have to deal with whatever query optimizing the XPathDocument
does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.

Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment?
Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache
of managing memory and indexes and let SQL Server do the work for you.
With SQL 2005 you can even store the XML document as a whole in a column
and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko
-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 1:49 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Querying Very Large XML

I am working on a project that will have about 500,000 records in an XML document. This document will need to be queried with XPath, and records will need to be updated. I was thinking about splitting up the XML into several XML documents (perhaps 50,000 per document) to be more efficient but
this will make things a lot more complex because the searching needs to go accross all 500,000 records. Can anyone point me to some best practices /
performance techniques for handling large XML documents? Obviously, the XmlDocument object is probably not a good choice...

Nov 12 '05 #2

P: n/a
Thanks for the info Chris. I was thinking along the same lines w/ the XML
objects. Unfortunately, a database isn't really an option for us due to
the cost (or percieved cost... and databases need DBA's...). A big reason
for using XML is to avoid having to use and maintain a database. We are
phasing out an old VAX program that currently does things completely file
based, and trying to do a similar thing with XML on the .NET platform. The
data tends to be relatively simple- the general process is going from a
fixed flat file, converting to XML, and then allowing the user to build
queries for tweaking some of the data. The queries would be XPATH (of
course, built with a nice UI)... Perhaps one of the biggest challenges is
eliminating duplicate records accross the entire data set. I'll probably
have to come up with an interesting data structure to do it efficiently in
conjunction with XML since I won't be loading everything in to the
XpathDocument at once. I would think everything else that has to be done
should be relatively doable by chunking out the files and using
XpathDocuments and xpath queries.

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in message
news:ew****************@TK2MSFTNGP14.phx.gbl...

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will
keep all data from the document plus all the DOM-related information in
memory, i.e. you will need sufficient memory in your server. On top of
that, you have to deal with whatever query optimizing the XPathDocument
does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.

Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment?
Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache
of managing memory and indexes and let SQL Server do the work for you.
With SQL 2005 you can even store the XML document as a whole in a column
and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko
-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 1:49 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Querying Very Large XML

I am working on a project that will have about 500,000 records in an

XML
document. This document will need to be queried with XPath, and

records
will need to be updated. I was thinking about splitting up the XML

into
several XML documents (perhaps 50,000 per document) to be more

efficient
but
this will make things a lot more complex because the searching needs

to go
accross all 500,000 records. Can anyone point me to some best

practices
/
performance techniques for handling large XML documents? Obviously,

the
XmlDocument object is probably not a good choice...


Nov 12 '05 #3

P: n/a
Dare Obasanjo wrote this article about efficient ways to handle (read,
update) large XML files:

http://msdn.microsoft.com/webservice...l/largexml.asp

Mujtaba.

"Greg" <na> wrote in message news:Ou*************@TK2MSFTNGP11.phx.gbl...
I am working on a project that will have about 500,000 records in an XML
document. This document will need to be queried with XPath, and records
will need to be updated. I was thinking about splitting up the XML into
several XML documents (perhaps 50,000 per document) to be more efficient but this will make things a lot more complex because the searching needs to go
accross all 500,000 records. Can anyone point me to some best practices / performance techniques for handling large XML documents? Obviously, the
XmlDocument object is probably not a good choice...

Nov 12 '05 #4

P: n/a

Greg,

I was hoping that MSDE (or SQL 2005 Express) might let you get around
the "we don't want to run a database" argument. Both versions are free
and shouldn't require much maintenance. Yet they provide the same XML
support as the full version of SQL Server. The only downside is that
they are not really built for concurrent access by a bigger number of
users simultaneously.

You sound like you know what you're in for with not using a database in
terms of concurrency management, access control, indexing across the
individual chunks, transactional integrity, etc, i.e. all those reasons
why databases are popular ;).

If you determined that it's still more economical to build that
functionality then that's hard to argue with. The trickiest piece to
figure out is figuring out which file to add new XML and how to perform
any updates that spawn multiple files, but again ... you sound like
you're well aware of what you're in for.
HTH,
Christoph Schittko
MVP XML
http://weblogs.asp.net/cschittko


-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 3:21 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Re: Querying Very Large XML

Thanks for the info Chris. I was thinking along the same lines w/ the XML objects. Unfortunately, a database isn't really an option for us due to the cost (or percieved cost... and databases need DBA's...). A big
reason
for using XML is to avoid having to use and maintain a database. We are phasing out an old VAX program that currently does things completely file based, and trying to do a similar thing with XML on the .NET platform.
The
data tends to be relatively simple- the general process is going from a fixed flat file, converting to XML, and then allowing the user to build queries for tweaking some of the data. The queries would be XPATH (of
course, built with a nice UI)... Perhaps one of the biggest challenges is eliminating duplicate records accross the entire data set. I'll probably have to come up with an interesting data structure to do it efficiently in conjunction with XML since I won't be loading everything in to the
XpathDocument at once. I would think everything else that has to be done should be relatively doable by chunking out the files and using
XpathDocuments and xpath queries.

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in message news:ew****************@TK2MSFTNGP14.phx.gbl...

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will keep all data from the document plus all the DOM-related information in memory, i.e. you will need sufficient memory in your server. On top of that, you have to deal with whatever query optimizing the XPathDocument does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.
Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment? Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache of managing memory and indexes and let SQL Server do the work for you. With SQL 2005 you can even store the XML document as a whole in a column and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko
-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 1:49 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Querying Very Large XML

I am working on a project that will have about 500,000 records in
an XML
document. This document will need to be queried with XPath, and

records
will need to be updated. I was thinking about splitting up the XML

into
several XML documents (perhaps 50,000 per document) to be more

efficient
but
this will make things a lot more complex because the searching
needs to go
accross all 500,000 records. Can anyone point me to some best

practices
/
performance techniques for handling large XML documents?
Obviously, the
XmlDocument object is probably not a good choice...


Nov 12 '05 #5

P: n/a
Mujtaba, thanks for the link to the article. Those are some interesting
ideas!

Greg

"Mujtaba Syed" <mu*****@marlabs.com> wrote in message
news:ew**************@TK2MSFTNGP09.phx.gbl...
Dare Obasanjo wrote this article about efficient ways to handle (read,
update) large XML files:

http://msdn.microsoft.com/webservice...l/largexml.asp
Mujtaba.

"Greg" <na> wrote in message news:Ou*************@TK2MSFTNGP11.phx.gbl...
I am working on a project that will have about 500,000 records in an XML
document. This document will need to be queried with XPath, and records
will need to be updated. I was thinking about splitting up the XML into
several XML documents (perhaps 50,000 per document) to be more efficient but
this will make things a lot more complex because the searching needs to go accross all 500,000 records. Can anyone point me to some best practices /
performance techniques for handling large XML documents? Obviously,

the XmlDocument object is probably not a good choice...


Nov 12 '05 #6

P: n/a
Chris, please see responses inline..

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in message
news:e9**************@TK2MSFTNGP15.phx.gbl...

Greg,

I was hoping that MSDE (or SQL 2005 Express) might let you get around
the "we don't want to run a database" argument. Both versions are free
and shouldn't require much maintenance. Yet they provide the same XML
support as the full version of SQL Server. The only downside is that
they are not really built for concurrent access by a bigger number of
users simultaneously.
MSDE is a good alternative but there are definitely some costs associated
with running it (at least that is what my manager will tell me). MSDE is
vulnerable to many of the same exploits that SQL Server is, so that means it
will have to be updated periodically. With my particular application,
that's probably the only real maintenance cost that would need to be
considered since I will be reloading the entire data set frequently.
However, I will definitely have to think about it as an alternative. It
would be interesting to estimate out what it would take to do an MSDE
solution vs. an XML solution. Even if it were cheaper to initially develop,
I think I could be challenged with the "what about maintenance and security"
concerns. Concurrency is definitely not an issue because it is only a one
user application. The only technical issue would be if there is a limit to
how much data you can store in MSDE, of which I don't believe there is one.

You sound like you know what you're in for with not using a database in
terms of concurrency management, access control, indexing across the
individual chunks, transactional integrity, etc, i.e. all those reasons
why databases are popular ;).
Transactional integrity and indexing is another good point. With spanning
multiple files, I'll probably need to be able to rollback changes if an
update on one of them fails. That may mean having to create new files, then
deleting the old ones when they are all successful. I'm not that concerned
about indexing since most of the searching I'm doing will be on just about
any field. XPath seems to do a pretty good job since most everything is
loaded in memory (at least for the file I'm searching...)

If you determined that it's still more economical to build that
functionality then that's hard to argue with. The trickiest piece to
figure out is figuring out which file to add new XML and how to perform
any updates that spawn multiple files, but again ... you sound like
you're well aware of what you're in for.

I won't actually need to add new XML, I'll just need to update certain
records it in my particular case. That definitely simplifies things.
Regardless, I think I'm going to take a look at what it may take to do an
MSDE solution. Thanks for the suggestion.

Greg


HTH,
Christoph Schittko
MVP XML
http://weblogs.asp.net/cschittko


-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 3:21 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Re: Querying Very Large XML

Thanks for the info Chris. I was thinking along the same lines w/ the

XML
objects. Unfortunately, a database isn't really an option for us due

to
the cost (or percieved cost... and databases need DBA's...). A big
reason
for using XML is to avoid having to use and maintain a database. We

are
phasing out an old VAX program that currently does things completely

file
based, and trying to do a similar thing with XML on the .NET platform.
The
data tends to be relatively simple- the general process is going from

a
fixed flat file, converting to XML, and then allowing the user to

build
queries for tweaking some of the data. The queries would be XPATH (of
course, built with a nice UI)... Perhaps one of the biggest

challenges is
eliminating duplicate records accross the entire data set. I'll

probably
have to come up with an interesting data structure to do it

efficiently in
conjunction with XML since I won't be loading everything in to the
XpathDocument at once. I would think everything else that has to be

done
should be relatively doable by chunking out the files and using
XpathDocuments and xpath queries.

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in

message
news:ew****************@TK2MSFTNGP14.phx.gbl...

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will keep all data from the document plus all the DOM-related information in memory, i.e. you will need sufficient memory in your server. On top of that, you have to deal with whatever query optimizing the XPathDocument does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.
Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment? Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache of managing memory and indexes and let SQL Server do the work for you. With SQL 2005 you can even store the XML document as a whole in a column and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko

> -----Original Message-----
> From: Greg [mailto:na]
> Posted At: Monday, December 27, 2004 1:49 PM
> Posted To: microsoft.public.dotnet.xml
> Conversation: Querying Very Large XML
> Subject: Querying Very Large XML
>
> I am working on a project that will have about 500,000 records in an XML
> document. This document will need to be queried with XPath, and
records
> will need to be updated. I was thinking about splitting up the XML
into
> several XML documents (perhaps 50,000 per document) to be more
efficient
> but
> this will make things a lot more complex because the searching needs to go
> accross all 500,000 records. Can anyone point me to some best
practices
> /
> performance techniques for handling large XML documents? Obviously, the
> XmlDocument object is probably not a good choice...
>


Nov 12 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.