Querying Very Large XML

Greg

I am working on a project that will have about 500,000 records in an XML
document. This document will need to be queried with XPath, and records
will need to be updated. I was thinking about splitting up the XML into
several XML documents (perhaps 50,000 per document) to be more efficient but
this will make things a lot more complex because the searching needs to go
accross all 500,000 records. Can anyone point me to some best practices /
performance techniques for handling large XML documents? Obviously, the
XmlDocument object is probably not a good choice...

Nov 12 '05 #1

Subscribe Post Reply

2625

Christoph Schittko [MVP]

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will
keep all data from the document plus all the DOM-related information in
memory, i.e. you will need sufficient memory in your server. On top of
that, you have to deal with whatever query optimizing the XPathDocument
does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.

Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment?
Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache
of managing memory and indexes and let SQL Server do the work for you.
With SQL 2005 you can even store the XML document as a whole in a column
and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko

-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 1:49 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Querying Very Large XML

I am working on a project that will have about 500,000 records in an XML document. This document will need to be queried with XPath, and records will need to be updated. I was thinking about splitting up the XML into several XML documents (perhaps 50,000 per document) to be more efficient but
this will make things a lot more complex because the searching needs to go accross all 500,000 records. Can anyone point me to some best practices /
performance techniques for handling large XML documents? Obviously, the XmlDocument object is probably not a good choice...

Nov 12 '05 #2

Greg

Thanks for the info Chris. I was thinking along the same lines w/ the XML
objects. Unfortunately, a database isn't really an option for us due to
the cost (or percieved cost... and databases need DBA's...). A big reason
for using XML is to avoid having to use and maintain a database. We are
phasing out an old VAX program that currently does things completely file
based, and trying to do a similar thing with XML on the .NET platform. The
data tends to be relatively simple- the general process is going from a
fixed flat file, converting to XML, and then allowing the user to build
queries for tweaking some of the data. The queries would be XPATH (of
course, built with a nice UI)... Perhaps one of the biggest challenges is
eliminating duplicate records accross the entire data set. I'll probably
have to come up with an interesting data structure to do it efficiently in
conjunction with XML since I won't be loading everything in to the
XpathDocument at once. I would think everything else that has to be done
should be relatively doable by chunking out the files and using
XpathDocuments and xpath queries.

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in message
news:ew****************@TK2MSFTNGP14.phx.gbl...

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will
keep all data from the document plus all the DOM-related information in
memory, i.e. you will need sufficient memory in your server. On top of
that, you have to deal with whatever query optimizing the XPathDocument
does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.

Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment?
Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache
of managing memory and indexes and let SQL Server do the work for you.
With SQL 2005 you can even store the XML document as a whole in a column
and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko
-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 1:49 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Querying Very Large XML

I am working on a project that will have about 500,000 records in an

XML
document. This document will need to be queried with XPath, and

records
will need to be updated. I was thinking about splitting up the XML

into
several XML documents (perhaps 50,000 per document) to be more

efficient
but
this will make things a lot more complex because the searching needs

to go
accross all 500,000 records. Can anyone point me to some best

practices
/
performance techniques for handling large XML documents? Obviously,

the
XmlDocument object is probably not a good choice...

Nov 12 '05 #3

Mujtaba Syed

Dare Obasanjo wrote this article about efficient ways to handle (read,
update) large XML files:

http://msdn.microsoft.com/webservice...l/largexml.asp

Mujtaba.

"Greg" <na> wrote in message news:Ou*************@TK2MSFTNGP11.phx.gbl...

I am working on a project that will have about 500,000 records in an XML
document. This document will need to be queried with XPath, and records
will need to be updated. I was thinking about splitting up the XML into
several XML documents (perhaps 50,000 per document) to be more efficient but this will make things a lot more complex because the searching needs to go
accross all 500,000 records. Can anyone point me to some best practices / performance techniques for handling large XML documents? Obviously, the
XmlDocument object is probably not a good choice...

Nov 12 '05 #4

Christoph Schittko [MVP]

Greg,

I was hoping that MSDE (or SQL 2005 Express) might let you get around
the "we don't want to run a database" argument. Both versions are free
and shouldn't require much maintenance. Yet they provide the same XML
support as the full version of SQL Server. The only downside is that
they are not really built for concurrent access by a bigger number of
users simultaneously.

You sound like you know what you're in for with not using a database in
terms of concurrency management, access control, indexing across the
individual chunks, transactional integrity, etc, i.e. all those reasons
why databases are popular ;).

If you determined that it's still more economical to build that
functionality then that's hard to argue with. The trickiest piece to
figure out is figuring out which file to add new XML and how to perform
any updates that spawn multiple files, but again ... you sound like
you're well aware of what you're in for.
HTH,
Christoph Schittko
MVP XML
http://weblogs.asp.net/cschittko

-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 3:21 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Re: Querying Very Large XML

Thanks for the info Chris. I was thinking along the same lines w/ the XML objects. Unfortunately, a database isn't really an option for us due to the cost (or percieved cost... and databases need DBA's...). A big
reason
for using XML is to avoid having to use and maintain a database. We are phasing out an old VAX program that currently does things completely file based, and trying to do a similar thing with XML on the .NET platform.
The
data tends to be relatively simple- the general process is going from a fixed flat file, converting to XML, and then allowing the user to build queries for tweaking some of the data. The queries would be XPATH (of
course, built with a nice UI)... Perhaps one of the biggest challenges is eliminating duplicate records accross the entire data set. I'll probably have to come up with an interesting data structure to do it efficiently in conjunction with XML since I won't be loading everything in to the
XpathDocument at once. I would think everything else that has to be done should be relatively doable by chunking out the files and using
XpathDocuments and xpath queries.

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in message news:ew****************@TK2MSFTNGP14.phx.gbl...

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will keep all data from the document plus all the DOM-related information in memory, i.e. you will need sufficient memory in your server. On top of that, you have to deal with whatever query optimizing the XPathDocument does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.
Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment? Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache of managing memory and indexes and let SQL Server do the work for you. With SQL 2005 you can even store the XML document as a whole in a column and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko
-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 1:49 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Querying Very Large XML

I am working on a project that will have about 500,000 records in
an XML
document. This document will need to be queried with XPath, and

records
will need to be updated. I was thinking about splitting up the XML

into
several XML documents (perhaps 50,000 per document) to be more

efficient
but
this will make things a lot more complex because the searching
needs to go
accross all 500,000 records. Can anyone point me to some best

practices
/
performance techniques for handling large XML documents?
Obviously, the
XmlDocument object is probably not a good choice...

Nov 12 '05 #5

Greg

Mujtaba, thanks for the link to the article. Those are some interesting
ideas!

Greg

"Mujtaba Syed" <mu*****@marlabs.com> wrote in message
news:ew**************@TK2MSFTNGP09.phx.gbl...

Dare Obasanjo wrote this article about efficient ways to handle (read,
update) large XML files:

http://msdn.microsoft.com/webservice...l/largexml.asp
Mujtaba.

"Greg" <na> wrote in message news:Ou*************@TK2MSFTNGP11.phx.gbl...
I am working on a project that will have about 500,000 records in an XML
document. This document will need to be queried with XPath, and records
will need to be updated. I was thinking about splitting up the XML into
several XML documents (perhaps 50,000 per document) to be more efficient but
this will make things a lot more complex because the searching needs to go accross all 500,000 records. Can anyone point me to some best practices /
performance techniques for handling large XML documents? Obviously,

the XmlDocument object is probably not a good choice...

Nov 12 '05 #6

Greg

Chris, please see responses inline..

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in message
news:e9**************@TK2MSFTNGP15.phx.gbl...

Greg,

I was hoping that MSDE (or SQL 2005 Express) might let you get around
the "we don't want to run a database" argument. Both versions are free
and shouldn't require much maintenance. Yet they provide the same XML
support as the full version of SQL Server. The only downside is that
they are not really built for concurrent access by a bigger number of
users simultaneously.
MSDE is a good alternative but there are definitely some costs associated
with running it (at least that is what my manager will tell me). MSDE is
vulnerable to many of the same exploits that SQL Server is, so that means it
will have to be updated periodically. With my particular application,
that's probably the only real maintenance cost that would need to be
considered since I will be reloading the entire data set frequently.
However, I will definitely have to think about it as an alternative. It
would be interesting to estimate out what it would take to do an MSDE
solution vs. an XML solution. Even if it were cheaper to initially develop,
I think I could be challenged with the "what about maintenance and security"
concerns. Concurrency is definitely not an issue because it is only a one
user application. The only technical issue would be if there is a limit to
how much data you can store in MSDE, of which I don't believe there is one.

You sound like you know what you're in for with not using a database in
terms of concurrency management, access control, indexing across the
individual chunks, transactional integrity, etc, i.e. all those reasons
why databases are popular ;).
Transactional integrity and indexing is another good point. With spanning
multiple files, I'll probably need to be able to rollback changes if an
update on one of them fails. That may mean having to create new files, then
deleting the old ones when they are all successful. I'm not that concerned
about indexing since most of the searching I'm doing will be on just about
any field. XPath seems to do a pretty good job since most everything is
loaded in memory (at least for the file I'm searching...)

If you determined that it's still more economical to build that
functionality then that's hard to argue with. The trickiest piece to
figure out is figuring out which file to add new XML and how to perform
any updates that spawn multiple files, but again ... you sound like
you're well aware of what you're in for.

I won't actually need to add new XML, I'll just need to update certain
records it in my particular case. That definitely simplifies things.
Regardless, I think I'm going to take a look at what it may take to do an
MSDE solution. Thanks for the suggestion.

Greg

HTH,
Christoph Schittko
MVP XML
http://weblogs.asp.net/cschittko

-----Original Message-----
From: Greg [mailto:na]
Posted At: Monday, December 27, 2004 3:21 PM
Posted To: microsoft.public.dotnet.xml
Conversation: Querying Very Large XML
Subject: Re: Querying Very Large XML

Thanks for the info Chris. I was thinking along the same lines w/ the

XML
objects. Unfortunately, a database isn't really an option for us due

to
the cost (or percieved cost... and databases need DBA's...). A big
reason
for using XML is to avoid having to use and maintain a database. We

are
phasing out an old VAX program that currently does things completely

file
based, and trying to do a similar thing with XML on the .NET platform.
The
data tends to be relatively simple- the general process is going from

a
fixed flat file, converting to XML, and then allowing the user to

build
queries for tweaking some of the data. The queries would be XPATH (of
course, built with a nice UI)... Perhaps one of the biggest

challenges is
eliminating duplicate records accross the entire data set. I'll

probably
have to come up with an interesting data structure to do it

efficiently in
conjunction with XML since I won't be loading everything in to the
XpathDocument at once. I would think everything else that has to be

done
should be relatively doable by chunking out the files and using
XpathDocuments and xpath queries.

"Christoph Schittko [MVP]" <IN**********@austin.rr.com> wrote in

message
news:ew****************@TK2MSFTNGP14.phx.gbl...

Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will keep all data from the document plus all the DOM-related information in memory, i.e. you will need sufficient memory in your server. On top of that, you have to deal with whatever query optimizing the XPathDocument does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.
Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment? Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache of managing memory and indexes and let SQL Server do the work for you. With SQL 2005 you can even store the XML document as a whole in a column and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko

> -----Original Message-----
> From: Greg [mailto:na]
> Posted At: Monday, December 27, 2004 1:49 PM
> Posted To: microsoft.public.dotnet.xml
> Conversation: Querying Very Large XML
> Subject: Querying Very Large XML
>
> I am working on a project that will have about 500,000 records in an XML
> document. This document will need to be queried with XPath, and
records
> will need to be updated. I was thinking about splitting up the XML
into
> several XML documents (perhaps 50,000 per document) to be more
efficient
> but
> this will make things a lot more complex because the searching needs to go
> accross all 500,000 records. Can anyone point me to some best
practices
> /
> performance techniques for handling large XML documents? Obviously, the
> XmlDocument object is probably not a good choice...
>

Nov 12 '05 #7

by: Matt Young | last post by:

I've been tasked with integrating an older management system based on DBF files with my snappy new ASP application to provide users of the ASP application with real-time data from the management...

ASP / Active Server Pages

Querying Multiple Tables

by: Michael roedig | last post by:

I need to query multiple (15) tables to build a data entry screen. Is better to create 15 separate SELECT statements and query each table independently or create 1 large SELECT and query one time?...

MySQL Database

Problem importing and querying after five integers (MS Access 2002)

by: Emilio | last post by:

(MS Access 2002) Hello, I'm working with some big Census (PUMS) files, and I run into a peculiar problem once the data field exceeds five integers. I'll explain every step, since I am doing it in...

Microsoft Access / VBA

Automatic Querying a number of levels

by: Shane | last post by:

I wonder if someone has any ideas about the following. I am currently producing some reports for a manufacturing company who work with metal. A finished part can contain multiple sub-parts to...

Microsoft Access / VBA

querying data from a dataset?

by: MDB | last post by:

I'd normally Google for a question like this, and hope to snag a few examples along with the answer, but this time I can't see to get the keywords specific enough. Or I'd ask coworkers, but...

ASP.NET

Help with blocking on querying two joined tables

by: loosecannon_1 | last post by:

I get a 90-120 second blocking when send 15 or so simultaneous queries to SQL Server 2000 that query a view made up of two joined tables. After each query is blocking for the same amount of time...

Microsoft SQL Server

Efficient way of querying datasets in c#

by: sql_er | last post by:

Guys, I have an XML file which is 233MB in size. It was created by loading 6 tables from an sql server database into a dataset object and then writing out the contents from this dataset into an...

C# / C Sharp

Facing Problem while querying larger tables

by: RajSharma | last post by:

Hi, I am facing a problem regarding querying thru a large table having millions of rows....... Its hanging in between while querying for all those rows Can anybody suggest me a query regarding :...

DB2 Database

facing an issue querying AD using C#

by: =?Utf-8?B?U3VoYXMgVmVuZ2lsYXQ=?= | last post by:

Hello, I am facing an issue while querying Active directory using C# code with system.DirectoryServices namespace. Here is the path for my LDAP - "LDAP://CN=XY - C++/Unix and other,...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Similar topics