473,394 Members | 1,674 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

huge xml

joe
hi,

I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?

thank you,
Sergio
Nov 12 '05 #1
10 1555
First, where did you read that XmlTextReader is limited to 2GB? I can't
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity. Your
statement that the reader has to read the entire document to check for
these things is false: so far as I know, XmlTextReader reads the
document line-by-line, or even character-by-character, remembering only
enough state to perform validation. The size of the file should not
matter.

(If that didn't make sense, think of it this way: XmlTextReader reads
the document bit by bit, and only knows that the document is
well-formed and valid _up to the point it's read so far_.)

I can think of only one situation in which a parser would have to
remember a lot of past history in order to validate, and that is if
you're using keys and references in XML, and the parser wants to
guarantee that the keys are unique and that the references match up.
Since DTDs can't express keys and references, you're safe there.

Besides, you can always instruct the reader not to validate, although
you can't tell it not to check for well-formedness (which definitely
_doesn't_ require having the entire document in hand).

I wrote something I call an XmlFragmentReader on top of XmlTextReader
that does what you want: it reads an XML document in chunks, returning
each chunk (in your case each product) as an XmlDocument (a DOM tree).
I designed it precisely for parsing documents like yours.

However, I'm still wondering where you read that XmlTextReader can
handle only up to 2GB of XML. I'm always open to being proven wrong. :)

Nov 12 '05 #2
Bruce,

See
http://msdn.microsoft.com/library/de...hXmlReader.asp

After the class table, there the following paragraph:
Note The XmlTextReader and XmlValidatingReader are constrained on the
size of files they can read. They cannot read files larger than 2 gigabytes.
If it is possible, split the source file into smaller, multiple files.
Like you, I'm baffled why there is such a limitation. Does anyone know if
this limitation will still be true with 2.0?

Richard Rosenheim
"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@o13g2000cwo.googlegr oups.com...
First, where did you read that XmlTextReader is limited to 2GB? I can't
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity. Your
statement that the reader has to read the entire document to check for
these things is false: so far as I know, XmlTextReader reads the
document line-by-line, or even character-by-character, remembering only
enough state to perform validation. The size of the file should not
matter.

(If that didn't make sense, think of it this way: XmlTextReader reads
the document bit by bit, and only knows that the document is
well-formed and valid _up to the point it's read so far_.)

I can think of only one situation in which a parser would have to
remember a lot of past history in order to validate, and that is if
you're using keys and references in XML, and the parser wants to
guarantee that the keys are unique and that the references match up.
Since DTDs can't express keys and references, you're safe there.

Besides, you can always instruct the reader not to validate, although
you can't tell it not to check for well-formedness (which definitely
_doesn't_ require having the entire document in hand).

I wrote something I call an XmlFragmentReader on top of XmlTextReader
that does what you want: it reads an XML document in chunks, returning
each chunk (in your case each product) as an XmlDocument (a DOM tree).
I designed it precisely for parsing documents like yours.

However, I'm still wondering where you read that XmlTextReader can
handle only up to 2GB of XML. I'm always open to being proven wrong. :)

Nov 12 '05 #3
How, exactly, does Microsoft expect one to "split the source files into
smaller, multiple files"?

Using msxsl, which is probably based on XmlTextReader in the first
place? That would be rich: "Please split the file up so that
XmlTextReader can read it. By the way, the only way to split the file
up is with a tool that uses XmlTextReader...."

I would love to know what that limitation is all about, since all
documentation I've read and everything I know about XmlTextReader
indicates that it does _not_ read the entire file at once.

Sergio,

It shouldn't be too hard to write a cheap-n-nasty little application
that just reads the XML as a text stream, recognizes your (unique) XML
format, and breaks the file into manageable chunks (say every n
products) that you could read using XmlTextReader.

Nov 12 '05 #4
Bruce Wood wrote:
First, where did you read that XmlTextReader is limited to 2GB? I can't
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity.


It happens to be Int32.MaxValue value :)

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #5
joe wrote:
I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?


First of all XmlTextReader never validates. It also can't resolve
external entities. It only checks if referred DTD exists. And it doesn't
read the whole document unless you call Read() method.
2Gb limit is apparently int number limit for some internal counters. I'm
not sure about .NET 2.0, but I will test this case for sure.
I believe the bets solution would be to split document at the time it
gets created. How do you get the document?

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #6
I thought of the same question. The best I could figure was that the author
of that recommendation assumed that it would be possible for the source
program to split the XML data into multiple files. Of course, that is also
assuming that you have any say/control/input over the program that
originally generated the XML file.

Richard Rosenheim

"Bruce Wood" <br*******@canada.com> wrote in message
news:11*********************@g14g2000cwa.googlegro ups.com...
How, exactly, does Microsoft expect one to "split the source files into
smaller, multiple files"?

Using msxsl, which is probably based on XmlTextReader in the first
place? That would be rich: "Please split the file up so that
XmlTextReader can read it. By the way, the only way to split the file
up is with a tool that uses XmlTextReader...."

I would love to know what that limitation is all about, since all
documentation I've read and everything I know about XmlTextReader
indicates that it does _not_ read the entire file at once.

Sergio,

It shouldn't be too hard to write a cheap-n-nasty little application
that just reads the XML as a text stream, recognizes your (unique) XML
format, and breaks the file into manageable chunks (say every n
products) that you could read using XmlTextReader.

Nov 12 '05 #7
Ahh, I hadn't thought of that. Yes, you're right.

Of course, I still wonder why XmlTextReader is keeping internal state
of the kind that would care how long the document is...?

Nov 12 '05 #8
Joe,
I'm not sure how real the 2GB limit is, as a FileStream has a Int64.MaxValue
file limit! In other words the size of the file should really be limited by
the amount of disk space you have!

I just created a 2.2GB XML file via XmlTextWriter and was able to read the
entire file via XmlTextReader.

I suspect the 2GB limit has more to do with XmlTextReader.LineNumber and
XmlTextReader.LinePosition, both which are Int32 values (As Oleg suggests
2GB is Int32.MaxValue).

Unfortunately it took 1/2 hour to create the file & 1 hour to read the file
on my desktop, I may set up a couple other test cases on my test server
(faster CPU, more disk space) with larger files...

Hope this helps
Jay

"joe" <ru*@on-consult.ch> wrote in message
news:b3**************************@posting.google.c om...
hi,

I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?

thank you,
Sergio

Nov 12 '05 #9
Interesting. In case anyone is interested.

The 2GB limit appears to be XmlTextReader.LinePosition only.

If I create an XML file (3.9G in my test case) with out formatting, so
basically the file is a single line longer then 2G I get an IndexOutOfRange
exception, as the LinePosition reaches 2G.

However if I create an XML file where each node is on its own line (8.5G in
my test case), plus bunches of blank lines (to get > 2G lines) then the
XmlTextReader.LineNumber property starts returning negative numbers. However
no exception...

So it appears to me that the 2G limit is really the length of one line
within the document...

Jay

"Jay B. Harlow [MVP - Outlook]" <Ja************@msn.com> wrote in message
news:ed**************@TK2MSFTNGP10.phx.gbl...
Joe,
I'm not sure how real the 2GB limit is, as a FileStream has a
Int64.MaxValue file limit! In other words the size of the file should
really be limited by the amount of disk space you have!

I just created a 2.2GB XML file via XmlTextWriter and was able to read the
entire file via XmlTextReader.

I suspect the 2GB limit has more to do with XmlTextReader.LineNumber and
XmlTextReader.LinePosition, both which are Int32 values (As Oleg suggests
2GB is Int32.MaxValue).

Unfortunately it took 1/2 hour to create the file & 1 hour to read the
file on my desktop, I may set up a couple other test cases on my test
server (faster CPU, more disk space) with larger files...

Hope this helps
Jay

"joe" <ru*@on-consult.ch> wrote in message
news:b3**************************@posting.google.c om...
hi,

I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?

thank you,
Sergio


Nov 12 '05 #10
Thanks, Jay. That's very useful information.

Nov 12 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Anders Søndergaard | last post by:
Hi, I'm trying to process a large filesystem (+20 million files) and keep the directories along with summarized information about the files (sizes, modification times, newest file and the like)...
53
by: john67 | last post by:
The company I work for is about to embark on developing a commercial application that will cost us tens-of-millions to develop. When all is said and done it will have thousands of business...
1
by: Jaunty Edward | last post by:
Hi, I am to make a DB that will handle over a million inserttions every month. Right Now I am to design it. I was wondering if any of you have a tutorial or some guide that can talk about the best...
1
by: Reynardine | last post by:
I am calling a C/C++ DLL from C# and I am marshalling the parameters to the API call by doing a type conversion for each parameter. For example, here is my C++ API method : short int XENO_API...
6
by: ad | last post by:
I have a huge sting array, there are about 1000 element in it. How can I divide the huge array into small ones, and there are one 10 elements in a small one array?
6
by: Daniel Walzenbach | last post by:
Hi, I have a web application which sometimes throws an “out of memory” exception. To get an idea what happens I traced some values using performance monitor and got the following values (for...
7
by: Peter Hansen | last post by:
Is in any way possible to define a variable as a integer-value which contain about 300 numbers in a row? I have thought about an Override-function for Data-types but I dunno how - but if it is...
12
by: Sezai YILMAZ | last post by:
Hi, I use PostgreSQL 7.4 for storing huge amount of data. For example 7 million rows. But when I run the query "select count(*) from table;", it results after about 120 seconds. Is this result...
3
by: clintonG | last post by:
Microsoft Charging $90 for .NET Undermines Huge Market Compelling The Use of Linux! Anybody with eyes sight can see "digital signage" emerging as a "display device platform" and a HUGE...
5
by: cpdar | last post by:
typedef struct FileHdr { unsigned long reg; unsigned short width; float sfactor; struct FileHdr huge *next; }FileHdr; FileHdr huge *Hdrhead=NULL,huge *Hdrcur=NULL; // //
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.