473,772 Members | 2,513 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

huge xml

joe
hi,

I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?

thank you,
Sergio
Nov 12 '05 #1
10 1582
First, where did you read that XmlTextReader is limited to 2GB? I can't
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity. Your
statement that the reader has to read the entire document to check for
these things is false: so far as I know, XmlTextReader reads the
document line-by-line, or even character-by-character, remembering only
enough state to perform validation. The size of the file should not
matter.

(If that didn't make sense, think of it this way: XmlTextReader reads
the document bit by bit, and only knows that the document is
well-formed and valid _up to the point it's read so far_.)

I can think of only one situation in which a parser would have to
remember a lot of past history in order to validate, and that is if
you're using keys and references in XML, and the parser wants to
guarantee that the keys are unique and that the references match up.
Since DTDs can't express keys and references, you're safe there.

Besides, you can always instruct the reader not to validate, although
you can't tell it not to check for well-formedness (which definitely
_doesn't_ require having the entire document in hand).

I wrote something I call an XmlFragmentRead er on top of XmlTextReader
that does what you want: it reads an XML document in chunks, returning
each chunk (in your case each product) as an XmlDocument (a DOM tree).
I designed it precisely for parsing documents like yours.

However, I'm still wondering where you read that XmlTextReader can
handle only up to 2GB of XML. I'm always open to being proven wrong. :)

Nov 12 '05 #2
Bruce,

See
http://msdn.microsoft.com/library/de...hXmlReader.asp

After the class table, there the following paragraph:
Note The XmlTextReader and XmlValidatingRe ader are constrained on the
size of files they can read. They cannot read files larger than 2 gigabytes.
If it is possible, split the source file into smaller, multiple files.
Like you, I'm baffled why there is such a limitation. Does anyone know if
this limitation will still be true with 2.0?

Richard Rosenheim
"Bruce Wood" <br*******@cana da.com> wrote in message
news:11******** **************@ o13g2000cwo.goo glegroups.com.. .
First, where did you read that XmlTextReader is limited to 2GB? I can't
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity. Your
statement that the reader has to read the entire document to check for
these things is false: so far as I know, XmlTextReader reads the
document line-by-line, or even character-by-character, remembering only
enough state to perform validation. The size of the file should not
matter.

(If that didn't make sense, think of it this way: XmlTextReader reads
the document bit by bit, and only knows that the document is
well-formed and valid _up to the point it's read so far_.)

I can think of only one situation in which a parser would have to
remember a lot of past history in order to validate, and that is if
you're using keys and references in XML, and the parser wants to
guarantee that the keys are unique and that the references match up.
Since DTDs can't express keys and references, you're safe there.

Besides, you can always instruct the reader not to validate, although
you can't tell it not to check for well-formedness (which definitely
_doesn't_ require having the entire document in hand).

I wrote something I call an XmlFragmentRead er on top of XmlTextReader
that does what you want: it reads an XML document in chunks, returning
each chunk (in your case each product) as an XmlDocument (a DOM tree).
I designed it precisely for parsing documents like yours.

However, I'm still wondering where you read that XmlTextReader can
handle only up to 2GB of XML. I'm always open to being proven wrong. :)

Nov 12 '05 #3
How, exactly, does Microsoft expect one to "split the source files into
smaller, multiple files"?

Using msxsl, which is probably based on XmlTextReader in the first
place? That would be rich: "Please split the file up so that
XmlTextReader can read it. By the way, the only way to split the file
up is with a tool that uses XmlTextReader.. .."

I would love to know what that limitation is all about, since all
documentation I've read and everything I know about XmlTextReader
indicates that it does _not_ read the entire file at once.

Sergio,

It shouldn't be too hard to write a cheap-n-nasty little application
that just reads the XML as a text stream, recognizes your (unique) XML
format, and breaks the file into manageable chunks (say every n
products) that you could read using XmlTextReader.

Nov 12 '05 #4
Bruce Wood wrote:
First, where did you read that XmlTextReader is limited to 2GB? I can't
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity.


It happens to be Int32.MaxValue value :)

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #5
joe wrote:
I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?


First of all XmlTextReader never validates. It also can't resolve
external entities. It only checks if referred DTD exists. And it doesn't
read the whole document unless you call Read() method.
2Gb limit is apparently int number limit for some internal counters. I'm
not sure about .NET 2.0, but I will test this case for sure.
I believe the bets solution would be to split document at the time it
gets created. How do you get the document?

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #6
I thought of the same question. The best I could figure was that the author
of that recommendation assumed that it would be possible for the source
program to split the XML data into multiple files. Of course, that is also
assuming that you have any say/control/input over the program that
originally generated the XML file.

Richard Rosenheim

"Bruce Wood" <br*******@cana da.com> wrote in message
news:11******** *************@g 14g2000cwa.goog legroups.com...
How, exactly, does Microsoft expect one to "split the source files into
smaller, multiple files"?

Using msxsl, which is probably based on XmlTextReader in the first
place? That would be rich: "Please split the file up so that
XmlTextReader can read it. By the way, the only way to split the file
up is with a tool that uses XmlTextReader.. .."

I would love to know what that limitation is all about, since all
documentation I've read and everything I know about XmlTextReader
indicates that it does _not_ read the entire file at once.

Sergio,

It shouldn't be too hard to write a cheap-n-nasty little application
that just reads the XML as a text stream, recognizes your (unique) XML
format, and breaks the file into manageable chunks (say every n
products) that you could read using XmlTextReader.

Nov 12 '05 #7
Ahh, I hadn't thought of that. Yes, you're right.

Of course, I still wonder why XmlTextReader is keeping internal state
of the kind that would care how long the document is...?

Nov 12 '05 #8
Joe,
I'm not sure how real the 2GB limit is, as a FileStream has a Int64.MaxValue
file limit! In other words the size of the file should really be limited by
the amount of disk space you have!

I just created a 2.2GB XML file via XmlTextWriter and was able to read the
entire file via XmlTextReader.

I suspect the 2GB limit has more to do with XmlTextReader.L ineNumber and
XmlTextReader.L inePosition, both which are Int32 values (As Oleg suggests
2GB is Int32.MaxValue) .

Unfortunately it took 1/2 hour to create the file & 1 hour to read the file
on my desktop, I may set up a couple other test cases on my test server
(faster CPU, more disk space) with larger files...

Hope this helps
Jay

"joe" <ru*@on-consult.ch> wrote in message
news:b3******** *************** ***@posting.goo gle.com...
hi,

I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?

thank you,
Sergio

Nov 12 '05 #9
Interesting. In case anyone is interested.

The 2GB limit appears to be XmlTextReader.L inePosition only.

If I create an XML file (3.9G in my test case) with out formatting, so
basically the file is a single line longer then 2G I get an IndexOutOfRange
exception, as the LinePosition reaches 2G.

However if I create an XML file where each node is on its own line (8.5G in
my test case), plus bunches of blank lines (to get > 2G lines) then the
XmlTextReader.L ineNumber property starts returning negative numbers. However
no exception...

So it appears to me that the 2G limit is really the length of one line
within the document...

Jay

"Jay B. Harlow [MVP - Outlook]" <Ja************ @msn.com> wrote in message
news:ed******** ******@TK2MSFTN GP10.phx.gbl...
Joe,
I'm not sure how real the 2GB limit is, as a FileStream has a
Int64.MaxValue file limit! In other words the size of the file should
really be limited by the amount of disk space you have!

I just created a 2.2GB XML file via XmlTextWriter and was able to read the
entire file via XmlTextReader.

I suspect the 2GB limit has more to do with XmlTextReader.L ineNumber and
XmlTextReader.L inePosition, both which are Int32 values (As Oleg suggests
2GB is Int32.MaxValue) .

Unfortunately it took 1/2 hour to create the file & 1 hour to read the
file on my desktop, I may set up a couple other test cases on my test
server (faster CPU, more disk space) with larger files...

Hope this helps
Jay

"joe" <ru*@on-consult.ch> wrote in message
news:b3******** *************** ***@posting.goo gle.com...
hi,

I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it's limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don't want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each 'product'
node is well formed. now my question. which class should I use to
accomplish this? any ideas?

thank you,
Sergio


Nov 12 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
1578
by: Anders Søndergaard | last post by:
Hi, I'm trying to process a large filesystem (+20 million files) and keep the directories along with summarized information about the files (sizes, modification times, newest file and the like) in an instance hierarchy in memory. I read the information from a Berkeley Database. I'm keeping it in a Left-Child-Right-Sibling instance structure, that I operate on recursively.
53
3558
by: john67 | last post by:
The company I work for is about to embark on developing a commercial application that will cost us tens-of-millions to develop. When all is said and done it will have thousands of business objects/classes, some of which will have hundreds-of-thousands of instances stored in a DB. Our clients will probably have somewhere between 50-200 users working on the app during the day, possibly in mutiple offices, and then a large number of batch...
1
1268
by: Jaunty Edward | last post by:
Hi, I am to make a DB that will handle over a million inserttions every month. Right Now I am to design it. I was wondering if any of you have a tutorial or some guide that can talk about the best practices that a DBA has to folow before he designs the new huge DB. The DB will be used with ASP and will be online on a Dedicated webserver in US only. I will be thankful if anyone can guide me to a tutorial or tell their
1
2702
by: Reynardine | last post by:
I am calling a C/C++ DLL from C# and I am marshalling the parameters to the API call by doing a type conversion for each parameter. For example, here is my C++ API method : short int XENO_API XcDatabaseCodes ( HWND hwnd, char FAR * pszDatabase, char FAR * HUGE * FAR * pszItemCategories
6
3177
by: ad | last post by:
I have a huge sting array, there are about 1000 element in it. How can I divide the huge array into small ones, and there are one 10 elements in a small one array?
6
3807
by: Daniel Walzenbach | last post by:
Hi, I have a web application which sometimes throws an “out of memory” exception. To get an idea what happens I traced some values using performance monitor and got the following values (for one day): \\FFDS24\ASP.NET Applications(_LM_W3SVC_1_Root_ATV2004)\Errors During Execution: 7 \\FFDS24\ASP.NET Apps v1.1.4322(_LM_W3SVC_1_Root_ATV2004)\Compilations
7
1601
by: Peter Hansen | last post by:
Is in any way possible to define a variable as a integer-value which contain about 300 numbers in a row? I have thought about an Override-function for Data-types but I dunno how - but if it is possible that a Data-type which can hold an unlimited numbers of numbers :D exists I would be happy or if it is possible to Override the Dim-function... Hilsen fra Peter
12
2327
by: Sezai YILMAZ | last post by:
Hi, I use PostgreSQL 7.4 for storing huge amount of data. For example 7 million rows. But when I run the query "select count(*) from table;", it results after about 120 seconds. Is this result normal for such a huge table? Is there any methods for speed up the querying time? The huge table has integer primary key and some other indexes for other columns. The hardware is: PIII 800 MHz processor, 512 MB RAM, and IDE hard disk drive.
3
1329
by: clintonG | last post by:
Microsoft Charging $90 for .NET Undermines Huge Market Compelling The Use of Linux! Anybody with eyes sight can see "digital signage" emerging as a "display device platform" and a HUGE OPPORTUNITY FOR DEVELOPERS as the display device platform emerges and people understand the many many opportunities now available to them. Typical current digital signage implementations utilize larger LCD displays with project costs in the many thousands...
5
1688
by: cpdar | last post by:
typedef struct FileHdr { unsigned long reg; unsigned short width; float sfactor; struct FileHdr huge *next; }FileHdr; FileHdr huge *Hdrhead=NULL,huge *Hdrcur=NULL; // //
0
9621
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
10264
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10106
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10039
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9914
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8937
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6716
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5484
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4009
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.