473,385 Members | 1,569 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

C# & GC

Hi,

I create application which transform huge XML files (~ 150 Mb) to CVS files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after 20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20 rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(

Any ideas are welcome.

--
Thanks,
Maxim
Jul 21 '05 #1
18 1982
Maxim Kazitov wrote:
I create application which transform huge XML files (~ 150 Mb) to CVS files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after 20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20 rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(


If you do this transform by reading the whole file into a
representation of the XML file and then generating CVS, you are
imposing serious memory pressure. If you can read an XML element and
write a CVS element, without each iteration adding (much, if at all)
to your working set, you might go much faster.

If you do need to build a representation of the whole file, and each
XML attribute name and value is a distinct string, you can often save
a lot by "interning" string values, eliminating duplicate string
values.

It's also entirely possible that this has nothing to do with the GC.
What you describe is compatible with some code that's walking a linked
list that keeps growing ....

--

www.midnightbeach.com
Jul 21 '05 #2
Hi Jon,

I use XmlTextReader, so I don't read all XML in once, during the parsing
I build small Xml Documents (one XmlDocument per row), and apply a set of
XPath's to each document. I have a couple of Hashtables in my code, but they
pretty small.
Thanks,
Max
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Maxim Kazitov wrote:
I create application which transform huge XML files (~ 150 Mb) to CVS
files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after
20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20
rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(


If you do this transform by reading the whole file into a
representation of the XML file and then generating CVS, you are
imposing serious memory pressure. If you can read an XML element and
write a CVS element, without each iteration adding (much, if at all)
to your working set, you might go much faster.

If you do need to build a representation of the whole file, and each
XML attribute name and value is a distinct string, you can often save
a lot by "interning" string values, eliminating duplicate string
values.

It's also entirely possible that this has nothing to do with the GC.
What you describe is compatible with some code that's walking a linked
list that keeps growing ....

--

www.midnightbeach.com

Jul 21 '05 #3
Are you creating an XmlDocument or reusing the same one? You should ensure
that you are simply using the same one and Loading the XML string into the
same one.

I ran into memory issues when I used XmlDocument instances a lot.

"Maxim Kazitov" <mv*****@tut.by> wrote in message
news:%2****************@TK2MSFTNGP14.phx.gbl...
Hi Jon,

I use XmlTextReader, so I don't read all XML in once, during the parsing I build small Xml Documents (one XmlDocument per row), and apply a set of
XPath's to each document. I have a couple of Hashtables in my code, but they pretty small.
Thanks,
Max
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Maxim Kazitov wrote:
I create application which transform huge XML files (~ 150 Mb) to CVS
files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after
20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20
rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(


If you do this transform by reading the whole file into a
representation of the XML file and then generating CVS, you are
imposing serious memory pressure. If you can read an XML element and
write a CVS element, without each iteration adding (much, if at all)
to your working set, you might go much faster.

If you do need to build a representation of the whole file, and each
XML attribute name and value is a distinct string, you can often save
a lot by "interning" string values, eliminating duplicate string
values.

It's also entirely possible that this has nothing to do with the GC.
What you describe is compatible with some code that's walking a linked
list that keeps growing ....

--

www.midnightbeach.com


Jul 21 '05 #4
Maxim,

Probably is the reason what you use to build your CSV files.
When you create them as long Strings first in memory, than the problem is
clear.

Can you show that?

Cor
Jul 21 '05 #5
On Mon, 28 Mar 2005 00:01:40 -0500, "Maxim Kazitov" <mv*****@tut.by>
wrote:
I use XmlTextReader, so I don't read all XML in once, during the parsing
I build small Xml Documents (one XmlDocument per row), and apply a set of
XPath's to each document. I have a couple of Hashtables in my code, but they
pretty small.


1. Make sure that you "let go" of each XmlDocument when you no longer
use it. All references must have gone out of scope, or set to null
references, or reassigned to the new XmlDocument. The old documents
must not stay around in memory.

2. Call System.GC.Collect() immediately before you create a new
XmlDocument. Microsoft pretends this can't happen but I've seen it
myself that the garbage collector's performance can completely break
down if you repeatedly allocate large pools of objects without manual
Collect calls in-between.
--
http://www.kynosarges.de
Jul 21 '05 #6

"Maxim Kazitov" <mv*****@tut.by> wrote in message
news:u0*************@TK2MSFTNGP15.phx.gbl...
Hi,

I create application which transform huge XML files (~ 150 Mb) to CVS
files. And I am facing strange problem. First 1000 rows parsed in 1 sec
after 20000 rows speed down to 100 rows per sec, after 70000 rows speed
down to 20 rows per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(

Any ideas are welcome.

--
Thanks,
Maxim


I could be wrong, but It looks like you are using more memoy than physically
available and as result the system starts paging and finaly starts
thrashing. That would mean you are holding references to objects that could
otherwise be collected by the GC, so it's not a GC problem it's a design
problem.
I suggest you start looking at the memory consumption using Perfmon (GC GEN
0, 1 and 2 memory counters) and the paging activity.
If it looks like I'm right you should check your object allocation pattern,
check wheter you are holding references that could otherwise be released,
for instance references stored in arrays/collections that are no longer
needed should be set to null.

Willy.

Jul 21 '05 #7
You should also use the StringBuilder to build your output string. If
you are using string concatenation, you are creating many string
instances, and that is very ineffecient. If you are concatenating a
string in a loop, always use stringbuilder.

Jul 21 '05 #8
Maxim,
Do you need the XmlDocument? Have you considered using XPathDocument class
instead. I don't know if its more memory friendly then XmlDocument, I do
know it is faster then XmlDocument...

Have you used PerfMon or CLR Profiler to see what is the life time of your
objects? I would use PerfMon first as Willy suggests, & if it suggests a
memory problem, then use CLR Profiler to identify specific problems...

Info on the CLR Profiler:
http://msdn.microsoft.com/library/de...nethowto13.asp

http://msdn.microsoft.com/library/de...anagedapps.asp

Hope this helps
Jay
"Maxim Kazitov" <mv*****@tut.by> wrote in message
news:%2****************@TK2MSFTNGP14.phx.gbl...
Hi Jon,

I use XmlTextReader, so I don't read all XML in once, during the
parsing I build small Xml Documents (one XmlDocument per row), and apply a
set of XPath's to each document. I have a couple of Hashtables in my code,
but they pretty small.
Thanks,
Max
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Maxim Kazitov wrote:
I create application which transform huge XML files (~ 150 Mb) to CVS
files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after
20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20
rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(


If you do this transform by reading the whole file into a
representation of the XML file and then generating CVS, you are
imposing serious memory pressure. If you can read an XML element and
write a CVS element, without each iteration adding (much, if at all)
to your working set, you might go much faster.

If you do need to build a representation of the whole file, and each
XML attribute name and value is a distinct string, you can often save
a lot by "interning" string values, eliminating duplicate string
values.

It's also entirely possible that this has nothing to do with the GC.
What you describe is compatible with some code that's walking a linked
list that keeps growing ....

--

www.midnightbeach.com


Jul 21 '05 #9
I already use StringBuilder

"Pat A" <pw*******@gmail.com> wrote in message
news:11*********************@o13g2000cwo.googlegro ups.com...
You should also use the StringBuilder to build your output string. If
you are using string concatenation, you are creating many string
instances, and that is very ineffecient. If you are concatenating a
string in a loop, always use stringbuilder.

Jul 21 '05 #10
Maxim Kazitov wrote:
I already use StringBuilder


Well, that's probably the chief source of your slowdown. Appending to
a StringBuilder is a lot faster than repeatedly doing BigString +=
SmallString, but it still is periodically reallocating its internal
buffer and copying the data from the old buffer to the new buffer.
That's slow in several ways: * Copying a big buffer runs at bus speeds
AND purges cache * allocating a large object forces a garbage
collection * large objects are allocated on the Large Object Heap, a
traditional linked-list heap, not a compacted heap.

Can you write your CVS to a file, line by line? That would keep your
working set from growing, and would tend to make your algorithm cost
linear with the number of rows read.

--

www.midnightbeach.com
Jul 21 '05 #11
Hi Maxim,
I think your gona have to post a code snipit of what your doing. My general
rule is to use the XmlTextReader to read node by node. DONT create
XmlDocument objects because they are heavy. DONT do XPath. DONT do xslt.

My recomendation, without seeing what you are doing, is to read the xml
document node by node, and write to a file stream for each node.

"Maxim Kazitov" wrote:
Hi Jon,

I use XmlTextReader, so I don't read all XML in once, during the parsing
I build small Xml Documents (one XmlDocument per row), and apply a set of
XPath's to each document. I have a couple of Hashtables in my code, but they
pretty small.
Thanks,
Max
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Maxim Kazitov wrote:
I create application which transform huge XML files (~ 150 Mb) to CVS
files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after
20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20
rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(


If you do this transform by reading the whole file into a
representation of the XML file and then generating CVS, you are
imposing serious memory pressure. If you can read an XML element and
write a CVS element, without each iteration adding (much, if at all)
to your working set, you might go much faster.

If you do need to build a representation of the whole file, and each
XML attribute name and value is a distinct string, you can often save
a lot by "interning" string values, eliminating duplicate string
values.

It's also entirely possible that this has nothing to do with the GC.
What you describe is compatible with some code that's walking a linked
list that keeps growing ....

--

www.midnightbeach.com


Jul 21 '05 #12
Hi, Maxim

I think you have several issues in your code. When you try to do stuff that
is way off from normal usage you will need to handle a lot of low level stuff
yourself instead of trusting the nice DOM and Path objects.

Here are some rules of thumb for performance when parsing large files:
1. Avoid in-memory parsing. If your input file is 150 MB on disk you will
need ALOT of memory and this will put unreasonable pressure on the entire
machine.
2. Avoid heavy object creation, you wrote that you create a XmlDocument for
every row, this will be VERY heavy.
3. Avoid XPAth, I know that it is a great technology but you are way out of
range and therefore need to handle everything down to the "metal".
4. Use XmlTextReader.
5. Combine FileStream, StreamWriter and StringBuilder to create a solution
that behave well. Dont create the entire file in a stringbuilder but instead
create parts of it in the stringbuilder then flush theese to file to avoid
memory consumption.

// Daniel

"Maxim Kazitov" wrote:
Hi,

I create application which transform huge XML files (~ 150 Mb) to CVS files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after 20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20 rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(

Any ideas are welcome.

--
Thanks,
Maxim

Jul 21 '05 #13
Have you tried running a profiling tool such as ANT?

also.... could you comment out the some code to try to isolate the problem,
look at perfmon / % time in GC, this should be around 20%.

Steve

"Maxim Kazitov" <mv*****@tut.by> wrote in message
news:u0*************@TK2MSFTNGP15.phx.gbl...
Hi,

I create application which transform huge XML files (~ 150 Mb) to CVS files. And I am facing strange problem. First 1000 rows parsed in 1 sec after 20000 rows speed down to 100 rows per sec, after 70000 rows speed down to 20 rows per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(

Any ideas are welcome.

--
Thanks,
Maxim

Jul 21 '05 #14

"Steve Drake" <Steve@_NOSPAM_.Drakey.co.uk> wrote in message
news:Ov**************@TK2MSFTNGP12.phx.gbl...
look at perfmon / % time in GC, this should be around 20%.

What makes you think that?

Willy.
Jul 21 '05 #15
How are you parsing the XML? Using DOM or XMLReaders?

"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...

"Steve Drake" <Steve@_NOSPAM_.Drakey.co.uk> wrote in message
news:Ov**************@TK2MSFTNGP12.phx.gbl...
look at perfmon / % time in GC, this should be around 20%.

What makes you think that?

Willy.

Jul 21 '05 #16
You can feed the reader itself into the XSLT processor, is that an option in
your app design? This essentially means XSLT itself iterates through the
reader (and outputs to a stream).

Untested, but perhaps this gives you better results, unless you explicitly
need to do something specific in between each intermediate step. (In which
case, you could also have XSLT invoke a specified callback BTW.)

"Maxim Kazitov" <mv*****@tut.by> schrieb im Newsbeitrag
news:%2****************@TK2MSFTNGP14.phx.gbl...
Hi Jon,

I use XmlTextReader, so I don't read all XML in once, during the
parsing I build small Xml Documents (one XmlDocument per row), and apply a
set of XPath's to each document. I have a couple of Hashtables in my code,
but they pretty small.
Thanks,
Max
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Maxim Kazitov wrote:
I create application which transform huge XML files (~ 150 Mb) to CVS
files.
And I am facing strange problem. First 1000 rows parsed in 1 sec after
20000
rows speed down to 100 rows per sec, after 70000 rows speed down to 20
rows
per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(


If you do this transform by reading the whole file into a
representation of the XML file and then generating CVS, you are
imposing serious memory pressure. If you can read an XML element and
write a CVS element, without each iteration adding (much, if at all)
to your working set, you might go much faster.

If you do need to build a representation of the whole file, and each
XML attribute name and value is a distinct string, you can often save
a lot by "interning" string values, eliminating duplicate string
values.

It's also entirely possible that this has nothing to do with the GC.
What you describe is compatible with some code that's walking a linked
list that keeps growing ....

--

www.midnightbeach.com


Jul 21 '05 #17
Hello Maxim,

I looked at each of your responses. Here is what you appear to be doing:
You read a very large XML document using XMLTextReader

You apply XPath queries... but you want to use XMLDocument (for some reason)
because you want to change the nodes.

Some folks feel that you are using XSLT, but, reading the messages in the
microsoft.public.dotnet.framework newsgroup, I don't see you saying anything
about XSLT. Perhaps you posted a response to only one NG? Or did others
read that in?

If your input is XML and your output is CSV, and you are using
XMLTextReader, there is no reason to ever use XMLDocument. You can load
data from the xml into a class, manipulate the data as methods and
properties, and write it using CSV, without ever using XMLDocument.

In fact, I'm wondering about something. How complex is the node structure
that is used to generate a single CSV record? Are we talking about hundreds
of attributes and tightly wound rules (like with a HIPAA XML transaction) or
are we talking about a sales invoice (with a few dozen fields and some
repeated columns)? If the latter, then use the XMLTextReader to get the
text for each CSV record, extract the InnerXML, and parse it, by hand. You
are very likely to get a performance ratio that is useful and that you can
understand and optimize.

I hope this helps,

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
"Maxim Kazitov" <mv*****@tut.by> wrote in message
news:u0*************@TK2MSFTNGP15.phx.gbl...
Hi,

I create application which transform huge XML files (~ 150 Mb) to CVS
files. And I am facing strange problem. First 1000 rows parsed in 1 sec
after 20000 rows speed down to 100 rows per sec, after 70000 rows speed
down to 20 rows per sec ( I should parse ~ 2 500 000 rows).

For me it looks like a GC problem, but I have no Idea how to fix it :(

Any ideas are welcome.

--
Thanks,
Maxim

Jul 21 '05 #18
I do the same thing: read a huge XML with XmlReader - just in order to split
it into individual records. Then I read each individual record into an
XmlDocument and retrieve data from it with XPath or transform it into
another Xml with XSLT. And there is no performance drop during the process -
it keeps constant pace of 600 - 900 records / sec, depending on the
complexity of operation involved.
So I think there is no problem with allocating lots of XmlDocuments, it must
lie somewhere else...

Best Regards
Rafal Gwizdala

"john conwell" <john co*****@discussions.microsoft.com> wrote in message
news:65**********************************@microsof t.com...
Hi Maxim,
I think your gona have to post a code snipit of what your doing. My
general
rule is to use the XmlTextReader to read node by node. DONT create
XmlDocument objects because they are heavy. DONT do XPath. DONT do xslt.

My recomendation, without seeing what you are doing, is to read the xml
document node by node, and write to a file stream for each node.

"Maxim Kazitov" wrote:
Hi Jon,

I use XmlTextReader, so I don't read all XML in once, during the
parsing
I build small Xml Documents (one XmlDocument per row), and apply a set
of
XPath's to each document. I have a couple of Hashtables in my code, but
they
pretty small.
Thanks,
Max
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
> Maxim Kazitov wrote:
>
>> I create application which transform huge XML files (~ 150 Mb) to CVS
>> files.
>> And I am facing strange problem. First 1000 rows parsed in 1 sec after
>> 20000
>> rows speed down to 100 rows per sec, after 70000 rows speed down to 20
>> rows
>> per sec ( I should parse ~ 2 500 000 rows).
>>
>> For me it looks like a GC problem, but I have no Idea how to fix it :(
>
> If you do this transform by reading the whole file into a
> representation of the XML file and then generating CVS, you are
> imposing serious memory pressure. If you can read an XML element and
> write a CVS element, without each iteration adding (much, if at all)
> to your working set, you might go much faster.
>
> If you do need to build a representation of the whole file, and each
> XML attribute name and value is a distinct string, you can often save
> a lot by "interning" string values, eliminating duplicate string
> values.
>
> It's also entirely possible that this has nothing to do with the GC.
> What you describe is compatible with some code that's walking a linked
> list that keeps growing ....
>
> --
>
> www.midnightbeach.com


Jul 21 '05 #19

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Collin VanDyck | last post by:
I have a basic understanding of this, so forgive me if I am overly simplistic in my explanation of my problem.. I am trying to get a Java/Xalan transform to pass through a numeric character...
1
by: DrTebi | last post by:
Hello, I have the following problem: I used to "encode" my email address within links, in order to avoid (most) email spiders. So I had a link like this: <a...
0
by: Thomas Scheffler | last post by:
Hi, I runned in trouble using XALAN for XSL-Transformation. The following snipplet show what I mean: <a href="http://blah.com/?test=test&amp;test2=test2">Test1&amp;</a> <a...
4
by: Luklrc | last post by:
Hi, I'm having to create a querysting with javascript. My problem is that javscript turns the "&" characher into "&amp;" when it gets used as a querystring in the url EG: ...
4
by: johkar | last post by:
When the output method is set to xml, even though I have CDATA around my JavaScript, the operaters of && and < are converted to XML character entities which causes errors in my JavaScript. I know...
8
by: Nathan Sokalski | last post by:
I add a JavaScript event handler to some of my Webcontrols using the Attributes.Add() method as follows: Dim jscode as String = "return (event.keyCode>=65&&event.keyCode<=90);"...
11
by: Jeremy | last post by:
How can one stop a browser from converting &amp; to & ? We have a textarea in our system wehre a user can type in some html code and have it saved to the database. When the data is retireved...
14
by: Arne | last post by:
A lot of Firefox users I know, says they have problems with validation where the ampersand sign has to be written as &amp; to be valid. I don't have Firefox my self and don't wont to install it only...
12
by: InvalidLastName | last post by:
We have been used XslTransform. .NET 1.1, for transform XML document, Dataset with xsl to HTML. Some of these html contents contain javascript and links. For example: // javascript if (a &gt; b)...
7
by: John Nagle | last post by:
I've been parsing existing HTML with BeautifulSoup, and occasionally hit content which has something like "Design & Advertising", that is, an "&" instead of an "&amp;". Is there some way I can get...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.