473,498 Members | 1,724 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Fast string operations

I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em) System.String
instances being created.

I've done some analysis and I'm led to believe (but can't yet quantitatively
establish as fact) that the two basic culprits are a lot of calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a bunch
of strings faster than the GC can collect them, or perhaps there's some
weird interning/caching thing going on. Regardless, the number of string
instances grows and grows. It gets bumped down occasionally, but it's
basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare() which
can perform in-place case-insensitive comparisons without generating new
string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the length
of a string, minus whitespace and without generating a new string instance,
in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what MSFT
did for all their string handling stuff inside System.String and using the
COMString stuff), but I'd like to try to avoid that, or at least use a
library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers
Nov 17 '05 #1
17 4633
Chad,

For the first scenario, your solution should give you an increase.

For the second scenario, you should use reflection once to get a
reference to the internal static character array WhitespaceChars on the
string class. Then, you can write a method which will cycle through a
string passed to it, like so:

public static bool TrimIsNullOrEmpty(string value)
{
// If null, then get out.
if (value == null)
{
// Return true.
return true;
}

// Cycle through the characters in the string. If the character is not
found
// in the whitespace array, return false, otherwise, when done, return
true.
foreach (char c in value)
{
// If the character is not found in the WhitespaceArray, then return
// false.
if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
{
// Return false.
return false;
}
}

// Return true, the string is full of whitespace.
return true;
}

I used the generic version of the IndexOf method on the Array class in
order to eliminate boxing. Also, if you really want to squeeze out every
last bit of performance from this, you can take the WhitespaceArray and use
the characters as keys in a dictionary. The number of whitespace characters
is 25 (right now, that is). However, if your strings typically are padded
with spaces, then you could get a big speed boost by copying the array
initially, and then placing the space character as the first element in the
array (which would cause most of the calls to IndexOf to return very
quickly, probably quicker than a lookup in a dictionary).

I am curious though, are you seeing a performance issue, or do you just
see the numbers and are worried about them? ASP.NET applications tend to
get in a nice groove with the GC over time.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot of
calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a bunch
of strings faster than the GC can collect them, or perhaps there's some
weird interning/caching thing going on. Regardless, the number of string
instances grows and grows. It gets bumped down occasionally, but it's
basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without generating
new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at least
use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers

Nov 17 '05 #2
> 1.) if( someString.ToLower() == "somestring" )

FxCop will actually catch and report instances of this for you. It is my 2nd
favorite tool outside of Visual Studio.
2.) if( someString != null && someString.Trim().Length > 0 )
I would recommend using

if (someString != null)
someString = someString.Trim();
else
someString = "";

if( someString.Length > 0 )

My assumption here is that you already intend to trim the string before it
is used.

--
Jonathan Allen
"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com... I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot of
calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a bunch
of strings faster than the GC can collect them, or perhaps there's some
weird interning/caching thing going on. Regardless, the number of string
instances grows and grows. It gets bumped down occasionally, but it's
basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without generating
new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at least
use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers

Nov 17 '05 #3
Nicholas,

Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!), so
I can't use Generics.

Would looping over chars like that slow things down significantly? Also, is
the char[] for each string cached with the string, or is a new one created
when you call things like ToCharArray() or foreach() on the string (not
every loop iteration, but on the first iteration)? Wouldn't I just be
replacing a new string instance with a new char[] and not get any net gain
over just calling .Trim()?

In your opinion, if I weren't against unsafe code, could I make this
significantly faster, or would it not afford me much difference?

As far as performance, on some of our clients' instances, memory growth is
rapid. It seems the more memory they have, the faster it grows which leads
me to believe that the GC is being lax since it has so much free memory and
doesn't see the need to aggressively collect memory. But it bothers our
clients and they perceive this to be a memory leak.

I realize it's an education issue, but I want to make sure that I'm
educating them correctly, as opposed to just making up a B.S. excuse and
Jedi hand-waving about the GC stuff.

Also, it's not an ASP.NET application, it's an ASP app that used to call
into VB6 COM objects. We've replaced the VB6 objects with .NET objects
exposing a "compatibility layer" that has a ComVisible API that is identical
(though not binary compatible) with the old VB6 stuff. Late-bound clients
don't know the difference other than a different ProgID for the COM objects.

So we're dealing with the wkst GC, as far as I know (since only ASP.NET uses
svr unless you host the CLR yourself, from what I understand). I'm not sure
how I'd even do that in an ASP/COM-interop situation, but, assuming it's
possible, would writing our own CLR host to use the svr GC help matters at
all?

Most of our clients' servers are dual-or-more processor boxes.

Thanks again,
Chad Myers

"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote in
message news:Oz*************@TK2MSFTNGP15.phx.gbl...
Chad,

For the first scenario, your solution should give you an increase.

For the second scenario, you should use reflection once to get a
reference to the internal static character array WhitespaceChars on the
string class. Then, you can write a method which will cycle through a
string passed to it, like so:

public static bool TrimIsNullOrEmpty(string value)
{
// If null, then get out.
if (value == null)
{
// Return true.
return true;
}

// Cycle through the characters in the string. If the character is not
found
// in the whitespace array, return false, otherwise, when done, return
true.
foreach (char c in value)
{
// If the character is not found in the WhitespaceArray, then
return
// false.
if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
{
// Return false.
return false;
}
}

// Return true, the string is full of whitespace.
return true;
}

I used the generic version of the IndexOf method on the Array class in
order to eliminate boxing. Also, if you really want to squeeze out every
last bit of performance from this, you can take the WhitespaceArray and
use the characters as keys in a dictionary. The number of whitespace
characters is 25 (right now, that is). However, if your strings typically
are padded with spaces, then you could get a big speed boost by copying
the array initially, and then placing the space character as the first
element in the array (which would cause most of the calls to IndexOf to
return very quickly, probably quicker than a lookup in a dictionary).

I am curious though, are you seeing a performance issue, or do you just
see the numbers and are worried about them? ASP.NET applications tend to
get in a nice groove with the GC over time.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot
of calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a
bunch of strings faster than the GC can collect them, or perhaps there's
some weird interning/caching thing going on. Regardless, the number of
string instances grows and grows. It gets bumped down occasionally, but
it's basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without
generating new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at
least use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers


Nov 17 '05 #4
Jonathon:

Thanks for your quick response.

Unfortunately, in (2), we're not doing that. In most cases, it's OK to have
padded strings, just not all-whitespace strings.

Regardless, even with your suggestion, the .Trim() still creates a new
string instance and fills the heap with crap :(

Thanks again,
Chad

"Jonathan Allen" <x@x.x> wrote in message
news:eX**************@TK2MSFTNGP10.phx.gbl...
1.) if( someString.ToLower() == "somestring" )


FxCop will actually catch and report instances of this for you. It is my
2nd favorite tool outside of Visual Studio.
2.) if( someString != null && someString.Trim().Length > 0 )


I would recommend using

if (someString != null)
someString = someString.Trim();
else
someString = "";

if( someString.Length > 0 )

My assumption here is that you already intend to trim the string before it
is used.

--
Jonathan Allen
"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot
of calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a
bunch of strings faster than the GC can collect them, or perhaps there's
some weird interning/caching thing going on. Regardless, the number of
string instances grows and grows. It gets bumped down occasionally, but
it's basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without
generating new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at
least use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers


Nov 17 '05 #5
KH
For the second scenario -- trimming white space -- you could check the first
and last chars to see if they're whitespace and only perform the Trim() if
that condition is true:

string str = " lalala ";

if (Char.IsWhiteSpace(str[0]) || Char.IsWhiteSpace(str[str.Length -1]))
{
str = str.Trim();
}
"Jonathan Allen" wrote:
1.) if( someString.ToLower() == "somestring" )


FxCop will actually catch and report instances of this for you. It is my 2nd
favorite tool outside of Visual Studio.
2.) if( someString != null && someString.Trim().Length > 0 )


I would recommend using

if (someString != null)
someString = someString.Trim();
else
someString = "";

if( someString.Length > 0 )

My assumption here is that you already intend to trim the string before it
is used.

--
Jonathan Allen
"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot of
calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a bunch
of strings faster than the GC can collect them, or perhaps there's some
weird interning/caching thing going on. Regardless, the number of string
instances grows and grows. It gets bumped down occasionally, but it's
basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without generating
new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at least
use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers


Nov 17 '05 #6
KH
Better yet (I didn't notice this overload before) ...

string str = " lalala ";

// Be sure to check that the string variable is not a null reference
// and its length is at least 1, otherwise you'll get index out of range
exceptions

if (Char.IsWhiteSpace(str, 0) || Char.IsWhiteSpace(str, str.Length -1))
{
str = str.Trim();
}

"Jonathan Allen" wrote:
1.) if( someString.ToLower() == "somestring" )


FxCop will actually catch and report instances of this for you. It is my 2nd
favorite tool outside of Visual Studio.
2.) if( someString != null && someString.Trim().Length > 0 )


I would recommend using

if (someString != null)
someString = someString.Trim();
else
someString = "";

if( someString.Length > 0 )

My assumption here is that you already intend to trim the string before it
is used.

--
Jonathan Allen
"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that there
are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot of
calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a bunch
of strings faster than the GC can collect them, or perhaps there's some
weird interning/caching thing going on. Regardless, the number of string
instances grows and grows. It gets bumped down occasionally, but it's
basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without generating
new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at least
use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers


Nov 17 '05 #7
KH,

Hrm, that's a good idea. In your first suggestion, I'm afraid of indexing
the char[] in the strings because I'm not sure when the char[] is created
(does it always tag along with the string, or is it only created the first
time you try to access the char[] -- through the indexer or through a call
to ToCharArray, etc).

But this suggestion just might work...

I'll look into it and let everyone know how it goes.

Thanks again everyone.

Sincerely,
Chad

"KH" <KH@discussions.microsoft.com> wrote in message
news:0F**********************************@microsof t.com...
Better yet (I didn't notice this overload before) ...

string str = " lalala ";

// Be sure to check that the string variable is not a null reference
// and its length is at least 1, otherwise you'll get index out of range
exceptions

if (Char.IsWhiteSpace(str, 0) || Char.IsWhiteSpace(str, str.Length -1))
{
str = str.Trim();
}

Nov 17 '05 #8
Chad,

Looping over characters like that can't slow things down that much. No
matter what you do, you will have to perform some sort of loop operation to
check the string. There is no other way to do it.

Also, the char[] that is enumerated through is not created for every
iteration through the string. Rather, the string implements IEnumerable,
and then the IEnumerator implementation returned will return a new char for
each iteration.

I don't think that using unsafe code is going to make things any better,
only because it's going to do the same thing you are going to do, maybe with
one or two operations eliminated in between (and I mean IL operations, not
function calls).

When you call Trim, a loop is going to start from the beginning of the
string, counting the whitespace characters that are at the beginning. Then
it is going to perform another loop to scan the end of the stirng for
whitespace characters. Once that is done, it will get the substring, which
will have to loop through the characters to copy them into a new string (on
some level or another, a loop is going to execute).

Also, the original issue was the amount of memory that is being consumed
(which in reality, it is not, but it is a customer education issue). If the
performance of the application is suffering, it is not because of these
operations. I would look elsewhere. The fact that you are using COM
interop means that for every call you make across that boundary, you are
adding something on the order of 40 extra operations. Depending on how
chunky your calls are, this could be a factor.

In the end, the GC is going to take up as much memory as possible, and
give it up only when the OS tells it (from a high level view). That's part
of what you sign up for when you use .NET. I'd work on educating your
customers to NOT look at task manager in order to determine whether or not
there is a memory leak. Rather, they should look at the performance
counters (many of which exist for .NET) which give a MUCH more clear
performance picture.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:P3*******************@tornado.texas.rr.com...
Nicholas,

Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!),
so I can't use Generics.

Would looping over chars like that slow things down significantly? Also,
is the char[] for each string cached with the string, or is a new one
created when you call things like ToCharArray() or foreach() on the string
(not every loop iteration, but on the first iteration)? Wouldn't I just be
replacing a new string instance with a new char[] and not get any net gain
over just calling .Trim()?

In your opinion, if I weren't against unsafe code, could I make this
significantly faster, or would it not afford me much difference?

As far as performance, on some of our clients' instances, memory growth is
rapid. It seems the more memory they have, the faster it grows which leads
me to believe that the GC is being lax since it has so much free memory
and doesn't see the need to aggressively collect memory. But it bothers
our clients and they perceive this to be a memory leak.

I realize it's an education issue, but I want to make sure that I'm
educating them correctly, as opposed to just making up a B.S. excuse and
Jedi hand-waving about the GC stuff.

Also, it's not an ASP.NET application, it's an ASP app that used to call
into VB6 COM objects. We've replaced the VB6 objects with .NET objects
exposing a "compatibility layer" that has a ComVisible API that is
identical (though not binary compatible) with the old VB6 stuff.
Late-bound clients don't know the difference other than a different ProgID
for the COM objects.

So we're dealing with the wkst GC, as far as I know (since only ASP.NET
uses svr unless you host the CLR yourself, from what I understand). I'm
not sure how I'd even do that in an ASP/COM-interop situation, but,
assuming it's possible, would writing our own CLR host to use the svr GC
help matters at all?

Most of our clients' servers are dual-or-more processor boxes.

Thanks again,
Chad Myers

"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote
in message news:Oz*************@TK2MSFTNGP15.phx.gbl...
Chad,

For the first scenario, your solution should give you an increase.

For the second scenario, you should use reflection once to get a
reference to the internal static character array WhitespaceChars on the
string class. Then, you can write a method which will cycle through a
string passed to it, like so:

public static bool TrimIsNullOrEmpty(string value)
{
// If null, then get out.
if (value == null)
{
// Return true.
return true;
}

// Cycle through the characters in the string. If the character is
not found
// in the whitespace array, return false, otherwise, when done, return
true.
foreach (char c in value)
{
// If the character is not found in the WhitespaceArray, then
return
// false.
if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
{
// Return false.
return false;
}
}

// Return true, the string is full of whitespace.
return true;
}

I used the generic version of the IndexOf method on the Array class in
order to eliminate boxing. Also, if you really want to squeeze out every
last bit of performance from this, you can take the WhitespaceArray and
use the characters as keys in a dictionary. The number of whitespace
characters is 25 (right now, that is). However, if your strings
typically are padded with spaces, then you could get a big speed boost by
copying the array initially, and then placing the space character as the
first element in the array (which would cause most of the calls to
IndexOf to return very quickly, probably quicker than a lookup in a
dictionary).

I am curious though, are you seeing a performance issue, or do you
just see the numbers and are worried about them? ASP.NET applications
tend to get in a nice groove with the GC over time.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that
there are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot
of calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a
bunch of strings faster than the GC can collect them, or perhaps there's
some weird interning/caching thing going on. Regardless, the number of
string instances grows and grows. It gets bumped down occasionally, but
it's basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without
generating new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at
least use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers



Nov 17 '05 #9
Chad,

When you use the indexer on the string, it does not create a new
character array representing the whole string. Rather, it just fetches the
character and returns a copy of that single character to the user. A
character array is never created for the return value of an indexer.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:g9*******************@tornado.texas.rr.com...
KH,

Hrm, that's a good idea. In your first suggestion, I'm afraid of indexing
the char[] in the strings because I'm not sure when the char[] is created
(does it always tag along with the string, or is it only created the first
time you try to access the char[] -- through the indexer or through a call
to ToCharArray, etc).

But this suggestion just might work...

I'll look into it and let everyone know how it goes.

Thanks again everyone.

Sincerely,
Chad

"KH" <KH@discussions.microsoft.com> wrote in message
news:0F**********************************@microsof t.com...
Better yet (I didn't notice this overload before) ...

string str = " lalala ";

// Be sure to check that the string variable is not a null reference
// and its length is at least 1, otherwise you'll get index out of range
exceptions

if (Char.IsWhiteSpace(str, 0) || Char.IsWhiteSpace(str, str.Length -1))
{
str = str.Trim();
}


Nov 17 '05 #10
> As far as performance, on some of our clients' instances, memory growth is
rapid. It seems the more memory they have, the faster it grows which leads
me to believe that the GC is being lax since it has so much free memory
and doesn't see the need to aggressively collect memory. But it bothers
our clients and they perceive this to be a memory leak.
If it was an ASP.net application, you could limit the amount of memory used
before the application recycles itself. However, I don't know if that is an
option for ASP. I think your goal of educating the client is probably the
best bet.

May I suggest using the PerfMon tool to show them how often the GC runs and
its effect on memory.

--
Jonathan Allen
"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:P3*******************@tornado.texas.rr.com... Nicholas,

Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!),
so I can't use Generics.

Would looping over chars like that slow things down significantly? Also,
is the char[] for each string cached with the string, or is a new one
created when you call things like ToCharArray() or foreach() on the string
(not every loop iteration, but on the first iteration)? Wouldn't I just be
replacing a new string instance with a new char[] and not get any net gain
over just calling .Trim()?

In your opinion, if I weren't against unsafe code, could I make this
significantly faster, or would it not afford me much difference?

As far as performance, on some of our clients' instances, memory growth is
rapid. It seems the more memory they have, the faster it grows which leads
me to believe that the GC is being lax since it has so much free memory
and doesn't see the need to aggressively collect memory. But it bothers
our clients and they perceive this to be a memory leak.

I realize it's an education issue, but I want to make sure that I'm
educating them correctly, as opposed to just making up a B.S. excuse and
Jedi hand-waving about the GC stuff.

Also, it's not an ASP.NET application, it's an ASP app that used to call
into VB6 COM objects. We've replaced the VB6 objects with .NET objects
exposing a "compatibility layer" that has a ComVisible API that is
identical (though not binary compatible) with the old VB6 stuff.
Late-bound clients don't know the difference other than a different ProgID
for the COM objects.

So we're dealing with the wkst GC, as far as I know (since only ASP.NET
uses svr unless you host the CLR yourself, from what I understand). I'm
not sure how I'd even do that in an ASP/COM-interop situation, but,
assuming it's possible, would writing our own CLR host to use the svr GC
help matters at all?

Most of our clients' servers are dual-or-more processor boxes.

Thanks again,
Chad Myers

"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote
in message news:Oz*************@TK2MSFTNGP15.phx.gbl...
Chad,

For the first scenario, your solution should give you an increase.

For the second scenario, you should use reflection once to get a
reference to the internal static character array WhitespaceChars on the
string class. Then, you can write a method which will cycle through a
string passed to it, like so:

public static bool TrimIsNullOrEmpty(string value)
{
// If null, then get out.
if (value == null)
{
// Return true.
return true;
}

// Cycle through the characters in the string. If the character is
not found
// in the whitespace array, return false, otherwise, when done, return
true.
foreach (char c in value)
{
// If the character is not found in the WhitespaceArray, then
return
// false.
if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
{
// Return false.
return false;
}
}

// Return true, the string is full of whitespace.
return true;
}

I used the generic version of the IndexOf method on the Array class in
order to eliminate boxing. Also, if you really want to squeeze out every
last bit of performance from this, you can take the WhitespaceArray and
use the characters as keys in a dictionary. The number of whitespace
characters is 25 (right now, that is). However, if your strings
typically are padded with spaces, then you could get a big speed boost by
copying the array initially, and then placing the space character as the
first element in the array (which would cause most of the calls to
IndexOf to return very quickly, probably quicker than a lookup in a
dictionary).

I am curious though, are you seeing a performance issue, or do you
just see the numbers and are worried about them? ASP.NET applications
tend to get in a nice groove with the GC over time.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that
there are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot
of calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a
bunch of strings faster than the GC can collect them, or perhaps there's
some weird interning/caching thing going on. Regardless, the number of
string instances grows and grows. It gets bumped down occasionally, but
it's basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without
generating new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new string
instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at
least use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers



Nov 17 '05 #11
Nicholas,

Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to loop
over arrays without all the unnecessary bounds checking.

I'm aware that looping has to occur one way or the other, but with Trim(),
the looping is happening in unmanaged code (COMString::TrimHelper to be
exact) and is much faster since it isn't required to do all the bloated .NET
array handling and such.

The problem with TrimHelper is that it always returns a new string instance.
There's no way to say "Trim and tell me what the length is when you're done,
don't return the trimmed string".

There are no performance issues that I'm aware of (yet). The concern is
rapid memory growth. The customer perceives this as a memory leak. I'm 99%
sure that it's just the GC being lazy/stand-offish until there's something
to worry about (we approach the 2GB limit of a process), but I wanted to
double-check before I unknowingly fed a line of B.S. to the customer.

The customer will eventually learn to understand this and accept it, but I
wanted to make sure that I understood it completely.

We have done extensive performance counting, so we're well aware of the
memory picture. I have followed numerous profiling guides and established
that the majority of allocations are of System.String's and that the
majority of the process's memory is being taken up with the Gen2 heap. This
is what concerns me. My understanding is that you have to survive several
successive garbage collections in order to make it to the Gen2 heap. How are
temporary strings making it to Gen2?

That's the only reason for the 1% of doubt I have left.

Thanks again,
Chad

P.S.- the previous COM stuff was written in VB6 and was STA. We had
customers running on our legacy stuff (it was written before I got here,
don't blame me! ;) ) with many concurrent users. Some were starting to run
into the STA limitations, but, for the most part, it was running
surprisingly (and I mean surprisingly!) well. Eventually some hit a wall.

When our .NET rewrite/rearchitecture was finished, we wrote a COM
facade/compatibility layer so that existing COM- or ASP/SCRIPT-based code
could run against the new stuff.

We saw several orders of magnitude difference in performance, even with the
COM interop overhead, not to mention the now highly scalable multi-threaded
interface. Score +1 for .NET, again :)
"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote in
message news:ef**************@TK2MSFTNGP14.phx.gbl...
Chad,

Looping over characters like that can't slow things down that much. No
matter what you do, you will have to perform some sort of loop operation
to check the string. There is no other way to do it.

Also, the char[] that is enumerated through is not created for every
iteration through the string. Rather, the string implements IEnumerable,
and then the IEnumerator implementation returned will return a new char
for each iteration.

I don't think that using unsafe code is going to make things any
better, only because it's going to do the same thing you are going to do,
maybe with one or two operations eliminated in between (and I mean IL
operations, not function calls).

When you call Trim, a loop is going to start from the beginning of the
string, counting the whitespace characters that are at the beginning.
Then it is going to perform another loop to scan the end of the stirng for
whitespace characters. Once that is done, it will get the substring,
which will have to loop through the characters to copy them into a new
string (on some level or another, a loop is going to execute).

Also, the original issue was the amount of memory that is being
consumed (which in reality, it is not, but it is a customer education
issue). If the performance of the application is suffering, it is not
because of these operations. I would look elsewhere. The fact that you
are using COM interop means that for every call you make across that
boundary, you are adding something on the order of 40 extra operations.
Depending on how chunky your calls are, this could be a factor.

In the end, the GC is going to take up as much memory as possible, and
give it up only when the OS tells it (from a high level view). That's
part of what you sign up for when you use .NET. I'd work on educating
your customers to NOT look at task manager in order to determine whether
or not there is a memory leak. Rather, they should look at the
performance counters (many of which exist for .NET) which give a MUCH more
clear performance picture.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:P3*******************@tornado.texas.rr.com...
Nicholas,

Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!),
so I can't use Generics.

Would looping over chars like that slow things down significantly? Also,
is the char[] for each string cached with the string, or is a new one
created when you call things like ToCharArray() or foreach() on the
string (not every loop iteration, but on the first iteration)? Wouldn't I
just be replacing a new string instance with a new char[] and not get any
net gain over just calling .Trim()?

In your opinion, if I weren't against unsafe code, could I make this
significantly faster, or would it not afford me much difference?

As far as performance, on some of our clients' instances, memory growth
is rapid. It seems the more memory they have, the faster it grows which
leads me to believe that the GC is being lax since it has so much free
memory and doesn't see the need to aggressively collect memory. But it
bothers our clients and they perceive this to be a memory leak.

I realize it's an education issue, but I want to make sure that I'm
educating them correctly, as opposed to just making up a B.S. excuse and
Jedi hand-waving about the GC stuff.

Also, it's not an ASP.NET application, it's an ASP app that used to call
into VB6 COM objects. We've replaced the VB6 objects with .NET objects
exposing a "compatibility layer" that has a ComVisible API that is
identical (though not binary compatible) with the old VB6 stuff.
Late-bound clients don't know the difference other than a different
ProgID for the COM objects.

So we're dealing with the wkst GC, as far as I know (since only ASP.NET
uses svr unless you host the CLR yourself, from what I understand). I'm
not sure how I'd even do that in an ASP/COM-interop situation, but,
assuming it's possible, would writing our own CLR host to use the svr GC
help matters at all?

Most of our clients' servers are dual-or-more processor boxes.

Thanks again,
Chad Myers

"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote
in message news:Oz*************@TK2MSFTNGP15.phx.gbl...
Chad,

For the first scenario, your solution should give you an increase.

For the second scenario, you should use reflection once to get a
reference to the internal static character array WhitespaceChars on the
string class. Then, you can write a method which will cycle through a
string passed to it, like so:

public static bool TrimIsNullOrEmpty(string value)
{
// If null, then get out.
if (value == null)
{
// Return true.
return true;
}

// Cycle through the characters in the string. If the character is
not found
// in the whitespace array, return false, otherwise, when done,
return true.
foreach (char c in value)
{
// If the character is not found in the WhitespaceArray, then
return
// false.
if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
{
// Return false.
return false;
}
}

// Return true, the string is full of whitespace.
return true;
}

I used the generic version of the IndexOf method on the Array class
in order to eliminate boxing. Also, if you really want to squeeze out
every last bit of performance from this, you can take the
WhitespaceArray and use the characters as keys in a dictionary. The
number of whitespace characters is 25 (right now, that is). However, if
your strings typically are padded with spaces, then you could get a big
speed boost by copying the array initially, and then placing the space
character as the first element in the array (which would cause most of
the calls to IndexOf to return very quickly, probably quicker than a
lookup in a dictionary).

I am curious though, are you seeing a performance issue, or do you
just see the numbers and are worried about them? ASP.NET applications
tend to get in a nice groove with the GC over time.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
I've been perf testing an application of mine and I've noticed that
there are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
System.String instances being created.

I've done some analysis and I'm led to believe (but can't yet
quantitatively establish as fact) that the two basic culprits are a lot
of calls to:

1.) if( someString.ToLower() == "somestring" )

and

2.) if( someString != null && someString.Trim().Length > 0 )
ToLower() generates a new string instance as does Trim().

I believe that these are getting called many times and churning up a
bunch of strings faster than the GC can collect them, or perhaps
there's some weird interning/caching thing going on. Regardless, the
number of string instances grows and grows. It gets bumped down
occasionally, but it's basically 5 steps forward, 1 back.

For reference, this is an ASP application calling into .NET ComVisible
objects. So I assume this uses the workstation GC, right?
Anyhow, so I think that I can solve problem (1) with String.Compare()
which can perform in-place case-insensitive comparisons without
generating new string instances.

Problem (2), however, is more complicated. There doesn't appear to be a
TrimmedLength or any type of method or property that can give me the
length of a string, minus whitespace and without generating a new
string instance, in the BCL.

I suppose I could do some unsafe, or even unmanaged code (which is what
MSFT did for all their string handling stuff inside System.String and
using the COMString stuff), but I'd like to try to avoid that, or at
least use a library that's already written and well tested.

Any thoughts?

Thanks in advance,
Chad Myers



Nov 17 '05 #12
KH
Array looping: The CLR has an optimization to remove bounds checking under
certain conditions, mainly your basic for loop using Array.Length as the
condition:

for (int i=0; i < myarray.Length; ++i)
{
// presumably playing with the indexer or re-assigning the array variable
// here would disable the optimization, but I don't really know.
}

Anyways I don't know what your app is but it must be mighty big to be so
worried about string performance. It's usually over-use of strings that
causes problems, like building strings by conditinally concatenating them,
stuff like that that people don't realize causes a new instance of string to
be created with EACH operation:

string str1 = " ABC";
string str2 = "DEF ";
string str3 = (str1 + str2).ToLower().Trim(); // 3 Strings created here

If that's the real issue you might look into the StringBuilder class, which
is mutable unlike String.

- KH

"Chad Myers" wrote:
Nicholas,

Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to loop
over arrays without all the unnecessary bounds checking.

I'm aware that looping has to occur one way or the other, but with Trim(),
the looping is happening in unmanaged code (COMString::TrimHelper to be
exact) and is much faster since it isn't required to do all the bloated .NET
array handling and such.

The problem with TrimHelper is that it always returns a new string instance.
There's no way to say "Trim and tell me what the length is when you're done,
don't return the trimmed string".

There are no performance issues that I'm aware of (yet). The concern is
rapid memory growth. The customer perceives this as a memory leak. I'm 99%
sure that it's just the GC being lazy/stand-offish until there's something
to worry about (we approach the 2GB limit of a process), but I wanted to
double-check before I unknowingly fed a line of B.S. to the customer.

The customer will eventually learn to understand this and accept it, but I
wanted to make sure that I understood it completely.

We have done extensive performance counting, so we're well aware of the
memory picture. I have followed numerous profiling guides and established
that the majority of allocations are of System.String's and that the
majority of the process's memory is being taken up with the Gen2 heap. This
is what concerns me. My understanding is that you have to survive several
successive garbage collections in order to make it to the Gen2 heap. How are
temporary strings making it to Gen2?

That's the only reason for the 1% of doubt I have left.

Thanks again,
Chad

P.S.- the previous COM stuff was written in VB6 and was STA. We had
customers running on our legacy stuff (it was written before I got here,
don't blame me! ;) ) with many concurrent users. Some were starting to run
into the STA limitations, but, for the most part, it was running
surprisingly (and I mean surprisingly!) well. Eventually some hit a wall.

When our .NET rewrite/rearchitecture was finished, we wrote a COM
facade/compatibility layer so that existing COM- or ASP/SCRIPT-based code
could run against the new stuff.

We saw several orders of magnitude difference in performance, even with the
COM interop overhead, not to mention the now highly scalable multi-threaded
interface. Score +1 for .NET, again :)
"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote in
message news:ef**************@TK2MSFTNGP14.phx.gbl...
Chad,

Looping over characters like that can't slow things down that much. No
matter what you do, you will have to perform some sort of loop operation
to check the string. There is no other way to do it.

Also, the char[] that is enumerated through is not created for every
iteration through the string. Rather, the string implements IEnumerable,
and then the IEnumerator implementation returned will return a new char
for each iteration.

I don't think that using unsafe code is going to make things any
better, only because it's going to do the same thing you are going to do,
maybe with one or two operations eliminated in between (and I mean IL
operations, not function calls).

When you call Trim, a loop is going to start from the beginning of the
string, counting the whitespace characters that are at the beginning.
Then it is going to perform another loop to scan the end of the stirng for
whitespace characters. Once that is done, it will get the substring,
which will have to loop through the characters to copy them into a new
string (on some level or another, a loop is going to execute).

Also, the original issue was the amount of memory that is being
consumed (which in reality, it is not, but it is a customer education
issue). If the performance of the application is suffering, it is not
because of these operations. I would look elsewhere. The fact that you
are using COM interop means that for every call you make across that
boundary, you are adding something on the order of 40 extra operations.
Depending on how chunky your calls are, this could be a factor.

In the end, the GC is going to take up as much memory as possible, and
give it up only when the OS tells it (from a high level view). That's
part of what you sign up for when you use .NET. I'd work on educating
your customers to NOT look at task manager in order to determine whether
or not there is a memory leak. Rather, they should look at the
performance counters (many of which exist for .NET) which give a MUCH more
clear performance picture.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:P3*******************@tornado.texas.rr.com...
Nicholas,

Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!),
so I can't use Generics.

Would looping over chars like that slow things down significantly? Also,
is the char[] for each string cached with the string, or is a new one
created when you call things like ToCharArray() or foreach() on the
string (not every loop iteration, but on the first iteration)? Wouldn't I
just be replacing a new string instance with a new char[] and not get any
net gain over just calling .Trim()?

In your opinion, if I weren't against unsafe code, could I make this
significantly faster, or would it not afford me much difference?

As far as performance, on some of our clients' instances, memory growth
is rapid. It seems the more memory they have, the faster it grows which
leads me to believe that the GC is being lax since it has so much free
memory and doesn't see the need to aggressively collect memory. But it
bothers our clients and they perceive this to be a memory leak.

I realize it's an education issue, but I want to make sure that I'm
educating them correctly, as opposed to just making up a B.S. excuse and
Jedi hand-waving about the GC stuff.

Also, it's not an ASP.NET application, it's an ASP app that used to call
into VB6 COM objects. We've replaced the VB6 objects with .NET objects
exposing a "compatibility layer" that has a ComVisible API that is
identical (though not binary compatible) with the old VB6 stuff.
Late-bound clients don't know the difference other than a different
ProgID for the COM objects.

So we're dealing with the wkst GC, as far as I know (since only ASP.NET
uses svr unless you host the CLR yourself, from what I understand). I'm
not sure how I'd even do that in an ASP/COM-interop situation, but,
assuming it's possible, would writing our own CLR host to use the svr GC
help matters at all?

Most of our clients' servers are dual-or-more processor boxes.

Thanks again,
Chad Myers

"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote
in message news:Oz*************@TK2MSFTNGP15.phx.gbl...
Chad,

For the first scenario, your solution should give you an increase.

For the second scenario, you should use reflection once to get a
reference to the internal static character array WhitespaceChars on the
string class. Then, you can write a method which will cycle through a
string passed to it, like so:

public static bool TrimIsNullOrEmpty(string value)
{
// If null, then get out.
if (value == null)
{
// Return true.
return true;
}

// Cycle through the characters in the string. If the character is
not found
// in the whitespace array, return false, otherwise, when done,
return true.
foreach (char c in value)
{
// If the character is not found in the WhitespaceArray, then
return
// false.
if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
{
// Return false.
return false;
}
}

// Return true, the string is full of whitespace.
return true;
}

I used the generic version of the IndexOf method on the Array class
in order to eliminate boxing. Also, if you really want to squeeze out
every last bit of performance from this, you can take the
WhitespaceArray and use the characters as keys in a dictionary. The
number of whitespace characters is 25 (right now, that is). However, if
your strings typically are padded with spaces, then you could get a big
speed boost by copying the array initially, and then placing the space
character as the first element in the array (which would cause most of
the calls to IndexOf to return very quickly, probably quicker than a
lookup in a dictionary).

I am curious though, are you seeing a performance issue, or do you
just see the numbers and are worried about them? ASP.NET applications
tend to get in a nice groove with the GC over time.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
> I've been perf testing an application of mine and I've noticed that
> there are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
> System.String instances being created.
>
> I've done some analysis and I'm led to believe (but can't yet
> quantitatively establish as fact) that the two basic culprits are a lot
> of calls to:
>
> 1.) if( someString.ToLower() == "somestring" )
>
> and
>
> 2.) if( someString != null && someString.Trim().Length > 0 )
>
>
> ToLower() generates a new string instance as does Trim().
>
> I believe that these are getting called many times and churning up a
> bunch of strings faster than the GC can collect them, or perhaps
> there's some weird interning/caching thing going on. Regardless, the
> number of string instances grows and grows. It gets bumped down
> occasionally, but it's basically 5 steps forward, 1 back.
>
> For reference, this is an ASP application calling into .NET ComVisible
> objects. So I assume this uses the workstation GC, right?
>
>
> Anyhow, so I think that I can solve problem (1) with String.Compare()
> which can perform in-place case-insensitive comparisons without
> generating new string instances.
>
> Problem (2), however, is more complicated. There doesn't appear to be a
> TrimmedLength or any type of method or property that can give me the
> length of a string, minus whitespace and without generating a new
> string instance, in the BCL.
>
> I suppose I could do some unsafe, or even unmanaged code (which is what
> MSFT did for all their string handling stuff inside System.String and
> using the COMString stuff), but I'd like to try to avoid that, or at
> least use a library that's already written and well tested.
>
> Any thoughts?
>
> Thanks in advance,
> Chad Myers
>
>



Nov 17 '05 #13

Array bound checking in a loop is optimized such that the bounds are
only verified once outside the loop when the JIT knows for sure that
every array index within the loop will be valid. This is the case
when looping over elements in an array (a built-in array, not a
collection) and only using the loop variable to index into the array
and not another variable or calculation.

Sam
On Wed, 01 Jun 2005 21:35:01 GMT, "Chad Myers"
<cm****@N0.SP4M.austin.rr.com> wrote:
Nicholas,

Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to loop
over arrays without all the unnecessary bounds checking.


Nov 17 '05 #14
> Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to
loop over arrays without all the unnecessary bounds checking.
That's not necessarily true.

for (int i = 0; i<arr.length; i++)
{
sum += arr[i];
}

The optimizer will recognize this pattern and not perform the array bound
checks. That said, the chances of the array bound check being significant is
very low. On the other hand, the chances of messing this up big time are
great. Even greater are the chances of messing up the real performance
improvements that the compiler can do. In a super computing class, we saw
that this code can be faster on some systems. The CLR knows this and uses it
when appropriate.

for (int i = 0; i<arr.length; i=i+4)
{
sum += arr[i];
sum += arr[i+1];
sum += arr[i+2];
sum += arr[i+3];
}
--
Jonathan Allen
"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:9S*****************@tornado.texas.rr.com... Nicholas,

Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to
loop over arrays without all the unnecessary bounds checking.

I'm aware that looping has to occur one way or the other, but with Trim(),
the looping is happening in unmanaged code (COMString::TrimHelper to be
exact) and is much faster since it isn't required to do all the bloated
.NET array handling and such.

The problem with TrimHelper is that it always returns a new string
instance. There's no way to say "Trim and tell me what the length is when
you're done, don't return the trimmed string".

There are no performance issues that I'm aware of (yet). The concern is
rapid memory growth. The customer perceives this as a memory leak. I'm 99%
sure that it's just the GC being lazy/stand-offish until there's something
to worry about (we approach the 2GB limit of a process), but I wanted to
double-check before I unknowingly fed a line of B.S. to the customer.

The customer will eventually learn to understand this and accept it, but I
wanted to make sure that I understood it completely.

We have done extensive performance counting, so we're well aware of the
memory picture. I have followed numerous profiling guides and established
that the majority of allocations are of System.String's and that the
majority of the process's memory is being taken up with the Gen2 heap.
This is what concerns me. My understanding is that you have to survive
several successive garbage collections in order to make it to the Gen2
heap. How are temporary strings making it to Gen2?

That's the only reason for the 1% of doubt I have left.

Thanks again,
Chad

P.S.- the previous COM stuff was written in VB6 and was STA. We had
customers running on our legacy stuff (it was written before I got here,
don't blame me! ;) ) with many concurrent users. Some were starting to run
into the STA limitations, but, for the most part, it was running
surprisingly (and I mean surprisingly!) well. Eventually some hit a wall.

When our .NET rewrite/rearchitecture was finished, we wrote a COM
facade/compatibility layer so that existing COM- or ASP/SCRIPT-based code
could run against the new stuff.

We saw several orders of magnitude difference in performance, even with
the COM interop overhead, not to mention the now highly scalable
multi-threaded interface. Score +1 for .NET, again :)
"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote
in message news:ef**************@TK2MSFTNGP14.phx.gbl...
Chad,

Looping over characters like that can't slow things down that much.
No matter what you do, you will have to perform some sort of loop
operation to check the string. There is no other way to do it.

Also, the char[] that is enumerated through is not created for every
iteration through the string. Rather, the string implements IEnumerable,
and then the IEnumerator implementation returned will return a new char
for each iteration.

I don't think that using unsafe code is going to make things any
better, only because it's going to do the same thing you are going to do,
maybe with one or two operations eliminated in between (and I mean IL
operations, not function calls).

When you call Trim, a loop is going to start from the beginning of the
string, counting the whitespace characters that are at the beginning.
Then it is going to perform another loop to scan the end of the stirng
for whitespace characters. Once that is done, it will get the substring,
which will have to loop through the characters to copy them into a new
string (on some level or another, a loop is going to execute).

Also, the original issue was the amount of memory that is being
consumed (which in reality, it is not, but it is a customer education
issue). If the performance of the application is suffering, it is not
because of these operations. I would look elsewhere. The fact that you
are using COM interop means that for every call you make across that
boundary, you are adding something on the order of 40 extra operations.
Depending on how chunky your calls are, this could be a factor.

In the end, the GC is going to take up as much memory as possible, and
give it up only when the OS tells it (from a high level view). That's
part of what you sign up for when you use .NET. I'd work on educating
your customers to NOT look at task manager in order to determine whether
or not there is a memory leak. Rather, they should look at the
performance counters (many of which exist for .NET) which give a MUCH
more clear performance picture.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:P3*******************@tornado.texas.rr.com...
Nicholas,

Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!),
so I can't use Generics.

Would looping over chars like that slow things down significantly? Also,
is the char[] for each string cached with the string, or is a new one
created when you call things like ToCharArray() or foreach() on the
string (not every loop iteration, but on the first iteration)? Wouldn't
I just be replacing a new string instance with a new char[] and not get
any net gain over just calling .Trim()?

In your opinion, if I weren't against unsafe code, could I make this
significantly faster, or would it not afford me much difference?

As far as performance, on some of our clients' instances, memory growth
is rapid. It seems the more memory they have, the faster it grows which
leads me to believe that the GC is being lax since it has so much free
memory and doesn't see the need to aggressively collect memory. But it
bothers our clients and they perceive this to be a memory leak.

I realize it's an education issue, but I want to make sure that I'm
educating them correctly, as opposed to just making up a B.S. excuse and
Jedi hand-waving about the GC stuff.

Also, it's not an ASP.NET application, it's an ASP app that used to call
into VB6 COM objects. We've replaced the VB6 objects with .NET objects
exposing a "compatibility layer" that has a ComVisible API that is
identical (though not binary compatible) with the old VB6 stuff.
Late-bound clients don't know the difference other than a different
ProgID for the COM objects.

So we're dealing with the wkst GC, as far as I know (since only ASP.NET
uses svr unless you host the CLR yourself, from what I understand). I'm
not sure how I'd even do that in an ASP/COM-interop situation, but,
assuming it's possible, would writing our own CLR host to use the svr GC
help matters at all?

Most of our clients' servers are dual-or-more processor boxes.

Thanks again,
Chad Myers

"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote
in message news:Oz*************@TK2MSFTNGP15.phx.gbl...
Chad,

For the first scenario, your solution should give you an increase.

For the second scenario, you should use reflection once to get a
reference to the internal static character array WhitespaceChars on the
string class. Then, you can write a method which will cycle through a
string passed to it, like so:

public static bool TrimIsNullOrEmpty(string value)
{
// If null, then get out.
if (value == null)
{
// Return true.
return true;
}

// Cycle through the characters in the string. If the character is
not found
// in the whitespace array, return false, otherwise, when done,
return true.
foreach (char c in value)
{
// If the character is not found in the WhitespaceArray, then
return
// false.
if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
{
// Return false.
return false;
}
}

// Return true, the string is full of whitespace.
return true;
}

I used the generic version of the IndexOf method on the Array class
in order to eliminate boxing. Also, if you really want to squeeze out
every last bit of performance from this, you can take the
WhitespaceArray and use the characters as keys in a dictionary. The
number of whitespace characters is 25 (right now, that is). However,
if your strings typically are padded with spaces, then you could get a
big speed boost by copying the array initially, and then placing the
space character as the first element in the array (which would cause
most of the calls to IndexOf to return very quickly, probably quicker
than a lookup in a dictionary).

I am curious though, are you seeing a performance issue, or do you
just see the numbers and are worried about them? ASP.NET applications
tend to get in a nice groove with the GC over time.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Chad Myers" <cm****@N0.SP4M.austin.rr.com> wrote in message
news:2r******************@tornado.texas.rr.com...
> I've been perf testing an application of mine and I've noticed that
> there are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
> System.String instances being created.
>
> I've done some analysis and I'm led to believe (but can't yet
> quantitatively establish as fact) that the two basic culprits are a
> lot of calls to:
>
> 1.) if( someString.ToLower() == "somestring" )
>
> and
>
> 2.) if( someString != null && someString.Trim().Length > 0 )
>
>
> ToLower() generates a new string instance as does Trim().
>
> I believe that these are getting called many times and churning up a
> bunch of strings faster than the GC can collect them, or perhaps
> there's some weird interning/caching thing going on. Regardless, the
> number of string instances grows and grows. It gets bumped down
> occasionally, but it's basically 5 steps forward, 1 back.
>
> For reference, this is an ASP application calling into .NET ComVisible
> objects. So I assume this uses the workstation GC, right?
>
>
> Anyhow, so I think that I can solve problem (1) with String.Compare()
> which can perform in-place case-insensitive comparisons without
> generating new string instances.
>
> Problem (2), however, is more complicated. There doesn't appear to be
> a TrimmedLength or any type of method or property that can give me the
> length of a string, minus whitespace and without generating a new
> string instance, in the BCL.
>
> I suppose I could do some unsafe, or even unmanaged code (which is
> what MSFT did for all their string handling stuff inside System.String
> and using the COMString stuff), but I'd like to try to avoid that, or
> at least use a library that's already written and well tested.
>
> Any thoughts?
>
> Thanks in advance,
> Chad Myers
>
>



Nov 17 '05 #15
It is a very big app. And there are two main reasons for string usage:

1.) Much frequently-used data is cached in memory. This accounts for a
large, static memory block. However, due to legacy conditions, when we do
cache lookups from the legacy apps, they may pass in strings with whitespace
padding and such, so we have to trim before looking in the Hashtable.
Unfortunately, until we can completely get rid of the legacy apps, we have
to deal with the trimming.

2.) Since it's a CRM app, there are lots and lots and lots of strings.
Almost everything is string data... customer names, phone numbers,
addresses, customer interaction logs, phone logs, email logs, etc, etc, etc.
The vast majority (I'd say 70+%) of the data in the database (and thus data
that gets passed around through our app) are string data.

There isn't a lot of concatenation (well, maybe on the ASP side, but that
has no affect on .NET gen2 heap size) and we're cautious about using
StringBuilders and such.

-c
"KH" <KH@discussions.microsoft.com> wrote in message
news:C6**********************************@microsof t.com...
Array looping: The CLR has an optimization to remove bounds checking under
certain conditions, mainly your basic for loop using Array.Length as the
condition:

for (int i=0; i < myarray.Length; ++i)
{
// presumably playing with the indexer or re-assigning the array
variable
// here would disable the optimization, but I don't really know.
}

Anyways I don't know what your app is but it must be mighty big to be so
worried about string performance. It's usually over-use of strings that
causes problems, like building strings by conditinally concatenating them,
stuff like that that people don't realize causes a new instance of string
to
be created with EACH operation:

string str1 = " ABC";
string str2 = "DEF ";
string str3 = (str1 + str2).ToLower().Trim(); // 3 Strings created here

If that's the real issue you might look into the StringBuilder class,
which
is mutable unlike String.

- KH

Nov 17 '05 #16
To all who posted on the array bounds: Thank you!

That helps a lot. I'll try the IsWhitespace as well as the array loops that
Nicholas suggested. I'm sure i'll find that one way works better in some
situations, and the other works better in others.

Thanks to all,
Chad

"Samuel R. Neff" <in**********@newsgroup.nospam> wrote in message
news:cu********************************@4ax.com...

Array bound checking in a loop is optimized such that the bounds are
only verified once outside the loop when the JIT knows for sure that
every array index within the loop will be valid. This is the case
when looping over elements in an array (a built-in array, not a
collection) and only using the loop variable to index into the array
and not another variable or calculation.

Sam
On Wed, 01 Jun 2005 21:35:01 GMT, "Chad Myers"
<cm****@N0.SP4M.austin.rr.com> wrote:
Nicholas,

Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to
loop
over arrays without all the unnecessary bounds checking.

Nov 17 '05 #17
Chad Myers <cm****@N0.SP4M.austin.rr.com> wrote:
Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to loop
over arrays without all the unnecessary bounds checking.


It's not particularly slow, but it's actually irrelevant to what
Nicholas was suggesting - no arrays are created when you either use the
indexer or foreach.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #18

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
1533
by: rawCoder | last post by:
Hi, Consider a server which receives delimiter based string data from client. It needs to process this data very fast. Now if its built in C++, one obvious idea would be to put the string on...
4
3082
by: Thomas Christmann | last post by:
Hi! First let me apologize for asking this question when there are so many answers to it on Google, but most of them are really contradicting, and making what I want to do very performant is...
4
2398
by: Sam | last post by:
Hi, i'm not very familiar with string operations in vb.net. How can I simply do the following: string1 : MM/dd/yyyy string2 : yyyy MM dd I want MM to be after dd in each of those string:
7
2688
by: Skybuck Flying | last post by:
Hello, The objective of this contest is to set and clear a bit at an arbitrary memory address as fast as possible. Implement one or multiple prototypes to take part in the contest: // 8 bit...
8
1772
by: Mugunth | last post by:
I'm writing a search engine crawler for indexing local files in C# My dataset is about 38000 XML files and as of now, I've successfully parsed the file, and tokenized it. But, it's surprising to...
3
1655
by: kelvin.koogan | last post by:
Using C++/CLI but I would imagine C# is the same. What is the fastest to do the following operations 1) Compare two Strings without case-sensitivity. 2) Take a group of objects with 4 string...
0
7126
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7168
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7381
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
4595
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3096
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3087
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1424
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
659
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
293
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.