By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,152 Members | 1,837 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,152 IT Pros & Developers. It's quick & easy.

Cleaning data - performance issue

P: n/a
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :o(

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg[i].Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;

The config file is read using EnterpriseLibrary.

The problem is not that it takes some time to use this method - the
problem is that the timespan increases as the method is called which
indicates to me that there might be some string-issues that I have
not taken care of.

So my question is roughly:

How do I most efficiently clean a string for unwanted chars?
Should I work on the individual bytes instead of using a
StringBuilder? The system creates roughly 1Gb of CSV-files
as it performs its largest job, so we really need to be able
to clean this amount of data most efficiently.

Any help will be greatly appreciated.

:o)

--
Jesper Stocholm
http://stocholm.dk
Apr 2 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Jesper Stocholm <j@stocholm.invalid> wrote:
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :o(

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg[i].
Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;


It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Apr 2 '06 #2

P: n/a
Another thing here, an internal RegEx object is being created through
every iteration of the loop. The OP would see much better performance by
creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably increase
dramatically.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Jesper Stocholm <j@stocholm.invalid> wrote:
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :o(

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_,
string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg[i].
Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;


It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #3

P: n/a
"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote
in news:Op**************@TK2MSFTNGP12.phx.gbl:
Another thing here, an internal RegEx object is being created
through
every iteration of the loop. The OP would see much better performance
by creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably
increase dramatically.


Hi guys,

I did as suggested and moved the parsing of chars outside the method itself
and I now pass this as a parameter. Also I have skipped the check on
StringBuilder.Length, since it was basically not needed.

The method now works as intended and memory consumption is moderate.

Thanks for your input.

:o)

--
Jesper Stocholm
http://stocholm.dk
Hvor køber du slik, cola eller smøger online?
Send linket til mig via http://ekiosk.dk
Apr 4 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.