Brian Kitt <Br*******@discussions.microsoft.com> wrote:
I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?
ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);
ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.
Okay, there are some problems here:
1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.
2) You're also not using RegexOptions.Compile, which can help
performance
3) You don't need to use a regular expression at all.
Here's a test program:
using System;
using System.Text;
using System.Text.RegularExpressions;
public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;
// How many iterations to run
const int Iterations = 1000;
// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();
static void Main()
{
string[] strings = GenerateTestData();
int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}
static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}
static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}
static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}
static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}
static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];
for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;
if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret[i] = builder.ToString();
}
return ret;
}
}
Here, the first version is what you've currently got.
The second version is a version using a cached, compiled regular
expression instance.
The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.
The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.
Now, with the sample data from above, here's the timing on my machine:
1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds
Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:
1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds
Finally, when the proportion is changed to 0.1%:
1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds
So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog:
http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too