472,808 Members | 3,058 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,808 software developers and data experts.

Regular Expression taking excessive CPU

I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.
Jan 3 '06 #1
2 2955
Brian Kitt <Br*******@discussions.microsoft.com> wrote:
I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.


Okay, there are some problems here:

1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.

2) You're also not using RegexOptions.Compile, which can help
performance

3) You don't need to use a regular expression at all.

Here's a test program:

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;

// How many iterations to run
const int Iterations = 1000;

// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();

static void Main()
{
string[] strings = GenerateTestData();

int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}

static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}

static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}

static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];

for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;

if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret[i] = builder.ToString();
}
return ret;
}
}

Here, the first version is what you've currently got.

The second version is a version using a cached, compiled regular
expression instance.

The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.

The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.

Now, with the sample data from above, here's the timing on my machine:

1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds
Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:

1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds

Finally, when the proportion is changed to 0.1%:

1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds

So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Jan 3 '06 #2
Wow, thanks. That was an extremely more helpful explantion that I had even
hoped for. This process is not entirely CPU critical, it's just parsing a
series of files and preparing them for DTS, but it used to run in about an
hour, and is now taking nearly 5 hours. I expect to be processing much more
data in the future, so more than anything, it's just irritating that it takes
so long to process. I will change to option #2, that should be sufficient
for what I'm doing. I agree about the regular expression, in the "Old Days"
I would have hard coded it much like you did in Option 3 or 4, but I like to
use the tools we have to make our life simpler.

"Jon Skeet [C# MVP]" wrote:
Brian Kitt <Br*******@discussions.microsoft.com> wrote:
I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.


Okay, there are some problems here:

1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.

2) You're also not using RegexOptions.Compile, which can help
performance

3) You don't need to use a regular expression at all.

Here's a test program:

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;

// How many iterations to run
const int Iterations = 1000;

// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();

static void Main()
{
string[] strings = GenerateTestData();

int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}

static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}

static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}

static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];

for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;

if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret[i] = builder.ToString();
}
return ret;
}
}

Here, the first version is what you've currently got.

The second version is a version using a cached, compiled regular
expression instance.

The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.

The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.

Now, with the sample data from above, here's the timing on my machine:

1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds
Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:

1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds

Finally, when the proportion is changed to 0.1%:

1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds

So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jan 3 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: gilad | last post by:
I'm working on an application in C# that will perform regular expression matching against a small string. Usually regular expressions are used such that the text being searched is large while the...
7
by: Patient Guy | last post by:
Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. ...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
2
by: shomun | last post by:
Hi, I am new to regular expression stuffs. I am facing problem while implementing a reg. exp. for a textbox using regular expression validator in ASP page. Requirement: It will take only...
18
by: Lit | last post by:
Hi, I am looking for a Regular expression for a password for my RegExp ValidationControl Requirements are, At least 8 characters long. At least one digit At least one upper case character
6
by: David | last post by:
I'm having trouble getting the regular expression validator to work with a text box. In this simple example I only want lower case letters to be allowed. So I tried the following and it doesn't...
10
by: venugopal.sjce | last post by:
Hi Friends, I'm constructing a regular expression for validating an expression which looks as any of the following forms: 1. =4*++2 OR 2. =Sum()*6 Some of the samples I have constructed...
5
karlectomy
by: karlectomy | last post by:
Hey all, I have a parsing application where I am taking text files and choppiing them up (so to speak) There are small formatting differences which I was hoping to take care of with regular...
9
by: Kirk | last post by:
Hi All, the following regular expression matching seems to enter in a infinite loop: ################ import re text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una '...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
5
by: DJRhino | last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer) If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _ 310030356 Or 310030359 Or 310030362 Or...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.