Regular Expression taking excessive CPU

Brian Kitt

I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.

Jan 3 '06 #1

Subscribe Post Reply

2993

Jon Skeet [C# MVP]

Brian Kitt <Br*******@discussions.microsoft.com> wrote:

I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.

Okay, there are some problems here:

1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.

2) You're also not using RegexOptions.Compile, which can help
performance

3) You don't need to use a regular expression at all.

Here's a test program:

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;

// How many iterations to run
const int Iterations = 1000;

// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();

static void Main()
{
string[] strings = GenerateTestData();

int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}

static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}

static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}

static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];

for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;

if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret[i] = builder.ToString();
}
return ret;
}
}

Here, the first version is what you've currently got.

The second version is a version using a cached, compiled regular
expression instance.

The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.

The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.

Now, with the sample data from above, here's the timing on my machine:

1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds
Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:

1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds

Finally, when the proportion is changed to 0.1%:

1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds

So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jan 3 '06 #2

Brian Kitt

Wow, thanks. That was an extremely more helpful explantion that I had even
hoped for. This process is not entirely CPU critical, it's just parsing a
series of files and preparing them for DTS, but it used to run in about an
hour, and is now taking nearly 5 hours. I expect to be processing much more
data in the future, so more than anything, it's just irritating that it takes
so long to process. I will change to option #2, that should be sufficient
for what I'm doing. I agree about the regular expression, in the "Old Days"
I would have hard coded it much like you did in Option 3 or 4, but I like to
use the tools we have to make our life simpler.

"Jon Skeet [C# MVP]" wrote:

Brian Kitt <Br*******@discussions.microsoft.com> wrote:
I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.

Okay, there are some problems here:

1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.

2) You're also not using RegexOptions.Compile, which can help
performance

3) You don't need to use a regular expression at all.

Here's a test program:

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;

// How many iterations to run
const int Iterations = 1000;

// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();

static void Main()
{
string[] strings = GenerateTestData();

int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}

static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}

static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}

static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];

for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;

if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret[i] = builder.ToString();
}
return ret;
}
}

Here, the first version is what you've currently got.

The second version is a version using a cached, compiled regular
expression instance.

The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.

The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.

Now, with the sample data from above, here's the timing on my machine:

1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds
Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:

1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds

Finally, when the proportion is changed to 0.1%:

1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds

So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jan 3 '06 #3

Similar topics

Regular Expression limits?

by: gilad | last post by:

I'm working on an application in C# that will perform regular expression matching against a small string. Usually regular expressions are used such that the text being searched is large while the...

.NET Framework

Regular Expressions Challenge

by: Patient Guy | last post by:

Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. ...

Javascript

Regular expression optimization

by: Billa | last post by:

Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...

.NET Framework

Regular Expression problem

by: shomun | last post by:

Hi, I am new to regular expression stuffs. I am facing problem while implementing a reg. exp. for a textbox using regular expression validator in ASP page. Requirement: It will take only...

.NET Framework

Regular Expression

by: Lit | last post by:

Hi, I am looking for a Regular expression for a password for my RegExp ValidationControl Requirements are, At least 8 characters long. At least one digit At least one upper case character

ASP.NET

Regular Expression Validator

by: David | last post by:

I'm having trouble getting the regular expression validator to work with a text box. In this simple example I only want lower case letters to be allowed. So I tried the following and it doesn't...

ASP.NET

Regular expression for validating [GrandTotal]=4*[TotalCharges]+[currentCharges]+2

by: venugopal.sjce | last post by:

Hi Friends, I'm constructing a regular expression for validating an expression which looks as any of the following forms: 1. =4*++2 OR 2. =Sum()*6 Some of the samples I have constructed...

C# / C Sharp

Regular Expression Logic problem

by: karlectomy | last post by:

Hey all, I have a parsing application where I am taking text files and choppiing them up (so to speak) There are small formatting differences which I was hoping to take care of with regular...

Java

Freeze problem with Regular Expression

by: Kirk | last post by:

Hi All, the following regular expression matching seems to enter in a infinite loop: ################ import re text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una '...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General