By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,386 Members | 1,817 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,386 IT Pros & Developers. It's quick & easy.

regex high cpu utilization

P: n/a
rh
hi all,
take the following 2 c# lines:
1) str = Regex.Replace(str, ".*AAA", "");
2) str = Regex.Replace(str, "^.*AAA", "");

notice that the only difference is that the pattern in line 2 has a
starter
marker (^). if str is large and does not contain the pattern, line 1
takes much much longer to complete than line 2 and cpu goes nearly
full throttle.

i have reproduced this behavior in .net 1.1 and 2.0. but jscript does
not exhibit this.

is this expected behavior? both lines do the exact same operation,
expect for line 2 we're explicitly instruction regex to start at the
beginning.
i'm baffled.

thanks,
rh

May 18 '06 #1
Share this Question
Share on Google+
1 Reply


P: n/a
"rh" <rh********@hotmail.com> wrote:
take the following 2 c# lines:
1) str = Regex.Replace(str, ".*AAA", "");
2) str = Regex.Replace(str, "^.*AAA", "");

notice that the only difference is that the pattern in line 2 has a
starter
marker (^). if str is large and does not contain the pattern, line 1
takes much much longer to complete than line 2 and cpu goes nearly
full throttle.

i have reproduced this behavior in .net 1.1 and 2.0. but jscript does
not exhibit this.

is this expected behavior? both lines do the exact same operation,
expect for line 2 we're explicitly instruction regex to start at the
beginning.


This is one possible expected behaviour under the current
implementation, which does some kinds of optimizations (such as
compiling to a dynamic assembly in memory if you wish) but not others
(as you have seen). Here's the way each one works in a string of length
n, assuming no optimization of any kind:

1)
Start at first character,
match next character * n times
end of string found, so back-track 1
try to match A, then end of string found | backtrack to .*
backtrack 1, try to match AA, then end of string | backtrack to .*
backtrack 1, try to match AAA, if found then success
not found, so backtrack the .* matching
Start at second character,
match next character * n-1 times
end of string found, so back-track 1
try to match A, then end of string found
backtrack 1, try to match AA, then end of string
backtrack 1, try to match AA, then end of string
not found, so backtrack the .* matching
Start at third character, ...

As you can see, this will take time proportional to n*n, or quadratic
time.

2)
Start at first character,
match next character * n times
end of string found, so back-track 1
try to match A, then end of string found
backtrack 1, try to match AA, then end of string
backtrack 1, try to match AA, then end of string
not found, so *NO MATCH*

This one can exit early because the '^' forces the expression to only
match the start of the string. It takes time proportional to n, or
linear time.

Now, semantically, one can see that '.*AAA' is simply the start of the
string up to AAA, and the whole string could be found quickly by just
searching for AAA. Since the '*' operator in .NET Regex is greedy, the
'^' is implied.

I don't know if the JScript test you ran has the same semantics, but it
appears it certainly has a different implementation with some more
optimizations. To make it quicker, you can use '^' yourself.

(I think this is a .net framework issue, not a general issue or a C#
issue, so I suppose it should have been posted to that newsgroup only.)

-- Barry

--
http://barrkel.blogspot.com/
May 18 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.