473,406 Members | 2,312 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Regex optimization

In a C# Regex expression which would be faster when run against say 10,000
strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.
Sep 26 '07 #1
7 2031
Try both tests and then you will know!
"Chuck B" <ch****@shc1.comwrote in message
news:Ou**************@TK2MSFTNGP05.phx.gbl...
In a C# Regex expression which would be faster when run against say 10,000
strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.
Sep 26 '07 #2
Yes I will - but then that won't tell me much about what went on internally.

I was hoping that someone with knowledge of the Regex engine could help me
understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)
"Stephany Young" <noone@localhostwrote in message
news:em**************@TK2MSFTNGP03.phx.gbl...
Try both tests and then you will know!
"Chuck B" <ch****@shc1.comwrote in message
news:Ou**************@TK2MSFTNGP05.phx.gbl...
>In a C# Regex expression which would be faster when run against say
10,000 strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.

Sep 26 '07 #3
The Regex engine works by looping through the string one character at a
time, applying the rules of the test to each fragment. In some cases, it
may backtrack, depending on the rules. Therefore, the more rules there are
in the regular expression, the longer it will take. The rules in the regular
expression are defined by the characters in it.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Chuck B" <ch****@shc1.comwrote in message
news:Ou**************@TK2MSFTNGP05.phx.gbl...
In a C# Regex expression which would be faster when run against say 10,000
strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.

Sep 26 '07 #4
Chuck B wrote:
Yes I will - but then that won't tell me much about what went on internally.

I was hoping that someone with knowledge of the Regex engine could help me
understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)
Trying it for yourself and seeing the results is a perfectly reasonable
answer.

While it may not allow you to recreate the source code, it certainly
will tell you which way is more efficient. Because your original regexes
were completely unequal, it's hard to say whether you could actually
learn anything about what is going on "internally" since in one instance
you'd get a LOT more hits than the other.

Assuming a somewhat normal dataset where some records have "The quick
brown fox" and more records have "The" in them, I'd suggest that it's
probably more work to return more results than it is simply to find
them, so #1 wins out. On 10,000 lines of "The quick brown fox" where the
returned values would be equal, I'd wager on the second regex being
faster (equal time spent returning values, less time spent searching).

Chris.
Sep 26 '07 #5

"Chris Shepherd" <ch**@nospam.chsh.cawrote in message
news:%2***************@TK2MSFTNGP03.phx.gbl...
Chuck B wrote:
>Yes I will - but then that won't tell me much about what went on
internally.

I was hoping that someone with knowledge of the Regex engine could help
me understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)

Trying it for yourself and seeing the results is a perfectly reasonable
answer.
I have to disagree here. Trying it for myself will give me numbers but not
understanding which is the untimate goal.
While it may not allow you to recreate the source code, it certainly will
tell you which way is more efficient. Because your original regexes were
completely unequal, it's hard to say whether you could actually learn
anything about what is going on "internally" since in one instance you'd
get a LOT more hits than the other.

Assuming a somewhat normal dataset where some records have "The quick
brown fox" and more records have "The" in them, I'd suggest that it's
probably more work to return more results than it is simply to find them,
so #1 wins out. On 10,000 lines of "The quick brown fox" where the
returned values would be equal, I'd wager on the second regex being faster
(equal time spent returning values, less time spent searching).
The fault here is mine for not explaining adequately what I wanted.

In the case above I'm looking for the special case where there is 1 match
per string for either Regex. For instance; the result of running both
examples above against "09/26/07 The quick brown fox ran away."

The date would probably take just as long for each regex. However, it seems
like it might be more efficient with the static characters to search for a
longer string than a shorter one (assuming that there was no match embedded
inside of another match). The reason for the increase is that the pointer
pointing to the head of the search would move a greater length after a
successful match.
Sep 26 '07 #6
Thanks Kevin.

I ran a test with both 10,000 iterations of the test string - "9/19/09 This
is a test. This is only a test of the quick brown fox. If this had been a
real quick brown fox it would have eaten yer toes." with the Regex
"\d+/\d+/\d+.*real.*" and "\d+/\d+/\d+.*this had been a real.*" and it turns
out that there's about a 3 microsecond difference with the shorter
expression being faster.

I'm guessing that rules that involve escape characters like \d, \w would
take longer to match due to the fact that there are more candidates to sort
thru.
"Kevin Spencer" <un**********@nothinks.comwrote in message
news:eO**************@TK2MSFTNGP03.phx.gbl...
The Regex engine works by looping through the string one character at a
time, applying the rules of the test to each fragment. In some cases, it
may backtrack, depending on the rules. Therefore, the more rules there are
in the regular expression, the longer it will take. The rules in the
regular expression are defined by the characters in it.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Chuck B" <ch****@shc1.comwrote in message
news:Ou**************@TK2MSFTNGP05.phx.gbl...
>In a C# Regex expression which would be faster when run against say
10,000 strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.


Sep 26 '07 #7
Hi Chuck,

Actually, it's just a simple matter of more rules in the regular expression.

\d+/\d+/\d+ The quick brown fox.*
\d+/\d+/\d+ The.*

Note that each character represents a rule. "quick brown fox" is actually 15
rules, indicating specific characters that must match. So, the Regex engine
must test each successive character in the target string against each of
these characters/rules to ascertain a match before moving on. With the
wildcard (.*), only the newline character (1 character) must be looked for.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
"Chuck B" <ch****@shc1.comwrote in message
news:OE**************@TK2MSFTNGP03.phx.gbl...
Thanks Kevin.

I ran a test with both 10,000 iterations of the test string - "9/19/09
This is a test. This is only a test of the quick brown fox. If this had
been a real quick brown fox it would have eaten yer toes." with the Regex
"\d+/\d+/\d+.*real.*" and "\d+/\d+/\d+.*this had been a real.*" and it
turns out that there's about a 3 microsecond difference with the shorter
expression being faster.

I'm guessing that rules that involve escape characters like \d, \w would
take longer to match due to the fact that there are more candidates to
sort thru.
"Kevin Spencer" <un**********@nothinks.comwrote in message
news:eO**************@TK2MSFTNGP03.phx.gbl...
>The Regex engine works by looping through the string one character at a
time, applying the rules of the test to each fragment. In some cases, it
may backtrack, depending on the rules. Therefore, the more rules there
are in the regular expression, the longer it will take. The rules in the
regular expression are defined by the characters in it.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Chuck B" <ch****@shc1.comwrote in message
news:Ou**************@TK2MSFTNGP05.phx.gbl...
>>In a C# Regex expression which would be faster when run against say
10,000 strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally
and it's not clear why one would be faster than the other.



Sep 27 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
by: Reinhold Birkenfeld | last post by:
Hello, I recently ported a simple utility script to analyze a data file from Perl to Python that uses regex substitutions, not more complex than re1 = re.compile(r"\s*<.*>\s*") re2 =...
10
by: Chance Hopkins | last post by:
I'm trying to match a set of matches after some initial text: mytext: "something" "somethingelse" "another thing" "maybe another" (?:mytext: )(?<mymatch>{1,1}+{1,1}+)+ I only get the last one...
2
by: John Grandy | last post by:
Is it advisable to compile a Regex for a massively scalable ASP.NET web-application ? How exactly does this work ? Do you create a separate class library and expose the Regex.Replace() as a...
2
by: .NET Developer | last post by:
Hello, I'm trying to write a RegEx that will find all occurances of a particular type of HTML anchor <a> element in a big block of HTML. Here are the pattern requirements - they consist of certain...
1
by: rh | last post by:
hi all, take the following 2 c# lines: 1) str = Regex.Replace(str, ".*AAA", ""); 2) str = Regex.Replace(str, "^.*AAA", ""); notice that the only difference is that the pattern in line 2 has a...
15
by: Kay Schluehr | last post by:
I have a list of strings ls = and want to create a regular expression sx from it, such that sx.match(s) yields a SRE_Match object when s starts with an s_i for one i in . There might be...
6
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all...
15
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
20
by: Ravikiran | last post by:
Hi Friends, I wanted know about whatt is ment by zero optimization and sign optimization and its differences.... Thank you...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.