473,769 Members | 2,355 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regex optimization

In a C# Regex expression which would be faster when run against say 10,000
strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.
Sep 26 '07 #1
7 2051
Try both tests and then you will know!
"Chuck B" <ch****@shc1.co mwrote in message
news:Ou******** ******@TK2MSFTN GP05.phx.gbl...
In a C# Regex expression which would be faster when run against say 10,000
strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.
Sep 26 '07 #2
Yes I will - but then that won't tell me much about what went on internally.

I was hoping that someone with knowledge of the Regex engine could help me
understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)
"Stephany Young" <noone@localhos twrote in message
news:em******** ******@TK2MSFTN GP03.phx.gbl...
Try both tests and then you will know!
"Chuck B" <ch****@shc1.co mwrote in message
news:Ou******** ******@TK2MSFTN GP05.phx.gbl...
>In a C# Regex expression which would be faster when run against say
10,000 strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.

Sep 26 '07 #3
The Regex engine works by looping through the string one character at a
time, applying the rules of the test to each fragment. In some cases, it
may backtrack, depending on the rules. Therefore, the more rules there are
in the regular expression, the longer it will take. The rules in the regular
expression are defined by the characters in it.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Chuck B" <ch****@shc1.co mwrote in message
news:Ou******** ******@TK2MSFTN GP05.phx.gbl...
In a C# Regex expression which would be faster when run against say 10,000
strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.

Sep 26 '07 #4
Chuck B wrote:
Yes I will - but then that won't tell me much about what went on internally.

I was hoping that someone with knowledge of the Regex engine could help me
understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)
Trying it for yourself and seeing the results is a perfectly reasonable
answer.

While it may not allow you to recreate the source code, it certainly
will tell you which way is more efficient. Because your original regexes
were completely unequal, it's hard to say whether you could actually
learn anything about what is going on "internally " since in one instance
you'd get a LOT more hits than the other.

Assuming a somewhat normal dataset where some records have "The quick
brown fox" and more records have "The" in them, I'd suggest that it's
probably more work to return more results than it is simply to find
them, so #1 wins out. On 10,000 lines of "The quick brown fox" where the
returned values would be equal, I'd wager on the second regex being
faster (equal time spent returning values, less time spent searching).

Chris.
Sep 26 '07 #5

"Chris Shepherd" <ch**@nospam.ch sh.cawrote in message
news:%2******** *******@TK2MSFT NGP03.phx.gbl.. .
Chuck B wrote:
>Yes I will - but then that won't tell me much about what went on
internally.

I was hoping that someone with knowledge of the Regex engine could help
me understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)

Trying it for yourself and seeing the results is a perfectly reasonable
answer.
I have to disagree here. Trying it for myself will give me numbers but not
understanding which is the untimate goal.
While it may not allow you to recreate the source code, it certainly will
tell you which way is more efficient. Because your original regexes were
completely unequal, it's hard to say whether you could actually learn
anything about what is going on "internally " since in one instance you'd
get a LOT more hits than the other.

Assuming a somewhat normal dataset where some records have "The quick
brown fox" and more records have "The" in them, I'd suggest that it's
probably more work to return more results than it is simply to find them,
so #1 wins out. On 10,000 lines of "The quick brown fox" where the
returned values would be equal, I'd wager on the second regex being faster
(equal time spent returning values, less time spent searching).
The fault here is mine for not explaining adequately what I wanted.

In the case above I'm looking for the special case where there is 1 match
per string for either Regex. For instance; the result of running both
examples above against "09/26/07 The quick brown fox ran away."

The date would probably take just as long for each regex. However, it seems
like it might be more efficient with the static characters to search for a
longer string than a shorter one (assuming that there was no match embedded
inside of another match). The reason for the increase is that the pointer
pointing to the head of the search would move a greater length after a
successful match.
Sep 26 '07 #6
Thanks Kevin.

I ran a test with both 10,000 iterations of the test string - "9/19/09 This
is a test. This is only a test of the quick brown fox. If this had been a
real quick brown fox it would have eaten yer toes." with the Regex
"\d+/\d+/\d+.*real.*" and "\d+/\d+/\d+.*this had been a real.*" and it turns
out that there's about a 3 microsecond difference with the shorter
expression being faster.

I'm guessing that rules that involve escape characters like \d, \w would
take longer to match due to the fact that there are more candidates to sort
thru.
"Kevin Spencer" <un**********@n othinks.comwrot e in message
news:eO******** ******@TK2MSFTN GP03.phx.gbl...
The Regex engine works by looping through the string one character at a
time, applying the rules of the test to each fragment. In some cases, it
may backtrack, depending on the rules. Therefore, the more rules there are
in the regular expression, the longer it will take. The rules in the
regular expression are defined by the characters in it.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Chuck B" <ch****@shc1.co mwrote in message
news:Ou******** ******@TK2MSFTN GP05.phx.gbl...
>In a C# Regex expression which would be faster when run against say
10,000 strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.


Sep 26 '07 #7
Hi Chuck,

Actually, it's just a simple matter of more rules in the regular expression.

\d+/\d+/\d+ The quick brown fox.*
\d+/\d+/\d+ The.*

Note that each character represents a rule. "quick brown fox" is actually 15
rules, indicating specific characters that must match. So, the Regex engine
must test each successive character in the target string against each of
these characters/rules to ascertain a match before moving on. With the
wildcard (.*), only the newline character (1 character) must be looked for.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
"Chuck B" <ch****@shc1.co mwrote in message
news:OE******** ******@TK2MSFTN GP03.phx.gbl...
Thanks Kevin.

I ran a test with both 10,000 iterations of the test string - "9/19/09
This is a test. This is only a test of the quick brown fox. If this had
been a real quick brown fox it would have eaten yer toes." with the Regex
"\d+/\d+/\d+.*real.*" and "\d+/\d+/\d+.*this had been a real.*" and it
turns out that there's about a 3 microsecond difference with the shorter
expression being faster.

I'm guessing that rules that involve escape characters like \d, \w would
take longer to match due to the fact that there are more candidates to
sort thru.
"Kevin Spencer" <un**********@n othinks.comwrot e in message
news:eO******** ******@TK2MSFTN GP03.phx.gbl...
>The Regex engine works by looping through the string one character at a
time, applying the rules of the test to each fragment. In some cases, it
may backtrack, depending on the rules. Therefore, the more rules there
are in the regular expression, the longer it will take. The rules in the
regular expression are defined by the characters in it.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Chuck B" <ch****@shc1.co mwrote in message
news:Ou******* *******@TK2MSFT NGP05.phx.gbl.. .
>>In a C# Regex expression which would be faster when run against say
10,000 strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally
and it's not clear why one would be faster than the other.



Sep 27 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
1894
by: Reinhold Birkenfeld | last post by:
Hello, I recently ported a simple utility script to analyze a data file from Perl to Python that uses regex substitutions, not more complex than re1 = re.compile(r"\s*<.*>\s*") re2 = re.compile(r".*\((.*)\).*") re3 = re.compile(r'^"(.*)"$') When run without these regex substitutions, the scripts' speed is nearly
10
3092
by: Chance Hopkins | last post by:
I'm trying to match a set of matches after some initial text: mytext: "something" "somethingelse" "another thing" "maybe another" (?:mytext: )(?<mymatch>{1,1}+{1,1}+)+ I only get the last one "maybe another". I want to get all the values with quotes as a group, hence the + after the ()'s. Is this possible?
2
4285
by: John Grandy | last post by:
Is it advisable to compile a Regex for a massively scalable ASP.NET web-application ? How exactly does this work ? Do you create a separate class library and expose the Regex.Replace() as a method ..... ..... or can you just use the syntax :
2
1384
by: .NET Developer | last post by:
Hello, I'm trying to write a RegEx that will find all occurances of a particular type of HTML anchor <a> element in a big block of HTML. Here are the pattern requirements - they consist of certain attributes being present, basically: must start with "<a" followed by 1 or more white-space characters then, before the closing tag ">" it must contain 3 attributes: 1. class 2. href
1
2011
by: rh | last post by:
hi all, take the following 2 c# lines: 1) str = Regex.Replace(str, ".*AAA", ""); 2) str = Regex.Replace(str, "^.*AAA", ""); notice that the only difference is that the pattern in line 2 has a starter marker (^). if str is large and does not contain the pattern, line 1 takes much much longer to complete than line 2 and cpu goes nearly full throttle.
15
3232
by: Kay Schluehr | last post by:
I have a list of strings ls = and want to create a regular expression sx from it, such that sx.match(s) yields a SRE_Match object when s starts with an s_i for one i in . There might be relations between those strings: s_k.startswith(s_1) -> True or s_k.endswith(s_1) -> True. An extreme case would be ls = . For this reason SRE_Match should provide the longest possible match. Is there a Python module able to create an optimized regex rx...
6
2503
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all the right parts in there in one big line and for some reason that did not work either. Here is my regex's. static Regex rar = new Regex("\\.part.*", RegexOptions.IgnoreCase); static Regex par = new Regex("\\.vol.*", RegexOptions.IgnoreCase);
15
50260
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
20
2357
by: Ravikiran | last post by:
Hi Friends, I wanted know about whatt is ment by zero optimization and sign optimization and its differences.... Thank you...
0
10050
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9999
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9866
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8876
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5310
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5448
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3570
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.