By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,665 Members | 2,334 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,665 IT Pros & Developers. It's quick & easy.

Regular expression optimization

P: n/a
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

Jan 26 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Hi Billa,

This was an interesting problem to me, so I took a crack at solving it.

I constructed a class which uses the Regex.Replace overload taking an input
string and a MatchEvaluator delegate. If you've never looked at this
overload, the Regular Expression is evaluated against the input string, and
the MatchEvaluator Delegate is called for each Match in the result.

The first part required combining the separate Regular Expression strings
into a single Regular Expression. I used grouping with the "|" (or)
operator, so that the Regular Expression would match any of the Regular
Expressions in the single expression.

The trick was to get the MatchEvaluator delegate to recognize which of the
sub-expressions formed the particular Match being passed to it. For this, I
used named groups. The class has a private string array of replacement
strings which is passed along with the Regular Expression to the
Constructor. The replacement string array must match the number of
sub-expressions in the Regular Expression, and in the same order.

The Replace Method of the class loops through the array of Groups in the
Match, which is the same as the array of Groups in the Regular Expression.
It ignores the first group in the array, as that is always the match itself.
It looks for the Group with the Success property as true, and returns the
replacement string in the array that corresponds to that position in the
Groups collection.

I tested this, and it is bug-free. The class definition follows, followed by
an example:

/// <summary>
/// Replaces multiple Regular Expressions in a string with multiple
replacement
/// strings without having to perform separate replacements
/// </summary>
/// <remarks>The <c>System.Text.RegularExpressions.Regex.Replace(st ring,
MatchEvaluator)</c>
/// overloads replace a single Regular Expression in a string. To replace
many
/// Regular Expressions in a string would entail creating many strings. This
class
/// enables the replacement to be done with a single string
returned.</remarks>
public class MultiReplacer
{
private string[] Replacers;
private Regex r;

private string Replacer(Match m)
{
for(int i = 1; i < m.Groups.Count; i++)
{
if (m.Groups[i].Success)
return Replacers[i - 1];
}
return "";
}

/// <summary>
/// Replaces all groups matching the expression initializer variable in
the input
/// string with the matching replacement string from the array of
replacement
/// strings passed in the initializer.
/// </summary>
/// <param name="input">String to evaluate.</param>
/// <returns>The fully-replaced string.</returns>
public string Replace(string input)
{
MatchEvaluator meval = new MatchEvaluator(Replacer);
return r.Replace(input, meval);
}

/// <summary type="System.String">
/// Constructor.
/// </summary>
/// <param name="expression">Regular Expression String.</param>
/// <param name="replacers">Array of replacement strings.</param>
/// <remarks>The <paramref name="expression"/> parameter must be a
Regular Expression using
/// named groups, combined with "|" to match any of the groups in the
/// <paramref name="expression"/>. The <paramref name="replacers"/>
array
/// must have the same number of elements as the number of named groups
in the
/// <paramref name="expression"/>, and in the same order.</remarks>
public MultiReplacer(string expression,
string[] replacers)
{
r = new Regex(expression);
Replacers = replacers;
}
}

example (from my test form):

private void btnMultiReplace_Click(object sender, EventArgs e)
{
string s = @"(?<multistatus><a\:multistatus
[^>]*\>)|(?<endresponse>\<\/response\>)";
string[] replacers = new string[] { @"<a:multistatus>",
@"</nsResp:response>" };
MultiReplacer replacer = new MultiReplacer(s, replacers);

// Calls a method that adds the text to the same TextBox
SetTextMessage(replacer.Replace(txtMessage.Text), false);
}

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.
"Billa" <Bi********@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

Jan 26 '06 #2

P: n/a
I have to correct myself. The code is fine, but my explanation has an error
in it. I *initially* thought of using named groups, but as I continued to
optimize the code, I realized that I didn't need to use named groups, as the
position of the group in the GroupNames array would correspond to the
position in the replacement array of strings, to identify the replacement
string to use. You'll notice the MatchEvaluator delegate doesn't use the
group names, but only iterates through the Collection.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:Oi**************@TK2MSFTNGP09.phx.gbl...
Hi Billa,

This was an interesting problem to me, so I took a crack at solving it.

I constructed a class which uses the Regex.Replace overload taking an
input string and a MatchEvaluator delegate. If you've never looked at this
overload, the Regular Expression is evaluated against the input string,
and the MatchEvaluator Delegate is called for each Match in the result.

The first part required combining the separate Regular Expression strings
into a single Regular Expression. I used grouping with the "|" (or)
operator, so that the Regular Expression would match any of the Regular
Expressions in the single expression.

The trick was to get the MatchEvaluator delegate to recognize which of the
sub-expressions formed the particular Match being passed to it. For this,
I used named groups. The class has a private string array of replacement
strings which is passed along with the Regular Expression to the
Constructor. The replacement string array must match the number of
sub-expressions in the Regular Expression, and in the same order.

The Replace Method of the class loops through the array of Groups in the
Match, which is the same as the array of Groups in the Regular Expression.
It ignores the first group in the array, as that is always the match
itself. It looks for the Group with the Success property as true, and
returns the replacement string in the array that corresponds to that
position in the Groups collection.

I tested this, and it is bug-free. The class definition follows, followed
by an example:

/// <summary>
/// Replaces multiple Regular Expressions in a string with multiple
replacement
/// strings without having to perform separate replacements
/// </summary>
/// <remarks>The <c>System.Text.RegularExpressions.Regex.Replace(st ring,
MatchEvaluator)</c>
/// overloads replace a single Regular Expression in a string. To replace
many
/// Regular Expressions in a string would entail creating many strings.
This class
/// enables the replacement to be done with a single string
returned.</remarks>
public class MultiReplacer
{
private string[] Replacers;
private Regex r;

private string Replacer(Match m)
{
for(int i = 1; i < m.Groups.Count; i++)
{
if (m.Groups[i].Success)
return Replacers[i - 1];
}
return "";
}

/// <summary>
/// Replaces all groups matching the expression initializer variable in
the input
/// string with the matching replacement string from the array of
replacement
/// strings passed in the initializer.
/// </summary>
/// <param name="input">String to evaluate.</param>
/// <returns>The fully-replaced string.</returns>
public string Replace(string input)
{
MatchEvaluator meval = new MatchEvaluator(Replacer);
return r.Replace(input, meval);
}

/// <summary type="System.String">
/// Constructor.
/// </summary>
/// <param name="expression">Regular Expression String.</param>
/// <param name="replacers">Array of replacement strings.</param>
/// <remarks>The <paramref name="expression"/> parameter must be a
Regular Expression using
/// named groups, combined with "|" to match any of the groups in the
/// <paramref name="expression"/>. The <paramref name="replacers"/>
array
/// must have the same number of elements as the number of named groups
in the
/// <paramref name="expression"/>, and in the same order.</remarks>
public MultiReplacer(string expression,
string[] replacers)
{
r = new Regex(expression);
Replacers = replacers;
}
}

example (from my test form):

private void btnMultiReplace_Click(object sender, EventArgs e)
{
string s = @"(?<multistatus><a\:multistatus
[^>]*\>)|(?<endresponse>\<\/response\>)";
string[] replacers = new string[] { @"<a:multistatus>",
@"</nsResp:response>" };
MultiReplacer replacer = new MultiReplacer(s, replacers);

// Calls a method that adds the text to the same TextBox
SetTextMessage(replacer.Replace(txtMessage.Text), false);
}

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
Who is Mighty Abbott?
A twin turret scalawag.
"Billa" <Bi********@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)


Jan 26 '06 #3

P: n/a
On 26 Jan 2006 06:40:13 -0800, "Billa" <Bi********@gmail.com> wrote:
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)


In .NET the String class is immutable, it cannot be changed. When you
make any change to a string a new copy is created leaving the old copy
behind. As you have noticed, multiple changes mean multiple copies.

If you are going to be making a lot of changes to a string then use
the StringBuilder class instead. A StringBuilder is mutable so the
changes are made in the object itself, hence no extra copies.

For best efficiency make sure that the StringBuilder is long enough to
hold the maximum size text that you want to put in it. This saves
reallocating memory when the StringBuilder expands.

rossum

--

The ultimate truth is that there is no ultimate truth
Jan 26 '06 #4

P: n/a
Thanks a lot Kevin!
I'll look into it and will come back to you if I have more questions. I
truly appreciate your time.

Jan 27 '06 #5

P: n/a
Thanks rossum!
can you please tell how can I use stringbuilder with regularexpressions?

Jan 27 '06 #6

P: n/a
I don't believe he understood the nature of your question.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

"Billa" <Bi********@gmail.com> wrote in message
news:11*********************@f14g2000cwb.googlegro ups.com...
Thanks rossum!
can you please tell how can I use stringbuilder with regularexpressions?

Jan 27 '06 #7

P: n/a

Your posted solution helped me to better understand how the replace
function works. Care ponder another one? I have a need to parse street
addresses and break them into individual components. I have sucessfuly
done this using the following regular expression written by John
Sample.

^(?'number'\d+)? (\s+)? (?# Optional House/Place
number)(?'dirp'NORTH|SOUTH|EAST|WEST|N|S|E|W|NE|NW |SE|SW|NORTHEAST|NORTHWEST|SOUTHEAST|SOUTHWEST)?(\ s+)?
(?# Dirp is optional)(?'street'\b\w[\w ]+\b) (?# Street Name -
required)\b(?'streetType'ALY|ARC|AVE|BLVD|BR|BRG|B YP|CIR|CRES|CSWY|CT|CTR|CV|DR|EXPY|FMRD|FWY|GRD|HW Y|LN|LOOP|MAL|MTWY|OVPS|PASS|PATH|PIKE|PKY|PL|PLZ| RAMP|RD|RMRD|ROW|RTE|RUE|RUN|SKWY|SPUR|SQ|ST|TER|T FWY|THFR|THWY|TPKE|TRCE|TRL|TUNL|WALK|WAY|WKWY|XIN G|STREET|DRIVE|ROAD|AVENUE|FREEWAY|PARKWAY|HIGHWAY |BOULEVARD|BYPASS|TURNPIKE|TRAIL|SQUARE)\b\.?(\s*) ((?'dirs'NORTH|SOUTH|EAST|WEST|N|S|E|W|NE|NW|SE|SW |NORTHEAST|NORTHWEST|SOUTHEAST|SOUTHWEST)\b)?(,?(? 'city'[\w
]{2,}\b),?\s(?'state'[A-Z]{2}))? (?# entire section is optional, but the
pieces, if they exist, are not)([^\r\n\w])?,?\s?(?'zip'\d{5})?

As you can see it contains several group names that break the street
address into components. For example "1000 North Thomas Jefferson St
NW, Washinton, DC 20007" is broken down into

number:*1000*
dirp:*North*
street:*Thomas Jefferson*
streetType:*St*
dirs:*NW*
city:*Washinton*
state:*DC*
zip:*20007*

Now I need to normalize the dirp, streetType, and dirs groups. Using
the example above, dirp should be normalized to "N". Of course I can
do this with the c# replace function easily enough, but I was hoping
there was a way to use the regex replace function instead. Thoughts?

- Brian
--
Brian Hardwick
------------------------------------------------------------------------
Brian Hardwick's Profile: http://www.hightechtalks.com/m869
View this thread: http://www.hightechtalks.com/t2343770

Jan 31 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.