473,416 Members | 1,555 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

Regular expression optimization

Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

Jan 26 '06 #1
7 3784
Hi Billa,

This was an interesting problem to me, so I took a crack at solving it.

I constructed a class which uses the Regex.Replace overload taking an input
string and a MatchEvaluator delegate. If you've never looked at this
overload, the Regular Expression is evaluated against the input string, and
the MatchEvaluator Delegate is called for each Match in the result.

The first part required combining the separate Regular Expression strings
into a single Regular Expression. I used grouping with the "|" (or)
operator, so that the Regular Expression would match any of the Regular
Expressions in the single expression.

The trick was to get the MatchEvaluator delegate to recognize which of the
sub-expressions formed the particular Match being passed to it. For this, I
used named groups. The class has a private string array of replacement
strings which is passed along with the Regular Expression to the
Constructor. The replacement string array must match the number of
sub-expressions in the Regular Expression, and in the same order.

The Replace Method of the class loops through the array of Groups in the
Match, which is the same as the array of Groups in the Regular Expression.
It ignores the first group in the array, as that is always the match itself.
It looks for the Group with the Success property as true, and returns the
replacement string in the array that corresponds to that position in the
Groups collection.

I tested this, and it is bug-free. The class definition follows, followed by
an example:

/// <summary>
/// Replaces multiple Regular Expressions in a string with multiple
replacement
/// strings without having to perform separate replacements
/// </summary>
/// <remarks>The <c>System.Text.RegularExpressions.Regex.Replace(st ring,
MatchEvaluator)</c>
/// overloads replace a single Regular Expression in a string. To replace
many
/// Regular Expressions in a string would entail creating many strings. This
class
/// enables the replacement to be done with a single string
returned.</remarks>
public class MultiReplacer
{
private string[] Replacers;
private Regex r;

private string Replacer(Match m)
{
for(int i = 1; i < m.Groups.Count; i++)
{
if (m.Groups[i].Success)
return Replacers[i - 1];
}
return "";
}

/// <summary>
/// Replaces all groups matching the expression initializer variable in
the input
/// string with the matching replacement string from the array of
replacement
/// strings passed in the initializer.
/// </summary>
/// <param name="input">String to evaluate.</param>
/// <returns>The fully-replaced string.</returns>
public string Replace(string input)
{
MatchEvaluator meval = new MatchEvaluator(Replacer);
return r.Replace(input, meval);
}

/// <summary type="System.String">
/// Constructor.
/// </summary>
/// <param name="expression">Regular Expression String.</param>
/// <param name="replacers">Array of replacement strings.</param>
/// <remarks>The <paramref name="expression"/> parameter must be a
Regular Expression using
/// named groups, combined with "|" to match any of the groups in the
/// <paramref name="expression"/>. The <paramref name="replacers"/>
array
/// must have the same number of elements as the number of named groups
in the
/// <paramref name="expression"/>, and in the same order.</remarks>
public MultiReplacer(string expression,
string[] replacers)
{
r = new Regex(expression);
Replacers = replacers;
}
}

example (from my test form):

private void btnMultiReplace_Click(object sender, EventArgs e)
{
string s = @"(?<multistatus><a\:multistatus
[^>]*\>)|(?<endresponse>\<\/response\>)";
string[] replacers = new string[] { @"<a:multistatus>",
@"</nsResp:response>" };
MultiReplacer replacer = new MultiReplacer(s, replacers);

// Calls a method that adds the text to the same TextBox
SetTextMessage(replacer.Replace(txtMessage.Text), false);
}

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.
"Billa" <Bi********@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

Jan 26 '06 #2
I have to correct myself. The code is fine, but my explanation has an error
in it. I *initially* thought of using named groups, but as I continued to
optimize the code, I realized that I didn't need to use named groups, as the
position of the group in the GroupNames array would correspond to the
position in the replacement array of strings, to identify the replacement
string to use. You'll notice the MatchEvaluator delegate doesn't use the
group names, but only iterates through the Collection.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:Oi**************@TK2MSFTNGP09.phx.gbl...
Hi Billa,

This was an interesting problem to me, so I took a crack at solving it.

I constructed a class which uses the Regex.Replace overload taking an
input string and a MatchEvaluator delegate. If you've never looked at this
overload, the Regular Expression is evaluated against the input string,
and the MatchEvaluator Delegate is called for each Match in the result.

The first part required combining the separate Regular Expression strings
into a single Regular Expression. I used grouping with the "|" (or)
operator, so that the Regular Expression would match any of the Regular
Expressions in the single expression.

The trick was to get the MatchEvaluator delegate to recognize which of the
sub-expressions formed the particular Match being passed to it. For this,
I used named groups. The class has a private string array of replacement
strings which is passed along with the Regular Expression to the
Constructor. The replacement string array must match the number of
sub-expressions in the Regular Expression, and in the same order.

The Replace Method of the class loops through the array of Groups in the
Match, which is the same as the array of Groups in the Regular Expression.
It ignores the first group in the array, as that is always the match
itself. It looks for the Group with the Success property as true, and
returns the replacement string in the array that corresponds to that
position in the Groups collection.

I tested this, and it is bug-free. The class definition follows, followed
by an example:

/// <summary>
/// Replaces multiple Regular Expressions in a string with multiple
replacement
/// strings without having to perform separate replacements
/// </summary>
/// <remarks>The <c>System.Text.RegularExpressions.Regex.Replace(st ring,
MatchEvaluator)</c>
/// overloads replace a single Regular Expression in a string. To replace
many
/// Regular Expressions in a string would entail creating many strings.
This class
/// enables the replacement to be done with a single string
returned.</remarks>
public class MultiReplacer
{
private string[] Replacers;
private Regex r;

private string Replacer(Match m)
{
for(int i = 1; i < m.Groups.Count; i++)
{
if (m.Groups[i].Success)
return Replacers[i - 1];
}
return "";
}

/// <summary>
/// Replaces all groups matching the expression initializer variable in
the input
/// string with the matching replacement string from the array of
replacement
/// strings passed in the initializer.
/// </summary>
/// <param name="input">String to evaluate.</param>
/// <returns>The fully-replaced string.</returns>
public string Replace(string input)
{
MatchEvaluator meval = new MatchEvaluator(Replacer);
return r.Replace(input, meval);
}

/// <summary type="System.String">
/// Constructor.
/// </summary>
/// <param name="expression">Regular Expression String.</param>
/// <param name="replacers">Array of replacement strings.</param>
/// <remarks>The <paramref name="expression"/> parameter must be a
Regular Expression using
/// named groups, combined with "|" to match any of the groups in the
/// <paramref name="expression"/>. The <paramref name="replacers"/>
array
/// must have the same number of elements as the number of named groups
in the
/// <paramref name="expression"/>, and in the same order.</remarks>
public MultiReplacer(string expression,
string[] replacers)
{
r = new Regex(expression);
Replacers = replacers;
}
}

example (from my test form):

private void btnMultiReplace_Click(object sender, EventArgs e)
{
string s = @"(?<multistatus><a\:multistatus
[^>]*\>)|(?<endresponse>\<\/response\>)";
string[] replacers = new string[] { @"<a:multistatus>",
@"</nsResp:response>" };
MultiReplacer replacer = new MultiReplacer(s, replacers);

// Calls a method that adds the text to the same TextBox
SetTextMessage(replacer.Replace(txtMessage.Text), false);
}

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
Who is Mighty Abbott?
A twin turret scalawag.
"Billa" <Bi********@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)


Jan 26 '06 #3
On 26 Jan 2006 06:40:13 -0800, "Billa" <Bi********@gmail.com> wrote:
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I want to avoid
that. My question here is if there is a way to pass either a memory
stream or array of "find", "replace" expressions or any other way to
avoid multiple copies of a string.

Any help will be highly appreciated

Regards,
Billla

Example:
'Delete all namespaces from <multistatus> root element
Pattern = "\<a\:multistatus [^>]*\>"
ReplaceWith = "<a:multistatus>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)

'Step#8-B.
'doing the closing tag part
Pattern = "\<\/response\>"
ReplaceWith = "</nsResp:response>"
sContents = Regex.Replace(sContents, Pattern, ReplaceWith,
RegexOptions.IgnorePatternWhitespace)


In .NET the String class is immutable, it cannot be changed. When you
make any change to a string a new copy is created leaving the old copy
behind. As you have noticed, multiple changes mean multiple copies.

If you are going to be making a lot of changes to a string then use
the StringBuilder class instead. A StringBuilder is mutable so the
changes are made in the object itself, hence no extra copies.

For best efficiency make sure that the StringBuilder is long enough to
hold the maximum size text that you want to put in it. This saves
reallocating memory when the StringBuilder expands.

rossum

--

The ultimate truth is that there is no ultimate truth
Jan 26 '06 #4
Thanks a lot Kevin!
I'll look into it and will come back to you if I have more questions. I
truly appreciate your time.

Jan 27 '06 #5
Thanks rossum!
can you please tell how can I use stringbuilder with regularexpressions?

Jan 27 '06 #6
I don't believe he understood the nature of your question.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

"Billa" <Bi********@gmail.com> wrote in message
news:11*********************@f14g2000cwb.googlegro ups.com...
Thanks rossum!
can you please tell how can I use stringbuilder with regularexpressions?

Jan 27 '06 #7

Your posted solution helped me to better understand how the replace
function works. Care ponder another one? I have a need to parse street
addresses and break them into individual components. I have sucessfuly
done this using the following regular expression written by John
Sample.

^(?'number'\d+)? (\s+)? (?# Optional House/Place
number)(?'dirp'NORTH|SOUTH|EAST|WEST|N|S|E|W|NE|NW |SE|SW|NORTHEAST|NORTHWEST|SOUTHEAST|SOUTHWEST)?(\ s+)?
(?# Dirp is optional)(?'street'\b\w[\w ]+\b) (?# Street Name -
required)\b(?'streetType'ALY|ARC|AVE|BLVD|BR|BRG|B YP|CIR|CRES|CSWY|CT|CTR|CV|DR|EXPY|FMRD|FWY|GRD|HW Y|LN|LOOP|MAL|MTWY|OVPS|PASS|PATH|PIKE|PKY|PL|PLZ| RAMP|RD|RMRD|ROW|RTE|RUE|RUN|SKWY|SPUR|SQ|ST|TER|T FWY|THFR|THWY|TPKE|TRCE|TRL|TUNL|WALK|WAY|WKWY|XIN G|STREET|DRIVE|ROAD|AVENUE|FREEWAY|PARKWAY|HIGHWAY |BOULEVARD|BYPASS|TURNPIKE|TRAIL|SQUARE)\b\.?(\s*) ((?'dirs'NORTH|SOUTH|EAST|WEST|N|S|E|W|NE|NW|SE|SW |NORTHEAST|NORTHWEST|SOUTHEAST|SOUTHWEST)\b)?(,?(? 'city'[\w
]{2,}\b),?\s(?'state'[A-Z]{2}))? (?# entire section is optional, but the
pieces, if they exist, are not)([^\r\n\w])?,?\s?(?'zip'\d{5})?

As you can see it contains several group names that break the street
address into components. For example "1000 North Thomas Jefferson St
NW, Washinton, DC 20007" is broken down into

number:*1000*
dirp:*North*
street:*Thomas Jefferson*
streetType:*St*
dirs:*NW*
city:*Washinton*
state:*DC*
zip:*20007*

Now I need to normalize the dirp, streetType, and dirs groups. Using
the example above, dirp should be normalized to "N". Of course I can
do this with the c# replace function easily enough, but I was hoping
there was a way to use the regex replace function instead. Thoughts?

- Brian
--
Brian Hardwick
------------------------------------------------------------------------
Brian Hardwick's Profile: http://www.hightechtalks.com/m869
View this thread: http://www.hightechtalks.com/t2343770

Jan 31 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
4
by: Buddy | last post by:
Can someone please show me how to create a regular expression to do the following My text is set to MyColumn{1, 100} Test I want a regular expression that sets the text to the following...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
11
by: Dimitris Georgakopuolos | last post by:
Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...
5
by: deepak.rathore | last post by:
Hi , I have seen lot of reg. expession with ?: For dummy eg (((XXX)ddd)ff) The above expression is modified as (?:(?:(XXX)ddd)ff) Although both the above expr. gives same result....
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
1
by: Allan Ebdrup | last post by:
I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...
1
by: NvrBst | last post by:
I want to use the .replace() method with the regular expression /^ %VAR % =,($|&)/. The following DOESN'T replace the "^default.aspx=,($|&)" regular expression with "":...
47
by: Henning_Thornblad | last post by:
What can be the cause of the large difference between re.search and grep? This script takes about 5 min to run on my computer: #!/usr/bin/env python import re row="" for a in range(156000):...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.