any regex gurus out there?

bill tie

I'd appreciate it if you could advise.

1. How do I replace "\" (backslash) with anything?

2. Suppose I want to replace

(a) every occurrence of characters "a", "b", "c", "d" with "x",
(b) every occurrence of characters "p", "q", "r", "s" with "y".

Right now, I do it as follows:

myString = Regex.Replace( myString, "[abcd]", "x" );
myString = Regex.Replace( myString, "[pqrs]", "y" );

Is there a better way? Can the replacing be performed in one fell swoop?

Nov 16 '05 #1

Subscribe Reply

2617

Niki Estner

"bill tie" <bi*****@discus sions.microsoft .com> wrote in
news:7F******** *************** ***********@mic rosoft.com...

I'd appreciate it if you could advise.

1. How do I replace "\" (backslash) with anything?
Do you mean like this:
myString = Regex.Replace( myString, @"\\", "anything" );
2. Suppose I want to replace

(a) every occurrence of characters "a", "b", "c", "d" with "x",
(b) every occurrence of characters "p", "q", "r", "s" with "y".

Right now, I do it as follows:

myString = Regex.Replace( myString, "[abcd]", "x" );
myString = Regex.Replace( myString, "[pqrs]", "y" );

Is there a better way? Can the replacing be performed in one fell swoop?

You can do that with a MatchEvaluator, like this:

using System;
using System.Text.Reg ularExpressions ;

class Test
{
static void Main()
{
string myString = "abcdefgpqr st";
myString = Regex.Replace( myString, "([abcd])|[pqrs]", new
MatchEvaluator( MyEvaluator));
Console.WriteLi ne(myString);
}

static string MyEvaluator(Mat ch m)
{
if (m.Groups[1].Length != 0)
return "x";
else
return "y";
}
}

However, I personally would prefer the 2-pass version for a simple problem
like this.

Niki

Nov 16 '05 #2

Niki Estner

"bill tie" <bi*****@discus sions.microsoft .com> wrote in
news:D5******** *************** ***********@mic rosoft.com...

Niki,

Thank you. Your examples were useful.

1. I would never know about the mysterious at-sign. Searching for it in
1GB
of the MSDN Library on my desktop was utterly futile. Only after I
consulted
Jesse Liberty's book, did I find VERBATIM.

But this verbatim may actually raise another problem:

I'm reading an input string entered by a user in the browser, and shaving
undesired characters such as a backslash. If I "make" the string
verbatim,
"Verbatimne ss" only applies to string literals: "\\n" is exactly the same
string constant as @"\n" - it's just that in the first case, the compiler
treats the backslash as an escape character. If you read strings from some
other source you can think of any string as "verbatim".
I'm afraid -- I haven't tried yet -- I won't be able to remove whitespace
(
\n \r \t ) easily like so:

myString = Regex.Replace( myString, "[\n\r\t]+", "" ); You should make the search pattern verbatim, like this:
myString = Regex.Replace( myString, @"[\n\r\t]+", "" );
I haven't tried either, but I'm pretty sure this should work.

The backslash is normally used by the regex for regex metacharacters like \w
or \d. If you want to have a 'real' backslash in your pattern, you have to
escape it - with a backslash (same as in C strings). If you use a string
constant as a pattern, use something like @"\\", or \\\\. If you need to
escape a whole string, you can also use Regex.Escape, which escapes all
metacharacters in a string with backslashes. So: Regex.Match(som eString,
Regex.Escape("[a-z]\n")) matches for (literally) "[a-z]\n, i.e. a opening
bracket, followed by an 'a', followed by a dash...
2. You proposed a matching algorithm. I have fifteen patterns to search.
The matching algorithm seems more efficient than making fifteen passes
over a
string of 100 or so characters.
As usual: Trust noone but your own benchmark if performance is important.
Having said that: The problem with the MatchEvaluator-technique is that
calling delegates is
quite slow, so 15 compiled regex's might well be faster than one
MatchEvaluator-Pass.
And: for 100 character strings I'd definitely favour readability over
performance. The time it takes to process such short strings will hardly be
measurable.
Let me make sure I understand. Suppose I'd like to replace

[abcd] with "z"
[pqrs] with "y"
[123] with "x"

Is this what it should look like?

string myString = "abcd efg pqrs t 1 k 2 m 3 n";
myString = Regex.Replace( myString, "(([abcd])|[pqrs])|[123]",
new MatchEvaluator( MyEvaluator));

static string MyEvaluator(Mat ch m)
{
if (m.Groups[2].Length != 0)
return "z";
if (m.Groups[1].Length != 0)
return "y";
if (m.Groups[0].Length != 0)
return "x";

return null;
}
Yes, exactly.
If I were undaunted by the growing number of parentheses in

"(([abcd])|[pqrs])|[123]"

is there a way to obtain the index of the Groups collection?

Could I do something like

if ( m.Groups[i].Length != 0 )
{
switch (i)
{
case 0: ...
}
}

You could of course use a for loop over all groups. Is that what you meant?

Niki

Nov 16 '05 #3

bill tie

Niki,

Thank you for good explanations.

[1]

If you want to have a 'real' backslash in your pattern,
you have to escape it - with a backslash
This is precisely what I couldn't achieve without the at-sign.

myString = "foo & ! ( \ * bar";
// after we clean up we should have only "foo spaces bar"
myString = Regex.Replace( myString, "[\\\!&*(]", "" );

No matter what number or combination of backslashes I used it didn't work.

The point of this exercise is I want to remove everything except:
letters a through z, A through Z
numerals 0 through 9
hyphen -
space " "

I thought I could do it by negation as follows:

myString = Regex.Replace( myString, "[^0-9a-zA-Z ]", "" );

This would be fantastic. There seem to be two problems:

(a) Regex considers some characters, e.g. parentheses, as letters.

{b) Letters the user enters may not be ordinary English letters. In the
French language, the letter A may have three different accents. I'm not sure
how the pattern [a-z] treats these letters.

On this point, do you have any suggestions?

[2] The problem with the MatchEvaluator-technique is that
calling delegates is quite slow, so 15 compiled regex's
might well be faster than one MatchEvaluator-Pass.
I was wondering, too. Yet, some compilers optimize code. I'm terribly new
to C#. I don't know whether it collapses my "calls" into some sort of loop
at compile-time.
I'd definitely favour readability over performance.
I concur the readability of the matching algorithm is awful.

Albeit I'm more inclined to apply the classic technique I started with, I'd
like to complete this intellectual exercise. Understanding how matching and
grouping work may be useful in other situations.

Using my last example:

Replace

[abcd] with "z"
[pqrs] with "y"
[123] with "x"

string myString = "abcd efg pqrs t 1 k 2 m 3 n";
myString = Regex.Replace( myString, "(([abcd])|[pqrs])|[123]",
new MatchEvaluator( MyEvaluator));

static string MyEvaluator(Mat ch m)
{
if (m.Groups[2].Length != 0)
return "z";
if (m.Groups[1].Length != 0)
return "y";
if (m.Groups[0].Length != 0)
return "x";

return null;
}

is there a way to obtain the index of the Groups collection?

Could I do something like

if ( m.Groups[i].Length != 0 )
{
switch (i)
{
case 0: ...
}
}

You could of course use a for loop over all groups. Is that
what you meant?

I'm not sure what I meant. ;-)

Slavishly, I tried to apply examples I found in the MSDN Library.

My code looked something like this:

static string MyEvaluator(Mat ch m)
{
for ( int i = 0; i < m.Groups.Count; i++ )
{
if (m.Groups[i].Length != 0)
{
switch (i)
{
case 0:
return "x";
case 1:
return "y";
case 2:
return "z";
}
}
return null;
}
return null;
}

This didn't do the job I wanted.

Any ideas?

Thank you.

Nov 16 '05 #4

Niki Estner

"bill tie" <bi*****@discus sions.microsoft .com> wrote in
news:DD******** *************** ***********@mic rosoft.com...

...
myString = "foo & ! ( \ * bar";
// after we clean up we should have only "foo spaces bar"
myString = Regex.Replace( myString, "[\\\!&*(]", "" );

No matter what number or combination of backslashes I used it didn't work.
I assume the first string constant really is myString = "foo & ! ( \\ *
bar"; - The compiler wouldn't take it otherwise.

The pattern you want to pass to the regex class looks like this: [\\!&*(]
You can either write it this way:
myString = Regex.Replace( myString, @"[\\!&*(]", "" );
or escape each backslash with a backslash, like this:

myString = Regex.Replace( myString, "[\\\\!&*(]", "" );
Note that this is exactly the same code, it's just a different way to write
the string constant.
Also, this doesn't have to do anything with regular expressions: If you want
to code a UNC-path in your code like \\SomeMachine\S omeFolder, you'd either
write @"\\SomeMachine \SomeFolder", or "\\\\SomeMachin e\\SomeFolder".
The point of this exercise is I want to remove everything except:
letters a through z, A through Z
numerals 0 through 9
hyphen -
space " "

I thought I could do it by negation as follows:

myString = Regex.Replace( myString, "[^0-9a-zA-Z ]", "" );

This would be fantastic. There seem to be two problems:

(a) Regex considers some characters, e.g. parentheses, as letters.
I don't think so. Maybe you have a sample?
{b) Letters the user enters may not be ordinary English letters. In the
French language, the letter A may have three different accents. I'm not
sure
how the pattern [a-z] treats these letters.
They're not included in this set. You'll have to use unicode character
classes for that problem:
myString = Regex.Replace( myString, @"[^\p{Ll}\p{Lu}\p {Lt}\p{Lo}\p{Nd } ]",
"" );
(As usual, if you want a non-verbatim string, replace all \ characters with
\\)
\p matches for a unicode character class:
\p{Ll}matches for "Letter lowercase", no matter what language
\p{Lu}matches for "Letter uppercase", no matter what language
\p{Lt}matches for "Letter titlecase", no matter what language
\p{Lo}matches for "other Letters", no matter what language
\p{Nd}matches for "Numeric decimal"

A list of all unicode character classes can be found here:
http://msdn.microsoft.com/library/de...classtopic.asp

Note that if you want to include underscores in this set you can use the
shortcut '\w' resp. '\W'.

myString = Regex.Replace( myString, @"[^\w ]", "" );
[2]
The problem with the MatchEvaluator-technique is that
calling delegates is quite slow, so 15 compiled regex's
might well be faster than one MatchEvaluator-Pass.
I was wondering, too. Yet, some compilers optimize code. I'm terribly
new
to C#. I don't know whether it collapses my "calls" into some sort of
loop
at compile-time.

There's a good article on .NET optimizations:
http://msdn.microsoft.com/library/de...anagedcode.asp
As you can see there, delegates are a lot slower than usual function calls.

I'd definitely favour readability over performance.

I concur the readability of the matching algorithm is awful.

Only if you don't know regular expressions. (And that probably applies to
every kind of code)
To use one of your previous examples:
myString = Regex.Replace( myString, "[^0-9a-zA-Z ]", "" );
I think readability of this line is great: Think about what the
corresponding C# code would look like; I'm pretty sure it would be a lot
more complex, and harder to read.
And you can always use tools like Expresso that make work with regex's
really easy.

So, I would disagree in that point.
Albeit I'm more inclined to apply the classic technique I started with,
I'd
like to complete this intellectual exercise. Understanding how matching
and
grouping work may be useful in other situations.

Using my last example:

Replace

[abcd] with "z"
[pqrs] with "y"
[123] with "x"

string myString = "abcd efg pqrs t 1 k 2 m 3 n";
myString = Regex.Replace( myString, "(([abcd])|[pqrs])|[123]",
new MatchEvaluator( MyEvaluator));
Paranthesis is wrong in this pattern. Use this one:
myString = Regex.Replace( myString, "([abcd])|([pqrs])|([123])",
new MatchEvaluator( MyEvaluator));
...
My code looked something like this:

static string MyEvaluator(Mat ch m)
{
for ( int i = 0; i < m.Groups.Count; i++ )
{
if (m.Groups[i].Length != 0)
{
switch (i)
{
case 0:
return "x";
case 1:
return "y";
case 2:
return "z";
}
}
return null;
}
return null;
}

This didn't do the job I wanted.

Yes, there's a tweak here: Groups[0] always refers to the whole capture;
Groups[1] returns the contents of the first paranthesis, and so on.
Try this one:

static string MyEvaluator(Mat ch m)
{
for ( int i = 1; i < m.Groups.Count; i++ )
{
if (m.Groups[i].Length != 0)
{
switch (i)
{
case 1:
return "x";
case 2:
return "y";
case 3:
return "z";
}
}
}
return null;
}
Niki

Nov 16 '05 #5

Niki Estner

"bill tie" <bi*****@discus sions.microsoft .com> wrote in
news:43******** *************** ***********@mic rosoft.com...

...
I think I'm getting where I want to be. There seems to be one last thing.

I need to replace accented letters with ordinary ASCII letters. In pseudo
code, it looks as follows:

replace ( myString, "[various and devious a's]", "a" )

In the "[various and devious a's]" part I pasted letters I had copied from
a
source on the web. I saved and closed my file. After I re-opened the
file
some a's lost their accents.

What do you suggest I do?

Usually .cs files are stored as ASCII files; I never tried this, but I think
you can tell Visual Studio to save your file in unicode format via File -
Advanced Save Options. If this doesn't work, you can still copy your
character into window's character table and use it's hex representation in
code like '\x0041' for 'A'.

Niki

Nov 16 '05 #6

bill tie

Niki,

Thank you. Your examples were useful.

1. I would never know about the mysterious at-sign. Searching for it in 1GB
of the MSDN Library on my desktop was utterly futile. Only after I consulted
Jesse Liberty's book, did I find VERBATIM.

But this verbatim may actually raise another problem:

I'm reading an input string entered by a user in the browser, and shaving
undesired characters such as a backslash. If I "make" the string verbatim,
I'm afraid -- I haven't tried yet -- I won't be able to remove whitespace (
\n \r \t ) easily like so:

myString = Regex.Replace( myString, "[\n\r\t]+", "" );

2. You proposed a matching algorithm. I have fifteen patterns to search.
The matching algorithm seems more efficient than making fifteen passes over a
string of 100 or so characters.

Let me make sure I understand. Suppose I'd like to replace

[abcd] with "z"
[pqrs] with "y"
[123] with "x"

Is this what it should look like?

string myString = "abcd efg pqrs t 1 k 2 m 3 n";
myString = Regex.Replace( myString, "(([abcd])|[pqrs])|[123]",
new MatchEvaluator( MyEvaluator));

static string MyEvaluator(Mat ch m)
{
if (m.Groups[2].Length != 0)
return "z";
if (m.Groups[1].Length != 0)
return "y";
if (m.Groups[0].Length != 0)
return "x";

return null;
}

If I were undaunted by the growing number of parentheses in

"(([abcd])|[pqrs])|[123]"

is there a way to obtain the index of the Groups collection?

Could I do something like

if ( m.Groups[i].Length != 0 )
{
switch (i)
{
case 0: ...
}
}
Thank you again.

Nov 16 '05 #7

bill tie

Niki,

Thank you for your reply.

[1]

(a) Regex considers some characters, e.g. parentheses, as letters.
I don't think so. Maybe you have a sample?

I stand corrected albeit I would've sworn I saw it. I ran a few new tests.

French language, the letter A may have three different accents. I'm not
sure
how the pattern [a-z] treats these letters.

They're not included in this set. You'll have to use unicode character
classes for that problem:
myString = Regex.Replace( myString, @"[^\p{Ll}\p{Lu}\p {Lt}\p{Lo}\p{Nd } ]",
"" );

It's a good thing you wrote this. I checked "Character Classes" in the MSDN
Library. Lo and behold, my problem is solved.

myString = Regex.Replace( myString, "[^\\w\\d ]", "" );

removes everything but underscores. Nice, very nice.

[2] Paranthesis is wrong in this pattern. Use this one:
myString = Regex.Replace( myString, "([abcd])|([pqrs])|([123])",
new MatchEvaluator( MyEvaluator));

This

"([abcd])|([pqrs])|([123])"

is much easier on the eyes than my

"(([abcd])|[pqrs])|[123]"

[3]
I think I'm getting where I want to be. There seems to be one last thing.

I need to replace accented letters with ordinary ASCII letters. In pseudo
code, it looks as follows:

replace ( myString, "[various and devious a's]", "a" )

In the "[various and devious a's]" part I pasted letters I had copied from a
source on the web. I saved and closed my file. After I re-opened the file
some a's lost their accents.

What do you suggest I do?

Nov 16 '05 #8

Similar topics

2144

What's your opinion RegEx Gurus?

by: Ali Eghtebas | last post by:

Hi, I've made this regex to catch the start of a valid multiline comment such as "/*" in e.g. T-SQL code. "(?<=^(?:*'*')*?*)(?<!^(?:*'*')*?--.*)/\*.*?$" With Multiline option on. As we know the T-SQL single line comment starts with a "--" and the string character is a "'". Considering all this, from these lines below the pattern will only catch "/*

.NET Framework

1122

Need RegEx help

by: Justin F | last post by:

I need to parse a string, which contains SQL commands, for the batch terminator ("GO"). I came up with "\s\s" which seems to work as long as there isn't a "GO" in any comments. I have no idea how to tell RegEx to ignore "GO" if it's either in a line comment (--) or within a block comment (/* */). Any RegEx gurus know how to do this? For example, the following "GO"s should be ignored select * from Customers -- go select * from...

.NET Framework

2136

regex -- substitute chars outside quoted strings

by: Gary McCullough | last post by:

What I want to do sounds simple, but it's defeating me. I want to substitute all occurences of a colon : character in a string with an @ character -- unless the : occurs within a single or double-quoted substring. Surely this can be done with regular expressions? Any regex gurus know how to do it?

.NET Framework

1454

Question for RegEx gurus

by: Nurchi BECHED | last post by:

I have a filename and its process id in brackets. The problem is, the filename can contain brackets and numbers in it, but the last number in the brackets is always the process id. Now, assume, the process name (3344 is the id): myfilename.something else ()(321) (3344) (the actual filename is "myfilename.something else ()(321) (3344).exe")

C# / C Sharp

1048

Another one for the regex gurus

by: Derrick | last post by:

Hi Dave - Thanks, I'll give an easier example, say I have: sodium ion test and I search for "sodium ion test", matching, tagging, I end up with, say TEST${sodium ion test}

C# / C Sharp

8118

Regex - Memory performance

by: jeevankodali | last post by:

Hi I have an .Net application which processes thousands of Xml nodes each day and for each node I am using around 30-40 Regex matches to see if they satisfy some conditions are not. These Regex matches are called within a loop (like if or for). E.g. for(int i = 0; i < 10; i++) { Regex r = new Regex();

C# / C Sharp

3986

Find instance in a string

by: Chris Thunell | last post by:

I'm looking to find in a long string an instance of 4 numbers in a row, and pull out those numbers. For instance: string = "0104 PBR", i'd like to get the 0104. string="PBR XT 0105 TD", i'd like to get the 0105. The numbers will always be 4 digits together. (I'm using vb.net) Any help would be greatly appreciated! Chris cthunell@pierceassociates.com

Visual Basic .NET

426

help with regex

by: Petra Meier | last post by:

Hello, I use the following script to parse URI and email: function parseLinks($sData){ $regexEmail = "/\w+((-\w+)|(\.\w+))*\@+((\.|-)+)*\.+/"; $sData = preg_replace($regexEmail, "<a id='external' href='mailto:'$0'>$0</a>", $sData); $regexURI = '#(^|{1})(http://|ftp://|https://|news:)(+) (|$)#sm';

PHP

1615

RegEx Help

by: slg | last post by:

Gurus, I am new to RegEx. How can i validate following. All characters in my strings are and underscore The string MUST begin with Upper Or lowercase character. Maximum length is 51 can have underscores in between but no spaces any where cannot end with underscore

C# / C Sharp

10034

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9843

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9713

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7248

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6534

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5142

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5304

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3358

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2666

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General