By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,784 Members | 2,933 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,784 IT Pros & Developers. It's quick & easy.

Regex: Finding strings in a source file

P: n/a
Bob
I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?

Thanks
Mar 21 '06 #1
Share this Question
Share on Google+
8 Replies


P: n/a
"Bob" <no****@nowhere.com> wrote:
I need to create a Regex to extract all strings (including
quotations) from a C# or C++ source file.


Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that
can't be distinguished from a real string using a regex alone:

// "I am a comment but I look like a string"

Eq.
Mar 21 '06 #2

P: n/a
Bob
Nope, it very well is possible...

Regex regex = new
Regex(@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')",
RegexOptions.Singleline);

String result = codeRegex.Replace(input, new MatchEvaluator(MatchEval));

public String MatchEval(Match match)
{
if(match.Groups[1].Success) { } //comment
if(match.Groups[2].Success) { } //string literal
...
}

Back to my original question, if anybody knows why the regex isn't correctly
watching for back-slashes followed by a quotation, any input is appreciated.
"Paul E Collins" <fi******************@CL4.org> wrote in message
news:dv**********@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com...
"Bob" <no****@nowhere.com> wrote:
I need to create a Regex to extract all strings (including
quotations) from a C# or C++ source file.


Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that can't
be distinguished from a real string using a regex alone:

// "I am a comment but I look like a string"

Eq.

Mar 21 '06 #3

P: n/a
Bob wrote:
I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)",
(RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
Mar 22 '06 #4

P: n/a
Bob
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format of
(?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with
a back-slash. For example: "This is a test\\". Can anybody see how to
fix this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)

Mar 22 '06 #5

P: n/a
Bob
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your literal
string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

"Bob" <no****@nowhere.com> wrote in message
news:uV**************@TK2MSFTNGP10.phx.gbl...
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)


Mar 22 '06 #6

P: n/a
Bob
I figured what it is... the <2> is a back reference to the commenting group,
and me prefixing the entire thing set the number off. I went ahead and
named it and now I have this:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?(?<comment>[""']).+?\<comment>)"

The only problem now is that it doesn't take into account escaped quotations
and double quotations when using the @ string literal prefix in C# files.
"Bob" <no****@nowhere.com> wrote in message
news:ef**************@tk2msftngp13.phx.gbl...
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your
literal string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

"Bob" <no****@nowhere.com> wrote in message
news:uV**************@TK2MSFTNGP10.phx.gbl...
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?

Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)



Mar 22 '06 #7

P: n/a
Bob
So here is what I've gotten so far:

@"(/\*.*?\*/|//.*?(?=\r|\n))|((?:@(?<c1>[""'])(?:""""|.)*?\<c1>)|(?:(?<c2>[""'])(?:\\.|.)*?\<c2>))"

I am using non-capturing groups for a specific reason not seen here, just
ignore those.

Anyway, the first part is for comments, the second part is for literal
strings starting with @, the third part is for literal strings with
potential escape characters. Everything seems to work now exept for
supporting double-quotation marks in literal strings starting with @. For
example, this input sample:

String str = "before @\"a\"\"b\"\"c\" after \"ok\"";

Captures:
@"a"
"b"
"c"
"ok"

When it should capture:
@"a""b""c"
"ok"

I tested making the capture non-lazy, but then it captures:
@"a""b""c" after "ok"

It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?

If you know why this might be, please share...
Mar 23 '06 #8

P: n/a
Bob wrote:
It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?


I'm out of ideas on this one. Probably something to do with not considering groups/patterns available for backreferencing if they're in an OR statement.
What I'd do is try to simplify the processing -- break your parsing into more than one pass to make the resulting strings more digestible. You might even find that regex isn't the best option -- string functions could wind up being more appropriate.

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
Mar 23 '06 #9

This discussion thread is closed

Replies have been disabled for this discussion.