472,328 Members | 1,016 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,328 software developers and data experts.

Regex: Finding strings in a source file

Bob
I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?

Thanks
Mar 21 '06 #1
8 2465
"Bob" <no****@nowhere.com> wrote:
I need to create a Regex to extract all strings (including
quotations) from a C# or C++ source file.


Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that
can't be distinguished from a real string using a regex alone:

// "I am a comment but I look like a string"

Eq.
Mar 21 '06 #2
Bob
Nope, it very well is possible...

Regex regex = new
Regex(@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')",
RegexOptions.Singleline);

String result = codeRegex.Replace(input, new MatchEvaluator(MatchEval));

public String MatchEval(Match match)
{
if(match.Groups[1].Success) { } //comment
if(match.Groups[2].Success) { } //string literal
...
}

Back to my original question, if anybody knows why the regex isn't correctly
watching for back-slashes followed by a quotation, any input is appreciated.
"Paul E Collins" <fi******************@CL4.org> wrote in message
news:dv**********@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com...
"Bob" <no****@nowhere.com> wrote:
I need to create a Regex to extract all strings (including
quotations) from a C# or C++ source file.


Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that can't
be distinguished from a real string using a regex alone:

// "I am a comment but I look like a string"

Eq.

Mar 21 '06 #3
Bob wrote:
I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)",
(RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
Mar 22 '06 #4
Bob
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format of
(?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with
a back-slash. For example: "This is a test\\". Can anybody see how to
fix this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)

Mar 22 '06 #5
Bob
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your literal
string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

"Bob" <no****@nowhere.com> wrote in message
news:uV**************@TK2MSFTNGP10.phx.gbl...
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)


Mar 22 '06 #6
Bob
I figured what it is... the <2> is a back reference to the commenting group,
and me prefixing the entire thing set the number off. I went ahead and
named it and now I have this:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?(?<comment>[""']).+?\<comment>)"

The only problem now is that it doesn't take into account escaped quotations
and double quotations when using the @ string literal prefix in C# files.
"Bob" <no****@nowhere.com> wrote in message
news:ef**************@tk2msftngp13.phx.gbl...
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your
literal string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

"Bob" <no****@nowhere.com> wrote in message
news:uV**************@TK2MSFTNGP10.phx.gbl...
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?

Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 ="This is a test\\"=
2 ="=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 ="test"=
2 ="=

Matching: 'Now for another\\'
1 ='Now for another\\'=
2 ='=

Matching: Using 'single quotes'
1 ='single quotes'=
2 ='=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 ="quoted"=
2 ="=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)



Mar 22 '06 #7
Bob
So here is what I've gotten so far:

@"(/\*.*?\*/|//.*?(?=\r|\n))|((?:@(?<c1>[""'])(?:""""|.)*?\<c1>)|(?:(?<c2>[""'])(?:\\.|.)*?\<c2>))"

I am using non-capturing groups for a specific reason not seen here, just
ignore those.

Anyway, the first part is for comments, the second part is for literal
strings starting with @, the third part is for literal strings with
potential escape characters. Everything seems to work now exept for
supporting double-quotation marks in literal strings starting with @. For
example, this input sample:

String str = "before @\"a\"\"b\"\"c\" after \"ok\"";

Captures:
@"a"
"b"
"c"
"ok"

When it should capture:
@"a""b""c"
"ok"

I tested making the capture non-lazy, but then it captures:
@"a""b""c" after "ok"

It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?

If you know why this might be, please share...
Mar 23 '06 #8
Bob wrote:
It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?


I'm out of ideas on this one. Probably something to do with not considering groups/patterns available for backreferencing if they're in an OR statement.
What I'd do is try to simplify the processing -- break your parsing into more than one pass to make the resulting strings more digestible. You might even find that regex isn't the best option -- string functions could wind up being more appropriate.

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
Mar 23 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: zOrg | last post by:
hi, i'm using the preg_match_all() function to parse an asp file and find all include file within this file : asp include strings can be :...
3
by: Day Of The Eagle | last post by:
Jeff_Relf wrote: > ...yet you don't even know what RegEx is. > I'm looking at the source code for mono's Regex implementation right now. ...
17
by: clintonG | last post by:
I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand....
7
by: melanieab | last post by:
Hi, I'm trying to use DataView to find the row number in the datatable that contains "Rich" in it so that I can highlight it. It works fine when I...
2
by: Martin Hart | last post by:
I have a connection string that I would like to extract a part from, but my knowledge does not extend far enough to resolve my problem. I can...
4
by: MooMaster | last post by:
I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: ...
3
by: | last post by:
I'm analyzing large strings and finding matches using the Regex class. I want to find the context those matches are found in and to display excerpts...
1
by: =?Utf-8?B?QWxCcnVBbg==?= | last post by:
I have a regular expression for capturing all occurrences of words contained between {{ and }} in a file. My problem is I need to capture what is...
3
by: =?Utf-8?B?bWFnZ2ll?= | last post by:
hi, I've been working getting a file parsed out using Regex. There's something I don't understand. When I define the pattern for my fields in my...
0
by: tammygombez | last post by:
Hey fellow JavaFX developers, I'm currently working on a project that involves using a ComboBox in JavaFX, and I've run into a bit of an issue....
0
by: tammygombez | last post by:
Hey everyone! I've been researching gaming laptops lately, and I must say, they can get pretty expensive. However, I've come across some great...
0
better678
by: better678 | last post by:
Question: Discuss your understanding of the Java platform. Is the statement "Java is interpreted" correct? Answer: Java is an object-oriented...
0
by: teenabhardwaj | last post by:
How would one discover a valid source for learning news, comfort, and help for engineering designs? Covering through piles of books takes a lot of...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
1
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.