473,468 Members | 1,531 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Regex: Finding strings in a source file

Bob
I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?

Thanks
Mar 21 '06 #1
8 2561
"Bob" <no****@nowhere.com> wrote:
I need to create a Regex to extract all strings (including
quotations) from a C# or C++ source file.


Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that
can't be distinguished from a real string using a regex alone:

// "I am a comment but I look like a string"

Eq.
Mar 21 '06 #2
Bob
Nope, it very well is possible...

Regex regex = new
Regex(@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')",
RegexOptions.Singleline);

String result = codeRegex.Replace(input, new MatchEvaluator(MatchEval));

public String MatchEval(Match match)
{
if(match.Groups[1].Success) { } //comment
if(match.Groups[2].Success) { } //string literal
...
}

Back to my original question, if anybody knows why the regex isn't correctly
watching for back-slashes followed by a quotation, any input is appreciated.
"Paul E Collins" <fi******************@CL4.org> wrote in message
news:dv**********@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com...
"Bob" <no****@nowhere.com> wrote:
I need to create a Regex to extract all strings (including
quotations) from a C# or C++ source file.


Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that can't
be distinguished from a real string using a regex alone:

// "I am a comment but I look like a string"

Eq.

Mar 21 '06 #3
Bob wrote:
I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)",
(RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
Mar 22 '06 #4
Bob
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format of
(?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with
a back-slash. For example: "This is a test\\". Can anybody see how to
fix this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)

Mar 22 '06 #5
Bob
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your literal
string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

"Bob" <no****@nowhere.com> wrote in message
news:uV**************@TK2MSFTNGP10.phx.gbl...
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?


Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)


Mar 22 '06 #6
Bob
I figured what it is... the <2> is a back reference to the commenting group,
and me prefixing the entire thing set the number off. I went ahead and
named it and now I have this:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?(?<comment>[""']).+?\<comment>)"

The only problem now is that it doesn't take into account escaped quotations
and double quotations when using the @ string literal prefix in C# files.
"Bob" <no****@nowhere.com> wrote in message
news:ef**************@tk2msftngp13.phx.gbl...
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your
literal string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

"Bob" <no****@nowhere.com> wrote in message
news:uV**************@TK2MSFTNGP10.phx.gbl...
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en...geelements.asp.

Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?

Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)



Mar 22 '06 #7
Bob
So here is what I've gotten so far:

@"(/\*.*?\*/|//.*?(?=\r|\n))|((?:@(?<c1>[""'])(?:""""|.)*?\<c1>)|(?:(?<c2>[""'])(?:\\.|.)*?\<c2>))"

I am using non-capturing groups for a specific reason not seen here, just
ignore those.

Anyway, the first part is for comments, the second part is for literal
strings starting with @, the third part is for literal strings with
potential escape characters. Everything seems to work now exept for
supporting double-quotation marks in literal strings starting with @. For
example, this input sample:

String str = "before @\"a\"\"b\"\"c\" after \"ok\"";

Captures:
@"a"
"b"
"c"
"ok"

When it should capture:
@"a""b""c"
"ok"

I tested making the capture non-lazy, but then it captures:
@"a""b""c" after "ok"

It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?

If you know why this might be, please share...
Mar 23 '06 #8
Bob wrote:
It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?


I'm out of ideas on this one. Probably something to do with not considering groups/patterns available for backreferencing if they're in an OR statement.
What I'd do is try to simplify the processing -- break your parsing into more than one pass to make the resulting strings more digestible. You might even find that regex isn't the best option -- string functions could wind up being more appropriate.

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
Mar 23 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: zOrg | last post by:
hi, i'm using the preg_match_all() function to parse an asp file and find all include file within this file : asp include strings can be : <!--#include virtual="/dir/file.asp"--> or...
3
by: Day Of The Eagle | last post by:
Jeff_Relf wrote: > ...yet you don't even know what RegEx is. > I'm looking at the source code for mono's Regex implementation right now. You can download that source here ( use the class...
17
by: clintonG | last post by:
I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher ...
7
by: melanieab | last post by:
Hi, I'm trying to use DataView to find the row number in the datatable that contains "Rich" in it so that I can highlight it. It works fine when I enter the entire string (i.e. Richard), but I...
2
by: Martin Hart | last post by:
I have a connection string that I would like to extract a part from, but my knowledge does not extend far enough to resolve my problem. I can have strings like: "Integrated Security=SSPI;Persist...
4
by: MooMaster | last post by:
I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: ...
3
by: | last post by:
I'm analyzing large strings and finding matches using the Regex class. I want to find the context those matches are found in and to display excerpts of that context, just as a search engine might....
1
by: =?Utf-8?B?QWxCcnVBbg==?= | last post by:
I have a regular expression for capturing all occurrences of words contained between {{ and }} in a file. My problem is I need to capture what is between those symbols. For instance, if I have...
3
by: =?Utf-8?B?bWFnZ2ll?= | last post by:
hi, I've been working getting a file parsed out using Regex. There's something I don't understand. When I define the pattern for my fields in my file, I am telling regex to grab those fields (...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.