I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:
@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"
I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?
Thanks 8 2561
"Bob" <no****@nowhere.com> wrote: I need to create a Regex to extract all strings (including quotations) from a C# or C++ source file.
Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that
can't be distinguished from a real string using a regex alone:
// "I am a comment but I look like a string"
Eq.
Nope, it very well is possible...
Regex regex = new
Regex(@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')",
RegexOptions.Singleline);
String result = codeRegex.Replace(input, new MatchEvaluator(MatchEval));
public String MatchEval(Match match)
{
if(match.Groups[1].Success) { } //comment
if(match.Groups[2].Success) { } //string literal
...
}
Back to my original question, if anybody knows why the regex isn't correctly
watching for back-slashes followed by a quotation, any input is appreciated.
"Paul E Collins" <fi******************@CL4.org> wrote in message
news:dv**********@nwrdmz02.dmz.ncs.ea.ibs-infra.bt.com... "Bob" <no****@nowhere.com> wrote:
I need to create a Regex to extract all strings (including quotations) from a C# or C++ source file.
Well, it's not possible. You'd need a complete C# parser to extract strings in a foolproof way. Here's one of the simpler examples that can't be distinguished from a real string using a regex alone:
// "I am a comment but I look like a string"
Eq.
Bob wrote: I need to create a Regex to extract all strings (including quotations) from a C# or C++ source file. After being unsuccessful myself, I found this sample on the internet:
@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"
I am inputting the entire source file string and using it with RegexOptions.Singleline. This works OK with, unless the string ends with a back-slash. For example: "This is a test\\". Can anybody see how to fix this sample so that back-slashes are considered?
Without examples of desired behaviour, here's what I came up with, using backreferences:
Regex regex = new Regex(@"(([""']).+\<2>)",
(RegexOptions) 0);
Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.
Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=
Matching: This is also a test
No Match
Matching: Here's another "test"
1 =»"test"«=
2 =»"«=
Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=
Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=
Matching: // Here 's a comment.
No Match
Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=
You'd want the group 1....
--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format of
(?<group>.*?). I couldn't find any reference to this at http://msdn.microsoft.com/library/en...geelements.asp.
Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl... Bob wrote: I need to create a Regex to extract all strings (including quotations) from a C# or C++ source file. After being unsuccessful myself, I found this sample on the internet:
@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"
I am inputting the entire source file string and using it with RegexOptions.Singleline. This works OK with, unless the string ends with a back-slash. For example: "This is a test\\". Can anybody see how to fix this sample so that back-slashes are considered?
Without examples of desired behaviour, here's what I came up with, using backreferences:
Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);
Sample input: "This is a test\\" This is also a test Here's another "test" 'Now for another\\' Using 'single quotes' // Here 's a comment. // And a "quoted" one.
Sample output: Matching: "This is a test\\" 1 =»"This is a test\\"«= 2 =»"«=
Matching: This is also a test No Match
Matching: Here's another "test" 1 =»"test"«= 2 =»"«=
Matching: 'Now for another\\' 1 =»'Now for another\\'«= 2 =»'«=
Matching: Using 'single quotes' 1 =»'single quotes'«= 2 =»'«=
Matching: // Here 's a comment. No Match
Matching: // And a "quoted" one. 1 =»"quoted"«= 2 =»"«=
You'd want the group 1....
-- Take care, Ken (to reply directly, remove the cool car. <sigh>)
Also, I prepended your pattern to test for comments first:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"
After prefixing the commenting part, comments are picked up but your literal
string part is completely ignored. For example:
Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";
The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";
Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"
Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).
Thanks
"Bob" <no****@nowhere.com> wrote in message
news:uV**************@TK2MSFTNGP10.phx.gbl... Your Regex works very well Ken, thanks. Can you explain what exactly the <2> does? It looks like a grouping construct, but it isn't in the format of (?<group>.*?). I couldn't find any reference to this at http://msdn.microsoft.com/library/en...geelements.asp.
Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message news:%2****************@TK2MSFTNGP09.phx.gbl... Bob wrote: I need to create a Regex to extract all strings (including quotations) from a C# or C++ source file. After being unsuccessful myself, I found this sample on the internet:
@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"
I am inputting the entire source file string and using it with RegexOptions.Singleline. This works OK with, unless the string ends with a back-slash. For example: "This is a test\\". Can anybody see how to fix this sample so that back-slashes are considered?
Without examples of desired behaviour, here's what I came up with, using backreferences:
Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);
Sample input: "This is a test\\" This is also a test Here's another "test" 'Now for another\\' Using 'single quotes' // Here 's a comment. // And a "quoted" one.
Sample output: Matching: "This is a test\\" 1 =»"This is a test\\"«= 2 =»"«=
Matching: This is also a test No Match
Matching: Here's another "test" 1 =»"test"«= 2 =»"«=
Matching: 'Now for another\\' 1 =»'Now for another\\'«= 2 =»'«=
Matching: Using 'single quotes' 1 =»'single quotes'«= 2 =»'«=
Matching: // Here 's a comment. No Match
Matching: // And a "quoted" one. 1 =»"quoted"«= 2 =»"«=
You'd want the group 1....
-- Take care, Ken (to reply directly, remove the cool car. <sigh>)
I figured what it is... the <2> is a back reference to the commenting group,
and me prefixing the entire thing set the number off. I went ahead and
named it and now I have this:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?(?<comment>[""']).+?\<comment>)"
The only problem now is that it doesn't take into account escaped quotations
and double quotations when using the @ string literal prefix in C# files.
"Bob" <no****@nowhere.com> wrote in message
news:ef**************@tk2msftngp13.phx.gbl... Also, I prepended your pattern to test for comments first:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"
After prefixing the commenting part, comments are picked up but your literal string part is completely ignored. For example:
Nothing is matched (should have gotten the "C"): String str = "extern \"C\"\r\n";
The whole line is correctly matched for a comment: String str = "//extern \"C\"\r\n";
Strangely enough the old pattern did work in this aspect: @"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\ \).')"
Unfortunately it fails to correctly end literal strings ending with a back-slash (unlike yours, which does work).
Thanks
"Bob" <no****@nowhere.com> wrote in message news:uV**************@TK2MSFTNGP10.phx.gbl... Your Regex works very well Ken, thanks. Can you explain what exactly the <2> does? It looks like a grouping construct, but it isn't in the format of (?<group>.*?). I couldn't find any reference to this at http://msdn.microsoft.com/library/en...geelements.asp.
Thanks again.
"Ken Arway" <ka****@jaguar.att.net> wrote in message news:%2****************@TK2MSFTNGP09.phx.gbl... Bob wrote: I need to create a Regex to extract all strings (including quotations) from a C# or C++ source file. After being unsuccessful myself, I found this sample on the internet:
@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"
I am inputting the entire source file string and using it with RegexOptions.Singleline. This works OK with, unless the string ends with a back-slash. For example: "This is a test\\". Can anybody see how to fix this sample so that back-slashes are considered?
Without examples of desired behaviour, here's what I came up with, using backreferences:
Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);
Sample input: "This is a test\\" This is also a test Here's another "test" 'Now for another\\' Using 'single quotes' // Here 's a comment. // And a "quoted" one.
Sample output: Matching: "This is a test\\" 1 =»"This is a test\\"«= 2 =»"«=
Matching: This is also a test No Match
Matching: Here's another "test" 1 =»"test"«= 2 =»"«=
Matching: 'Now for another\\' 1 =»'Now for another\\'«= 2 =»'«=
Matching: Using 'single quotes' 1 =»'single quotes'«= 2 =»'«=
Matching: // Here 's a comment. No Match
Matching: // And a "quoted" one. 1 =»"quoted"«= 2 =»"«=
You'd want the group 1....
-- Take care, Ken (to reply directly, remove the cool car. <sigh>)
So here is what I've gotten so far:
@"(/\*.*?\*/|//.*?(?=\r|\n))|((?:@(?<c1>[""'])(?:""""|.)*?\<c1>)|(?:(?<c2>[""'])(?:\\.|.)*?\<c2>))"
I am using non-capturing groups for a specific reason not seen here, just
ignore those.
Anyway, the first part is for comments, the second part is for literal
strings starting with @, the third part is for literal strings with
potential escape characters. Everything seems to work now exept for
supporting double-quotation marks in literal strings starting with @. For
example, this input sample:
String str = "before @\"a\"\"b\"\"c\" after \"ok\"";
Captures:
@"a"
"b"
"c"
"ok"
When it should capture:
@"a""b""c"
"ok"
I tested making the capture non-lazy, but then it captures:
@"a""b""c" after "ok"
It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?
If you know why this might be, please share...
Bob wrote: It is like it is going to the second option instead of doing the first, even though the first is available: (?:""""|.).*?
I'm out of ideas on this one. Probably something to do with not considering groups/patterns available for backreferencing if they're in an OR statement.
What I'd do is try to simplify the processing -- break your parsing into more than one pass to make the resulting strings more digestible. You might even find that regex isn't the best option -- string functions could wind up being more appropriate.
--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>) This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: zOrg |
last post by:
hi,
i'm using the preg_match_all() function to parse an asp file and find
all include file within this file :
asp include strings can be :
<!--#include virtual="/dir/file.asp"-->
or...
|
by: Day Of The Eagle |
last post by:
Jeff_Relf wrote:
> ...yet you don't even know what RegEx is.
>
I'm looking at the source code for mono's Regex implementation right
now. You can download that source here ( use the class...
|
by: clintonG |
last post by:
I'm using an .aspx tool I found at but as nice as the interface is I
think I need to consider using others. Some can generate C# I understand.
Your preferences please...
<%= Clinton Gallagher
...
|
by: melanieab |
last post by:
Hi,
I'm trying to use DataView to find the row number in the datatable that
contains "Rich" in it so that I can highlight it. It works fine when I enter
the entire string (i.e. Richard), but I...
|
by: Martin Hart |
last post by:
I have a connection string that I would like to extract a part from, but
my knowledge does not extend far enough to resolve my problem.
I can have strings like:
"Integrated Security=SSPI;Persist...
|
by: MooMaster |
last post by:
I'm trying to develop a little script that does some string
manipulation. I have some few hundred strings that currently look like
this:
cond(a,b,c)
and I want them to look like this:
...
|
by: |
last post by:
I'm analyzing large strings and finding matches using the Regex class. I
want to find the context those matches are found in and to display excerpts
of that context, just as a search engine might....
|
by: =?Utf-8?B?QWxCcnVBbg==?= |
last post by:
I have a regular expression for capturing all occurrences of words contained
between {{ and }} in a file. My problem is I need to capture what is between
those symbols. For instance, if I have...
|
by: =?Utf-8?B?bWFnZ2ll?= |
last post by:
hi,
I've been working getting a file parsed out using Regex. There's something I
don't understand. When I define the pattern for my fields in my file, I am
telling regex to grab those fields (...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
| |