Regular Expressions in C# | | |
Hello all,
I am attempting to create a small scripting application to be used
during testing. I extract the commands from the script file I was going
to tokenize the each line as one of the requirements is there one
command per line. I have always wanted to learn Regular Expressions, so
I was hoping I might do this using Regular Expressions. For a fair
number of the command will have the syntax like
Write( 0x123, 0x12, 25, 100 ) <- Write three bytes to address 0x123
Write(varName1, 0x12) <- Write one bytes to address
expressed by the value of
varName1
Read( 0x55, 5 ) <- Write one bytes to address 0x55
Read(0x3456, 0x12) <- Read eighteen bytes to address
0x3456
varName2 = Read( varName1 ) <- Read one byte from address
expressed by the value of varName1
and store that read value to
varName2
I know if I use the regular expression (^[a-zA-Z]*) will find the
initial keywords or variable names which I can perform an initial check
to make sure they are valid or the variable has been declared already,
but the hard part is creating a regular expression to match the various
forms of the syntax. How would I create a regular express for the first
and last script commands? I think with those I can attempt to determine
the others. The spaces between the arguments are optional and may be
omitted if the user so desires.
For the first script command I was attempting to craft one that looks
like..
(^[a-zA-Z]*)('\(')(['0x',0-9][a-zA-Z]*)(',')(['0x',0-9][a-zA-Z]*)
but this obviously doesn't work. Any help is greatly appreciated.
Mark | | | | re: Regular Expressions in C#
Hi Mark,
For parsing script commands you might consider using a lexical analyser
like CsLex or C# Lex, maybe with a grammer parser such as GPPG.
To match you first command, try something like:
\w+\((\s*0x\d+\s*,\s*{2}\d+\s*,\s*\d+\s*\)
There's a great regexp reference here: http://www.regular-expressions.info/reference.html
HTH,
Chris | | | | re: Regular Expressions in C#
I couldn't help but bite on this one. It is a very challenging problem. Here
is your solution:
(?i)(?:(?<function>Write|Read)\s*\()\s*|(?<=(?:(?: Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))
Let me break it down a bit. First, I used (?i) to indicate that it is
non-case-sensitive.
Next, I had the problem of identifying *both* function names and parameters
in the same Regular Expression.
The function name Regular Expression is:
(?:(?<function>Write|Read)\s*\(\s*)
"function" is the name of the capturing group, which captures only the
function name. The rest of the match is to identify it as a function.
It will match only if the function name is "Read" or "Write" and is followed
by an opening parenthesis. I assumed that any token may have any number of
white-space characters before and after it. This was not too tricky.
The second one is a bit trickier:
(?<=(?:(?:Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))
The trick here is to identify a parameter from inside a set of function
parameters.
The rules break down as:
1. A parameter is always preceded by a function name followed by an open
parenthesis, as in:
Write (
2. It may be preceded by another parameter followed by a comma.
Write(param1,
- or -
Write(.......param3,
3. It is always followed by either a comma or an end-parenthesis.
param1,
- or -
param2 )
So, starting with the third rule, we get:
(?<parameter>[\d\w]+)(?=,\s*|\s*\))
"parameter" is the name of the capturing group, which according to these
rules is an alphanumeric token. The rest of it is how the parameter is
matched. It is a positive look-ahead, which means that it *must* be followed
by either a comma or an end parenthesis.
However, the problem here is that *any* word in the string that is not a
function and is followed by a comma or an end parenthesis will match this,
as in:
Read( 0x55, 5 ) <- Write one byte, to (address 0x55)
In this line, "byte," and "(address 0x55)" will match.
So, how do we eliminate non-parameters? Well, obviously, a parameter is
defined as being inside the parentheses of a function call. So, first, use a
positive look-behind to see if it is preceded by a function call. We need to
identify the function, using the same syntax as before:
(?:(?:Write|Read)\s*\(\s*)
However, it may have a parameter before it, instead of the function call. So
we use an OR "|" operator to indicate that it may be preceded by:
(?:(?:[\d\w]+\s*,\s*))
Note that we have changed the rule slightly. Any parameter which precedes
another parameter will *not* be followed by an end-parenthesis. It will
*always* be followed by a comma.
So, we use the Positive Lookbehind syntax (?>=) coupled with an OR operator
("|"), and get:
(?<=(?:(?:Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))
Translated: Match any alphanumeric set of tokens which is followed by either
a comma or an end parenthesis, and is preceded either by a function call or
by another parameter.
Now to put them together, we use the OR operator:
(?i)(?:(?<function>Write|Read)\s*\()\s*|(?<=(?:(?: Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))
The function name will be captured into the "function" group, and all of the
parameters will be captured into the "parameter" group. This could be stated
as:
Match any token that is either "Read" or "Write" followed by an open
parenthesis, and call it "function," OR Match any alphanumeric set of tokens
which is followed by either a comma or an end parenthesis, and is preceded
either by a function call or by another parameter, and call it "parameter."
You sure picked a doozy to start out with!
--
HTH,
Kevin Spencer
Microsoft MVP
Professional Numbskull
Hard work is a medication for which
there is no placebo.
<LordHog@hotmail.com> wrote in message
news:1144962018.113580.94720@u72g2000cwu.googlegro ups.com...[color=blue]
> Hello all,
>
> I am attempting to create a small scripting application to be used
> during testing. I extract the commands from the script file I was going
> to tokenize the each line as one of the requirements is there one
> command per line. I have always wanted to learn Regular Expressions, so
> I was hoping I might do this using Regular Expressions. For a fair
> number of the command will have the syntax like
>
> Write( 0x123, 0x12, 25, 100 ) <- Write three bytes to address 0x123
> Write(varName1, 0x12) <- Write one bytes to address
> expressed by the value of
> varName1
> Read( 0x55, 5 ) <- Write one bytes to address 0x55
> Read(0x3456, 0x12) <- Read eighteen bytes to address
> 0x3456
> varName2 = Read( varName1 ) <- Read one byte from address
> expressed by the value of varName1
> and store that read value to
> varName2
>
>
> I know if I use the regular expression (^[a-zA-Z]*) will find the
> initial keywords or variable names which I can perform an initial check
> to make sure they are valid or the variable has been declared already,
> but the hard part is creating a regular expression to match the various
> forms of the syntax. How would I create a regular express for the first
> and last script commands? I think with those I can attempt to determine
> the others. The spaces between the arguments are optional and may be
> omitted if the user so desires.
>
> For the first script command I was attempting to craft one that looks
> like..
>
> (^[a-zA-Z]*)('\(')(['0x',0-9][a-zA-Z]*)(',')(['0x',0-9][a-zA-Z]*)
>
> but this obviously doesn't work. Any help is greatly appreciated.
>
> Mark
>[/color] | | | | re: Regular Expressions in C#
Kevin,
Thanks for providing a response and I am sorry for such a long delay
in my follow-up. I found help in the RegEx group which helped out a
great deal. I wanted to share the RegEx that I have thus far. They are
not fully testest, but they are functional for the most part. I used
unnamed groups for just about everything since that is just how I
decided to parse everything out. Perhaps I might change it in the
future if I find this approach problematic.
So here we go...
Syntax format: Write( address, data [, 44] )
\s*Write\s*\((?:\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*){1,1}(?:\s*,\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*))*\s*\))\s*$
Syntax format: [variable3 =] Read( 0x44 [, 44] )
Group 1 : Optional: variable name with equal sign
(e.g. "variable2 =")
Group 2 : Required: Read keyword
Group 3 : Required: Address
Group 4 : Optional: Number of bytes to read starting at 'Address'
^\s*(?:([a-zA-Z][a-zA-z\d]\w*)\s*=\s*){0,1}(?:\s*(Read){1,1}\s*)\((?:\s*(\d+ |0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*)(?:\s*,\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*))*\s*\))\s*$
This one is rather long, but there are multiple cases that I need to
account for. I could have created a RegEx for each individual case,
but I rather have one all encompassing one then check each of the
parameters instead of processing each RegEx which I think would be
slower. For these, you can change byte to short, int and float which
is used in my application.
Syntax format: byte var1
Group 1 : Required: var1
Group 2 : Optional: Not Present
Group 3 : Optional: Not Present
Group 4 : Optional: Not Present
Group 5 : Optional: Not Present
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present
----------------------------------------------
Syntax format: byte var2 = variableNew
Group 1 : Required: var2
Group 2 : Optional: Not Present
Group 3 : Optional: Not Present
Group 4 : Optional: Not Present
Group 5 : Optional: variableNew
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present
----------------------------------------------
Syntax format: byte var3[3] = { 0x11, 0xAA, 0x33 }
Group 1 : Required: var3
Group 2 : Optional: 3
Group 3 : Optional: 0x11
Group 4 : Optional:
Capture 1: 0xAA
Capture 2: 0x33
Group 5 : Optional: Not Present
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present
----------------------------------------------
Syntax format: byte var4[] = { 0x33, 0x444 }
Group 1 : Required: var4
Group 2 : Optional: Not Present
Group 3 : Optional: 0x33
Group 4 : Optional: 0x444
Group 5 : Optional: Not Present
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present
----------------------------------------------
Syntax format: byte var5[5] = 5555
Group 1 : Required: var5
Group 2 : Optional: Not Present
Group 3 : Optional: Not Present
Group 4 : Optional: Not Present
Group 5 : Optional: Not Present
Group 6 : Optional: 5
^\s*byte
(?:\s*([a-zA-Z][\da-zA-Z]*))(?:\[(?:\s*(\d+)\s*)?\]\s*=\s*(?:\s*\{\s*(\d+|0x[\dA-Fa-f]*)(?:\s*,\s*(\d+|0x[\dA-Fa-f]*))*\s*\})|\s*=\s*(?:(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*))|(?:\[(?:\s*(\d+)\s*)?\]\s*=\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*)))?\s*$
Syntax format: SetCommParam(COMn, BaudRate, DataBits, StopBits,
Parity)
Note: The Comm Port and Parity strings are case sensitive
Group 1 : Required: Port Number { COMn }
Group 2 : Required: Baud Rate
Group 3 : Required: DataBits
Group 4 : Required: StopBits
Group 5 : Required: Parity { None, Odd, Even, Mark, Space }
^\s*SetCommParam\s*\(\s*(?:(COM\d+))\s*,\s*(?:(\d+ ))\s*,\s*(?:([5-8])){1,1}\s*,\s*(?:(1|1.5|2))\s*,\s*(?:(None|Odd|Eve n|Mark|Space))\s*\)\s*$
I hope this might help someone else in the future. Thanks too all of
the great people on the newsgroups and forums.
Mark |  | Similar .NET Framework bytes | | | /bytes/about
We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights.
Get the best answers to your questions from over 226,374 network members.
|