In article <O4**************@TK2MSFTNGP15.phx.gbl>,
Martin Robins <so***@noaddress.spam> wrote:
: The string I am trying to parse is as follows:
: commandText=insert into [Trace] (Text) values (@message + N': ' +
: @category);commandType=StoredProcedure; message=@message;
: category=@category
: [...]
: The regular expression code is as follows:
: Regex regex = new
: Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)",
: RegexOptions.ExplicitCapture);
Part of your problem is that most metacharacters lose their special
meanings inside character classes. I doubt that you meant to say
that a value is zero or more characters that aren't parentheses,
question mark, colon, semicolon, pipe, and dollar sign.
The trickier part was figuring out why it matched the first *name*
as "insert into...commandType". At first, I thought it might have
been a longest-leftmost issue[*], but then I realized it was due to
a combination of the character class misunderstanding and your trailing
"anchor."
[*] A POSIX thing -- see pg. 116 of Friedl's *Mastering Regular
Expressions* or
http://shurl.org/friedl-longest-leftmost
When the matching engine tries the real first value ("insert...
@category)"), it sees the left parenthesis before @message and
says, 'Wait, a value can't have any parentheses because of the
given character class."
It then tries to backtrack, but it can't match the trailing anchor,
i.e., there's no semicolon or end-of-line to the left of that
paren before @message.
'Okay,' it thinks, 'I must've matched a bad substring for name,'
but a name is zero or more characters that aren't equals signs.
The next place that can start is "insert into...", and the greedy
star quantifier sucks up everything up to "commandType". The
rest of the pattern can match from there, and that explains the
faulty match.
Consider the following snippet:
static void Main(string[] args)
{
string str =
@"commandText=insert into [Trace] (Text) values (@message + N': ' +
@category);commandType=StoredProcedure ; message=@message;
category=@category";
Regex nameval = new Regex(
@"(?<name>\S+)\s*=\s*(?<val>[^;]+?)\s*(;|$)",
RegexOptions.Singleline);
foreach (Match m in nameval.Matches(str))
{
Console.WriteLine(
"name=[{0}], val=[{1}]",
m.Groups["name"].ToString(),
m.Groups["val"].ToString());
}
}
Its output is
name=[commandText], val=[insert into [Trace] (Text) values (@message + N': ' +
@category)]
name=[commandType], val=[StoredProcedure]
name=[message], val=[@message]
name=[category], val=[@category]
Here we define a name as a run of non-whitespace characters (\S+). By
matching optional whitespace (\s*) and excluding it from the capturing
parentheses, we save the trim steps from your code.
The val subpattern is similar: a val is a run of non-semicolon
characters. One place to pay attention is the +? quantifier. Remember
that * (zero or more of..) and + (one or more of..) are greedy: they
grab as much text as they can. The ? versions (think of them as
cautious or timid) are very anxious to turn control over to the next
part of the expression.
If the val subpattern had been [^;]+ instead of [^;]+?, any trailing
whitespace would be consumed as part of val, but \s* would still happily
matched the empty string. (Remember that starred expressions *always*
succeed, although perhaps by matching nothing.)
This is mostly a polish issue. Using the non-greedy plus gives \s*
a chance to throw away whitespace. Again, this saves the extra trim
steps.
One more important note: because the final name-val pair may be
terminated by end-of-string instead of a semicolon, use of
RegexOptions.Singleline is important because it changes $ to mean
only end-of-string. (I wasn't sure if the newlines in your example
were an artifact of posting to Usenet or whether they might actually
be there, so I took the conservative route.)
I hope this helps.
Greg