"ENIZIN" <EN****@discussions.microsoft.com> wrote in
news:A8**********************************@microsof t.com...
Hello,
I'm having a bit of trouble creating my regular expression and need a
guru's
help!
Here's what I have...I have a sequence of characters that need to be
validated against the database.
string: ACCCGUCAU[5Br]IAACCU
What I'm trying to do is load the available values from the database and
create my regex pattern from that. Right now I'm basically just using the
"|"
operator which gets a lot of it but it still needs more. I'm also escaping
the "[" and "]" characters during generation.
pattern: A|C|G|U|U\[5Br\]|C\[5F\]|U\[5F\]|U\[5I\]|5-M-C|2'-N-C|I
My problem is I think I'm escaping things improperly or something because
if
I use this whole pattern I'm able to locate all of my "A,C,G,U,I"
characters.
You can use Regex.Escape to escape a string, however, I don't think that's
your problem.
However, if I trim off those characters from my regex and start at
U\[5Br]...
I can then locate the U[5Br] in my string. This is why I think I've
screwed
something up.
There's a 'U' in your alternation before the 'U\[5Br\]' part: This will
match the "U" in the input string. The following "[Br]" part can't be
matched anymore, so the match ends. The regex engine has no reason to do
backtracking, so it simply returns this match (although it's not the longest
possible). You can either give it a reason to backtrack like this:
(A|C|G|U|U\[5Br\]|C\[5F\]|U\[5F\]|U\[5I\]|5-M-C|2'-N-C|I)*$
This will backtrack after the failed attempt to match, and find the correct
match (if there is one)
Another way to get a full match is to modify the original alternation
sequence. If you put the 'U' part after the 'U\[...' parts in the
alternation, those will be tried first, resulting in a good match, too. If I
got you right, you build the pattern programatically anyway, so it seems
possible to me to eliminate this kind of situation (multiple alternation
members starting with the same substring); You should be able to build an
"alternation tree" from your input patterns, recursively combining the ones
starting with a common substring:
A|C(\[5F\]|)|G|U(\[5(Br\]|F\]|I\])|)|5-M-C|2'-N-C|I
I think this should always work, as it does more or less the same thing I'd
do if I had to do it without regex's.
What I would really like for this to do is not show me what matches but
what
doesn't match.
string: ACCCGUCAU[5Bxxx]IAACCU
pattern: A|C|G|U|U\[5Br\]|C\[5F\]|U\[5F\]|U\[5I\]|5-M-C|2'-N-C|I
From this I'd hope to see "U[5Bxxx]" since it's not in the database.
But "U" is in the database, so why wouldn't the output be "[5Bxxx]"? If
there actually is a way to find out these characters belong together
(although they're not in the DB), that could make your task a lot easier.
Anyway, assuming you have a pattern that recognizes all correct input
sequences, and assuming you want to the lowest number of "mismatch
characters" (which would be [Bxx] in your example), this should be possible.
I didn't test this too much, but it seems to work:
((?>(A|C(\[5F\]|)|G|U(\[5(Br\]|F\]|I\])|)|5-M-C|2'-N-C|I)*)(?<mismatch>.*?))*$
But I don't know how fast it is if input strings get longer. You can get the
"mismatch characters" from the "mismatch"-group's captures list.
Hope this helps,
Niki