By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,248 Members | 882 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,248 IT Pros & Developers. It's quick & easy.

Splitting a string with Regex and keep the separator

P: n/a
I need to split a string whenever a separator string is present (lets
sey #Key(val) where val is a variable) and rejoin it in the proper
order after doing some processing.

Is there a way to use the Regex.Split function to split the string
whenever the #Key(val) occurrs but that keeps the #Key(val)
occurrences to that I can reconstruct the final string after doing
certain operations on each token (I need to basically convert each
string into an array of characters but I need to do this differently
is the string is a #Key(val)

Thanks.
Andrea
Jun 4 '07 #1
Share this Question
Share on Google+
15 Replies


P: n/a
* na***@community.nospam wrote, On 4-6-2007 23:16:
I need to split a string whenever a separator string is present (lets
sey #Key(val) where val is a variable) and rejoin it in the proper
order after doing some processing.

Is there a way to use the Regex.Split function to split the string
whenever the #Key(val) occurrs but that keeps the #Key(val)
occurrences to that I can reconstruct the final string after doing
certain operations on each token (I need to basically convert each
string into an array of characters but I need to do this differently
is the string is a #Key(val)

Thanks.
Andrea
This code splits each line differently:

Regex rx = new Regex(@"(?=#\w+\(\w+\))|(?<=#\w+\(\w+\))",
RegexOptions.None);
string[] arr = rx.Split(input);

It looks for every point at the beginning or the end of the pattern
you're looking for. it isn't the fastest on large inputs I guess, but I
haven't tested.

You might be better off experimenting with a MatchEvaluator and a well
written replace call, but to help you with that I'd need a little more
info on what kind of string manipulation you'd be doing.

Jesse
Jun 4 '07 #2

P: n/a
* na***@community.nospam wrote, On 4-6-2007 23:16:
I need to split a string whenever a separator string is present (lets
sey #Key(val) where val is a variable) and rejoin it in the proper
order after doing some processing.

Is there a way to use the Regex.Split function to split the string
whenever the #Key(val) occurrs but that keeps the #Key(val)
occurrences to that I can reconstruct the final string after doing
certain operations on each token (I need to basically convert each
string into an array of characters but I need to do this differently
is the string is a #Key(val)

Thanks.
Andrea
This should work even better:

Regex rx2 = new
Regex(@"(?<keyval>#\w+\(\w+\))|(?<other>((?!#\w+\( \w+\)).)*)",
RegexOptions.None);
string result = rx2.Replace("input", new
MatchEvaluator(ManipulateString));

private string ManipulateString(Match target)
{
if (target.Groups["keyval"].Success)
{
return ManupulateKeyVal(target.Groups["keyval"].Value);
}

else if (target.Groups["other"].Success)
{
return ManupulateOther(target.Groups["other"].Value);
}
}

This will pass the found pieces in order to the manipulate function and
pass the result into a new string when done.

Kind regards,

Jesse
Jun 4 '07 #3

P: n/a
Hi Andrea,

I'm not sure if I fully understand your question. Would you please let us
know if Jesse's reply helps? Thanks.
Regards,
Walter Wang (wa****@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.

Jun 5 '07 #4

P: n/a
Thanks. I'll try that code. Just to explain things better, I don't
need to create an output string. I need to convert the strings to an
array of keycodes to send text. So I just need to have a list of
tokens (and know if they are normal text or #KeyVal) and treat them
differently.

I'll test the code and let you know.
Thanks.
Andrea

On Tue, 05 Jun 2007 05:58:52 GMT, wa****@online.microsoft.com (Walter
Wang [MSFT]) wrote:
>Hi Andrea,

I'm not sure if I fully understand your question. Would you please let us
know if Jesse's reply helps? Thanks.
Regards,
Walter Wang (wa****@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

================================================= =
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
================================================= =

This posting is provided "AS IS" with no warranties, and confers no rights.
Jun 5 '07 #5

P: n/a
* na***@community.nospam wrote, On 5-6-2007 9:00:
Thanks. I'll try that code. Just to explain things better, I don't
need to create an output string. I need to convert the strings to an
array of keycodes to send text. So I just need to have a list of
tokens (and know if they are normal text or #KeyVal) and treat them
differently.

I'll test the code and let you know.
Thanks.
Andrea

Ok, I think I get where you're going. Try this:
Regex rx2 = new
Regex(@"(?<keyval>#\w+\(\w+\))|(?<other>((?!#\w+\( \w+\)).)*)",
RegexOptions.None);
Matches ms = rx2.Matches("input");

foreach (Match m in ms)
{
if (m.Groups["keyval"].Success)
{
ManupulateKeyVal(m.Groups["keyval"].Value);
}

else if (m.Groups["other"].Success)
{
ManupulateOther(m.Groups["other"].Value);
}
}
That should work best. I thought you were creating a new string at first ;)

Jesse
Jun 5 '07 #6

P: n/a
Thanks Jesse. I'll try it and let you know.
I'm fashinated by how you can do with Regex :-) Could you briefly
comment on the regex you used? How does Matches groups work?
Thanks.
Andrea
Jun 5 '07 #7

P: n/a
It work like a charm. Thanks.
Is there also a regex to get the information inside the round brackets
and tokenize it where there's a space?

Given

#Key(SHIFT F2)

I would like to get
SHIFT
F2

Thanks so much.
Andrea
Jun 5 '07 #8

P: n/a
One more thing Jesse.
I noticed that the Key(val) is not interpreted correctly if I have two
expressions attached to one another.

For example

test#Key(F1) is interpreted correctly
test#Key(F1)#Key(F2) is not

How should I change the expression?
Thanks again.
Andrea
Jun 5 '07 #9

P: n/a
* na***@community.nospam wrote, On 5-6-2007 22:01:
It work like a charm. Thanks.
Is there also a regex to get the information inside the round brackets
and tokenize it where there's a space?

Given

#Key(SHIFT F2)

I would like to get
SHIFT
F2

Thanks so much.
Andrea
That would be possible.

It would look something like:

#Key\(((?<token>(?>\w+))\s*)+\)

This will put all the single tokens in match.Groups["token"].Captures[*]

I'd have to look into the problem of two adjacent keyval thingies. I'll
get back to you on that.

Jesse
Jun 5 '07 #10

P: n/a
Thanks for your help Jesse.

I can parse the single tokens later.

I just need to be able to detect the

#Key(val) like #Key(SHIFT ALT) inside the text. I saw that the regex
you sent me doesn't work properly if:

1. The #Key(val) contains a space between parenthesis
2. There are two or more #Key(val) attached
3. I saw it detects as a key any string that is preceded by a # sign

Thanks.
Andrea
Jun 5 '07 #11

P: n/a
* na***@community.nospam wrote, On 5-6-2007 22:06:
One more thing Jesse.
I noticed that the Key(val) is not interpreted correctly if I have two
expressions attached to one another.

For example

test#Key(F1) is interpreted correctly
test#Key(F1)#Key(F2) is not

How should I change the expression?
Thanks again.
Andrea
After some careful testing I got the other issues fixed as well. The
regex is quite big already. I'll try to explain what's going on where.

First, the regex:

\G((?<other>((?!#[^\(]+\([^\)]+\)).)+)|(?<keyval>#(?<key>\w+)\(((?<token>(?>\w+) )\s*)+\)))

To extract the fields you can use this:

foreach (Match m in ms)
{
if (m.Groups["keyval"].Success)
{
string key = m.Groups["key"].Value;
foreach (Capture c in m.Groups["token"].Captures)
{
string token = c.Value;
}
}

else if (m.Groups["other"].Success)
{
ManupulateOther(m.Groups["other"].Value);
}
}

And now for the explanation:

\G -Make sure every new match is directly adjacent to the previous
one, so we're not skipping invalid input

(?<other>) -Match the 'other' text into a group named "other"
((?!#[^\(]+\([^\)]+\)).)+ Match every character that isn't the start of
a key/val pair. I'm doing this by looking ahead to see if a keyval
structure is found, and if it isn't I add one character to the match (.).

If we're at the end of an "other section" there's two options, either
the end of the string, in which case the regex just stops matching, or
there's the start of a key/val thingy.

(?<keyval>) -match the whole key/val structure into a group named "keyval"

#(?<key>\w+)\( -match the key an put it in a group named "key". The
key comes directly after a "#" and only contains one or more
alphanumeric characters (\w+) followed by "("

(((?<token>(?>\w+))\s*)+) -Match every token into a group called
"token". If this group captures multiple tokens they're added to the
group's Captures collection in the order in which they're found. A token
is made up of one or more alphanumeric characters (\w+). It can be
followed by zero or more spaces. The (?>...) construction is used to
prevent too much backtracking going on. The whole
token-followed-by-space can exist multiple times. As the final token
will not have a space behind it I used \s*.

\) -and finally the closing parenthesis.

Keep in mind that if you use the RegexOptions.IgnorePatternWhitespace,
you can reflow the regex to be easier to read. It's also easier to add
comments that way.

@"
(?# Start of the previous match)
\G
(
(?#
Match any character until you fin the start of
A key/val pair.
)
(?<other>((?!#[^\(]+\([^\)]+\)).)+)
|
(?#
Match a key/val pair. Put the keyname in a group
and every token in another.
)
(?<keyval>#(?<key>\w+)\(((?<token>\w+)\s*)+\))
)
";

One alternative to this whole approach I haven't tested yet, but would
work none the less is to only look for the special key/val thingies with
only the right subexpression:

#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)

And query the start/end location of each match to determine if there
were any other characters since the last found match. You can then
extract those characters with a substring function. I'm not sure which
option is faster, but I would not be surprised if the substring option
would work even better, though it would contain more coding.

Jesse Houwing
Jun 5 '07 #12

P: n/a
* na***@community.nospam wrote, On 6-6-2007 0:11:
Thanks for your help Jesse.

I can parse the single tokens later.

I just need to be able to detect the

#Key(val) like #Key(SHIFT ALT) inside the text. I saw that the regex
you sent me doesn't work properly if:

1. The #Key(val) contains a space between parenthesis
Just add \s* right after the parenthesis, which should fix this.
2. There are two or more #Key(val) attached
Fixed that, see my other long post.
3. I saw it detects as a key any string that is preceded by a # sign
Ahhh I had guessed this was something variable. You can of course use
Key instead of \w+ here to fix it.

This is an updated expression (see the explanation in my other mail):

\G((?<other>((?!#Key\([^\)]+\)).)+)|(?<keyval>#Key\(\s*((?<token>\w+)\s*)+\)) )

Jesse
Jun 5 '07 #13

P: n/a
Thanks so much for your help Jesse. I could implement it using your
suggestions.

I ended up using the following expression:
\G(?<keyval>#Key\([\w\s]+\))|(?<other>((?!#Key\([\w\s]+\)).)*)

which seems to work fine for me (it detects when two #Key() are places
one beside the other.

I want also to thank you for the regex explanation. It's always a
difficult topic.
Andrea

On Wed, 06 Jun 2007 00:28:29 +0200, Jesse Houwing
<je***********@nospam-sogeti.nlwrote:
>* na***@community.nospam wrote, On 5-6-2007 22:06:
>One more thing Jesse.
I noticed that the Key(val) is not interpreted correctly if I have two
expressions attached to one another.

For example

test#Key(F1) is interpreted correctly
test#Key(F1)#Key(F2) is not

How should I change the expression?
Thanks again.
Andrea

After some careful testing I got the other issues fixed as well. The
regex is quite big already. I'll try to explain what's going on where.

First, the regex:

\G((?<other>((?!#[^\(]+\([^\)]+\)).)+)|(?<keyval>#(?<key>\w+)\(((?<token>(?>\w+) )\s*)+\)))

To extract the fields you can use this:

foreach (Match m in ms)
{
if (m.Groups["keyval"].Success)
{
string key = m.Groups["key"].Value;
foreach (Capture c in m.Groups["token"].Captures)
{
string token = c.Value;
}
}

else if (m.Groups["other"].Success)
{
ManupulateOther(m.Groups["other"].Value);
}
}

And now for the explanation:

\G -Make sure every new match is directly adjacent to the previous
one, so we're not skipping invalid input

(?<other>) -Match the 'other' text into a group named "other"
((?!#[^\(]+\([^\)]+\)).)+ Match every character that isn't the start of
a key/val pair. I'm doing this by looking ahead to see if a keyval
structure is found, and if it isn't I add one character to the match (.).

If we're at the end of an "other section" there's two options, either
the end of the string, in which case the regex just stops matching, or
there's the start of a key/val thingy.

(?<keyval>) -match the whole key/val structure into a group named "keyval"

#(?<key>\w+)\( -match the key an put it in a group named "key". The
key comes directly after a "#" and only contains one or more
alphanumeric characters (\w+) followed by "("

(((?<token>(?>\w+))\s*)+) -Match every token into a group called
"token". If this group captures multiple tokens they're added to the
group's Captures collection in the order in which they're found. A token
is made up of one or more alphanumeric characters (\w+). It can be
followed by zero or more spaces. The (?>...) construction is used to
prevent too much backtracking going on. The whole
token-followed-by-space can exist multiple times. As the final token
will not have a space behind it I used \s*.

\) -and finally the closing parenthesis.

Keep in mind that if you use the RegexOptions.IgnorePatternWhitespace,
you can reflow the regex to be easier to read. It's also easier to add
comments that way.

@"
(?# Start of the previous match)
\G
(
(?#
Match any character until you fin the start of
A key/val pair.
)
(?<other>((?!#[^\(]+\([^\)]+\)).)+)
|
(?#
Match a key/val pair. Put the keyname in a group
and every token in another.
)
(?<keyval>#(?<key>\w+)\(((?<token>\w+)\s*)+\))
)
";

One alternative to this whole approach I haven't tested yet, but would
work none the less is to only look for the special key/val thingies with
only the right subexpression:

#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)

And query the start/end location of each match to determine if there
were any other characters since the last found match. You can then
extract those characters with a substring function. I'm not sure which
option is faster, but I would not be surprised if the substring option
would work even better, though it would contain more coding.

Jesse Houwing
Jun 7 '07 #14

P: n/a
* na***@community.nospam wrote, On 8-6-2007 0:22:
Thanks so much for your help Jesse. I could implement it using your
suggestions.

I ended up using the following expression:
\G(?<keyval>#Key\([\w\s]+\))|(?<other>((?!#Key\([\w\s]+\)).)*)

which seems to work fine for me (it detects when two #Key() are places
one beside the other.

I want also to thank you for the regex explanation. It's always a
difficult topic.
You're welcome :)

Jesse

Jun 8 '07 #15

P: n/a
Hi Andrea,

Please feel free to let me know if there's anything I can help. Thanks.
Regards,
Walter Wang (wa****@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.

Jun 11 '07 #16

This discussion thread is closed

Replies have been disabled for this discussion.