By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,645 Members | 1,048 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,645 IT Pros & Developers. It's quick & easy.

Easy (?!) regular expression -- find line breaks

P: n/a
So, I'm trying to learn how the Regex class works, and I've been trying to
use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem to
come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't terminate
in a "\r\n", the last line isn't matched

"(.+)(\r\n)*" -- the idea being to allow the last line to match if no
"\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n)*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).

I also tried using single-line mode, trying to work around the problem in
the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last match
without the \r\n pair at the end of the string".

Is this actually impossible using Regex, or is there some combination of
options that will allow me to match the first \r\n pair without requiring
a \r\n pair at the end of the last match?

Thanks,
Pete
Jun 14 '07 #1
Share this Question
Share on Google+
6 Replies


P: n/a
On Wed, 13 Jun 2007 20:55:10 -0700, Peter Duniho
<Np*********@nnowslpianmk.comwrote:
[...]
If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.
And just to clarify...

Yes, I understand that I can just use String.Split() to do this. I'm
talking about the more general question of the matching, and my little
self-assigned homework exercise to try to learn how Regex works.
Jun 14 '07 #2

P: n/a
Peter Duniho wrote:
So, I'm trying to learn how the Regex class works, and I've been trying
to use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem
to come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't
terminate in a "\r\n", the last line isn't matched

"(.+)(\r\n)*" -- the idea being to allow the last line to match if
no "\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n)*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).
Use a non-catching group: (?:\r\n)
I also tried using single-line mode, trying to work around the problem
in the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last
match without the \r\n pair at the end of the string".
Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
Is this actually impossible using Regex, or is there some combination of
options that will allow me to match the first \r\n pair without
requiring a \r\n pair at the end of the last match?

Thanks,
Pete

--
Göran Andersson
_____
http://www.guffa.com
Jun 14 '07 #3

P: n/a
On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <gu***@guffa.com>
wrote:
Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
Ah. So simple. Thanks!
Jun 14 '07 #4

P: n/a
* Peter Duniho wrote, On 14-6-2007 19:25:
On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <gu***@guffa.com>
wrote:
>Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)

Ah. So simple. Thanks!

Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

It's probably faster as well.

Jesse
Jun 14 '07 #5

P: n/a
On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
<je***********@nospam-sogeti.nlwrote:
Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).
Except that as near as I can tell, Regex only uses Unix-style linebreaks.
That is, \n by itself. Which means that if I use the Multiline option
(which seems to be the default, actually), I wind up with the \r as part
of my matched strings, which I don't want.

Pete
Jun 14 '07 #6

P: n/a
* Peter Duniho wrote, On 14-6-2007 20:39:
On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
<je***********@nospam-sogeti.nlwrote:
>Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

Except that as near as I can tell, Regex only uses Unix-style
linebreaks. That is, \n by itself. Which means that if I use the
Multiline option (which seems to be the default, actually), I wind up
with the \r as part of my matched strings, which I don't want.
This shouldn't be so, but does seem to be the case in .NET 2.0. I've
file a bug against it and it should be fixed in framework Orcas. It
hasn't been this way in .NET 1.0 and 1.1 as far as I can remember.

^.*?\r?^ should fix it in the mean while, but is probably slower.

Please file a bug against this to get it fixed in the next service pack
of .net 2.0 if you want to see this fixed there. I tried, but they keep
closing the bug with the message that they cannot reproduce in orcas,
which is still far away for quite some of our customers.

Jesse
Jun 14 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.