473,322 Members | 1,314 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Easy (?!) regular expression -- find line breaks

So, I'm trying to learn how the Regex class works, and I've been trying to
use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem to
come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't terminate
in a "\r\n", the last line isn't matched

"(.+)(\r\n)*" -- the idea being to allow the last line to match if no
"\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n)*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).

I also tried using single-line mode, trying to work around the problem in
the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last match
without the \r\n pair at the end of the string".

Is this actually impossible using Regex, or is there some combination of
options that will allow me to match the first \r\n pair without requiring
a \r\n pair at the end of the last match?

Thanks,
Pete
Jun 14 '07 #1
6 9010
On Wed, 13 Jun 2007 20:55:10 -0700, Peter Duniho
<Np*********@nnowslpianmk.comwrote:
[...]
If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.
And just to clarify...

Yes, I understand that I can just use String.Split() to do this. I'm
talking about the more general question of the matching, and my little
self-assigned homework exercise to try to learn how Regex works.
Jun 14 '07 #2
Peter Duniho wrote:
So, I'm trying to learn how the Regex class works, and I've been trying
to use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem
to come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't
terminate in a "\r\n", the last line isn't matched

"(.+)(\r\n)*" -- the idea being to allow the last line to match if
no "\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n)*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).
Use a non-catching group: (?:\r\n)
I also tried using single-line mode, trying to work around the problem
in the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last
match without the \r\n pair at the end of the string".
Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
Is this actually impossible using Regex, or is there some combination of
options that will allow me to match the first \r\n pair without
requiring a \r\n pair at the end of the last match?

Thanks,
Pete

--
Göran Andersson
_____
http://www.guffa.com
Jun 14 '07 #3
On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <gu***@guffa.com>
wrote:
Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
Ah. So simple. Thanks!
Jun 14 '07 #4
* Peter Duniho wrote, On 14-6-2007 19:25:
On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <gu***@guffa.com>
wrote:
>Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)

Ah. So simple. Thanks!

Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

It's probably faster as well.

Jesse
Jun 14 '07 #5
On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
<je***********@nospam-sogeti.nlwrote:
Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).
Except that as near as I can tell, Regex only uses Unix-style linebreaks.
That is, \n by itself. Which means that if I use the Multiline option
(which seems to be the default, actually), I wind up with the \r as part
of my matched strings, which I don't want.

Pete
Jun 14 '07 #6
* Peter Duniho wrote, On 14-6-2007 20:39:
On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
<je***********@nospam-sogeti.nlwrote:
>Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

Except that as near as I can tell, Regex only uses Unix-style
linebreaks. That is, \n by itself. Which means that if I use the
Multiline option (which seems to be the default, actually), I wind up
with the \r as part of my matched strings, which I don't want.
This shouldn't be so, but does seem to be the case in .NET 2.0. I've
file a bug against it and it should be fixed in framework Orcas. It
hasn't been this way in .NET 1.0 and 1.1 as far as I can remember.

^.*?\r?^ should fix it in the mean while, but is probably slower.

Please file a bug against this to get it fixed in the next service pack
of .net 2.0 if you want to see this fixed there. I tried, but they keep
closing the bug with the message that they cannot reproduce in orcas,
which is still far away for quite some of our customers.

Jesse
Jun 14 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

26
by: Shannon Jacobs | last post by:
Sorry to ask what is surely a trivial question. Also sorry that I don't have my current code version on hand, but... Anyway, must be some problem with trying to do the negative. It seems like I get...
18
by: Shannon Jacobs | last post by:
Trying to solve this with a regex approach rather than the programmatic approach of counting up and down the levels. I have a fairly complicated HTML page that I want to simplify. I've been able to...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
1
by: pmclinn | last post by:
I'm trying to understand how to parse text that contains line breaks. Below I have some html code that I'm trying to parse using a regular expression. The problem is that my expression works on...
6
by: Ludwig | last post by:
Hi, i'm using the regular expression \b\w to find the beginning of a word, in my C# application. If the word is 'public', for example, it works. However, if the word is '<public', it does not...
3
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...
3
by: Peter Afonin | last post by:
Hello, I'm looking for a simple regular expression for the Regular Expression validator that would allow any text, spaces and the line breaks and had a minimum and maximum text length. Something...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
1
by: sreemathy2000 | last post by:
My requirement is to read/write a javascript file in a windows applications. this javascript file is used by my website.. function test { var a; //start dim obj ={'abc','bcd','cde','def'};...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.