473,698 Members | 2,196 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Easy (?!) regular expression -- find line breaks

So, I'm trying to learn how the Regex class works, and I've been trying to
use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem to
come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't terminate
in a "\r\n", the last line isn't matched

"(.+)(\r\n) *" -- the idea being to allow the last line to match if no
"\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n )*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).

I also tried using single-line mode, trying to work around the problem in
the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last match
without the \r\n pair at the end of the string".

Is this actually impossible using Regex, or is there some combination of
options that will allow me to match the first \r\n pair without requiring
a \r\n pair at the end of the last match?

Thanks,
Pete
Jun 14 '07 #1
6 9054
On Wed, 13 Jun 2007 20:55:10 -0700, Peter Duniho
<Np*********@nn owslpianmk.comw rote:
[...]
If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.
And just to clarify...

Yes, I understand that I can just use String.Split() to do this. I'm
talking about the more general question of the matching, and my little
self-assigned homework exercise to try to learn how Regex works.
Jun 14 '07 #2
Peter Duniho wrote:
So, I'm trying to learn how the Regex class works, and I've been trying
to use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem
to come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't
terminate in a "\r\n", the last line isn't matched

"(.+)(\r\n) *" -- the idea being to allow the last line to match if
no "\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n )*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).
Use a non-catching group: (?:\r\n)
I also tried using single-line mode, trying to work around the problem
in the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last
match without the \r\n pair at the end of the string".
Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
Is this actually impossible using Regex, or is there some combination of
options that will allow me to match the first \r\n pair without
requiring a \r\n pair at the end of the last match?

Thanks,
Pete

--
Göran Andersson
_____
http://www.guffa.com
Jun 14 '07 #3
On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <gu***@guffa.co m>
wrote:
Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
Ah. So simple. Thanks!
Jun 14 '07 #4
* Peter Duniho wrote, On 14-6-2007 19:25:
On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <gu***@guffa.co m>
wrote:
>Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)

Ah. So simple. Thanks!

Even easier would be to set the RegexOption.Mul tiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

It's probably faster as well.

Jesse
Jun 14 '07 #5
On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
<je***********@ nospam-sogeti.nlwrote:
Even easier would be to set the RegexOption.Mul tiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).
Except that as near as I can tell, Regex only uses Unix-style linebreaks.
That is, \n by itself. Which means that if I use the Multiline option
(which seems to be the default, actually), I wind up with the \r as part
of my matched strings, which I don't want.

Pete
Jun 14 '07 #6
* Peter Duniho wrote, On 14-6-2007 20:39:
On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
<je***********@ nospam-sogeti.nlwrote:
>Even easier would be to set the RegexOption.Mul tiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

Except that as near as I can tell, Regex only uses Unix-style
linebreaks. That is, \n by itself. Which means that if I use the
Multiline option (which seems to be the default, actually), I wind up
with the \r as part of my matched strings, which I don't want.
This shouldn't be so, but does seem to be the case in .NET 2.0. I've
file a bug against it and it should be fixed in framework Orcas. It
hasn't been this way in .NET 1.0 and 1.1 as far as I can remember.

^.*?\r?^ should fix it in the mean while, but is probably slower.

Please file a bug against this to get it fixed in the next service pack
of .net 2.0 if you want to see this fixed there. I tried, but they keep
closing the bug with the message that they cannot reproduce in orcas,
which is still far away for quite some of our customers.

Jesse
Jun 14 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

26
11772
by: Shannon Jacobs | last post by:
Sorry to ask what is surely a trivial question. Also sorry that I don't have my current code version on hand, but... Anyway, must be some problem with trying to do the negative. It seems like I get into these ruts each time I try to deal with regular expressions. All I'm trying to do is delete the lines which don't contain a particular string. Actually a filter to edit a log file. I can find and replace a thing with null, but can't...
18
3752
by: Shannon Jacobs | last post by:
Trying to solve this with a regex approach rather than the programmatic approach of counting up and down the levels. I have a fairly complicated HTML page that I want to simplify. I've been able to mung most of it using several regular expressions, but I've become stuck at this point. I can't figure out how to grab only the <tr> tags that are associated with tables that are two levels deep. I feel like I got close, but it seems that...
4
3222
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go over each document, find out if it contains a header and/or a footer and extract only the main content part. The headers and the footers have no specific format and I have to detect and remove them using a list of strings that may appear as...
1
1198
by: pmclinn | last post by:
I'm trying to understand how to parse text that contains line breaks. Below I have some html code that I'm trying to parse using a regular expression. The problem is that my expression works on all tags that do not include linebreaks. I wrote this regular expersion that is almost perfect. "<span\sclass="tableListing-*.*</span>" Results of regex:
6
2289
by: Ludwig | last post by:
Hi, i'm using the regular expression \b\w to find the beginning of a word, in my C# application. If the word is 'public', for example, it works. However, if the word is '<public', it does not work: it seems that < is not a valid character, so the beginning of the word starts at theletter 'p' instead of '<'. Because I'm not an expert in regular expressions, maybe someone of you guys can help me? I need the correct regex to find the...
3
2562
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular expression. ^(.+?) uses (?!a spoon)\.$
3
4994
by: Peter Afonin | last post by:
Hello, I'm looking for a simple regular expression for the Regular Expression validator that would allow any text, spaces and the line breaks and had a minimum and maximum text length. Something like this one: ^({8,64})$ This one doesn't allow line breaks and allows only alphanumeric characters.
25
5150
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How do I gain access to the expression (not the matches) at runtime? Thanks, Mike
1
1518
by: sreemathy2000 | last post by:
My requirement is to read/write a javascript file in a windows applications. this javascript file is used by my website.. function test { var a; //start dim obj ={'abc','bcd','cde','def'}; //end Some other code here }
0
8683
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8609
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8901
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8871
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
5862
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4622
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3052
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2336
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2007
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.