473,624 Members | 2,119 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

how do I handle linebreaks in Regex?

I'm trying to do some regex in C# but for some reason linebreaks are causing
my regex to not work.

the test string goes like this:

string ss = "<tagname
something=45678 &somethingelse= 12345>blah</tagname>\r\n<ta g2>stuff</tag2>";

and my regex code is like:

Regex pat = new Regex("somethin g=([0-9]*).*>(.*)<.*<ta g2>(.*)</tag2>",
RegexOptions.Mu ltiline);

foreach (Match m in pat.Matches(ss) ) {

foreach (Group g in m.Groups) {
Console.Write(g +", ");
}
Console.WriteLi ne();
}

but it never works unless I remove the "\r\n" from the test string.

how do I get around that? I thought that's what the RegexOptions.Mu ltiline
was supposed to take care of?
Aug 22 '06 #1
7 2420

Oops, looks like I should have used "Singleline " option, not Multiline!
Aug 22 '06 #2
Ok there's definitely something I am not understanding with Regular
Expression...

when I switched to Singleline mode it worked great for that small test
string but in the main file all hell broke loose and my Console seems to be
printing out the entire file over and over in some infinite loop.

As another test I tried this:

Regex pat = new Regex("<a.*>(.* )</a>", RegexOptions.Si ngleline);

on a regular web page, and again same problem... it just starts spitting out
the whole file over and over. If I switch to Multiline mode it works great
but it does not pick up and <atags which is broken up by one or more
newlines... how do I get around this?
Aug 22 '06 #3
"MrNobody" <Mr******@discu ssions.microsof t.comwrote in message
news:EE******** *************** ***********@mic rosoft.com...
Ok there's definitely something I am not understanding with Regular
Expression...

when I switched to Singleline mode it worked great for that small test
string but in the main file all hell broke loose and my Console seems to
be
printing out the entire file over and over in some infinite loop.

As another test I tried this:

Regex pat = new Regex("<a.*>(.* )</a>", RegexOptions.Si ngleline);

on a regular web page, and again same problem... it just starts spitting
out
the whole file over and over. If I switch to Multiline mode it works great
but it does not pick up and <atags which is broken up by one or more
newlines... how do I get around this?
Instead of . use [\S\s]

-- Alan
Aug 22 '06 #4
Here's the problem:

<a.*>(.*)</a>

You're using the '.' character escape. This means "any character that is not
a newline character". When the MultiLine option is on, the '.' matches
newlines, and therefore matches everything. So, when you have MultiLine off,
the newline character sequence breaks the match. When you have it turned on,
everything is matched in the first match until the end of the last match.
Here:

<a
something=45678 &somethingelse= 12345>blah</a>
<a>stuff</a>

With MultiLine ON, the first tag matches, and so does every character after
it, until the last "</a>" in the string.

Using the '.' character escape is to Regular Expressions what using an
Atomic Bomb is to warfare. You want to be as specific as possible, rather
than the opposite.

In the example below, I use a very specific character class: [^<] - This
means any character that is NOT a '<' (or a '>' in another case). This way,
the match stops where the first '<' character is found, and the rest of the
match is evaluated from the remaining portion of the string. The following
will find the 2 (and only 2) matches in your example. Each matching value
will be in Group 1 of the match.

<a[^>]*>([^<]*)</a>

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.
"MrNobody" <Mr******@discu ssions.microsof t.comwrote in message
news:EE******** *************** ***********@mic rosoft.com...
Ok there's definitely something I am not understanding with Regular
Expression...

when I switched to Singleline mode it worked great for that small test
string but in the main file all hell broke loose and my Console seems to
be
printing out the entire file over and over in some infinite loop.

As another test I tried this:

Regex pat = new Regex("<a.*>(.* )</a>", RegexOptions.Si ngleline);

on a regular web page, and again same problem... it just starts spitting
out
the whole file over and over. If I switch to Multiline mode it works great
but it does not pick up and <atags which is broken up by one or more
newlines... how do I get around this?

Aug 22 '06 #5
Kevin, thanks for that tip, it works great for that example!

So there is no way in regex to say something like, accept all characters
until you hit a specific group of characters, like "</div>" ? Like let's say
you are scanning a web page for a specific opening <divtag, and you want to
grab all the text between that and the next closing </divtag, so the
contents may include many <'s and >'s inside. I guess regex is not the way to
go for doing something like that?
Aug 22 '06 #6
Hi Mr. Nobody,

Actually, Regex is quite capable of handling this sort of situation. In the
solution I gave you I went for the simplest solution necessary, as I
understood it. Your second example was using an image tag, which would not
contain other tags. In a case where other tags might be nested, you would
need to use a different set of Regex tools.

For example, to get all text between 2 matching beginning and ending tags,
when there are no nested tags, you would use something like:

<([^>]*)>([^<]*)</\1>

This indicates that a match begins with the left angle bracket. The left
angle bracket is followed by a sequence of any length of characters that are
NOT a right angle bracket. This sequence of characters is put into Group 1.
This is followed by a sequence of any length that is NOT a left angle
bracket, followed by a left angel bracket and a forward-slash. The last part
of the match is that the text from the first tag (Group 1) is matched,
followed by a right angle bracket.

For tags that contain nested tags, something like the following might work:

<(table|form|di v)[^>]*>(.*?)</\1>

This indicates that tables, forms, and divs (I'm sure I may have missed one
or two) are matched. The ending tag uses the group captured from the first
tag. Group 2 contains the content.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.
"MrNobody" <Mr******@discu ssions.microsof t.comwrote in message
news:6A******** *************** ***********@mic rosoft.com...
Kevin, thanks for that tip, it works great for that example!

So there is no way in regex to say something like, accept all characters
until you hit a specific group of characters, like "</div>" ? Like let's
say
you are scanning a web page for a specific opening <divtag, and you want
to
grab all the text between that and the next closing </divtag, so the
contents may include many <'s and >'s inside. I guess regex is not the way
to
go for doing something like that?

Aug 23 '06 #7
Awesome!

Thanks Kevin, you were a BIG help- thanks to you I was able to do what I
needed without resorting to trying and parse the HTML response which would
have been a nightmare and been really slow compared to regex!

I needed to write a program which basically monitors forum responses by
stripping out key strings like the post itself, it's thread ID and the user's
name and then performing some checks on them.

Thanks again for your help
Aug 23 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
2895
by: Tomba | last post by:
hi there, I am looking for a way to convert line breaks that are written in a textarea (with an enter) to <br> to create the same line break in html is there anyone who can help me with this? I can't seem to find a way to recognize the linebreaks in the textarea thanks, Steven De Groote
5
2262
by: Mechphisto | last post by:
I'm using a textarea to insert some notes into a database. It's inserting the complete result into the mySQL table including the linebreaks...and not as a visible symbol but an actual linebreak. If I select the fireld in mySQL at a command prompt, the results take an inordinate amount of space from all the linebreaks. I tried nl2br() before it goes into the INSERT, but all that does is ADD <br /> to the data but it still have linebreaks...
6
5674
by: Mark | last post by:
Hello, I'm trying to handle HttpRequestValidationException. If a hacker enters certain values into a textbox, like "<script>", it will trigger this error. I understand why .Net has this, but I need a way to gracefully handle it. Ideally the app would catch it as invalid input, and then return control to the user instead of throwing an exception. This is a problem is a legitimate user enters it into a long description box as part of a rare,...
2
1904
by: peter | last post by:
I have a xslt that writes a txt file from xml. Xml file is generated by a user input on a webpage and some nodes have linebreaks which corrupts my output. is there a way to generally set that xslt should ignore linebreaks or a way to use <xsl:value-of select="$VALUE"/and clear $VALUE of linebreaks or replace them with spaces?? Thanks alot! Best regards, Peter.
1
9153
by: =?Utf-8?B?TXJOb2JvZHk=?= | last post by:
I want to match some HTML string using Regex but the linebreaks are getting me. Is there a way to just completely ignore linebreaks in my regular expression? If not, how would I specify a linebreak in the regex? I know how to do begin line/end line but that only works in non-singleline mode right? i.e. if I have:
5
1214
by: rweth | last post by:
I am using nntplib to download archived xml messages from our internal newsgroup. This is working fine except the download of files to the connected server, has extra embedded lines in them (all over the place), from the s.body(id,afile) # body method Is there any way to employ this library to strip out these extra line breaks? I know this may seem trivial but they cause serious issues when I try and employ the subsequent downloaded...
12
1845
by: Torsten Bronger | last post by:
Hallöchen! I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. My current approach is to look for matches in substrings with the escaped characters as boundaries between the substrings. However, then ^ and $ in the patterns are treated wrongly. (Although I use startpos and endpos parameters for...
5
2609
by: David Schwartz | last post by:
I've got some pre-formatted text and I'm not sure how to encode it in my xml to preserve its formatting. Any help would be appreciated! TIA, David
6
1193
by: K Viltersten | last post by:
I'll be working with regular expressions and hopeful as i am, i count on that there are tools ready to handle e.g. file operations using regular expressions. Please tell me it's so and give me a pointer to what classes/packages to aim at. -- Regards
0
8234
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8172
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8677
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
8335
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8474
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7158
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5563
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4079
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
1482
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.