473,503 Members | 1,629 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regex greedy/lazy problem

I have a scenario where a string is sent in chunks to my app. I need to be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>

I need to be able to have a regex that will capture the start, middle and
end of this file based on the tags. The problem is that the end tag may not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.

My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?

This will work but will return the </DALFilein the <contentgroup.
Making the .* lazy (i.e. .*?) will work but only if the end tag is present,
which may not always be the case as the string is chunked.

The following also works but if there's a carriage return in xxxxxx it does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?

Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.
Jun 19 '07 #1
3 2402
You really haven't clarified your rules. Several things are not clear.

Will the "chunks" ever split the tags themselves?

What sort of characters may be in the "content" between the tags?

Making a couple of assumptions, I came up with the following:

(?:(?<startTag><DALFile>))?(?<content>[^<]*)(?:(?<endTag></DALFile>))?

This can be broken up into 3 sections:

(?:(?<startTag><DALFile>))?

0 or 1 sequence of "<DALFile>" - assumption that it is never broken.

(?<content>[^<]*)

0 or more characters that are NOT '<' - assumption that the '<' character
may not appear between the tags.

(?:(?<endTag></DALFile>))?

0 or 1 sequences of "</DALFile>" - assumption that it is never broken.

Why are you not waiting until you get all of the string to parse it, rather
than attempting to parse "chunks?"

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"sbparsons" <sb*******@discussions.microsoft.comwrote in message
news:E3**********************************@microsof t.com...
>I have a scenario where a string is sent in chunks to my app. I need to be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>

I need to be able to have a regex that will capture the start, middle and
end of this file based on the tags. The problem is that the end tag may
not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.

My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?

This will work but will return the </DALFilein the <contentgroup.
Making the .* lazy (i.e. .*?) will work but only if the end tag is
present,
which may not always be the case as the string is chunked.

The following also works but if there's a carriage return in xxxxxx it
does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?

Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.

Jun 19 '07 #2
Hi Kevin, and thanks for the response.

Yes - I was concerned about over complicating the message so I omitted a few
rules.
Basically I have a series of files transferred through sockets and the
receiving socket is parsing the data as it arrives - and is not waiting for
the whole stream to arrive.

The files may be either ascii or binary but all transferred as binary. The
receiving socket usese the GetString method on the byte array and parses that
when it determines that the start/end of a file is in the current chunk. So
there may well be angle brackets inside the string in addition to those
introduced by the sending socket.

Yes, the tags may be split but I can handle the case when there is no match
easilly enough.

I've taken a look at your solution and it doesn't appear to handle newline
characters for the content. From my reading it appears that the DOT can treat
carriage returns as characters but am unsure what other constructs are
available for this.

Thanks again for the reply.

Sean

"Kevin Spencer" wrote:
You really haven't clarified your rules. Several things are not clear.

Will the "chunks" ever split the tags themselves?

What sort of characters may be in the "content" between the tags?

Making a couple of assumptions, I came up with the following:

(?:(?<startTag><DALFile>))?(?<content>[^<]*)(?:(?<endTag></DALFile>))?

This can be broken up into 3 sections:

(?:(?<startTag><DALFile>))?

0 or 1 sequence of "<DALFile>" - assumption that it is never broken.

(?<content>[^<]*)

0 or more characters that are NOT '<' - assumption that the '<' character
may not appear between the tags.

(?:(?<endTag></DALFile>))?

0 or 1 sequences of "</DALFile>" - assumption that it is never broken.

Why are you not waiting until you get all of the string to parse it, rather
than attempting to parse "chunks?"

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"sbparsons" <sb*******@discussions.microsoft.comwrote in message
news:E3**********************************@microsof t.com...
I have a scenario where a string is sent in chunks to my app. I need to be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>

I need to be able to have a regex that will capture the start, middle and
end of this file based on the tags. The problem is that the end tag may
not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.

My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?

This will work but will return the </DALFilein the <contentgroup.
Making the .* lazy (i.e. .*?) will work but only if the end tag is
present,
which may not always be the case as the string is chunked.

The following also works but if there's a carriage return in xxxxxx it
does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?

Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.


Jun 19 '07 #3
You can use the dot to match a newline by preceding the expression with
"(?s)" - the regular expression for "dot matches new line," as in the
following:

(?s)(?:(?<startTag><DALFile>))?(?<content>.*)(?<en dTag></DALFile>)?

The problem here is that the "content" group will now absorb the entire
remaining part of the string.

In addition, one of your conditions makes the situation highly problematic:
there may well be angle brackets inside the string in addition to those
introduced by the sending socket.
I suspected that you might simply be trying to parse each bit that comes
through, and I think the solution is a compromise on your original
requirement. Parse the text in chunks that begin and end with the beginning
and ending tags. That is, don't attempt to use a regular expression until
you have a string ending with the end tag. This can be done by using a
second string buffer and putting each chunk received into it. When the end
tag is in a chunk, you put only the part of the chunk that ends in the end
tag, then parse the resulting string and continue receiving.

Here's why. Imagine a section that comes through as follows:

<DALFile>xxxxxxxxx</DAL

How do you identify the content?

According to your requirements, the following would be a legitimate element,
as you've said that right angle brackets may appear prior to the end tag:

<DALFile>xxxxxxxxx</DAL</DALFile>

Again, what if a chunk comes through as follows:

LFILE>xxx

The only way to ensure that you have a complete element is to get a complete
element to parse. In that case, you can use:

(?s)(?:(?<startTag><DALFile>))(?<content>.*)(?<end Tag></DALFile>)

This requires that both the start and end tags are present, and will match
correctly.

If you have more than one tag, you can use a more generic approach:

(?s)(?:(?<startTag><([^>]+)>))(?<content>.*)(?<endTag></\1>)

This identifies the tag name of the start tag with a numbered capturing
group, and uses a reference to that tag name in the end capturing group.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"sbparsons" <sb*******@discussions.microsoft.comwrote in message
news:74**********************************@microsof t.com...
Hi Kevin, and thanks for the response.

Yes - I was concerned about over complicating the message so I omitted a
few
rules.
Basically I have a series of files transferred through sockets and the
receiving socket is parsing the data as it arrives - and is not waiting
for
the whole stream to arrive.

The files may be either ascii or binary but all transferred as binary. The
receiving socket usese the GetString method on the byte array and parses
that
when it determines that the start/end of a file is in the current chunk.
So
there may well be angle brackets inside the string in addition to those
introduced by the sending socket.

Yes, the tags may be split but I can handle the case when there is no
match
easilly enough.

I've taken a look at your solution and it doesn't appear to handle newline
characters for the content. From my reading it appears that the DOT can
treat
carriage returns as characters but am unsure what other constructs are
available for this.

Thanks again for the reply.

Sean

"Kevin Spencer" wrote:
>You really haven't clarified your rules. Several things are not clear.

Will the "chunks" ever split the tags themselves?

What sort of characters may be in the "content" between the tags?

Making a couple of assumptions, I came up with the following:

(?:(?<startTag><DALFile>))?(?<content>[^<]*)(?:(?<endTag></DALFile>))?

This can be broken up into 3 sections:

(?:(?<startTag><DALFile>))?

0 or 1 sequence of "<DALFile>" - assumption that it is never broken.

(?<content>[^<]*)

0 or more characters that are NOT '<' - assumption that the '<' character
may not appear between the tags.

(?:(?<endTag></DALFile>))?

0 or 1 sequences of "</DALFile>" - assumption that it is never broken.

Why are you not waiting until you get all of the string to parse it,
rather
than attempting to parse "chunks?"

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"sbparsons" <sb*******@discussions.microsoft.comwrote in message
news:E3**********************************@microso ft.com...
>I have a scenario where a string is sent in chunks to my app. I need to
be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>

I need to be able to have a regex that will capture the start, middle
and
end of this file based on the tags. The problem is that the end tag may
not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.

My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?

This will work but will return the </DALFilein the <contentgroup.
Making the .* lazy (i.e. .*?) will work but only if the end tag is
present,
which may not always be the case as the string is chunked.

The following also works but if there's a carriage return in xxxxxx it
does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?

Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.



Jun 20 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

75
4582
by: Xah Lee | last post by:
http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...
2
1852
by: Daniel Billingsley | last post by:
First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking in this area... it's just a whole bunch of information thrown at you and you're left to yourself as to organizing it in...
3
2424
by: DDK | last post by:
I am trying to figure out how to Replace tags such as ... with the correct HTML <b>...</b> tags in C#. The code below works however only if one set of tags are found, if you have more than two...
2
1464
by: Horizon | last post by:
Hi, I would like to build a regex that returns me the table list of a SQL SELECT request like : -SELECT * FROM tab1, tab2, tab3 -SELECT * FROM tab1, tab2 WHERE col="abc" -SELECT * FROM tab1,...
4
1679
by: Brent | last post by:
Take this string: "---------------------------------------- " (i.e., hyphens followed by a newline ) I thought I could match it simply with this Regex: "-*?\n"
9
2789
by: taylorjonl | last post by:
I am having a problem matching some text. It is a very simple pattern but it doesn't seem to work. Here goes. <td*>.*?</td> That is the pattern, it should match any <td></td> pair. Here is...
4
2185
by: Julius Fuchs | last post by:
Hi, I have a string like "bla<cut>blubb<cut>bla<cut>blubb" and want to replace the substring from the first occurrence of <cut> to the second with "TEST" in order to get "blaTESTbla<cut>blubb"....
2
244
by: japi | last post by:
Hi, as a regex starter I am having a little trouble here. suppose i want to parse the folling html fragment: <li> a </li> <li>
4
312
by: pedrito | last post by:
I have a regex question and it never occurred to me to ask here, until I saw Jesse Houwing's quick response to Phil for his Regex question. I have some filenames that I'm trying to parse out of...
0
7084
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7328
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7458
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5578
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
4672
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3154
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1512
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
736
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
380
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.