473,399 Members | 2,146 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Using regulare expressions to parse text (HTML)

I am tring to scan a html string for all content within cretain tags. In the
simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only one
result that inludes the first "A" to the last "c". This is not what I want.
I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc
Nov 16 '05 #1
4 1822
Earl Teigrob wrote:
I am tring to scan a html string for all content within cretain tags.
In the simplifed example below, I would like to scan the source text
for all occurrences of text that start with "A" and end in "c" without
any overlapping. In the following example, the regular expression
finds only one result that inludes the first "A" to the last "c". This
is not what I want. I want every non overlapping occurrance of "A"
and "c". The result set should be
[...]
Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc


You ALWAYS get only one result for regex-strings!

Maybe you should try to match ONE occurens and then find out the length
(match[1] and use then missed right string do do an other match until the
whole string was matched...

And the regex should look like: "(^(A[^c]*c))"
--
Greetings
Jochen

Do you need a memory-leak finder ?
http://www.codeproject.com/tools/leakfinder.asp

Do you need daily reports from your server?
http://sourceforge.net/projects/srvreport/
Nov 16 '05 #2
".+" and ".*" will match as many characters as possible.
Use lazy quantifiers ".+?" or ".*?" instead.
That should do what you want.
Two advices:
1. Get some regular expression testing environment - I'm using Expresso, but
I guess there are others, too.
2. If you want to understand what you're doing, get a good book on the
topic!

Niki

PS: Maybe I misunderstood that other post: Of course "Matches" returns more
than one match if there is more than one match. I didn't test it, but I
think your code should run fine if you use ".+?" or "[^c]+".

"Earl Teigrob" <ea******@hotmail.com> wrote in
news:%2******************@TK2MSFTNGP11.phx.gbl...
I am tring to scan a html string for all content within cretain tags. In the simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only one result that inludes the first "A" to the last "c". This is not what I want. I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc

Nov 16 '05 #3
Niki Estner wrote:
PS: Maybe I misunderstood that other post: Of course "Matches" returns
more than one match if there is more than one match. I didn't test it,
but I think your code should run fine if you use ".+?" or "[^c]+".


Sorry for the misunderstanding from my side...
The following works well:

<code>
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Class1
{
static void Main(string[] args)
{
string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";
Regex r = new Regex("A[^c]*c");

MatchCollection mc = r.Matches(SourceString);
foreach(Match m in mc)
{
System.Console.WriteLine(m.Groups[0]);
}
}
}
}
</code>

It also matches (Ac) if you do not want this use "A[^c]+c" instead.

--
Greetings
Jochen

Do you need a memory-leak finder ?
http://www.codeproject.com/tools/leakfinder.asp

Do you need daily reports from your server?
http://sourceforge.net/projects/srvreport/
Nov 16 '05 #4
Perfect, thanks for the info. I have done a fair bit with regular
expressions but I was not aware of the concept of lazy qualifiers. This gets
me on track...

and...I will check out one of the re testing environments...great advice!

Earl

"Niki Estner" <ni*********@cube.net> wrote in message
news:Oe**************@TK2MSFTNGP11.phx.gbl...
".+" and ".*" will match as many characters as possible.
Use lazy quantifiers ".+?" or ".*?" instead.
That should do what you want.
Two advices:
1. Get some regular expression testing environment - I'm using Expresso, but I guess there are others, too.
2. If you want to understand what you're doing, get a good book on the
topic!

Niki

PS: Maybe I misunderstood that other post: Of course "Matches" returns more than one match if there is more than one match. I didn't test it, but I
think your code should run fine if you use ".+?" or "[^c]+".

"Earl Teigrob" <ea******@hotmail.com> wrote in
news:%2******************@TK2MSFTNGP11.phx.gbl...
I am tring to scan a html string for all content within cretain tags. In

the
simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only

one
result that inludes the first "A" to the last "c". This is not what I

want.
I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc


Nov 16 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: laredotornado | last post by:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF attributes of anchor tags and SRC attributes of IMG tags. Does anyone know of any libraries/freeware to help parse through...
4
by: Befuddled | last post by:
I am writing a function to have its argument, HTML-containing string, return a DOM 1 Document Fragment, and so it seems the use of regular expressions (REs) is a natural. My problem is that the...
7
by: Patient Guy | last post by:
Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. ...
2
by: Bob | last post by:
Let me state up front that I know very little about XML. My experience is pretty much limited using the XML Serializer to serialize a user preferences class to a file and back again. I'm writing...
1
by: ratnakarp | last post by:
Hi, I have a search text box. The user enters the value in the text box and click on enter button. In code behind on button click i'm writing the code to get the values from the database and...
1
by: Daniel Walzenbach | last post by:
Hi, does anybody know I can extract a substring of a text with regular expressions. Let’s consider the following text: “Regular expressions are often used to make sure that a string matches a...
4
by: rufus | last post by:
I need to parse some HTML and add links to some keywords (up to 1000) defined in a DB table. What I need to do is search for these keywords and if they are not already a link, and they are not...
6
by: John Salerno | last post by:
Ok, this might look familiar. I'd like to use regular expressions to change this line: self.source += '<p>' + paragraph + '</p>\n\n' to read: self.source += '<p>%s</p>\n\n' % paragraph ...
5
by: Rob | last post by:
Hi, I have a VB.Net application that parses an HTML file. This file was an MS Word document that was saved as web page. My application removes all unnecessary code generated by MS Word and does...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.