Using regulare expressions to parse text (HTML)

Earl Teigrob

I am tring to scan a html string for all content within cretain tags. In the
simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only one
result that inludes the first "A" to the last "c". This is not what I want.
I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc

Nov 16 '05 #1

Subscribe Post Reply

1822

Jochen Kalmbach

Earl Teigrob wrote:

I am tring to scan a html string for all content within cretain tags.
In the simplifed example below, I would like to scan the source text
for all occurrences of text that start with "A" and end in "c" without
any overlapping. In the following example, the regular expression
finds only one result that inludes the first "A" to the last "c". This
is not what I want. I want every non overlapping occurrance of "A"
and "c". The result set should be
[...]
Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc

You ALWAYS get only one result for regex-strings!

Maybe you should try to match ONE occurens and then find out the length
(match[1] and use then missed right string do do an other match until the
whole string was matched...

And the regex should look like: "(^(A[^c]*c))"
--
Greetings
Jochen

Do you need a memory-leak finder ?
http://www.codeproject.com/tools/leakfinder.asp

Do you need daily reports from your server?
http://sourceforge.net/projects/srvreport/

Nov 16 '05 #2

Niki Estner

".+" and ".*" will match as many characters as possible.
Use lazy quantifiers ".+?" or ".*?" instead.
That should do what you want.
Two advices:
1. Get some regular expression testing environment - I'm using Expresso, but
I guess there are others, too.
2. If you want to understand what you're doing, get a good book on the
topic!

Niki

PS: Maybe I misunderstood that other post: Of course "Matches" returns more
than one match if there is more than one match. I didn't test it, but I
think your code should run fine if you use ".+?" or "[^c]+".

"Earl Teigrob" <ea******@hotmail.com> wrote in
news:%2******************@TK2MSFTNGP11.phx.gbl...

I am tring to scan a html string for all content within cretain tags. In the simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only one result that inludes the first "A" to the last "c". This is not what I want. I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc

Nov 16 '05 #3

Jochen Kalmbach

Niki Estner wrote:

PS: Maybe I misunderstood that other post: Of course "Matches" returns
more than one match if there is more than one match. I didn't test it,
but I think your code should run fine if you use ".+?" or "[^c]+".

Sorry for the misunderstanding from my side...
The following works well:

<code>
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Class1
{
static void Main(string[] args)
{
string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";
Regex r = new Regex("A[^c]*c");

MatchCollection mc = r.Matches(SourceString);
foreach(Match m in mc)
{
System.Console.WriteLine(m.Groups[0]);
}
}
}
}
</code>

It also matches (Ac) if you do not want this use "A[^c]+c" instead.

--
Greetings
Jochen

Do you need a memory-leak finder ?
http://www.codeproject.com/tools/leakfinder.asp

Do you need daily reports from your server?
http://sourceforge.net/projects/srvreport/

Nov 16 '05 #4

Earl Teigrob

Perfect, thanks for the info. I have done a fair bit with regular
expressions but I was not aware of the concept of lazy qualifiers. This gets
me on track...

and...I will check out one of the re testing environments...great advice!

Earl

"Niki Estner" <ni*********@cube.net> wrote in message
news:Oe**************@TK2MSFTNGP11.phx.gbl...

".+" and ".*" will match as many characters as possible.
Use lazy quantifiers ".+?" or ".*?" instead.
That should do what you want.
Two advices:
1. Get some regular expression testing environment - I'm using Expresso, but I guess there are others, too.
2. If you want to understand what you're doing, get a good book on the
topic!

Niki

PS: Maybe I misunderstood that other post: Of course "Matches" returns more than one match if there is more than one match. I didn't test it, but I
think your code should run fine if you use ".+?" or "[^c]+".

"Earl Teigrob" <ea******@hotmail.com> wrote in
news:%2******************@TK2MSFTNGP11.phx.gbl...
I am tring to scan a html string for all content within cretain tags. In

the
simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only

one
result that inludes the first "A" to the last "c". This is not what I

want.
I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc

Nov 16 '05 #5

Similar topics

using PHP to parse through HTML

by: laredotornado | last post by:

Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF attributes of anchor tags and SRC attributes of IMG tags. Does anyone know of any libraries/freeware to help parse through...

PHP

Regular Expressions Difficulty

by: Befuddled | last post by:

I am writing a function to have its argument, HTML-containing string, return a DOM 1 Document Fragment, and so it seems the use of regular expressions (REs) is a natural. My problem is that the...

Javascript

Regular Expressions Challenge

by: Patient Guy | last post by:

Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. ...

Javascript

Using XML to serialize a SQL Query

by: Bob | last post by:

Let me state up front that I know very little about XML. My experience is pretty much limited using the XML Serializer to serialize a user preferences class to a file and back again. I'm writing...

.NET Framework

Search & Paging using Repeater control.

by: ratnakarp | last post by:

Hi, I have a search text box. The user enters the value in the text box and click on enter button. In code behind on button click i'm writing the code to get the values from the database and...

ASP.NET

How to parse for a substring using regular expressions??

by: Daniel Walzenbach | last post by:

Hi, does anybody know I can extract a substring of a text with regular expressions. Letâ€™s consider the following text: â€œRegular expressions are often used to make sure that a string matches a...

Visual Basic .NET

Regular expressions

by: rufus | last post by:

I need to parse some HTML and add links to some keywords (up to 1000) defined in a DB table. What I need to do is search for these keywords and if they are not already a link, and they are not...

.NET Framework

regular expressions, substituting and adding in one step?

by: John Salerno | last post by:

Ok, this might look familiar. I'd like to use regular expressions to change this line: self.source += '<p>' + paragraph + '</p>\n\n' to read: self.source += '<p>%s</p>\n\n' % paragraph ...

Python

Junk characters when using StreamReader and StreamWriter

by: Rob | last post by:

Hi, I have a VB.Net application that parses an HTML file. This file was an MS Word document that was saved as web page. My application removes all unnecessary code generated by MS Word and does...

Visual Basic .NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA