Regular Expression Matches

Pete Davis

I'm using regular expressions to extract some data and some links from some
web pages. I download the page and then I want to get a list of certain
links.

For building regular expressions, I use an app call The Regulator, which
makes it pretty easy to build and test regular expressions.

As a warning, I'm real weak with regular expressions. Let's say my regular
expression is:

(href=)(?<caturl>.*)(class=title>\[   )

Now, using The Regulator and giving it the source for a particular web page,
I get 8 matches.

According to the regulator, the options it's using are:

Multiline, ignore case, ignore whitespace

In my own code, I'm doing:

Regex indexRegex = new Regex(categoryListRegex,
RegexOptions.Multiline |
RegexOptions.IgnorePatternWhitespace |
RegexOptions.IgnoreCase);
MatchCollection indexMatches = indexRegex.Matches(pageText);

This only returns one match in indexMatches with the same page that I'm
giving The Regulator. It seems that no matter what combination of regex
options I use, I'm only getting one match.

Why is that? How do I get all 8 matches?

Thanks.

pete

Feb 21 '06 #1

Subscribe Post Reply

3326

Kevin Spencer

Hi Pete,

You need to escape the '<' and '>' characters in your Regurlar Expression.
These are used in some flavors of Regular Expression language to indicate a
named group. If the first (<caturl>) is a group name, name both groups or
neither.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:1e******************************@giganews.com ...

I'm using regular expressions to extract some data and some links from
some web pages. I download the page and then I want to get a list of
certain links.

For building regular expressions, I use an app call The Regulator, which
makes it pretty easy to build and test regular expressions.

As a warning, I'm real weak with regular expressions. Let's say my regular
expression is:

(href=)(?<caturl>.*)(class=title>\[   )

Now, using The Regulator and giving it the source for a particular web
page, I get 8 matches.

According to the regulator, the options it's using are:

Multiline, ignore case, ignore whitespace

In my own code, I'm doing:

Regex indexRegex = new Regex(categoryListRegex,
RegexOptions.Multiline |
RegexOptions.IgnorePatternWhitespace |
RegexOptions.IgnoreCase);
MatchCollection indexMatches = indexRegex.Matches(pageText);

This only returns one match in indexMatches with the same page that I'm
giving The Regulator. It seems that no matter what combination of regex
options I use, I'm only getting one match.

Why is that? How do I get all 8 matches?

Thanks.

pete

Feb 21 '06 #2

Pete Davis

Well, there are 3 groups. <caturl> is a group name. The other 2 are unnamed.
Why do they need to be named if <caturl> is named? I'm not interested in the
other groups. I'm simply using them as "delimiters" for lack of a better
word.

I've modified the expression to look like this:

(href=)(?<caturl>.*)(class=title.*\[   )

This gives the exact same results.

The escaping stuff gets a little confusing because the regular expressions
are actually stored in an XML file, so they get escaped for that.

In the XML file that looks like:

(href=)(?<caturl>.*)(class=title.*\[&nbsp;&nbsp;&nbsp;)

This still isn't returning multiple results. Just the last match. I don't
think the < was the problem.

Pete

"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:%2****************@TK2MSFTNGP11.phx.gbl...

Hi Pete,

You need to escape the '<' and '>' characters in your Regurlar Expression.
These are used in some flavors of Regular Expression language to indicate
a named group. If the first (<caturl>) is a group name, name both groups
or neither.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:1e******************************@giganews.com ...
I'm using regular expressions to extract some data and some links from
some web pages. I download the page and then I want to get a list of
certain links.

For building regular expressions, I use an app call The Regulator, which
makes it pretty easy to build and test regular expressions.

As a warning, I'm real weak with regular expressions. Let's say my
regular expression is:

(href=)(?<caturl>.*)(class=title>\[   )

Now, using The Regulator and giving it the source for a particular web
page, I get 8 matches.

According to the regulator, the options it's using are:

Multiline, ignore case, ignore whitespace

In my own code, I'm doing:

Regex indexRegex = new Regex(categoryListRegex,
RegexOptions.Multiline |
RegexOptions.IgnorePatternWhitespace |
RegexOptions.IgnoreCase);
MatchCollection indexMatches = indexRegex.Matches(pageText);

This only returns one match in indexMatches with the same page that I'm
giving The Regulator. It seems that no matter what combination of regex
options I use, I'm only getting one match.

Why is that? How do I get all 8 matches?

Thanks.

pete

Feb 21 '06 #3

Pete Davis

I found the solution to the problem, so now if someone could explain why,
I'd appreciate that:

The solution is to replace carriage returns with line feeds.

If I do that, I get all 8 results instead of the 1 result I was getting.

Now, can anyone tell me WHY? And aren't there other regex options for
dealing with that without having to change the source text?

Thanks.

Pete

Feb 22 '06 #4

Kevin Spencer

> The solution is to replace carriage returns with line feeds.

I didn't see any carriage returns or line feeds in your regular expression.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:pK********************@giganews.com...

I found the solution to the problem, so now if someone could explain why,
I'd appreciate that:

The solution is to replace carriage returns with line feeds.

If I do that, I get all 8 results instead of the 1 result I was getting.

Now, can anyone tell me WHY? And aren't there other regex options for
dealing with that without having to change the source text?

Thanks.

Pete

Feb 22 '06 #5

Pete Davis

The Carriage Returns and Linefeeds are in the web page I'm scanning, not the
regex.

Pete
"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:u4*************@tk2msftngp13.phx.gbl...

The solution is to replace carriage returns with line feeds.

I didn't see any carriage returns or line feeds in your regular
expression.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:pK********************@giganews.com...
I found the solution to the problem, so now if someone could explain why,
I'd appreciate that:

The solution is to replace carriage returns with line feeds.

If I do that, I get all 8 results instead of the 1 result I was getting.

Now, can anyone tell me WHY? And aren't there other regex options for
dealing with that without having to change the source text?

Thanks.

Pete

Feb 22 '06 #6

Kevin Spencer

Well, Pete, I'm not sure, but I do know that Microsoft and Unix text
documents have a distinct difference in terms of line feeds. The Microsoft
text document uses CR/LF (\r\n), while the Unix model uses only LF(\n). Does
that tell you anything?

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:Re******************************@giganews.com ...

The Carriage Returns and Linefeeds are in the web page I'm scanning, not
the regex.

Pete
"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:u4*************@tk2msftngp13.phx.gbl...
The solution is to replace carriage returns with line feeds.

I didn't see any carriage returns or line feeds in your regular
expression.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:pK********************@giganews.com...
I found the solution to the problem, so now if someone could explain why,
I'd appreciate that:

The solution is to replace carriage returns with line feeds.

If I do that, I get all 8 results instead of the 1 result I was getting.

Now, can anyone tell me WHY? And aren't there other regex options for
dealing with that without having to change the source text?

Thanks.

Pete

Feb 22 '06 #7

Pete Davis

Kevin,

Thanks for the help. The Unix/Windows difference on crlf vs. lf I'm aware
of, but that's not really the issue.

The Regex engine in .NET, I would assume, is more geared towards windows
documents if anything, but maybe not.

There are Regex options for singleline and multiline to help determine how
the parser handles the data, but it's unclear to me how to get all matches
from my document without replacing the carriage returns. Nothing I've tried
has worked. I suspect there are changes I can make to the regular
expressions themselves to handle it, but I don't know how to do that.

Pete
"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:Os****************@TK2MSFTNGP12.phx.gbl...

Well, Pete, I'm not sure, but I do know that Microsoft and Unix text
documents have a distinct difference in terms of line feeds. The Microsoft
text document uses CR/LF (\r\n), while the Unix model uses only LF(\n).
Does that tell you anything?

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:Re******************************@giganews.com ...
The Carriage Returns and Linefeeds are in the web page I'm scanning, not
the regex.

Pete
"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:u4*************@tk2msftngp13.phx.gbl...
The solution is to replace carriage returns with line feeds.

I didn't see any carriage returns or line feeds in your regular
expression.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:pK********************@giganews.com...
I found the solution to the problem, so now if someone could explain
why, I'd appreciate that:

The solution is to replace carriage returns with line feeds.

If I do that, I get all 8 results instead of the 1 result I was
getting.

Now, can anyone tell me WHY? And aren't there other regex options for
dealing with that without having to change the source text?

Thanks.

Pete

Feb 23 '06 #8

Larry Lard

Pete Davis wrote:

I'm using regular expressions to extract some data and some links from some
web pages. I download the page and then I want to get a list of certain
links.

Completely not answering your actual question, but maybe this will save
you a lot of bother: the free HtmlAgilityPack will convert an (even
malformed) HTML document into a nice XML tree which makes getting
content out much much easier.

--
Larry Lard
Replies to group please

Feb 23 '06 #9

Kevin Spencer

Hi Pete,

Here are some Regular Expressions that may help with this sort of thing. I
tend to use these rather than the options in the Regex engine:

(?i) Turn of case-sensitivity for the remainder
of the regular expression.
(?s) Turn on "dot matches now line" for the
remainder of the regular expression.
(?m) Caret and dollar sign match after and before
new lines for the remainder of the regular expression.
(?i-sm) Tunrs on the "i" and "m" options for the remainder
of the regular expression, and turn off "s"
(?i-sm:regex) Turns on the "i" and "m" options
for the regular expression inside the parentheses,
and turn off "s"

Here is the Microsoft .Net Framework Regular Expressions reference:

http://msdn.microsoft.com/library/de...geElements.asp

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:aa********************@giganews.com...

Kevin,

Thanks for the help. The Unix/Windows difference on crlf vs. lf I'm aware
of, but that's not really the issue.

The Regex engine in .NET, I would assume, is more geared towards windows
documents if anything, but maybe not.

There are Regex options for singleline and multiline to help determine how
the parser handles the data, but it's unclear to me how to get all matches
from my document without replacing the carriage returns. Nothing I've
tried has worked. I suspect there are changes I can make to the regular
expressions themselves to handle it, but I don't know how to do that.

Pete
"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:Os****************@TK2MSFTNGP12.phx.gbl...
Well, Pete, I'm not sure, but I do know that Microsoft and Unix text
documents have a distinct difference in terms of line feeds. The
Microsoft text document uses CR/LF (\r\n), while the Unix model uses only
LF(\n). Does that tell you anything?

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:Re******************************@giganews.com ...
The Carriage Returns and Linefeeds are in the web page I'm scanning, not
the regex.

Pete
"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:u4*************@tk2msftngp13.phx.gbl...
> The solution is to replace carriage returns with line feeds.

I didn't see any carriage returns or line feeds in your regular
expression.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:pK********************@giganews.com...
>I found the solution to the problem, so now if someone could explain
>why, I'd appreciate that:
>
> The solution is to replace carriage returns with line feeds.
>
> If I do that, I get all 8 results instead of the 1 result I was
> getting.
>
> Now, can anyone tell me WHY? And aren't there other regex options for
> dealing with that without having to change the source text?
>
> Thanks.
>
> Pete
>

Feb 23 '06 #10

by: Kenneth McDonald | last post by:

I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...

Python

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Regular expression problem - Replacing a pattern

by: Dimitris Georgakopuolos | last post by:

Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...

C# / C Sharp

Help on Regular Expression

by: John | last post by:

I am new in Regular Expression. Could someone please help me in following expression? 1. the string cannot be empty 2. the string can only contains AlphaNumeric characters. No space or any...

Visual Basic .NET

Regular expression

by: Cylix | last post by:

I am going to write a function that the search engine done. in search engine, we may using double quotation to specify a pharse like "I love you", How can I using regular expression to sperate...

.NET Framework

Get regular expression

by: Mike | last post by:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...

C# / C Sharp

Dynamic list of regular expressions, find the one that matches.

by: Allan Ebdrup | last post by:

I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...

C# / C Sharp

Regular Expression Frustration

by: Shawn B. | last post by:

Greetings, I'm using a custom WebBrowser control: http://www.codeproject.com/KB/miscctrl/csEXWB.aspx When I get the DocumentSource of a web page I browsed, and run a regular expression...

C# / C Sharp

using a regular expression to match up to but not including html start/end tags

by: Andy B | last post by:

I need to create a regular expression that will match a 5 digit number, a space and then anything up to but not including the next closing html tag. Here is an example: <startTag>55555 any...

Visual Basic .NET

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Regular Expression Matches

Similar topics