I'm using regular expressions to extract some data and some links from some
web pages. I download the page and then I want to get a list of certain
links.
For building regular expressions, I use an app call The Regulator, which
makes it pretty easy to build and test regular expressions.
As a warning, I'm real weak with regular expressions. Let's say my regular
expression is:
(href=)(?<catur l>.*)(class=tit le>\[ &nb sp;)
Now, using The Regulator and giving it the source for a particular web page,
I get 8 matches.
According to the regulator, the options it's using are:
Multiline, ignore case, ignore whitespace
In my own code, I'm doing:
Regex indexRegex = new Regex(categoryL istRegex,
RegexOptions.Mu ltiline |
RegexOptions.Ig norePatternWhit espace |
RegexOptions.Ig noreCase);
MatchCollection indexMatches = indexRegex.Matc hes(pageText);
This only returns one match in indexMatches with the same page that I'm
giving The Regulator. It seems that no matter what combination of regex
options I use, I'm only getting one match.
Why is that? How do I get all 8 matches?
Thanks.
pete 9 3357
Hi Pete,
You need to escape the '<' and '>' characters in your Regurlar Expression.
These are used in some flavors of Regular Expression language to indicate a
named group. If the first (<caturl>) is a group name, name both groups or
neither.
--
HTH,
Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:1e******** *************** *******@giganew s.com... I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links.
For building regular expressions, I use an app call The Regulator, which makes it pretty easy to build and test regular expressions.
As a warning, I'm real weak with regular expressions. Let's say my regular expression is:
(href=)(?<catur l>.*)(class=tit le>\[ &nb sp;)
Now, using The Regulator and giving it the source for a particular web page, I get 8 matches.
According to the regulator, the options it's using are:
Multiline, ignore case, ignore whitespace
In my own code, I'm doing:
Regex indexRegex = new Regex(categoryL istRegex, RegexOptions.Mu ltiline | RegexOptions.Ig norePatternWhit espace | RegexOptions.Ig noreCase); MatchCollection indexMatches = indexRegex.Matc hes(pageText);
This only returns one match in indexMatches with the same page that I'm giving The Regulator. It seems that no matter what combination of regex options I use, I'm only getting one match.
Why is that? How do I get all 8 matches?
Thanks.
pete
Well, there are 3 groups. <caturl> is a group name. The other 2 are unnamed.
Why do they need to be named if <caturl> is named? I'm not interested in the
other groups. I'm simply using them as "delimiters " for lack of a better
word.
I've modified the expression to look like this:
(href=)(?<catur l>.*)(class=tit le.*\[ &nb sp;)
This gives the exact same results.
The escaping stuff gets a little confusing because the regular expressions
are actually stored in an XML file, so they get escaped for that.
In the XML file that looks like:
(href=)(?<ca turl>.*)(cla ss=title.*\[&nbsp;& nbsp;&nbsp; )
This still isn't returning multiple results. Just the last match. I don't
think the < was the problem.
Pete
"Kevin Spencer" <ke***@DIESPAMM ERSDIEtakempis. com> wrote in message
news:%2******** ********@TK2MSF TNGP11.phx.gbl. .. Hi Pete,
You need to escape the '<' and '>' characters in your Regurlar Expression. These are used in some flavors of Regular Expression language to indicate a named group. If the first (<caturl>) is a group name, name both groups or neither.
-- HTH,
Kevin Spencer Microsoft MVP .Net Developer To a tea you esteem a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message news:1e******** *************** *******@giganew s.com... I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links.
For building regular expressions, I use an app call The Regulator, which makes it pretty easy to build and test regular expressions.
As a warning, I'm real weak with regular expressions. Let's say my regular expression is:
(href=)(?<catur l>.*)(class=tit le>\[ &nb sp;)
Now, using The Regulator and giving it the source for a particular web page, I get 8 matches.
According to the regulator, the options it's using are:
Multiline, ignore case, ignore whitespace
In my own code, I'm doing:
Regex indexRegex = new Regex(categoryL istRegex, RegexOptions.Mu ltiline | RegexOptions.Ig norePatternWhit espace | RegexOptions.Ig noreCase); MatchCollection indexMatches = indexRegex.Matc hes(pageText);
This only returns one match in indexMatches with the same page that I'm giving The Regulator. It seems that no matter what combination of regex options I use, I'm only getting one match.
Why is that? How do I get all 8 matches?
Thanks.
pete
I found the solution to the problem, so now if someone could explain why,
I'd appreciate that:
The solution is to replace carriage returns with line feeds.
If I do that, I get all 8 results instead of the 1 result I was getting.
Now, can anyone tell me WHY? And aren't there other regex options for
dealing with that without having to change the source text?
Thanks.
Pete
> The solution is to replace carriage returns with line feeds.
I didn't see any carriage returns or line feeds in your regular expression.
--
HTH,
Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:pK******** ************@gi ganews.com... I found the solution to the problem, so now if someone could explain why, I'd appreciate that:
The solution is to replace carriage returns with line feeds.
If I do that, I get all 8 results instead of the 1 result I was getting.
Now, can anyone tell me WHY? And aren't there other regex options for dealing with that without having to change the source text?
Thanks.
Pete
The Carriage Returns and Linefeeds are in the web page I'm scanning, not the
regex.
Pete
"Kevin Spencer" <ke***@DIESPAMM ERSDIEtakempis. com> wrote in message
news:u4******** *****@tk2msftng p13.phx.gbl... The solution is to replace carriage returns with line feeds.
I didn't see any carriage returns or line feeds in your regular expression.
-- HTH,
Kevin Spencer Microsoft MVP .Net Developer To a tea you esteem a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message news:pK******** ************@gi ganews.com...I found the solution to the problem, so now if someone could explain why, I'd appreciate that:
The solution is to replace carriage returns with line feeds.
If I do that, I get all 8 results instead of the 1 result I was getting.
Now, can anyone tell me WHY? And aren't there other regex options for dealing with that without having to change the source text?
Thanks.
Pete
Well, Pete, I'm not sure, but I do know that Microsoft and Unix text
documents have a distinct difference in terms of line feeds. The Microsoft
text document uses CR/LF (\r\n), while the Unix model uses only LF(\n). Does
that tell you anything?
--
HTH,
Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:Re******** *************** *******@giganew s.com... The Carriage Returns and Linefeeds are in the web page I'm scanning, not the regex.
Pete
"Kevin Spencer" <ke***@DIESPAMM ERSDIEtakempis. com> wrote in message news:u4******** *****@tk2msftng p13.phx.gbl... The solution is to replace carriage returns with line feeds.
I didn't see any carriage returns or line feeds in your regular expression.
-- HTH,
Kevin Spencer Microsoft MVP .Net Developer To a tea you esteem a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message news:pK******** ************@gi ganews.com...I found the solution to the problem, so now if someone could explain why, I'd appreciate that:
The solution is to replace carriage returns with line feeds.
If I do that, I get all 8 results instead of the 1 result I was getting.
Now, can anyone tell me WHY? And aren't there other regex options for dealing with that without having to change the source text?
Thanks.
Pete
Kevin,
Thanks for the help. The Unix/Windows difference on crlf vs. lf I'm aware
of, but that's not really the issue.
The Regex engine in .NET, I would assume, is more geared towards windows
documents if anything, but maybe not.
There are Regex options for singleline and multiline to help determine how
the parser handles the data, but it's unclear to me how to get all matches
from my document without replacing the carriage returns. Nothing I've tried
has worked. I suspect there are changes I can make to the regular
expressions themselves to handle it, but I don't know how to do that.
Pete
"Kevin Spencer" <ke***@DIESPAMM ERSDIEtakempis. com> wrote in message
news:Os******** ********@TK2MSF TNGP12.phx.gbl. .. Well, Pete, I'm not sure, but I do know that Microsoft and Unix text documents have a distinct difference in terms of line feeds. The Microsoft text document uses CR/LF (\r\n), while the Unix model uses only LF(\n). Does that tell you anything?
-- HTH,
Kevin Spencer Microsoft MVP .Net Developer To a tea you esteem a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message news:Re******** *************** *******@giganew s.com... The Carriage Returns and Linefeeds are in the web page I'm scanning, not the regex.
Pete
"Kevin Spencer" <ke***@DIESPAMM ERSDIEtakempis. com> wrote in message news:u4******** *****@tk2msftng p13.phx.gbl... The solution is to replace carriage returns with line feeds.
I didn't see any carriage returns or line feeds in your regular expression.
-- HTH,
Kevin Spencer Microsoft MVP .Net Developer To a tea you esteem a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message news:pK******** ************@gi ganews.com... I found the solution to the problem, so now if someone could explain why, I'd appreciate that:
The solution is to replace carriage returns with line feeds.
If I do that, I get all 8 results instead of the 1 result I was getting.
Now, can anyone tell me WHY? And aren't there other regex options for dealing with that without having to change the source text?
Thanks.
Pete
Pete Davis wrote: I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links.
Completely not answering your actual question, but maybe this will save
you a lot of bother: the free HtmlAgilityPack will convert an (even
malformed) HTML document into a nice XML tree which makes getting
content out much much easier.
--
Larry Lard
Replies to group please
Hi Pete,
Here are some Regular Expressions that may help with this sort of thing. I
tend to use these rather than the options in the Regex engine:
(?i) Turn of case-sensitivity for the remainder
of the regular expression.
(?s) Turn on "dot matches now line" for the
remainder of the regular expression.
(?m) Caret and dollar sign match after and before
new lines for the remainder of the regular expression.
(?i-sm) Tunrs on the "i" and "m" options for the remainder
of the regular expression, and turn off "s"
(?i-sm:regex) Turns on the "i" and "m" options
for the regular expression inside the parentheses,
and turn off "s"
Here is the Microsoft .Net Framework Regular Expressions reference: http://msdn.microsoft.com/library/de...geElements.asp
--
HTH,
Kevin Spencer
Microsoft MVP
..Net Developer
To a tea you esteem
a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:aa******** ************@gi ganews.com... Kevin,
Thanks for the help. The Unix/Windows difference on crlf vs. lf I'm aware of, but that's not really the issue.
The Regex engine in .NET, I would assume, is more geared towards windows documents if anything, but maybe not.
There are Regex options for singleline and multiline to help determine how the parser handles the data, but it's unclear to me how to get all matches from my document without replacing the carriage returns. Nothing I've tried has worked. I suspect there are changes I can make to the regular expressions themselves to handle it, but I don't know how to do that.
Pete
"Kevin Spencer" <ke***@DIESPAMM ERSDIEtakempis. com> wrote in message news:Os******** ********@TK2MSF TNGP12.phx.gbl. .. Well, Pete, I'm not sure, but I do know that Microsoft and Unix text documents have a distinct difference in terms of line feeds. The Microsoft text document uses CR/LF (\r\n), while the Unix model uses only LF(\n). Does that tell you anything?
-- HTH,
Kevin Spencer Microsoft MVP .Net Developer To a tea you esteem a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message news:Re******** *************** *******@giganew s.com... The Carriage Returns and Linefeeds are in the web page I'm scanning, not the regex.
Pete
"Kevin Spencer" <ke***@DIESPAMM ERSDIEtakempis. com> wrote in message news:u4******** *****@tk2msftng p13.phx.gbl... > The solution is to replace carriage returns with line feeds.
I didn't see any carriage returns or line feeds in your regular expression.
-- HTH,
Kevin Spencer Microsoft MVP .Net Developer To a tea you esteem a hurting back as a wallet.
"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message news:pK******** ************@gi ganews.com... >I found the solution to the problem, so now if someone could explain >why, I'd appreciate that: > > The solution is to replace carriage returns with line feeds. > > If I do that, I get all 8 results instead of the 1 result I was > getting. > > Now, can anyone tell me WHY? And aren't there other regex options for > dealing with that without having to change the source text? > > Thanks. > > Pete >
This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Kenneth McDonald |
last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate
feedback, suggestions, and criticism as I work towards finalizing the
API and feature sets. rex is a module intended to make regular expressions
easier to create and use (and in my experience as a regular expression
user, it makes them MUCH easier to create and use.)
I'm still working on formal documentation, and in any case, such
documentation isn't necessarily the...
|
by: Neri |
last post by:
Some document processing program I write has to deal with documents
that have headers and footers that are unnecessary for the main
processing part. Therefore, I'm using a regular expression to go over
each document, find out if it contains a header and/or a footer and
extract only the main content part.
The headers and the footers have no specific format and I have to
detect and remove them using a list of strings that may appear as...
|
by: Dimitris Georgakopuolos |
last post by:
Hello,
I have a text file that I load up to a string. The text includes
certain expression like {firstName} or {userName} that I want to match
and then replace with a new expression. However, I want to use the
text included within the brackets to do a lookup so that I can replace
the expression with the new text.
For example:
|
by: John |
last post by:
I am new in Regular Expression. Could someone please help me in following
expression?
1. the string cannot be empty
2. the string can only contains AlphaNumeric characters. No space or any
special characters are allowed
3. space characters at the end of string is ok
4. the string cannot contains only numeric characters, in other word, the
string must contains a least one alpha character
Thanks for the help
|
by: Cylix |
last post by:
I am going to write a function that the search engine done.
in search engine, we may using double quotation to specify a pharse
like "I love you",
How can I using regular expression to sperate each pharse?
test case:
"I love" all "of you"
I would like it return:
| |
by: Mike |
last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
do I gain access to the expression (not the matches) at runtime?
Thanks,
Mike
|
by: Allan Ebdrup |
last post by:
I have a dynamic list of regular expressions, the expressions don't change
very often but they can change. And I have a single string that I want to
match the regular expressions against and find the first regular expression
that matches the string.
I've gor the regular expressions ordered so that the highest priority is
first (if two or more regular expressions match the string I want the first
one returned)
The code that does this has...
|
by: Shawn B. |
last post by:
Greetings,
I'm using a custom WebBrowser control:
http://www.codeproject.com/KB/miscctrl/csEXWB.aspx
When I get the DocumentSource of a web page I browsed, and run a regular
expression against it, the Expression never matches anything, nothing,
nadda. Never. I know it is a correct Regular Expression because if I use
the intrinsic WebBrowser control, it the expression works. I know that if I
|
by: Andy B |
last post by:
I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:
<startTag>55555 any text</aClosingTag>
I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |