473,405 Members | 2,421 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Regex Issues - Finding Qualified URLS

Hi,
I am using the following function to match any URLS from within a string
containing the html of a webpage:

public List<stringDumpHrefs(String inputString)
{
Regex r;
Match m;
List<stringLstURLs = new List<string>();

r = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
for (m = r.Match(inputString); m.Success; m = m.NextMatch())
{
LstURLs.Add(m.Groups[1].ToString());
}
return LstURLs;
}
However the problem with this, is it returns all links on the page, and
I only wish to return fully qualified links such as
http://www.domain.com/page.html and not relitive links.
I was given the following information by Kevin Spencer:
/* Start */
(?i)href\s*=\s*"?(?<1>http://[^"]+\"?[^>]*)>
First, rather than using an alternation, I just gave a rule that it could
have 0 or 1 quotes at the beginning and end. The (?i) indicates that the
regex is not case-sensitive. The group 1 consists of the character sequence
"http://" followed by any character that is not a quote mark, followed by
zero or 1 quote marks, followed by any character that is not ">". The
expression ends with the ">" character.
/* End */

I am unsure of how to incorperate the regex given by kevin into my
function, does anyone have any suggestions?

Regards
Dec 11 '07 #1
2 1333
/* Start */
(?i)href\s*=\s*"?(?<1>http://[^"]+\"?[^>]*)>
Hmm, this link *might* help

http://www.regexplib.com/Search.aspx...=-1&m=-1&ps=20

sorta new to regex, but there is a group in regexp I think... cant se
how that above masks out links to same page but...

//CY
Dec 12 '07 #2
Hmm, this link *might* help
>
http://www.regexplib.com/Search.aspx...=-1&m=-1&ps=20
but it might be 4 posix... when I think a bit about the "p"

//CY
Dec 12 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

12
by: chris | last post by:
i can see the power of regular expressions but am having a bit of a battle getting my head around them. can anyone recommend some BASIC - tutorials for using regex something like th idots...
4
by: Fazer | last post by:
Hello, I have a string which has a url (Begins with a http://) somewhere in it. I want to detect such a url and just spit out the url. Since I am very poor in regex, can someone show me how to...
0
by: lkrubner | last post by:
My boss gave me this assignment: "Change all the URLs on our website so that they no longer look like dynamic URLs. Make them look like folders." I spent all yesterday studying Apache...
5
by: Petra Meier | last post by:
Hello, I use the following script to parse URI and email: function parseLinks($sData){ $regexEmail = "/\w+((-\w+)|(\.\w+))*\@+((\.|-)+)*\.+/"; $sData = preg_replace($regexEmail, "<a...
4
by: Henrik Dahl | last post by:
Hello! In my application I have a need for using a regular expression now and then. Often the same regular expression must be used multiple times. For performance reasons I use the...
16
by: Mark Chambers | last post by:
Hi there, I'm seeking opinions on the use of regular expression searching. Is there general consensus on whether it's now a best practice to rely on this rather than rolling your own (string)...
11
by: ymic8 | last post by:
Hi everyone, this is my first thread coz I just joined. Does anyone know how to crawl a particular URL using Python? I tried to build a breadth-first sort of crawler but have little success. ...
1
by: Mick Walker | last post by:
Hi, I am using the following function to match any URLS from within a string containing the html of a webpage: public List<stringDumpHrefs(String inputString) { Regex r; Match m;...
0
by: Mick Walker | last post by:
Hi, I am using the following function to match any URLS from within a string containing the html of a webpage: public List<stringDumpHrefs(String inputString) { Regex r; Match m;...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.