By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,552 Members | 905 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,552 IT Pros & Developers. It's quick & easy.

Need help with a Regular Expression

P: n/a
After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.
Nov 20 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Hi,
Maybe this will help.
Dim wc As New System.Net.WebClient

Dim sr As New System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))

Dim strHtml As String

Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")

Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")

Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")

Dim m As System.Text.RegularExpressions.Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches(strHtml)

Dim mLink As System.Text.RegularExpressions.Match

For Each mLink In regLink.Matches(m.ToString())

Trace.WriteLine(String.Format("Link {0}", mLink.ToString))

Next

For Each mLink In regTitle.Matches(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replace(">", "")

strTitle = strTitle.Replace("<", "")

Trace.WriteLine(String.Format("Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()

Ken

----------------------------

"Luhar" <lu***@luharsoftworks.com> wrote in message
news:uH*************@TK2MSFTNGP10.phx.gbl...
After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.

Nov 20 '05 #2

P: n/a
Thanks Ken. I'll try that.

While I was waiting for somebody to reply, I refined my original regex a bit
and it seems to find the url and text of href tags, no matter how the tag is
formatted. Here it is:
<\s*a\s*.*[^href]?href\s*={1}?[\s""']*([^\s'"">]*)[^>]*?>{1}(.*?\n*.*?)</a>+
?
Looks ugly, but it seems to do the job. First capture group is the entire
tag from the opening <a... to the closing ...</a>. The second capture group
is just the url, including the protocol, host, destination, fragments, and
queries if they exist. The third capture group is the text between the <a>
and </> tags, including any embedded html tags.

If anybody can think of a situation where this regex doesn't match an anchor
tag in html code, please let me know.

Thanks again,

Luhar

"Ken Tucker [MVP]" <vb***@bellsouth.net> wrote in message
news:Oj**************@TK2MSFTNGP09.phx.gbl...
Hi,
Maybe this will help.
Dim wc As New System.Net.WebClient

Dim sr As New System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))
Dim strHtml As String

Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")

Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")

Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")

Dim m As System.Text.RegularExpressions.Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches(strHtml)

Dim mLink As System.Text.RegularExpressions.Match

For Each mLink In regLink.Matches(m.ToString())

Trace.WriteLine(String.Format("Link {0}", mLink.ToString))

Next

For Each mLink In regTitle.Matches(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replace(">", "")

strTitle = strTitle.Replace("<", "")

Trace.WriteLine(String.Format("Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()

Ken

----------------------------

"Luhar" <lu***@luharsoftworks.com> wrote in message
news:uH*************@TK2MSFTNGP10.phx.gbl...
After much scouring of information on Regular Expressions from books and the web, I've come up with the this handy little Regex to parse links from HTML:
<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.

Nov 20 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.