Thanks Ken. I'll try that.
While I was waiting for somebody to reply, I refined my original regex a bit
and it seems to find the url and text of href tags, no matter how the tag is
formatted. Here it is:
<\s*a\s*.*[^href]?href\s*={1}?[\s""']*([^\s'"">]*)[^>]*?>{1}(.*?\n*.*?)</a>+
?
Looks ugly, but it seems to do the job. First capture group is the entire
tag from the opening <a... to the closing ...</a>. The second capture group
is just the url, including the protocol, host, destination, fragments, and
queries if they exist. The third capture group is the text between the <a>
and </> tags, including any embedded html tags.
If anybody can think of a situation where this regex doesn't match an anchor
tag in html code, please let me know.
Thanks again,
Luhar
"Ken Tucker [MVP]" <vb***@bellsouth.net> wrote in message
news:Oj**************@TK2MSFTNGP09.phx.gbl...
Hi,
Maybe this will help.
Dim wc As New System.Net.WebClient
Dim sr As New
System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))
Dim strHtml As String
Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")
Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")
Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")
Dim m As System.Text.RegularExpressions.Match
strHtml = sr.ReadToEnd
Try
For Each m In regHref.Matches(strHtml)
Dim mLink As System.Text.RegularExpressions.Match
For Each mLink In regLink.Matches(m.ToString())
Trace.WriteLine(String.Format("Link {0}", mLink.ToString))
Next
For Each mLink In regTitle.Matches(m.ToString())
Dim strTitle As String = mLink.ToString
strTitle = strTitle.Replace(">", "")
strTitle = strTitle.Replace("<", "")
Trace.WriteLine(String.Format("Title {0}", strTitle))
Next
Next
Catch
End Try
sr.Close()
wc.Dispose()
Ken
----------------------------
"Luhar" <lu***@luharsoftworks.com> wrote in message
news:uH*************@TK2MSFTNGP10.phx.gbl...
After much scouring of information on Regular Expressions from books and
the web, I've come up with the this handy little Regex to parse links from
HTML:
<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>
It works quite well at extracting the url and title of a link from an
anchor tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider
it a match. Here are some examples:
This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"
This one doesn't:
<a href="/" target="_blank">Home</a>
I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.
Thanks.