Thanks Ken. I'll try that.
While I was waiting for somebody to reply, I refined my original regex a bit
and it seems to find the url and text of href tags, no matter how the tag is
formatted. Here it is:
<\s*a\s*.*[^href]?href\s*={1}?[\s""']*([^\s'"">]*)[^>]*?>{1}(.*?\n*.* ?)</a>+
?
Looks ugly, but it seems to do the job. First capture group is the entire
tag from the opening <a... to the closing ...</a>. The second capture group
is just the url, including the protocol, host, destination, fragments, and
queries if they exist. The third capture group is the text between the <a>
and </> tags, including any embedded html tags.
If anybody can think of a situation where this regex doesn't match an anchor
tag in html code, please let me know.
Thanks again,
Luhar
"Ken Tucker [MVP]" <vb***@bellsout h.net> wrote in message
news:Oj******** ******@TK2MSFTN GP09.phx.gbl...
Hi,
Maybe this will help.
Dim wc As New System.Net.WebC lient
Dim sr As New
System.IO.Strea mReader(wc.Open Read("http://news.google.com/"))
Dim strHtml As String
Dim regLink As New
System.Text.Reg ularExpressions .Regex("\""(?<u rl>[^\""]*)\""")
Dim regTitle As New System.Text.Reg ularExpressions .Regex(">(.*?)\ <")
Dim regHref As New System.Text.Reg ularExpressions .Regex("\<a
href=""(.*?)""\ >(.*?)\<\/a\>")
Dim m As System.Text.Reg ularExpressions .Match
strHtml = sr.ReadToEnd
Try
For Each m In regHref.Matches (strHtml)
Dim mLink As System.Text.Reg ularExpressions .Match
For Each mLink In regLink.Matches (m.ToString())
Trace.WriteLine (String.Format( "Link {0}", mLink.ToString) )
Next
For Each mLink In regTitle.Matche s(m.ToString())
Dim strTitle As String = mLink.ToString
strTitle = strTitle.Replac e(">", "")
strTitle = strTitle.Replac e("<", "")
Trace.WriteLine (String.Format( "Title {0}", strTitle))
Next
Next
Catch
End Try
sr.Close()
wc.Dispose()
Ken
----------------------------
"Luhar" <lu***@luharsof tworks.com> wrote in message
news:uH******** *****@TK2MSFTNG P10.phx.gbl...
After much scouring of information on Regular Expressions from books and
the web, I've come up with the this handy little Regex to parse links from
HTML:
<a\s+href(?:\s+ )?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?) </a>
It works quite well at extracting the url and title of a link from an
anchor tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider
it a match. Here are some examples:
This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"
This one doesn't:
<a href="/" target="_blank" >Home</a>
I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.
Thanks.