Connecting Tech Pros Worldwide Help | Site Map

How to capture URLs in HTML file

Newbie
 
Join Date: Dec 2008
Posts: 12
#1: Feb 1 '09
Hello everyone,

I am trying to capture URLs in HTML file which appears like

Expand|Select|Wrap|Line Numbers
  1. <a[string]href[space(s) or nothing]=[space(s) or nothing]["][url]["][string]>
  2.  
I found this code but it does not work well.

Expand|Select|Wrap|Line Numbers
  1. Imports System.Text.RegularExpressions
  2. Public Class Form1
  3. Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
  4. Dim rx As New Regex("[<]a[\s][\w\W]*[href=](?<word>\S*)[\s\W\w]*[>]", _
  5. RegexOptions.Compiled Or RegexOptions.IgnoreCase)
  6. Dim text As String = "<a href=http:// name=as>"
  7. Dim matches As MatchCollection = rx.Matches(text)
  8. For Each m As Match In matches
  9. MsgBox(m.Groups("word").Value)
  10. Next
  11. End Sub
  12. End Class
  13.  
Thank You
Plater's Avatar
Moderator
 
Join Date: Apr 2007
Location: New England
Posts: 7,148
#2: Feb 2 '09

re: How to capture URLs in HTML file


Assuming the regex matches your given pattern it would probably work.
But remember, not all websites are built with " in the href field (they should be, but aren't)
You might find a single quote ' or no quoting at all
Newbie
 
Join Date: Dec 2008
Posts: 12
#3: Feb 4 '09

re: How to capture URLs in HTML file


Hello everyone,
Thank you Plater for your answer and your important tip.
I make some modifications in the code and it must be work, but finaly I got only the last URL, then can you help me in this.

Expand|Select|Wrap|Line Numbers
  1. Dim sr As New StreamReader("c:\cas.html")
  2. Dim text As String = sr.ReadToEnd()
  3. sr.Close()
  4. text = text.Replace(Chr(13), "")
  5. text = text.Replace("  ", " ")
  6. Dim spattern As String = "<\s*a\s+[\w\W]*href\s*=[\s'" & Chr(34) & "]*(?<word>[^" & Chr(34) & "'\s]+)[\s\S\W\w]*[>]"
  7. Dim rx As New Regex(spattern, _
  8. RegexOptions.Compiled Or RegexOptions.IgnoreCase)
  9. Dim matches As MatchCollection = rx.Matches(text)
  10. For Each m As Match In matches
  11.     ListBox1.Items.Add(m.Groups("word").Value)
  12. Next
  13.  
The html page that I use is attached, and if you try this code for this page you will get contact.htm

Thank you
Attached Files
File Type: zip cas.zip (11.8 KB, 1 views)
Newbie
 
Join Date: Dec 2008
Posts: 12
#4: Feb 4 '09

re: How to capture URLs in HTML file


I think I found the solution
Expand|Select|Wrap|Line Numbers
  1. Dim sr As New StreamReader("c:\cas.html")
  2. Dim text As String = sr.ReadToEnd()
  3. sr.Close()
  4. text = text.Replace(Chr(13), "")
  5. Do While InStr(text, "  ")
  6.     text = text.Replace("  ", " ")
  7. Loop
  8. Dim spattern As String = "<\s*a\s+[^>]*href\s*=[\s'" & Chr(34) & "]*(?<word>[^" & Chr(34) & "'\s]+)[^>]*>"
  9. For Each m As Match In Regex.Matches(text, spattern, RegexOptions.Compiled Or RegexOptions.IgnoreCase)
  10.     ListBox1.Items.Add(m.Groups("word").Value)
  11. Next
  12.  
The problem is that when I use \W\w its include > symbol.

Thank you
Member
 
Join Date: Nov 2007
Location: Alberta, Canada
Posts: 59
#5: Feb 6 '09

re: How to capture URLs in HTML file


wouldnt it be easier to use the mshtml class?
Reply