By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,290 Members | 1,195 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,290 IT Pros & Developers. It's quick & easy.

How to capture URLs in HTML file

P: 24
Hello everyone,

I am trying to capture URLs in HTML file which appears like

Expand|Select|Wrap|Line Numbers
  1. <a[string]href[space(s) or nothing]=[space(s) or nothing]["][url]["][string]>
  2.  
I found this code but it does not work well.

Expand|Select|Wrap|Line Numbers
  1. Imports System.Text.RegularExpressions
  2. Public Class Form1
  3. Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
  4. Dim rx As New Regex("[<]a[\s][\w\W]*[href=](?<word>\S*)[\s\W\w]*[>]", _
  5. RegexOptions.Compiled Or RegexOptions.IgnoreCase)
  6. Dim text As String = "<a href=http:// name=as>"
  7. Dim matches As MatchCollection = rx.Matches(text)
  8. For Each m As Match In matches
  9. MsgBox(m.Groups("word").Value)
  10. Next
  11. End Sub
  12. End Class
  13.  
Thank You
Feb 1 '09 #1
Share this Question
Share on Google+
4 Replies


Plater
Expert 5K+
P: 7,872
Assuming the regex matches your given pattern it would probably work.
But remember, not all websites are built with " in the href field (they should be, but aren't)
You might find a single quote ' or no quoting at all
Feb 2 '09 #2

P: 24
Hello everyone,
Thank you Plater for your answer and your important tip.
I make some modifications in the code and it must be work, but finaly I got only the last URL, then can you help me in this.

Expand|Select|Wrap|Line Numbers
  1. Dim sr As New StreamReader("c:\cas.html")
  2. Dim text As String = sr.ReadToEnd()
  3. sr.Close()
  4. text = text.Replace(Chr(13), "")
  5. text = text.Replace("  ", " ")
  6. Dim spattern As String = "<\s*a\s+[\w\W]*href\s*=[\s'" & Chr(34) & "]*(?<word>[^" & Chr(34) & "'\s]+)[\s\S\W\w]*[>]"
  7. Dim rx As New Regex(spattern, _
  8. RegexOptions.Compiled Or RegexOptions.IgnoreCase)
  9. Dim matches As MatchCollection = rx.Matches(text)
  10. For Each m As Match In matches
  11.     ListBox1.Items.Add(m.Groups("word").Value)
  12. Next
  13.  
The html page that I use is attached, and if you try this code for this page you will get contact.htm

Thank you
Attached Files
File Type: zip cas.zip (11.8 KB, 794 views)
Feb 4 '09 #3

P: 24
I think I found the solution
Expand|Select|Wrap|Line Numbers
  1. Dim sr As New StreamReader("c:\cas.html")
  2. Dim text As String = sr.ReadToEnd()
  3. sr.Close()
  4. text = text.Replace(Chr(13), "")
  5. Do While InStr(text, "  ")
  6.     text = text.Replace("  ", " ")
  7. Loop
  8. Dim spattern As String = "<\s*a\s+[^>]*href\s*=[\s'" & Chr(34) & "]*(?<word>[^" & Chr(34) & "'\s]+)[^>]*>"
  9. For Each m As Match In Regex.Matches(text, spattern, RegexOptions.Compiled Or RegexOptions.IgnoreCase)
  10.     ListBox1.Items.Add(m.Groups("word").Value)
  11. Next
  12.  
The problem is that when I use \W\w its include > symbol.

Thank you
Feb 4 '09 #4

P: 59
wouldnt it be easier to use the mshtml class?
Feb 6 '09 #5

Post your reply

Sign in to post your reply or Sign up for a free account.