473,320 Members | 1,978 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Extracting Links from a Document

I am pretty new to VB, so please forgive the simplistic question. This is
using VB .NET Standard 2003.

My form has three objects on it: a TextBox named URL, a Button named Extract
and a WebBrowser named AxWebBrowser1. The goal is to have the user enter a
URL in the TextBox and then hit the Extract button and then to get the links
from the web page they entered.

So far I have:

AxWebBrowser1.Navigate(URL.Text)

which does display the page in the browser frame (not that I need to see it.
I am doing this to parse the data, so it is not necessary to display the
pages.)

and then:

Dim doc As mshtml.HTMLDocument = _
AxWebBrowser1.Document()

which I had gotten off some site which I assume creates a document object
named doc. Now, how do I extract the links and convert them to strings so
that I can parse them, looking for the keywords I am trying to find?

Thanks
-John
--
http://www.johnstexas.com http://stores.ebay.com/Johnstexas
Nov 21 '05 #1
1 1508
I wrote some code to do pretty much exactly that a while back. First, you
don't need ot use Internet Explorer if you don't want to. The Microsoft .NET
Framework has classes that make it easy to download a page.

Here's the code. Just create a new Console application, and paste this in:

Option Compare Text
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
SiteSweep("http://www.asp.net/whidbey/pdc.aspx?tabindex=0&tabid=1",
"c:\PDC")

SiteSweep("http://msdn.microsoft.com/events/pdc/agendaandsessions/sessions/default.aspx", "c:\PDC")
Console.WriteLine("Done")
Console.ReadLine()
End Sub
Public Sub SiteSweep(ByVal source As String, ByVal dest As String)
' needed to deal with relative paths
Dim root As String = Left(source, source.IndexOf("/", 7))
Dim current As String = Left(source, source.LastIndexOf("/") + 1)
' pull page
Dim w As New WebClient
Dim sr As New StreamReader(w.OpenRead(source))
Dim s As String = sr.ReadToEnd()
' find hrefs
Dim r As New Regex("href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>S+))", _
RegexOptions.IgnoreCase Or RegexOptions.Compiled)
' get rid of dups
Dim d As New Hashtable
For Each m As Match In r.Matches(s)
Dim url As String = m.Groups(1).Value
' find only certain file types. This could have been done with
the
' previous regex, except (1) I ripped that regex off of MSDN,
and (2)
' I plan on running the app all of one time, so who cares.
If Right(url, 4) = ".ppt" Or Right(url, 4) = ".zip" Or
Right(url, 4) = ".doc" Then
If Left(url, 7) <> "http://" Then
If url.StartsWith("/") Then
url = root & url
Else
url = current & url
End If
End If
d(url) = Right(url, Len(url) - url.LastIndexOf("/") - 1)
End If
Next
If Not Directory.Exists(dest) Then
Directory.CreateDirectory(dest)
End If
' download each file. If the download bombs, try again, unless you
get
' a 415 or 404 because there appears to be a problem with one some
of the
' files, or they are hrefs that are commented out, and my regex
ain't smart
' enough to figure that out.
For Each s In d.Keys
Dim isDownloaded As Boolean = False
While Not isDownloaded
Try
Console.WriteLine("Downloading:" & s)
If Not File.Exists(dest & "\" & d(s)) Then
w.DownloadFile(s, dest & "\" & d(s))
End If
isDownloaded = True
Catch exc As Exception
Console.WriteLine(exc.Message)
If exc.Message.IndexOf("(415)") >= 0 Or
exc.Message.IndexOf("(404)") Then
isDownloaded = True
End If
End Try
End While
Next
End Sub
End Module
Scott Swigart
www.swigartconsulting.com
blog.swigartconsulting.com

"John Seeliger" wrote:
I am pretty new to VB, so please forgive the simplistic question. This is
using VB .NET Standard 2003.

My form has three objects on it: a TextBox named URL, a Button named Extract
and a WebBrowser named AxWebBrowser1. The goal is to have the user enter a
URL in the TextBox and then hit the Extract button and then to get the links
from the web page they entered.

So far I have:

AxWebBrowser1.Navigate(URL.Text)

which does display the page in the browser frame (not that I need to see it.
I am doing this to parse the data, so it is not necessary to display the
pages.)

and then:

Dim doc As mshtml.HTMLDocument = _
AxWebBrowser1.Document()

which I had gotten off some site which I assume creates a document object
named doc. Now, how do I extract the links and convert them to strings so
that I can parse them, looking for the keywords I am trying to find?

Thanks
-John
--
http://www.johnstexas.com http://stores.ebay.com/Johnstexas

Nov 21 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Markus Ernst | last post by:
Hello I have a regex problem, spent about 7 hours on this now, but I don't find the answer in the manual and googling, though I think this must have been discussed before. I try to simply...
2
by: chris hughes | last post by:
Can anyone please tell me how to create a javascript that I can place in any page that will disable all the links or just change all the hrefs to # Many Thanks Chris Hughes
1
by: Tommy | last post by:
Hi, I have 3 frames. One to let the user input a URL, one to display the page of that URL and one to display some info on that page (including all the links of the page). The problem is that...
10
by: andreister | last post by:
He there! I've discovered that the ================================================= document.links('link_id_here').href = "something"; ================================================= is...
2
by: Chris Belcher | last post by:
First some background... The database tracks Action Items assigned to a group of 20 or so managers. Once the assignment is created it is then emailed to each of the managers that are included in...
0
by: Mico | last post by:
I would be very grateful for any help with the following: I currently have the code below. This opens a MS Word document, and uses C#'s internal regular expressions library to find if there is a...
0
by: k_nil | last post by:
Hi, I have created a self extracting exe. the exe will extract an exe and a txt file on the machine. I wana launch the exe as soon as the extraction happens. how could i do this? any information...
2
by: Kevin K | last post by:
Hi, I'm having a problem with extracting text from a Word document using StreamReader. As I'm developing a web application, I do NOT want the server to make calls to Word. I want to simply...
14
by: Adnan Siddiqi | last post by:
Hi Suppose I have following URLs comming from an HTML document <a href="http://mydomain1.com">Domain1</a> <a...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.