Ferreting out broken links, Part 2

Dave

Hello All,

A couple of weeks ago, I undertook to write a utility that would loop
through various URLs and test whether they were valid. I got some good help
from this list, and was able to write the utility.

Now, I have run into a problem that is difficult for me to solve. It is
this: When looping through a large set of URLs, if many of the URLS are
bad, the program will time out. Conversely, if most of the URLs are good,
it will perform as expected and complete.

I am including code stubs below that will illustrate this.
Dim req As System.Net.Http WebRequest
Dim resp As System.Net.Http WebResponse

for i = 0 to 1000

s = "www.google.com "

req = System.Net.WebR equest.Create( s )

try
resp = req.GetResponse ()
LinkStatus = Resp.StatusCode .ToString

resp.close()

catch exWeb As System.net.WebE xception

LinkStatus = exWeb.Message

end try

next i

'The preceding block will work because finding www.google.com 1000 times is
not time-consuming.

'But the next block tries to access a non-existent site. Even doing this
"only" 500 times causes the app to time out, evidently because it takes
longer to GetResponse() a non-existent site.
for i = 0 to 500

'This time we try a non-existent site
s = "www.google.edu "

req = System.Net.WebR equest.Create( s )
try
resp = req.GetResponse ()
LinkStatus = Resp.StatusCode .ToString

resp.close()

catch exWeb As System.net.WebE xception

LinkStatus = exWeb.Message

end try

next i

So can anybody provide any pointers or documentation that would help me
solve this problem? I need the program to be able to handle large sets of
invalid URLs.

Thanks very much in advance,

Dave

Nov 21 '05 #1

Subscribe Reply

999

AMercer

> Now, I have run into a problem that is difficult for me to solve. It is

this: When looping through a large set of URLs, if many of the URLS are
bad, the program will time out. Conversely, if most of the URLs are good,
it will perform as expected and complete.

The program will time out? What does that mean?

Regardless, the way I would approach this problem is by using multiple
threads. Create a queue of urls to be tested and and an empty output queue
of urls and their status (success, non-existent, timeout, whatever). Create
the proper mechanisms to thread-safely dequeue a url to be tested and enqueue
a url and its status when the url has been tested. Launch a number of
threads where the processing of each thread is as follows:

while true
if there are no urls left to test then return thereby ending the thread
dequeue a url to test
test the url
update the output queue with the result
end while

Your main program launches a number of these threads (10 is as good a
starting number as any) and waits until they are all completed. Each thread
will operate independently of the others. If one runs slowly because it ran
into a batch of invalid urls, others will run more quickly. The process will
bog down only when all threads are running slowly, in which case you should
launch more threads (ie if 10 is still too slow, then try 20). So long as at
least one thread is having good success testing urls, the entire process will
keep moving along.

Nov 21 '05 #2

Dave

By timeout, I get this message:

Description: An unhandled exception occurred during the execution of the
current web request. Please review the stack trace for more information
about the error and where it originated in the code.

Exception Details: System.Web.Http Exception: Request timed out.

[HttpException (0x80004005): Request timed out.]

It's an interesting solution you lay out - if I don't see anything simpler,
I will give it a try. Statements like this, from Help, which I don't fully
understand, take some of my optimism away, though:
Thread Safety
An application must run in full trust mode when using serialization.

Anyway, thanks for your quick response.

Dave

"AMercer" <AM*****@discus sions.microsoft .com> wrote in message
news:38******** *************** ***********@mic rosoft.com...

The program will time out? What does that mean?

Regardless, the way I would approach this problem is by using multiple
threads. Create a queue of urls to be tested and and an empty output
queue

[snip]

Nov 21 '05 #3

AMercer

1. I don't know if it will make a difference or not, but I suggest you put
req = System.Net.WebR equest.Create( s )
inside the try block.

2. Re:

An application must run in full trust mode when using serialization.

Don't worry about this - serialization as used by MS means preserving an
object to a medium like a disk, and later, deserializing it refers to
creating a clone of the object from disk. It is a poor choice of words.
Before MS appropriated the word, to serialize used to mean to place in
sequence. In some computer science settings, it means enqueue.

3. Re threading... Assuming you have queues (System.Collect ions.Queue) for
use by the threads, the problem you need to solve is the contention problem.
You have to prevent two threads from updating a queue at the same time. Two
among many choices are:
Threading.Monit or.Enter(MyQueu e)
... enqueue or dequeue here
Threading.Monit or.Exit(MyQueue )
and
SyncLock MyQueue
... enqueue or dequeue here
End SyncLock
Additionally, the .net queue object has property IsSynchronized which sounds
like it solves the contention problem, but I have no experience with it.

Good luck.

Nov 21 '05 #4

Dave

Okay, thanks again.

1) Actually, I have the req = ...Create(s) stmt in its own Try block since
some of the URLs are so poorly formed they create a URI error on that
statement. I left all that out for clarity.

2) Thanks for that good info. That is encouraging.

3) I will try probably try this, but I'm going to have to come back to it
since it looks like I need to teach myself multi-threading.

Anyway, thanks a lot for all this.

Dave
"AMercer" <AM*****@discus sions.microsoft .com> wrote in message
news:DE******** *************** ***********@mic rosoft.com...

1. I don't know if it will make a difference or not, but I suggest you
put
req = System.Net.WebR equest.Create( s )
inside the try block.

2. Re:
> An application must run in full trust mode when using serialization.

Don't worry about this - serialization as used by MS means preserving an
object to a medium like a disk, and later, deserializing it refers to
creating a clone of the object from disk. It is a poor choice of words.
Before MS appropriated the word, to serialize used to mean to place in
sequence. In some computer science settings, it means enqueue.

3. Re threading... Assuming you have queues (System.Collect ions.Queue) for
use by the threads, the problem you need to solve is the contention
problem.
You have to prevent two threads from updating a queue at the same time.
Two
among many choices are:
Threading.Monit or.Enter(MyQueu e)
... enqueue or dequeue here
Threading.Monit or.Exit(MyQueue )
and
SyncLock MyQueue
... enqueue or dequeue here
End SyncLock
Additionally, the .net queue object has property IsSynchronized which
sounds
like it solves the contention problem, but I have no experience with it.

Good luck.

Nov 21 '05 #5

Dave

This solution was almost too simple.

Just added the following line in the Page_Load() sub, and it was done.

server.ScriptTi meout = 300 ' i.e. 5 minutes

Dave
"Dave" <da************ *************** **@stic.net> wrote in message
news:OM******** ******@TK2MSFTN GP10.phx.gbl...

Hello All,

A couple of weeks ago, I undertook to write a utility that would loop
through various URLs and test whether they were valid. I got some good
help from this list, and was able to write the utility.

Now, I have run into a problem that is difficult for me to solve. It is
this: When looping through a large set of URLs, if many of the URLS are
bad, the program will time out. Conversely, if most of the URLs are good,
it will perform as expected and complete.

[snip]

Nov 21 '05 #6

Similar topics

3331

How to hide broken links?

by: Chris Hemingway | last post by:

Hi I have an html file which links to word docs amongst other things; but these files and their location may change periodically. How can I adapt my html so that if the files do not exist, the links are hidden i.e. how do I hide broken links? Chris

Javascript

1682

Broken links to graphics

by: John at Free Design | last post by:

Thanks in advance for the help and let me know if this should be posted to another MSDN. I am developing a web-based application in VS.NET2003 using VB. When I insert a graphic into a project page it shows just fine in design mode. However, when I run the app in debug or directly from a browser all the graphic links in the app show they are broken. Checking the properties shows that they are pointing to the correct folder. I'm not...

.NET Framework

1864

finding broken links using FrontPage automation

by: talyabn | last post by:

Hi, I'm trying to invoke the 'Broken Hyperlinks' option in the FrontPage application. The problem is that I get all the links in a given HTML page instead of getting only the broken links. I'm using automation in my Visual Basic program and I'd like to know if there is any way to get only the broken links in a web page.

HTML / CSS

2265

Broken Images ASP.NET/IIS Problem

by: Jacob | last post by:

Has anybody else encountered a problem when running your asp.net applications off your localhost and having broken image links? The weird thing is, the links aren't really broken. The reference is correct. And what's weirder that than, it will only do it sporadically. I can refresh the screen and those images that were once broken are now visible but some others may now be broken. Or sometimes they all work, sometimes none. This...

ASP.NET

3358

A broken link preventer

by: Craig Cockburn | last post by:

I have a tool which tells me the number of times that visitors attempt to access a link from my site to an external site and what the response code received was. In the event of the remote site returning an error code, they are not sent to the remote site - why bother, it wouldn't work! Since I have over 1000 external links, this allows me to locate the broken links that people see the most often and fix those first. Conventional link...

HTML / CSS

5245

Finding Broken Link in WebSite

by: sristhrashguy | last post by:

Hi everyone, i want .net(VB or C#) code for finding broken links in a website. The requirement is that the user will be able to type the url in a text box so once the button is clicked , it has to show whether there are any broken links in that particular page. Please help me out in this. Thanks Sridhar.S

.NET Framework

9656

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9498

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10370

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10177

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10113

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9969

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8995

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

4074

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2896

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General