473,785 Members | 2,756 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

System.Net.Webc lient screen scraping: how to gracefully handle 403 (and other) errors?

I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs. It
worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next ID
in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnection = ConfigurationSe ttings.AppSetti ngs["connwhatev er"];

SqlConnection conn = new SqlConnection(s trConnection);

string query = // [my query];

SqlDataAdapter a = new SqlDataAdapter( query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebC lient wc = new WebClient();

string strData =
wc.DownloadStri ng("http://whatever.org/article.asp?art icleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c: \whateverpath\" + dr[0].ToString() +
".htm", FileMode.Create , FileAccess.Writ e);

StreamWriter stream = new StreamWriter(fs tream);

stream.Write(st rData);

stream.Close();

fstream.Close() ;


Jan 10 '07 #1
6 3580
please read chapter on try/catch

-- bruce (sqlwork.com)

ke*****@nospam. nospam wrote:
I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs. It
worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next ID
in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnection = ConfigurationSe ttings.AppSetti ngs["connwhatev er"];

SqlConnection conn = new SqlConnection(s trConnection);

string query = // [my query];

SqlDataAdapter a = new SqlDataAdapter( query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebC lient wc = new WebClient();

string strData =
wc.DownloadStri ng("http://whatever.org/article.asp?art icleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c: \whateverpath\" + dr[0].ToString() +
".htm", FileMode.Create , FileAccess.Writ e);

StreamWriter stream = new StreamWriter(fs tream);

stream.Write(st rData);

stream.Close();

fstream.Close() ;

Jan 10 '07 #2
Understand try/catch generally. What event(s) should I be trying to catch?

Thank you,
-KF

"bruce barker" <no****@nospam. comwrote in message
news:%2******** ********@TK2MSF TNGP03.phx.gbl. ..
please read chapter on try/catch

-- bruce (sqlwork.com)

ke*****@nospam. nospam wrote:
>I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs.
It worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next
ID in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnectio n = ConfigurationSe ttings.AppSetti ngs["connwhatev er"];

SqlConnectio n conn = new SqlConnection(s trConnection);

string query = // [my query];

SqlDataAdapt er a = new SqlDataAdapter( query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.Web Client wc = new WebClient();

string strData =
wc.DownloadStr ing("http://whatever.org/article.asp?art icleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c: \whateverpath\" +
dr[0].ToString() + ".htm", FileMode.Create , FileAccess.Writ e);

StreamWriter stream = new StreamWriter(fs tream);

stream.Write(s trData);

stream.Close() ;

fstream.Close( );
Jan 10 '07 #3
Hello KF,

Based on your description, you're using the webclient class to request many
web pages programmaticall y in ASP.NET page code. However, since some page
may raise some exception, your client loop code in ASP.NET page break,
correct?

As for the 403 error, it is normally caused by the security authorization
checking at server-side fails. I'm not sure whether there is any other
particular scenario here, however, if what you want is simply captuer and
ignore such error and continue the loop, you can just add a try catch block
around your webclient class's downloadXXX method call and if any exception
captured you can simply ignore it and skip the current loop. e.g.

=============== ========
foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebC lient wc = new WebClient();

try
{

string strData =
wc.DownloadStri ng("http://whatever.org/article.asp?art icleid=" +
dr[0].ToString());

}catch(Exceptio n ex)
{
//ignore and continue the loop
}

............... ............... ......

}
=============== ==========

Does this work for your scenario?

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

=============== =============== =============== =====

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscripti...ult.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscripti...t/default.aspx.

=============== =============== =============== =====

This posting is provided "AS IS" with no warranties, and confers no rights.

Jan 10 '07 #4
For webclient or HttpWebRequest, it normally will throw a
System.Net.WebE xception, however, any exception can be handled by the super
class "Exception" . So you can use either

try
{
}catch(Exceptio n)
{

}

or

try
{
}catch(WebExcep tion)
{

}

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead
This posting is provided "AS IS" with no warranties, and confers no rights.

Jan 10 '07 #5
You can specify error documents for specific error-codes.

In your web.config, add the following entries in <system.websect ion:

<customErrors >
<error statusCode="403 " redirect="403.a spx"/>
</customErrors>

Note that generally, 403 would be given by the web server and not be the
ASP.Net engine. At times, when the authentication fails, 403 may be returned
by an IHttpModule - like the authentication modules (NTML, Kerberos, Digest
etc).

--
Happy Hacking,
Gaurav Vaish | www.mastergaurav.com
www.edujini-labs.com
http://eduzine.edujinionline.com
-----------------------------------------
<ke*****@nospam .nospamwrote in message
news:%2******** ********@TK2MSF TNGP06.phx.gbl. ..
I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs.
It worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next
ID in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

Jan 10 '07 #6
This worked great for my scenario. Thanks very much to everyone for the
timely assistance.

-KF

"Steven Cheng[MSFT]" <st*****@online .microsoft.comw rote in message
news:FY******** ******@TK2MSFTN GHUB02.phx.gbl. ..
Hello KF,

Based on your description, you're using the webclient class to request
many
web pages programmaticall y in ASP.NET page code. However, since some page
may raise some exception, your client loop code in ASP.NET page break,
correct?

As for the 403 error, it is normally caused by the security authorization
checking at server-side fails. I'm not sure whether there is any other
particular scenario here, however, if what you want is simply captuer and
ignore such error and continue the loop, you can just add a try catch
block
around your webclient class's downloadXXX method call and if any exception
captured you can simply ignore it and skip the current loop. e.g.

=============== ========
foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebC lient wc = new WebClient();

try
{

string strData =
wc.DownloadStri ng("http://whatever.org/article.asp?art icleid=" +
dr[0].ToString());

}catch(Exceptio n ex)
{
//ignore and continue the loop
}

............... ............... .....

}
=============== ==========

Does this work for your scenario?

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

=============== =============== =============== =====

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscripti...ult.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscripti...t/default.aspx.

=============== =============== =============== =====

This posting is provided "AS IS" with no warranties, and confers no
rights.

Jan 10 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1948
by: Todd Hampton | last post by:
I'm trying to write a small app that will screen scrape jobs from the Hire Texas site but have not been able to post to the search criteria page. Anybody have any insights? Thanks, Todd Dim strBuff As String Dim nvcFormParams As New System.Collections.Specialized.NameValueCollection()
5
4032
by: Vitling | last post by:
For no apparent reason, a NullReference exception is thrown in system.dll (System.Net.Sockets.OverlappedAsyncResult.CompletionPortCallback). Since I only get a disassembly from Visual Studio, it is almost impossible to figure out what causes this. I've tried adding: AppDomain.CurrentDomain.UnhandledException += new UnhandledExceptionEventHandler (SystemErrorHandler); to my main method, and a handler function:
9
15084
by: Glen | last post by:
I'm writing a console utility to download specific files from web sites based on the command line options. In most cases, I can trap the 404 error when the file isn't available because the operator mistyped the URL or it's offline for whatever reason. The problem I'm running into is with certain sites where the admin has set up a redirect to handle the 404 condition and redirects the request to another page. In this case, the...
3
2358
by: Jim Giblin | last post by:
I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will credit the vendors as the data source, I do not want to use the format of their pages, and want the inforamtion consolidated to a single page of my design. I did something like this for a client a couple of years ago in ASP, but it was complex, and I do not have access to the code. A...
14
12142
by: John A Grandy | last post by:
has anyone successfully used HttpWebRequest or WebClient class to simulate submission of a simple HTML form? for example: a very simple plain-vanilla form with a textbox and a button. when the button is clicked the form is submitted with the textbox contents. could you please post some sample code? thanks.
4
3470
by: rachel | last post by:
Hello, I am currently contracted out by a real estate agent. He has a page that he has created himself that has a list of homes.. their images and data in html format. He wants me to take this page and reformat it so that it looks different. Do I use screen scraping to do this? Could someone please point me to a good screen scraping
1
3889
by: jmhmaine | last post by:
I've used the WebClient class on a few projects but I wanted to know if anyone could point to the good resource for Best Practices with this object. The two things I haven't seen in sample code are: 1. How to retrieve connection error messages, such as a DNS resolution error or connection dropped errors. 2. Should I use the Dispose or Finalize methods to destroy the object when I done? I assume that the object uses unmanaged code at some...
3
5267
by: tscamurra | last post by:
Hello, I have a web app that performs screen scaping and submits a form. My code worked until the page was changed to use .aspx code. I am updating my code to conform to the new pages but am having difficulty submitting the page. I use the WebCLient Class and create a collection of form values, however there is not 'submit' button. There is an anchor tag that has an href that calls the familiar __DoPostBack function.
5
5014
by: benmess | last post by:
This code snippet works fine on a localhost because the file you upload resides on the host machine (where FileServer.aspx is a new page invoked from the UploadFile call) function UploadGeneralFile(ByVal sURL as string,ByVal sFilename As String) Dim client As New System.Net.WebClient Dim sHTTPURL As String Dim req As New Uri(sURL) Dim respponse As Byte() sHTTPURL = req.Scheme() & "://" &...
0
10319
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10147
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10087
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9947
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8971
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7496
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4046
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2877
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.