473,378 Members | 1,410 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

System.Net.Webclient screen scraping: how to gracefully handle 403 (and other) errors?

I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs. It
worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next ID
in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnection = ConfigurationSettings.AppSettings["connwhatever"];

SqlConnection conn = new SqlConnection(strConnection);

string query = // [my query];

SqlDataAdapter a = new SqlDataAdapter(query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c:\whateverpath\" + dr[0].ToString() +
".htm", FileMode.Create, FileAccess.Write);

StreamWriter stream = new StreamWriter(fstream);

stream.Write(strData);

stream.Close();

fstream.Close();


Jan 10 '07 #1
6 3527
please read chapter on try/catch

-- bruce (sqlwork.com)

ke*****@nospam.nospam wrote:
I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs. It
worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next ID
in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnection = ConfigurationSettings.AppSettings["connwhatever"];

SqlConnection conn = new SqlConnection(strConnection);

string query = // [my query];

SqlDataAdapter a = new SqlDataAdapter(query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c:\whateverpath\" + dr[0].ToString() +
".htm", FileMode.Create, FileAccess.Write);

StreamWriter stream = new StreamWriter(fstream);

stream.Write(strData);

stream.Close();

fstream.Close();

Jan 10 '07 #2
Understand try/catch generally. What event(s) should I be trying to catch?

Thank you,
-KF

"bruce barker" <no****@nospam.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
please read chapter on try/catch

-- bruce (sqlwork.com)

ke*****@nospam.nospam wrote:
>I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs.
It worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next
ID in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnection = ConfigurationSettings.AppSettings["connwhatever"];

SqlConnection conn = new SqlConnection(strConnection);

string query = // [my query];

SqlDataAdapter a = new SqlDataAdapter(query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c:\whateverpath\" +
dr[0].ToString() + ".htm", FileMode.Create, FileAccess.Write);

StreamWriter stream = new StreamWriter(fstream);

stream.Write(strData);

stream.Close();

fstream.Close();
Jan 10 '07 #3
Hello KF,

Based on your description, you're using the webclient class to request many
web pages programmatically in ASP.NET page code. However, since some page
may raise some exception, your client loop code in ASP.NET page break,
correct?

As for the 403 error, it is normally caused by the security authorization
checking at server-side fails. I'm not sure whether there is any other
particular scenario here, however, if what you want is simply captuer and
ignore such error and continue the loop, you can just add a try catch block
around your webclient class's downloadXXX method call and if any exception
captured you can simply ignore it and skip the current loop. e.g.

=======================
foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

try
{

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

}catch(Exception ex)
{
//ignore and continue the loop
}

....................................

}
=========================

Does this work for your scenario?

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

==================================================

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscripti...ult.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscripti...t/default.aspx.

==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.

Jan 10 '07 #4
For webclient or HttpWebRequest, it normally will throw a
System.Net.WebException, however, any exception can be handled by the super
class "Exception". So you can use either

try
{
}catch(Exception)
{

}

or

try
{
}catch(WebException)
{

}

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead
This posting is provided "AS IS" with no warranties, and confers no rights.

Jan 10 '07 #5
You can specify error documents for specific error-codes.

In your web.config, add the following entries in <system.websection:

<customErrors>
<error statusCode="403" redirect="403.aspx"/>
</customErrors>

Note that generally, 403 would be given by the web server and not be the
ASP.Net engine. At times, when the authentication fails, 403 may be returned
by an IHttpModule - like the authentication modules (NTML, Kerberos, Digest
etc).

--
Happy Hacking,
Gaurav Vaish | www.mastergaurav.com
www.edujini-labs.com
http://eduzine.edujinionline.com
-----------------------------------------
<ke*****@nospam.nospamwrote in message
news:%2****************@TK2MSFTNGP06.phx.gbl...
I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs.
It worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next
ID in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

Jan 10 '07 #6
This worked great for my scenario. Thanks very much to everyone for the
timely assistance.

-KF

"Steven Cheng[MSFT]" <st*****@online.microsoft.comwrote in message
news:FY**************@TK2MSFTNGHUB02.phx.gbl...
Hello KF,

Based on your description, you're using the webclient class to request
many
web pages programmatically in ASP.NET page code. However, since some page
may raise some exception, your client loop code in ASP.NET page break,
correct?

As for the 403 error, it is normally caused by the security authorization
checking at server-side fails. I'm not sure whether there is any other
particular scenario here, however, if what you want is simply captuer and
ignore such error and continue the loop, you can just add a try catch
block
around your webclient class's downloadXXX method call and if any exception
captured you can simply ignore it and skip the current loop. e.g.

=======================
foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

try
{

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

}catch(Exception ex)
{
//ignore and continue the loop
}

...................................

}
=========================

Does this work for your scenario?

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

==================================================

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscripti...ult.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscripti...t/default.aspx.

==================================================

This posting is provided "AS IS" with no warranties, and confers no
rights.

Jan 10 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Todd Hampton | last post by:
I'm trying to write a small app that will screen scrape jobs from the Hire Texas site but have not been able to post to the search criteria page. Anybody have any insights? Thanks, Todd Dim...
5
by: Vitling | last post by:
For no apparent reason, a NullReference exception is thrown in system.dll (System.Net.Sockets.OverlappedAsyncResult.CompletionPortCallback). Since I only get a disassembly from Visual Studio, it...
9
by: Glen | last post by:
I'm writing a console utility to download specific files from web sites based on the command line options. In most cases, I can trap the 404 error when the file isn't available because the...
3
by: Jim Giblin | last post by:
I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will credit the vendors as the data source, I do not...
14
by: John A Grandy | last post by:
has anyone successfully used HttpWebRequest or WebClient class to simulate submission of a simple HTML form? for example: a very simple plain-vanilla form with a textbox and a button. when the...
4
by: rachel | last post by:
Hello, I am currently contracted out by a real estate agent. He has a page that he has created himself that has a list of homes.. their images and data in html format. He wants me to take...
1
by: jmhmaine | last post by:
I've used the WebClient class on a few projects but I wanted to know if anyone could point to the good resource for Best Practices with this object. The two things I haven't seen in sample code...
3
by: tscamurra | last post by:
Hello, I have a web app that performs screen scaping and submits a form. My code worked until the page was changed to use .aspx code. I am updating my code to conform to the new pages but am...
5
by: benmess | last post by:
This code snippet works fine on a localhost because the file you upload resides on the host machine (where FileServer.aspx is a new page invoked from the UploadFile call) function...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.