Screen Scraping Issue | | |
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
The remote server returned an error: (500) Internal Server Error
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
System.Net.WebResponse httpRes = httpReq.GetResponse();
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
httpRes.Close();
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
Cheers,
Andrew | | | | re: Screen Scraping Issue
Try using Simon Mourier's Html Agility Pack: http://www.codeplex.com/Wiki/View.as...tmlagilitypack
It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.
Knoxy wrote: Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
>
>
Cheers,
Andrew
| | | | re: Screen Scraping Issue
Cheers for the info Chris, appreciated.
At minute, really looking for what I'm doing wrong in my code though
:-)
Anyone any ideas?
Chris Fulstow wrote: Quote:
Try using Simon Mourier's Html Agility Pack: http://www.codeplex.com/Wiki/View.as...tmlagilitypack
>
It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.
>
Knoxy wrote:
> Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
The remote server returned an error: (500) Internal Server Error
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
System.Net.WebResponse httpRes = httpReq.GetResponse();
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
httpRes.Close();
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
Cheers,
Andrew
| | | | re: Screen Scraping Issue
In fact... it seems to break when I pass anything as a querystring
parameter...
eg: url.aspx?param=val
Does that help at all?
Knoxy wrote: Quote:
Cheers for the info Chris, appreciated.
>
At minute, really looking for what I'm doing wrong in my code though
:-)
>
Anyone any ideas?
>
>
Chris Fulstow wrote: Quote:
Try using Simon Mourier's Html Agility Pack: http://www.codeplex.com/Wiki/View.as...tmlagilitypack
It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.
Knoxy wrote: Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
>
>
Cheers,
Andrew
| | | | re: Screen Scraping Issue
Knoxy wrote: Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
Hi Andrew,
Not sure about your particular problem, but could you not write your
function as:
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
System.Net.WebResponse httpRes = httpReq.GetResponse();
System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream() );
try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}
As in, use the built in types and not worry about doing your own
buffering?
Damien | | | | re: Screen Scraping Issue
Cheers Damien - yeah, that does seem a little simpler :-)
I'm still fairly stumped on this one mind - do i need to do anything
with the url querystring data before I use it or something? Just breaks
on httpReq.GetResponse() call...
Damien wrote: Quote:
Knoxy wrote: Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
The remote server returned an error: (500) Internal Server Error
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
System.Net.WebResponse httpRes = httpReq.GetResponse();
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
httpRes.Close();
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
Hi Andrew,
>
Not sure about your particular problem, but could you not write your
function as:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream() );
>
try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}
>
As in, use the built in types and not worry about doing your own
buffering?
>
Damien
| | | | re: Screen Scraping Issue
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.
I'm stumped, anyone out there? :-)
Knoxy wrote: Quote:
Cheers Damien - yeah, that does seem a little simpler :-)
>
I'm still fairly stumped on this one mind - do i need to do anything
with the url querystring data before I use it or something? Just breaks
on httpReq.GetResponse() call...
>
>
Damien wrote: Quote:
Knoxy wrote: Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
Hi Andrew,
Not sure about your particular problem, but could you not write your
function as:
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
System.Net.WebResponse httpRes = httpReq.GetResponse();
System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream() );
try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}
As in, use the built in types and not worry about doing your own
buffering?
Damien
| | | | re: Screen Scraping Issue
Thus wrote Knoxy, Quote:
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return
System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.
>
I'm stumped, anyone out there? :-)
HTTP 500 Server Error means, um, server error. Without any insight into what
happens on the server-side, it's just guesswork ;-)
Having said that, you should send at least the HTTP headers User-Agent, Accept-Encoding,
and Accept.
Cheers,
--
Joerg Jooss news-reply@joergjooss.de | | | | re: Screen Scraping Issue
Thanks for the reply Joerg
Just letting you know that it was a simple case of a user control used
by the page that was failing when I was reading in info from the
Request object.
But, I'd got another problem when I uploaded it to the live server
(when it worked on my machine) and loaded up a trace.axd page with my
errors written into it:
The underlying connection was closed: Unable to connect to the remote
server
Now I'm not passing any HTTP headers - will this cause a problem on a
secured web server and if so why? :-)
Cheers,
Knoxy
Joerg Jooss wrote: Quote:
Thus wrote Knoxy,
> Quote:
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return
System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.
I'm stumped, anyone out there? :-)
>
HTTP 500 Server Error means, um, server error. Without any insight into what
happens on the server-side, it's just guesswork ;-)
>
Having said that, you should send at least the HTTP headers User-Agent, Accept-Encoding,
and Accept.
>
Cheers,
--
Joerg Jooss news-reply@joergjooss.de | | | | re: Screen Scraping Issue
ps: by adding those headers...
httpWeb.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.2; .NET CLR 1.0.3705;)");
httpWeb.Headers.Add ("accept", "*/*");
httpWeb.Headers.Add ("accept-encoding", "gzip, deflate");
.... it still came back with the error:
The underlying connection was closed: Unable to connect to the remote
server
Any ideas why this might be different on the live server as opposed to
my dev machine?
Regards,
Knoxy
Knoxy wrote: Quote:
Thanks for the reply Joerg
>
Just letting you know that it was a simple case of a user control used
by the page that was failing when I was reading in info from the
Request object.
>
But, I'd got another problem when I uploaded it to the live server
(when it worked on my machine) and loaded up a trace.axd page with my
errors written into it:
>
The underlying connection was closed: Unable to connect to the remote
server
>
Now I'm not passing any HTTP headers - will this cause a problem on a
secured web server and if so why? :-)
>
Cheers,
Knoxy
>
>
Joerg Jooss wrote: Quote:
Thus wrote Knoxy, Quote:
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return
System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.
>
I'm stumped, anyone out there? :-)
HTTP 500 Server Error means, um, server error. Without any insight into what
happens on the server-side, it's just guesswork ;-)
Having said that, you should send at least the HTTP headers User-Agent, Accept-Encoding,
and Accept.
Cheers,
--
Joerg Jooss news-reply@joergjooss.de |  | | | | /bytes/about
We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights.
Get the best answers to your questions from over 226,501 network members.
|