Connecting Tech Pros Worldwide Forums | Help | Site Map

Screen Scraping Issue

Knoxy
Guest
 
Posts: n/a
#1: Sep 29 '06
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}



Cheers,
Andrew


Chris Fulstow
Guest
 
Posts: n/a
#2: Sep 29 '06

re: Screen Scraping Issue



Try using Simon Mourier's Html Agility Pack:
http://www.codeplex.com/Wiki/View.as...tmlagilitypack

It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.

Knoxy wrote:
Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
>
>
Cheers,
Andrew
Knoxy
Guest
 
Posts: n/a
#3: Sep 29 '06

re: Screen Scraping Issue


Cheers for the info Chris, appreciated.

At minute, really looking for what I'm doing wrong in my code though
:-)

Anyone any ideas?


Chris Fulstow wrote:
Quote:
Try using Simon Mourier's Html Agility Pack:
http://www.codeplex.com/Wiki/View.as...tmlagilitypack
>
It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.
>
Knoxy wrote:
>
Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}



Cheers,
Andrew
Knoxy
Guest
 
Posts: n/a
#4: Sep 29 '06

re: Screen Scraping Issue


In fact... it seems to break when I pass anything as a querystring
parameter...

eg: url.aspx?param=val

Does that help at all?


Knoxy wrote:
Quote:
Cheers for the info Chris, appreciated.
>
At minute, really looking for what I'm doing wrong in my code though
:-)
>
Anyone any ideas?
>
>
Chris Fulstow wrote:
Quote:
Try using Simon Mourier's Html Agility Pack:
http://www.codeplex.com/Wiki/View.as...tmlagilitypack

It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.

Knoxy wrote:
Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
>
>
Cheers,
Andrew
Damien
Guest
 
Posts: n/a
#5: Sep 29 '06

re: Screen Scraping Issue


Knoxy wrote:
Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
Hi Andrew,

Not sure about your particular problem, but could you not write your
function as:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream() );

try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}

As in, use the built in types and not worry about doing your own
buffering?

Damien

Knoxy
Guest
 
Posts: n/a
#6: Sep 29 '06

re: Screen Scraping Issue


Cheers Damien - yeah, that does seem a little simpler :-)

I'm still fairly stumped on this one mind - do i need to do anything
with the url querystring data before I use it or something? Just breaks
on httpReq.GetResponse() call...


Damien wrote:
Quote:
Knoxy wrote:
Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
Hi Andrew,
>
Not sure about your particular problem, but could you not write your
function as:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream() );
>
try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}
>
As in, use the built in types and not worry about doing your own
buffering?
>
Damien
Knoxy
Guest
 
Posts: n/a
#7: Oct 4 '06

re: Screen Scraping Issue


Update:
I've just got back onto this and tried using the webclient class
instead...

try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();

return System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}

I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.

I'm stumped, anyone out there? :-)


Knoxy wrote:
Quote:
Cheers Damien - yeah, that does seem a little simpler :-)
>
I'm still fairly stumped on this one mind - do i need to do anything
with the url querystring data before I use it or something? Just breaks
on httpReq.GetResponse() call...
>
>
Damien wrote:
Quote:
Knoxy wrote:
Quote:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:
>
The remote server returned an error: (500) Internal Server Error
>
Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:
>
private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);
>
System.Net.WebResponse httpRes = httpReq.GetResponse();
>
byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();
>
while(httpRes.GetResponseStream().Read(buffer,0,bu ffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buff er));
}
>
httpRes.Close();
>
return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
>
Hi Andrew,

Not sure about your particular problem, but could you not write your
function as:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream() );

try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}

As in, use the built in types and not worry about doing your own
buffering?

Damien
Joerg Jooss
Guest
 
Posts: n/a
#8: Oct 5 '06

re: Screen Scraping Issue


Thus wrote Knoxy,
Quote:
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return
System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.
>
I'm stumped, anyone out there? :-)
HTTP 500 Server Error means, um, server error. Without any insight into what
happens on the server-side, it's just guesswork ;-)

Having said that, you should send at least the HTTP headers User-Agent, Accept-Encoding,
and Accept.

Cheers,
--
Joerg Jooss
news-reply@joergjooss.de


Knoxy
Guest
 
Posts: n/a
#9: Oct 26 '06

re: Screen Scraping Issue


Thanks for the reply Joerg

Just letting you know that it was a simple case of a user control used
by the page that was failing when I was reading in info from the
Request object.

But, I'd got another problem when I uploaded it to the live server
(when it worked on my machine) and loaded up a trace.axd page with my
errors written into it:

The underlying connection was closed: Unable to connect to the remote
server

Now I'm not passing any HTTP headers - will this cause a problem on a
secured web server and if so why? :-)

Cheers,
Knoxy


Joerg Jooss wrote:
Quote:
Thus wrote Knoxy,
>
Quote:
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return
System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.

I'm stumped, anyone out there? :-)
>
HTTP 500 Server Error means, um, server error. Without any insight into what
happens on the server-side, it's just guesswork ;-)
>
Having said that, you should send at least the HTTP headers User-Agent, Accept-Encoding,
and Accept.
>
Cheers,
--
Joerg Jooss
news-reply@joergjooss.de
Knoxy
Guest
 
Posts: n/a
#10: Oct 26 '06

re: Screen Scraping Issue


ps: by adding those headers...

httpWeb.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.2; .NET CLR 1.0.3705;)");
httpWeb.Headers.Add ("accept", "*/*");
httpWeb.Headers.Add ("accept-encoding", "gzip, deflate");

.... it still came back with the error:

The underlying connection was closed: Unable to connect to the remote
server


Any ideas why this might be different on the live server as opposed to
my dev machine?

Regards,
Knoxy

Knoxy wrote:
Quote:
Thanks for the reply Joerg
>
Just letting you know that it was a simple case of a user control used
by the page that was failing when I was reading in info from the
Request object.
>
But, I'd got another problem when I uploaded it to the live server
(when it worked on my machine) and loaded up a trace.axd page with my
errors written into it:
>
The underlying connection was closed: Unable to connect to the remote
server
>
Now I'm not passing any HTTP headers - will this cause a problem on a
secured web server and if so why? :-)
>
Cheers,
Knoxy
>
>
Joerg Jooss wrote:
Quote:
Thus wrote Knoxy,
Quote:
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return
System.Text.Encoding.UTF8.GetString(httpWeb.Downlo adData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.
>
I'm stumped, anyone out there? :-)
HTTP 500 Server Error means, um, server error. Without any insight into what
happens on the server-side, it's just guesswork ;-)

Having said that, you should send at least the HTTP headers User-Agent, Accept-Encoding,
and Accept.

Cheers,
--
Joerg Jooss
news-reply@joergjooss.de
Closed Thread