Downloading WebSites using HttpWebRequest

thomas peter

I am building a precache engine... one that request over 100 pages on an
remote server to cache them remotely...
can i use the HttpWebRequest and WebResponse classes for this? or must i use
the MSHTML objects to really load the HTML and request all of the images on
site?

string lcUrl = http://www.cnn.com;

// *** Establish the request

HttpWebRequest loHttp =

(HttpWebRequest) WebRequest.Create(lcUrl);

// *** Set properties

loHttp.Timeout = 10000; // 10 secs

loHttp.UserAgent = "Code Sample Web Client";

// *** Retrieve request info headers

HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();

Encoding enc = Encoding.GetEncoding(1252); // Windows default Code Page

StreamReader loResponseStream =

new StreamReader(loWebResponse.GetResponseStream(),enc );

string lcHtml = loResponseStream.ReadToEnd();

loWebResponse.Close();

loResponseStream.Close();

Nov 16 '05 #1

Subscribe Post Reply

12596

Steven Cheng[MSFT]

Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it's response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence
medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page's response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to
get the response result (the html ouput or file stream), the HttpWEbRequest
is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/p...=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and Visual
Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx

Nov 16 '05 #2

Thomas Peter

Thanks Steven,

I need to make sure that i am remotely caching all of the html including all
pitcures... hence i figured a simple WebRequest wont do...
so i am trying to get the GetResponseStream() into an HTMLDocument object to
ensure that the entire site loads...
But
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
into/as htmldocument?

Any ideas?

///--------------- Full example

HttpWebRequest request = (HttpWebRequest)WebRequest.Create
(http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

Console.WriteLine ("Content length is {0}", response.ContentLength);

Console.WriteLine ("Content type is {0}", response.ContentType);

Stream receiveStream = response.GetResponseStream ();

StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp;
response.Close ();

readStream.Close ();
"Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
news:kB**************@cpmsftngxa10.phx.gbl...

Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it's response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence
medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page's response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to
get the response result (the html ouput or file stream), the HttpWEbRequest is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/p...=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and Visual Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx

Nov 16 '05 #3

Thomas Peter

it now appears that i cannot use HttpWebRequest because i need to be able to
specify the Host Header.... and HttpWebRequest.Headers HOST is set by the
system to the current host information and now way for me to modify it..

I need to retrive webpages for the remote server to cache it... any ideas?

"Thomas Peter" <al*******@K.com> wrote in message
news:OT**************@tk2msftngp13.phx.gbl...

Thanks Steven,

I need to make sure that i am remotely caching all of the html including all pitcures... hence i figured a simple WebRequest wont do...
so i am trying to get the GetResponseStream() into an HTMLDocument object to ensure that the entire site loads...
But
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
into/as htmldocument?

Any ideas?

///--------------- Full example

HttpWebRequest request = (HttpWebRequest)WebRequest.Create
(http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

Console.WriteLine ("Content length is {0}", response.ContentLength);

Console.WriteLine ("Content type is {0}", response.ContentType);

Stream receiveStream = response.GetResponseStream ();

StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp;
response.Close ();

readStream.Close ();
"Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
news:kB**************@cpmsftngxa10.phx.gbl...
Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it's response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page's response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to get the response result (the html ouput or file stream), the

HttpWEbRequest
is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/p...=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and

Visual
Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx

Nov 16 '05 #4

Steven Cheng[MSFT]

Hi Thomas,

Thanks for your followup. Based on my experience, since you want to request
the page and retrieve it's reponse stream and load it into the HTMLDocument
to process it. I think you can consider using the WEbBrowser control to do
the task. You can use WebBrowser control to navigate a certain web resource
and when the page is loaded, it'll automatically be loaded into a Document
object.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx

Nov 16 '05 #5

Thomas Peter

Cant use webbrowser because application must be a webapplication....

I dropped HTTPWebRequest/Response methods and opted for MSXML2, But does
ServerXMLHTTP open support different ports?
MSXML2.ServerXMLHTTPClass();

"Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
news:hz**************@cpmsftngxa10.phx.gbl...

Hi Thomas,

Thanks for your followup. Based on my experience, since you want to request the page and retrieve it's reponse stream and load it into the HTMLDocument to process it. I think you can consider using the WEbBrowser control to do
the task. You can use WebBrowser control to navigate a certain web resource and when the page is loaded, it'll automatically be loaded into a Document
object.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx

Nov 16 '05 #6

Sunny

Hi,

why do you need to change the HOST header?

Sunny
In article <OL**************@TK2MSFTNGP09.phx.gbl>, al*******@K.com
says...

it now appears that i cannot use HttpWebRequest because i need to be able to
specify the Host Header.... and HttpWebRequest.Headers HOST is set by the
system to the current host information and now way for me to modify it..

I need to retrive webpages for the remote server to cache it... any ideas?

"Thomas Peter" <al*******@K.com> wrote in message
news:OT**************@tk2msftngp13.phx.gbl...
Thanks Steven,

I need to make sure that i am remotely caching all of the html including

all
pitcures... hence i figured a simple WebRequest wont do...
so i am trying to get the GetResponseStream() into an HTMLDocument object

to
ensure that the entire site loads...
But
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
into/as htmldocument?

Any ideas?

///--------------- Full example

HttpWebRequest request = (HttpWebRequest)WebRequest.Create
(http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

Console.WriteLine ("Content length is {0}", response.ContentLength);

Console.WriteLine ("Content type is {0}", response.ContentType);

Stream receiveStream = response.GetResponseStream ();

StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp;
response.Close ();

readStream.Close ();
"Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
news:kB**************@cpmsftngxa10.phx.gbl...
Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it's response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page's response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to get the response result (the html ouput or file stream), the

HttpWEbRequest
is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/p...=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and

Visual
Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx

Nov 16 '05 #7

Thomas Peter

Different Websites sharing same IP's example

microsoft.com and abc.com both on server 207.71.34.12

require host header to specify desired site

"Sunny" <su******@icebergwireless.com> wrote in message
news:Ol**************@TK2MSFTNGP09.phx.gbl...

Hi,

why do you need to change the HOST header?

Sunny
In article <OL**************@TK2MSFTNGP09.phx.gbl>, al*******@K.com
says...
it now appears that i cannot use HttpWebRequest because i need to be able to specify the Host Header.... and HttpWebRequest.Headers HOST is set by the system to the current host information and now way for me to modify it..

I need to retrive webpages for the remote server to cache it... any ideas?

"Thomas Peter" <al*******@K.com> wrote in message
news:OT**************@tk2msftngp13.phx.gbl...
Thanks Steven,

I need to make sure that i am remotely caching all of the html including
all
pitcures... hence i figured a simple WebRequest wont do...
so i am trying to get the GetResponseStream() into an HTMLDocument
object to
ensure that the entire site loads...
But
StreamReader readStream = new StreamReader (receiveStream,
Encoding.UTF8);
string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
into/as htmldocument?

Any ideas?

///--------------- Full example

HttpWebRequest request = (HttpWebRequest)WebRequest.Create
(http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

Console.WriteLine ("Content length is {0}", response.ContentLength);

Console.WriteLine ("Content type is {0}", response.ContentType);

Stream receiveStream = response.GetResponseStream ();

StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);
string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp;
response.Close ();

readStream.Close ();
"Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
news:kB**************@cpmsftngxa10.phx.gbl...
> Hi Thomas,
>
> As for the request and cache remote pages question, I think the
> HttpWebRequest is capable of handling this. We can use HttpWebRequest to > send request to a certain url and get it's response stream, thus, we can > store the response result(Html or anyother mime type) into the

persistence
> medium we want , for example, file system, memory ,database or ...
>
> And the MSHTML components are the components library that help to
> progrmatically process the certain web page's response as a Document(DOM > structure) , just like what we can do in a web browser. If we just want to
> get the response result (the html ouput or file stream), the
HttpWEbRequest
> is enough and the MSHTML is not necessary.
> In addition, here are some tech articles on using the HttpWebRequest

to > request web resources:
>
> #Accessing Web Sites Using Desktop Applications
> http://www.devsource.ziffdavis.com/p...=119849,00.asp >
> #Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and
Visual
> Basic .NET
> http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/
>
> Hope also helps. Thanks.
>
> Regards,
>
> Steven Cheng
> Microsoft Online Support
>
> Get Secure! www.microsoft.com/security
> (This posting is provided "AS IS", with no warranties, and confers no > rights.)
>
> Get Preview at ASP.NET whidbey
> http://msdn.microsoft.com/asp.net/whidbey/default.aspx
>
>

Nov 16 '05 #8

Sunny

So,
are you saying that:

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

and

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://abc.com/");

both create one and the same HttpWebRequest object, and you need to fix
the HOST header?

In my tests, the correct header is created, so still I'm wondering why
you can not use HttpWebRequest for your task.

I have created in the past a very basic web spider, which uses
HttpWebRequest, the creates a MSHTMLDocument document with the content
fetched, and then I was able to iterate and download all links and
pictures.
Sunny

In article <ui**************@TK2MSFTNGP10.phx.gbl>, al*******@K.com
says...

Different Websites sharing same IP's example

microsoft.com and abc.com both on server 207.71.34.12

require host header to specify desired site

"Sunny" <su******@icebergwireless.com> wrote in message
news:Ol**************@TK2MSFTNGP09.phx.gbl...
Hi,

why do you need to change the HOST header?

Sunny
In article <OL**************@TK2MSFTNGP09.phx.gbl>, al*******@K.com
says...
it now appears that i cannot use HttpWebRequest because i need to be able to specify the Host Header.... and HttpWebRequest.Headers HOST is set by the system to the current host information and now way for me to modify it..

I need to retrive webpages for the remote server to cache it... any ideas?

"Thomas Peter" <al*******@K.com> wrote in message
news:OT**************@tk2msftngp13.phx.gbl...
> Thanks Steven,
>
> I need to make sure that i am remotely caching all of the html including all
> pitcures... hence i figured a simple WebRequest wont do...
> so i am trying to get the GetResponseStream() into an HTMLDocument object to
> ensure that the entire site loads...
> But
> StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8); >
> string tmp = readStream.ReadLine();
>
> HTMLDocument htmlDoc = new HTMLDocumentClass();
>
> htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
> into/as htmldocument?
>
> Any ideas?
>
>
>
>
>
>
>
> ///--------------- Full example
>
> HttpWebRequest request = (HttpWebRequest)WebRequest.Create
> (http://www.microsoft.com);
>
> request.MaximumAutomaticRedirections = 4;
>
> request.MaximumResponseHeadersLength = 4;
>
>
> HttpWebResponse response = (HttpWebResponse)request.GetResponse ();
>
> Console.WriteLine ("Content length is {0}", response.ContentLength);
>
> Console.WriteLine ("Content type is {0}", response.ContentType);
>
> Stream receiveStream = response.GetResponseStream ();
>
> StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8); >
> string tmp = readStream.ReadLine();
>
> HTMLDocument htmlDoc = new HTMLDocumentClass();
>
> htmlDoc = (HTMLDocument) tmp;
>
>
> response.Close ();
>
> readStream.Close ();
>
>
> "Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
> news:kB**************@cpmsftngxa10.phx.gbl...
> > Hi Thomas,
> >
> > As for the request and cache remote pages question, I think the
> > HttpWebRequest is capable of handling this. We can use HttpWebRequest to > > send request to a certain url and get it's response stream, thus, we can > > store the response result(Html or anyother mime type) into the
persistence
> > medium we want , for example, file system, memory ,database or ...
> >
> > And the MSHTML components are the components library that help to
> > progrmatically process the certain web page's response as a Document(DOM > > structure) , just like what we can do in a web browser. If we just want to
> > get the response result (the html ouput or file stream), the
> HttpWEbRequest
> > is enough and the MSHTML is not necessary.
> > In addition, here are some tech articles on using the HttpWebRequest to > > request web resources:
> >
> > #Accessing Web Sites Using Desktop Applications
> > http://www.devsource.ziffdavis.com/p...=119849,00.asp > >
> > #Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and
> Visual
> > Basic .NET
> > http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/
> >
> > Hope also helps. Thanks.
> >
> > Regards,
> >
> > Steven Cheng
> > Microsoft Online Support
> >
> > Get Secure! www.microsoft.com/security
> > (This posting is provided "AS IS", with no warranties, and confers no > > rights.)
> >
> > Get Preview at ASP.NET whidbey
> > http://msdn.microsoft.com/asp.net/whidbey/default.aspx
> >
> >
>
>

Nov 16 '05 #9

Thomas Peter

Sunny,

I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com, quite common for multiple sites to be sharing 1 IP address,
usually going thru DNS its no problem... but i need to be able to directly
access a site...
example above: in order for me to get the correct site i must also supply
the microsoft.com host header value or abc.com host header value.

It appears that one cannot modify certain Headers in HttpWebRequest and Host
is one of them.

Be a hero and share your spider code ;0) i am working on something
similar...

"Sunny" <su******@icebergwireless.com> wrote in message
news:eb**************@TK2MSFTNGP09.phx.gbl...

So,
are you saying that:

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

and

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://abc.com/");

both create one and the same HttpWebRequest object, and you need to fix
the HOST header?

In my tests, the correct header is created, so still I'm wondering why
you can not use HttpWebRequest for your task.

I have created in the past a very basic web spider, which uses
HttpWebRequest, the creates a MSHTMLDocument document with the content
fetched, and then I was able to iterate and download all links and
pictures.
Sunny

In article <ui**************@TK2MSFTNGP10.phx.gbl>, al*******@K.com
says...
Different Websites sharing same IP's example

microsoft.com and abc.com both on server 207.71.34.12

require host header to specify desired site

"Sunny" <su******@icebergwireless.com> wrote in message
news:Ol**************@TK2MSFTNGP09.phx.gbl...
Hi,

why do you need to change the HOST header?

Sunny
In article <OL**************@TK2MSFTNGP09.phx.gbl>, al*******@K.com
says...
> it now appears that i cannot use HttpWebRequest because i need to be

able to
> specify the Host Header.... and HttpWebRequest.Headers HOST is set by
the
> system to the current host information and now way for me to modify
it.. >
> I need to retrive webpages for the remote server to cache it... any

ideas?
>
>
>
>
> "Thomas Peter" <al*******@K.com> wrote in message
> news:OT**************@tk2msftngp13.phx.gbl...
> > Thanks Steven,
> >
> > I need to make sure that i am remotely caching all of the html

including
> all
> > pitcures... hence i figured a simple WebRequest wont do...
> > so i am trying to get the GetResponseStream() into an HTMLDocument

object
> to
> > ensure that the entire site loads...
> > But
> > StreamReader readStream = new StreamReader (receiveStream,

Encoding.UTF8);
> >
> > string tmp = readStream.ReadLine();
> >
> > HTMLDocument htmlDoc = new HTMLDocumentClass();
> >
> > htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream > > into/as htmldocument?
> >
> > Any ideas?
> >
> >
> >
> >
> >
> >
> >
> > ///--------------- Full example
> >
> > HttpWebRequest request = (HttpWebRequest)WebRequest.Create
> > (http://www.microsoft.com);
> >
> > request.MaximumAutomaticRedirections = 4;
> >
> > request.MaximumResponseHeadersLength = 4;
> >
> >
> > HttpWebResponse response = (HttpWebResponse)request.GetResponse (); > >
> > Console.WriteLine ("Content length is {0}", response.ContentLength); > >
> > Console.WriteLine ("Content type is {0}", response.ContentType);
> >
> > Stream receiveStream = response.GetResponseStream ();
> >
> > StreamReader readStream = new StreamReader (receiveStream,

Encoding.UTF8);
> >
> > string tmp = readStream.ReadLine();
> >
> > HTMLDocument htmlDoc = new HTMLDocumentClass();
> >
> > htmlDoc = (HTMLDocument) tmp;
> >
> >
> > response.Close ();
> >
> > readStream.Close ();
> >
> >
> > "Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message > > news:kB**************@cpmsftngxa10.phx.gbl...
> > > Hi Thomas,
> > >
> > > As for the request and cache remote pages question, I think the
> > > HttpWebRequest is capable of handling this. We can use

HttpWebRequest to
> > > send request to a certain url and get it's response stream, thus, we can
> > > store the response result(Html or anyother mime type) into the
> persistence
> > > medium we want , for example, file system, memory ,database or
.... > > >
> > > And the MSHTML components are the components library that help to > > > progrmatically process the certain web page's response as a

Document(DOM
> > > structure) , just like what we can do in a web browser. If we just want
> to
> > > get the response result (the html ouput or file stream), the
> > HttpWEbRequest
> > > is enough and the MSHTML is not necessary.
> > > In addition, here are some tech articles on using the
HttpWebRequest to
> > > request web resources:
> > >
> > > #Accessing Web Sites Using Desktop Applications
> > >

http://www.devsource.ziffdavis.com/p...=119849,00.asp
> > >
> > > #Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET
and > > Visual
> > > Basic .NET
> > > http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/
> > >
> > > Hope also helps. Thanks.
> > >
> > > Regards,
> > >
> > > Steven Cheng
> > > Microsoft Online Support
> > >
> > > Get Secure! www.microsoft.com/security
> > > (This posting is provided "AS IS", with no warranties, and

confers no
> > > rights.)
> > >
> > > Get Preview at ASP.NET whidbey
> > > http://msdn.microsoft.com/asp.net/whidbey/default.aspx
> > >
> > >
> >
> >
>
>
>

Nov 16 '05 #10

Sunny

Hi Thomas,
(inline)

In article <uS**************@TK2MSFTNGP09.phx.gbl>, al*******@K.com
says...

Sunny,

I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com, quite common for multiple sites to be sharing 1 IP address,
usually going thru DNS its no problem... but i need to be able to directly
access a site...
example above: in order for me to get the correct site i must also supply
the microsoft.com host header value or abc.com host header value.
I was confused, that you rejected HttpWebRequest from using only based
on the fact that you can not modify HOST header. That's why I asked the
question :)
I can not see a reason why you would like to do this. If you already
know what you want to put in that header, you just have to create the
right HttpWebRequest object. Or I'm missing something?

It appears that one cannot modify certain Headers in HttpWebRequest and Host
is one of them.
There a lot of things in the framework which are made by a purpose, and
a lot are not :). But especially for that header, I do not see a reason
to be exposed as I said before.

Be a hero and share your spider code ;0) i am working on something
similar...

Unfortunately, I can share only a small part of the code. I'll post it
later.
Sunny

Nov 16 '05 #11

Thomas Peter

Sunny,

You got my attention (inline)

"Sunny" <su******@icebergwireless.com> wrote in message
news:Og**************@TK2MSFTNGP09.phx.gbl...

Hi Thomas,
(inline)

In article <uS**************@TK2MSFTNGP09.phx.gbl>, al*******@K.com
says...
Sunny,

I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com, quite common for multiple sites to be sharing 1 IP address,
usually going thru DNS its no problem... but i need to be able to directly access a site...
example above: in order for me to get the correct site i must also supply the microsoft.com host header value or abc.com host header value.
I was confused, that you rejected HttpWebRequest from using only based
on the fact that you can not modify HOST header. That's why I asked the
question :)
I can not see a reason why you would like to do this. If you already
know what you want to put in that header, you just have to create the
right HttpWebRequest object. Or I'm missing something?

Am i missing something? How do i do this? I know what i want to put in that
header... i just want to specify the HOST HEADER value but i dont think
thats possible

It appears that one cannot modify certain Headers in HttpWebRequest and Host is one of them.
There a lot of things in the framework which are made by a purpose, and
a lot are not :). But especially for that header, I do not see a reason
to be exposed as I said before.

Be a hero and share your spider code ;0) i am working on something
similar...

Unfortunately, I can share only a small part of the code. I'll post it
later.

I will do the same

Sunny

Nov 16 '05 #12

Steven Cheng[MSFT]

Hi Thomas,

As Sunny has mentioned, when we request some certain sites distinguished
via host header, we can just create the HttpWebREquest object by the
certain specified url(with hostheader) and the serverside can correctly
router the request according to the host header in the url.
In addition, as for the webbrowser control, we can use it in web
application. For example, we can create a winform control which use the
webbrowser control and then embeded the winform control in web page.(IE
support embeded winform control which run at the clientside 's CLR).

Anyway, I think we can first have a look at Sunny's suggestion. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx

Nov 16 '05 #13

Sunny

Hi Thomas,

pls, read inline:

Am i missing something? How do i do this? I know what i want to put in that
header... i just want to specify the HOST HEADER value but i dont think
thats possible

And from your pev. post:
I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com

So, if you know the domain name (microsoft.com or abc.com) and want to
put it in the header, then why you just not create

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

This way the HOST header will be set correctly.

That was my point, if you know with what you want to change the HOST
header, I.e. you know the domain, you can easily just create a new
HttpWebRequest with that domain.

I do not understand why someone would like to do this (pseudocode):

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

myReq.HostHeader = "microsoft.com"; //this not work
Why not to directly create the webrequest against the known domain?

Sunny

Nov 16 '05 #14

Sunny

Hmm, something happened with the attachement.
I copy/paste it here, watch for line wraps.

public void GetItem()
{
if (this.link.IsImage)
this.GetImage();
else
this.GetPage();
}

private void GetImage()
{
System.Net.WebClient source = new System.Net.WebClient();
Stream myData = null;
FileStream myFile = null;
FileInfo filename = new FileInfo("c:\myworkfolder" + @"\" +
this.link.Subs);

try
{
byte[] buffer = new byte[4096];

myData = source.OpenRead(this.link.Orig);
myFile = new FileStream(filename.FullName,
FileMode.Create);

int br;
do
{
br = myData.Read(buffer, 0, buffer.Length);
if (br > 0)
myFile.Write(buffer, 0, br);
}
while (br > 0);
myFile.Close();

myData.Close();
this.link.IsRead = true;
}
finally
{
if (myData != null)
myData.Close();
if (myFile != null)
myFile.Close();
if (filename.Exists)
{
try {filename.Delete()};
catch{}
}
}
}

private void GetPage()
{
System.Net.WebClient source = new System.Net.WebClient();
StreamReader mr = null;
string sWebPage = String.Empty;

try
{
mr = new StreamReader(source.OpenRead(this.link.Orig));
sWebPage = mr.ReadToEnd();
}
finally
{
if (mr != null)
mr.Close();
}

HTMLDocumentClass myDoc;

try
{
object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);
}
catch
{
//page is not well formated, skip it
return;
}

// if we are here, we have readed the page and we are ready to
parse it

IHTMLElementCollection cMyLinks = (IHTMLElementCollection)
myDoc.links;

foreach (IHTMLAnchorElement oLink in cMyLinks)
oLink.href = this.SubstituteTags(true, this.link.Orig,
oLink.href, false);
//SubstituteTags method changes the <href> tag to the
filename
//in which we'll save the link, so page is ready for
offline viewing
//and it also adds the link in the queue of the pages to be
//processed

cMyLinks = (IHTMLElementCollection)myDoc.images;
foreach (IHTMLImgElement oImage in cMyLinks)
oImage.src = this.SubstituteTags(false, this.link.Orig,
oImage.href, false);

StreamWriter myFile = null;
sWebPage = myDoc.documentElement.outerHTML;
this.link.IsRead = true;
try
{
myFile = new StreamWriter(oParent.WriteFolder + @"\" +
this.link.Subs, false);
myFile.Write(sWebPage);
}
finally
{
if (myFile != null)
myFile.Close();
}
}

Nov 16 '05 #15

Joerg Jooss

Thomas Peter wrote:

Sunny,

I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com, quite common for multiple sites to be sharing 1 IP
address, usually going thru DNS its no problem... but i need to be
able to directly access a site...
example above: in order for me to get the correct site i must also
supply the microsoft.com host header value or abc.com host header
value.

It appears that one cannot modify certain Headers in HttpWebRequest
and Host is one of them.

Be a hero and share your spider code ;0) i am working on something
similar...

Lets not confuse things here. A spider is just a special purpose web
*client*. It does not relay requests like a proxy. I'm not sure what you're
trying to build -- a true caching proxy or simply some sort of spider or web
leech? If it's a proxy, you will need to be able to set "Host" indepedently
of the destination address -- HttpWebRequest won't work here. As a web
client that should never be the case -- unless you've got some nasty user
who prefers to address multihome servers by IP address and Host header ;-)

Cheers,

--
Joerg Jooss
jo*********@gmx.net

Nov 16 '05 #16

Thomas Peter

Thanks Sunny,

I'll look over your code and will touch base with you shortly...

thanks again,,, i am really excited about this now

~Thomas

"Sunny" <su******@icebergwireless.com> wrote in message
news:OL**************@TK2MSFTNGP10.phx.gbl...

Hmm, something happened with the attachement.
I copy/paste it here, watch for line wraps.

public void GetItem()
{
if (this.link.IsImage)
this.GetImage();
else
this.GetPage();
}

private void GetImage()
{
System.Net.WebClient source = new System.Net.WebClient();
Stream myData = null;
FileStream myFile = null;
FileInfo filename = new FileInfo("c:\myworkfolder" + @"\" +
this.link.Subs);

try
{
byte[] buffer = new byte[4096];

myData = source.OpenRead(this.link.Orig);
myFile = new FileStream(filename.FullName,
FileMode.Create);

int br;
do
{
br = myData.Read(buffer, 0, buffer.Length);
if (br > 0)
myFile.Write(buffer, 0, br);
}
while (br > 0);
myFile.Close();

myData.Close();
this.link.IsRead = true;
}
finally
{
if (myData != null)
myData.Close();
if (myFile != null)
myFile.Close();
if (filename.Exists)
{
try {filename.Delete()};
catch{}
}
}
}

private void GetPage()
{
System.Net.WebClient source = new System.Net.WebClient();
StreamReader mr = null;
string sWebPage = String.Empty;

try
{
mr = new StreamReader(source.OpenRead(this.link.Orig));
sWebPage = mr.ReadToEnd();
}
finally
{
if (mr != null)
mr.Close();
}

HTMLDocumentClass myDoc;

try
{
object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);
}
catch
{
//page is not well formated, skip it
return;
}

// if we are here, we have readed the page and we are ready to
parse it

IHTMLElementCollection cMyLinks = (IHTMLElementCollection)
myDoc.links;

foreach (IHTMLAnchorElement oLink in cMyLinks)
oLink.href = this.SubstituteTags(true, this.link.Orig,
oLink.href, false);
//SubstituteTags method changes the <href> tag to the
filename
//in which we'll save the link, so page is ready for
offline viewing
//and it also adds the link in the queue of the pages to be
//processed

cMyLinks = (IHTMLElementCollection)myDoc.images;
foreach (IHTMLImgElement oImage in cMyLinks)
oImage.src = this.SubstituteTags(false, this.link.Orig,
oImage.href, false);

StreamWriter myFile = null;
sWebPage = myDoc.documentElement.outerHTML;
this.link.IsRead = true;
try
{
myFile = new StreamWriter(oParent.WriteFolder + @"\" +
this.link.Subs, false);
myFile.Write(sWebPage);
}
finally
{
if (myFile != null)
myFile.Close();
}
}

Nov 16 '05 #17

Downloading WebSites using HttpWebRequest

Similar topics