473,404 Members | 2,178 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,404 software developers and data experts.

Screen Scraping a Password Protected Site

I'm trying to screen scrape a site that requires a password. If I
access the site's login page in my browser and view the source, I
see that it does not contain a viewstate.

When my program posts the login information, the response I get
is the same page as if I had logged in using my browser. In the
page it says "Welcome" followed by my name. The cookie collection
returned doesn't contain any cookies (response.cookies.count =
0).

When I access other pages, the login screen is returned instead
of the desired page.

Obviously, I need to somehow maintain the session in subsequent
calls, but how do I do that when there are no cookies and there
is no viewstate?

If I use Fiddler to see what happens when I access the site from
my browser, I can see that the first line for the site (where the
result is 200 and the host says "CONNECT") says "SessionID:
empty" under Session Inspector - Textview for the request. For
the response it says "SessionID: " then several bytes of data.
Subsequent 200/CONNECT lines have that same data for both the
request and the response. This must be what I need to maintain my
session. If anyone can help me figure out how to get this
information and use it, I'll be very grateful.

(I'm using VB in VS2003.)

Thanks.
--
Greg
----
http://www.spencerbooksellers.com
greg00 -at- spencersoft -dot- com
Dec 16 '06 #1
3 4004
Check out the System.Net.CookieContainer class.

You can override a System.Net.WebClient class to store and retrieve
cookies to a singleton CookieContainer and the once you have logged in
to the website you will stay logged in.

something like this (untested) ...

'============================================
Imports System.Net

Public Class CookieWebClient : Inherits WebClient

' overridden to add cookie headers to http requests.
Protected Overrides Function GetWebRequest(ByVal address As
System.Uri) As System.Net.WebRequest
Dim request As WebRequest = MyBase.GetWebRequest(address)
If TypeOf request Is HttpWebRequest Then
DirectCast(request, HttpWebRequest).CookieContainer =
_cookies
End If
Return request
End Function

' overridden to save cookies to the container for http requests.
Protected Overrides Function GetWebResponse(ByVal request As
System.Net.WebRequest) As System.Net.WebResponse
Dim response As WebResponse = MyBase.GetWebResponse(request)
If TypeOf response Is HttpWebResponse Then
_cookies.Add(response.ResponseUri, DirectCast(response,
HttpWebResponse).Cookies)
End If
Return response
End Function

' overridden to save cookies to the container for async http
requests.
Protected Overrides Function GetWebResponse(ByVal request As
System.Net.WebRequest, ByVal result As System.IAsyncResult) As
System.Net.WebResponse
Dim response As WebResponse = MyBase.GetWebResponse(request,
result)
If TypeOf response Is HttpWebResponse Then
_cookies.Add(response.ResponseUri, DirectCast(response,
HttpWebResponse).Cookies)
End If
Return response
End Function

Private Shared _cookies As CookieContainer = New CookieContainer

End Class
'============================================

Then just use the ExWebClient class to make your requests;
Dim c As New ExWebClient

Dim s as string = c.DownloadString("http://www.somesite.com")
Works for me :-)

-Blake

Gregory A Greenman wrote:
I'm trying to screen scrape a site that requires a password. If I
access the site's login page in my browser and view the source, I
see that it does not contain a viewstate.

When my program posts the login information, the response I get
is the same page as if I had logged in using my browser. In the
page it says "Welcome" followed by my name. The cookie collection
returned doesn't contain any cookies (response.cookies.count =
0).

When I access other pages, the login screen is returned instead
of the desired page.

Obviously, I need to somehow maintain the session in subsequent
calls, but how do I do that when there are no cookies and there
is no viewstate?

If I use Fiddler to see what happens when I access the site from
my browser, I can see that the first line for the site (where the
result is 200 and the host says "CONNECT") says "SessionID:
empty" under Session Inspector - Textview for the request. For
the response it says "SessionID: " then several bytes of data.
Subsequent 200/CONNECT lines have that same data for both the
request and the response. This must be what I need to maintain my
session. If anyone can help me figure out how to get this
information and use it, I'll be very grateful.

(I'm using VB in VS2003.)

Thanks.
--
Greg
----
http://www.spencerbooksellers.com
greg00 -at- spencersoft -dot- com
Dec 17 '06 #2

i Should have stated before that to login you will need to call

CookieWebClient.UploadValues() to post to your sites login form first.

-Blake

Blake wrote:
Check out the System.Net.CookieContainer class.

You can override a System.Net.WebClient class to store and retrieve
cookies to a singleton CookieContainer and the once you have logged in
to the website you will stay logged in.

something like this (untested) ...

'============================================
Imports System.Net

Public Class CookieWebClient : Inherits WebClient

' overridden to add cookie headers to http requests.
Protected Overrides Function GetWebRequest(ByVal address As
System.Uri) As System.Net.WebRequest
Dim request As WebRequest = MyBase.GetWebRequest(address)
If TypeOf request Is HttpWebRequest Then
DirectCast(request, HttpWebRequest).CookieContainer =
_cookies
End If
Return request
End Function

' overridden to save cookies to the container for http requests.
Protected Overrides Function GetWebResponse(ByVal request As
System.Net.WebRequest) As System.Net.WebResponse
Dim response As WebResponse = MyBase.GetWebResponse(request)
If TypeOf response Is HttpWebResponse Then
_cookies.Add(response.ResponseUri, DirectCast(response,
HttpWebResponse).Cookies)
End If
Return response
End Function

' overridden to save cookies to the container for async http
requests.
Protected Overrides Function GetWebResponse(ByVal request As
System.Net.WebRequest, ByVal result As System.IAsyncResult) As
System.Net.WebResponse
Dim response As WebResponse = MyBase.GetWebResponse(request,
result)
If TypeOf response Is HttpWebResponse Then
_cookies.Add(response.ResponseUri, DirectCast(response,
HttpWebResponse).Cookies)
End If
Return response
End Function

Private Shared _cookies As CookieContainer = New CookieContainer

End Class
'============================================

Then just use the ExWebClient class to make your requests;
Dim c As New ExWebClient

Dim s as string = c.DownloadString("http://www.somesite.com")
Works for me :-)

-Blake

Gregory A Greenman wrote:
I'm trying to screen scrape a site that requires a password. If I
access the site's login page in my browser and view the source, I
see that it does not contain a viewstate.

When my program posts the login information, the response I get
is the same page as if I had logged in using my browser. In the
page it says "Welcome" followed by my name. The cookie collection
returned doesn't contain any cookies (response.cookies.count =
0).

When I access other pages, the login screen is returned instead
of the desired page.

Obviously, I need to somehow maintain the session in subsequent
calls, but how do I do that when there are no cookies and there
is no viewstate?

If I use Fiddler to see what happens when I access the site from
my browser, I can see that the first line for the site (where the
result is 200 and the host says "CONNECT") says "SessionID:
empty" under Session Inspector - Textview for the request. For
the response it says "SessionID: " then several bytes of data.
Subsequent 200/CONNECT lines have that same data for both the
request and the response. This must be what I need to maintain my
session. If anyone can help me figure out how to get this
information and use it, I'll be very grateful.

(I'm using VB in VS2003.)

Thanks.
--
Greg
----
http://www.spencerbooksellers.com
greg00 -at- spencersoft -dot- com
Dec 17 '06 #3
Triple post. Yay!.

It's also worth noting that you dont need to use Fiddler to see the
http traffic.

The System.Net classes have been compiled with TRACE turned on, so you
can add a .config file like this;

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<system.diagnostics>
<sources>
<source name="System.Net" switchValue="Information"/>
</sources>
</system.diagnostics>
</configuration>

....and you will see the http headers going back and forth in the output
window. If you set the level to verbose you can also see the data.

-Blake

Blake wrote:
i Should have stated before that to login you will need to call

CookieWebClient.UploadValues() to post to your sites login form first.

-Blake

Blake wrote:
Check out the System.Net.CookieContainer class.

You can override a System.Net.WebClient class to store and retrieve
cookies to a singleton CookieContainer and the once you have logged in
to the website you will stay logged in.

something like this (untested) ...

'============================================
Imports System.Net

Public Class CookieWebClient : Inherits WebClient

' overridden to add cookie headers to http requests.
Protected Overrides Function GetWebRequest(ByVal address As
System.Uri) As System.Net.WebRequest
Dim request As WebRequest = MyBase.GetWebRequest(address)
If TypeOf request Is HttpWebRequest Then
DirectCast(request, HttpWebRequest).CookieContainer =
_cookies
End If
Return request
End Function

' overridden to save cookies to the container for http requests.
Protected Overrides Function GetWebResponse(ByVal request As
System.Net.WebRequest) As System.Net.WebResponse
Dim response As WebResponse = MyBase.GetWebResponse(request)
If TypeOf response Is HttpWebResponse Then
_cookies.Add(response.ResponseUri, DirectCast(response,
HttpWebResponse).Cookies)
End If
Return response
End Function

' overridden to save cookies to the container for async http
requests.
Protected Overrides Function GetWebResponse(ByVal request As
System.Net.WebRequest, ByVal result As System.IAsyncResult) As
System.Net.WebResponse
Dim response As WebResponse = MyBase.GetWebResponse(request,
result)
If TypeOf response Is HttpWebResponse Then
_cookies.Add(response.ResponseUri, DirectCast(response,
HttpWebResponse).Cookies)
End If
Return response
End Function

Private Shared _cookies As CookieContainer = New CookieContainer

End Class
'============================================

Then just use the ExWebClient class to make your requests;
Dim c As New ExWebClient

Dim s as string = c.DownloadString("http://www.somesite.com")
Works for me :-)

-Blake

Gregory A Greenman wrote:
I'm trying to screen scrape a site that requires a password. If I
access the site's login page in my browser and view the source, I
see that it does not contain a viewstate.
>
When my program posts the login information, the response I get
is the same page as if I had logged in using my browser. In the
page it says "Welcome" followed by my name. The cookie collection
returned doesn't contain any cookies (response.cookies.count =
0).
>
When I access other pages, the login screen is returned instead
of the desired page.
>
Obviously, I need to somehow maintain the session in subsequent
calls, but how do I do that when there are no cookies and there
is no viewstate?
>
If I use Fiddler to see what happens when I access the site from
my browser, I can see that the first line for the site (where the
result is 200 and the host says "CONNECT") says "SessionID:
empty" under Session Inspector - Textview for the request. For
the response it says "SessionID: " then several bytes of data.
Subsequent 200/CONNECT lines have that same data for both the
request and the response. This must be what I need to maintain my
session. If anyone can help me figure out how to get this
information and use it, I'll be very grateful.
>
(I'm using VB in VS2003.)
>
Thanks.
>
>
--
Greg
----
http://www.spencerbooksellers.com
greg00 -at- spencersoft -dot- com
Dec 17 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var...
0
by: Robert Martinez | last post by:
I've seen a lot about screen scraping with .NET, mostly in VB.net. I have been able to convert most of it over, but it is still just very basic stuff. Can someone help direct me toward some good...
14
by: n8 | last post by:
Hi, Hi have to do the followign and have been racking my brain with various solutions that have had no so great results. I want to use the System.Net.WebClient to submit data to a form (log a...
4
by: rachel | last post by:
Hello, I am currently contracted out by a real estate agent. He has a page that he has created himself that has a list of homes.. their images and data in html format. He wants me to take...
2
by: Victor | last post by:
I'm doing screen scraping by retrieving data from one site and entering into another site. I have a problem with logging into the site. User name and password field contain 'name' property, and...
2
by: Victor | last post by:
Hi, I have a problem with logging into web site via screen scraping. User name and password field contain 'name' property, and therefore I can easily do assignment to them:...
0
by: Steve | last post by:
I am working on an application to screen scrape information from a web page. I have the base code working but the problem is I have to login before I can get the info I need. The page is hosted on...
2
by: Alan Silver | last post by:
Hello, I would like to pull some information off a site that requires a log in. I have a subscription to a premium content site, and I would like to be able to do a few automatic requests...
4
by: apondu | last post by:
I'm trying to screen scrape a site that requires a password. I am using C#.Net, i am new to this and with the information available around on the internet i just put tht information into the...
1
by: ashwiniappajigowda | last post by:
Hi, I have an simple MFC dialog based application. On launch of that application 'Password protected screen saver' is not getting activated after the screen saver timeout. If 'On resume,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.