473,402 Members | 2,061 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,402 software developers and data experts.

HTML Page Scraping

Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Thanks,
James

Oct 8 '07 #1
8 2436
James,

It can get pretty difficult, depending on the technologies that are
utilized in the page. For example, authentication can be in the form of
HTTP authentication, or forms-based, or maybe it is done through an AJAX
call. Needless to say, you are almost definitely going to have to
specialized depending on the site and the security it uses.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"james" <ja********@gmail.comwrote in message
news:11**********************@50g2000hsm.googlegro ups.com...
Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Thanks,
James
Oct 8 '07 #2
Carl,

Would an embedded browser be able to handle security tokens for me?

Also, do you know if I can access the DOM in C# or would I have to use
javascript or regex?

Thanks,
James

Oct 9 '07 #3
Carl,

Would an embedded browser be able to handle security tokens for me?

Also, do you know if I can access the DOM in C# or would I have to use
javascript or regex?

Thanks,
James

Oct 9 '07 #4
james wrote:
Carl,

Would an embedded browser be able to handle security tokens for me?
Maybe. It's a better place to start than from scratch, for sure.
>
Also, do you know if I can access the DOM in C# or would I have to use
javascript or regex?
You should be able to access it from C#. It's a COM/OLE Automation
component, so you should be able to access the whole API from C# via COM
interop. I belive that .NET 2.0+ provides a pre-built wrapper for you.

-cd
Oct 9 '07 #5
James,

You will for sure need MSHTML, it is a very extended class and therefore the
information is hard to get on MSDN. Be aware that you have to cast almost
everything and mostly 2 times over each other to get the right types as you
use it.

Be aware that it can slow down your IDE referencing it.

Cor

Oct 9 '07 #6
On Oct 9, 12:27 am, james <james.w...@gmail.comwrote:
Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Thanks,
James
Like some people stated here, its site dependant.
but here's a list of starting points:

IE Developer toolbar - to look for the commands/html/http/post
requests between the website and the browser (works on IE, for Firefox
you have FireBug)
HttpWebRequest and HttpWebResponse for using http protocol.
for the above - CookieContainer - for cookie based authentication
CredentialCache for NTLM based authentication (perhaps its used for
other things, but this is what I used it for so far)
HtmlAgilityPack - for parsing HTML, perhaps there's a better
component, so do your own research.

Hope its helpful :-)

Oct 9 '07 #7
Thanks for the advice.

Regards,
James
Oct 9 '07 #8
On Oct 8, 6:27 pm, james <james.w...@gmail.comwrote:
Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Thanks,
James
You can also try SWExplorerAutomation (SWEA) from http://webius.net.
SWEA records, replays automation scripts and generates VB.NET or C#
code.

Oct 18 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: 2peachy | last post by:
hello... I am brand new to this... I did a search with no results... how do you convert an html page into an xml page 2peach...
17
by: DesignGuy | last post by:
I would like to download the RDF dump and generate static HTML pages (with customizable headers and footers). I have only found one program called iHierarchy that claims to do this (...
1
by: mustafa | last post by:
anyone know some good reliable html scraping (with python) tutorials. i have looked around and found a few. one uses urllib2 and beautifull soap modules for scraping and parsing...
5
by: Lorenzo | last post by:
I've a web site with a classic asp login page (https), another where in a textbox i write a sql query and a third that shows the resulset of the query.... Now i want to create an asp.net...
2
by: Paul W | last post by:
Hi - I want to be able to capture the html generated by one of my pages. Is there any way to do this from within the application, or must I use some form of 'screen-scraping'. If screen-scraping,...
3
by: Sanjay Arora | last post by:
We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- time....html parsing/scraping. It would require full emulation...
3
by: Jim S | last post by:
I have a need to read the contents of an html table on a remote web page into a variable. I guess this is called screen scraping but not sure. I'm not sure where to start or what the best...
6
by: Christopher Glenn | last post by:
I have very basic html skills. My friend who has a wide screen monitor and is using IE7 sent me a jpg screen shot of my home page. I have attached this jpg, but I recall a while back that...
10
by: vegetable21 | last post by:
Hi All, I'm writing an app in C# that will be doing a bit of web scraping. I've got a fair bit of expierence with this but i've come across an issue with the returned HTML i'm getting from the...
18
by: Ecka | last post by:
Hi everyone, I'm trying to write a PHP script that connects to a bank's currency convertor page using cURL and that part works fine. The issue is that I end up with a page that includes a lot...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.