By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,990 Members | 2,324 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,990 IT Pros & Developers. It's quick & easy.

Parsing HTML in .Net (Document Object Model Parsing)

P: n/a
I've tried the WebBrowser in the System.Windows.Forms namespace, but it
dosen't work when you instanciate an object from a class. It needs a Form to
live in to work.

My application allready has the ability to parse html. It send webrequests,
get the html, extracts links, detects <base href=""> tags, <iframe>, <frame>
tags and so on. The only problem is when the html contain loops,
document.write() calls stuff defined in the Document Object Model (DOM)
standard to write links. My code only traverses the html as-it-is received
from the webserver.

The WebBrowser object fixed all this and returned the finished absolute
links. It was easy to get it to travese iframe/frames aswell, but it has its
problems, Cookies, and user-agent, accept, authenticate headers can't be
set. Popups, Activex and Scripterrors is also issues with it. My app only
extract links (it's a crawler)

What i need is a class like the HTMLdocument class that i can give the html
received from the webserver, and then get the DOM-processed result out of
it, but without the WebBrowser class

Does anything like that exist?
Anyone have any idea what keywords to use when googling for this kind of
stuff?
*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
Apr 4 '06 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.