Hi fbrewster,
From your description, you're looking for some components or means to parse
HTML string, correct?
What's the input format of the html, are you programmticaly captureing the
html content from web and parse it or are there any existing html files on
local file disk?
Yes, in .net you can still use the MSHTML component(IE DOM parser) to parse
html. It is a COM component, therefore you need to call it via COM interop.
Here are some web articles demonstrating how to use it in .net:
#Parsing html markup text using MSHTML
http://www.eggheadcafe.com/articles/parsinghtml.asp
#Parsing HTML without Using the Browser Control
http://www.codeguru.com/vb/vb_intern...cle.php/c4815/
the MSHTML component load the html into it's DOM memory model and you can
access html elements in the DOM structure just like what you can do when
using javascript to accessing client-side html's DOM collection.
Also, for .net framework specific components, I've ever used the "Html
Agility Pack" which is good one for parsing html:
#.NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML
http://blogs.msdn.com/smourier/archi...6/04/8265.aspx
#Html Agility Pack
http://www.codeplex.com/htmlagilitypack
it also provide a DOM based model. And it also support XPATH based query
which is quite convenient and powerful.
========sample code using Html Aglity Pack=========== =
private void Parse_Questions ()
{
//get html content from web
HttpWebRequest req = WebRequest.Crea te(txtUrl.Text) as
HttpWebRequest;
HttpWebResponse rep = req.GetResponse () as HttpWebResponse ;
StreamReader sr = new StreamReader(re p.GetResponseSt ream());
//construct html document object and load the html stream
html.HtmlDocume nt hd = new HtmlAgilityPack .HtmlDocument() ;
hd.Load(sr);
sr.Close();
rep.Close();
//use xpath t o query the expected nodes in the htmldocument
html.HtmlNode doc = hd.DocumentNode ;
html.HtmlNodeCo llection divs =
doc.SelectNodes ("//div[@class='questio nbody']");
StreamWriter sw = new StreamWriter(@" e:\temp\htmlout put.htm");
int i = 0;
sw.WriteLine("< html><body>");
foreach (html.HtmlNode node in divs)
{
//....processing code
}
sw.WriteLine("</body></html>");
sw.Close();
}
=============== =============== =========
Hope this helps.
Sincerely,
Steven Cheng
Microsoft MSDN Online Support Lead
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
ms****@microsof t.com.
=============== =============== =============== =====
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscripti...ult.aspx#notif
ications.
Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscripti...t/default.aspx.
=============== =============== =============== =====
This posting is provided "AS IS" with no warranties, and confers no rights.
--------------------
From: "fbrewster" <fb*******@news group.nospam>
Subject: Can I use Internet explorers DOM parser?
Date: Mon, 16 Jun 2008 12:53:22 -0500
I'm writing an HTML parser and would like to use Internet Explorers DOM
parser.
Can I use Internet Explorers DOM parser through a web service?
thanks for the help