473,326 Members | 2,173 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

Parsing HTML

I have a web page that I call and I need to get the body text out of the HTML.

<html>
<body>
Hi.
How are you?
</body>
</html>

What is the best way to do this in CO# and .NET?

Thanks,

Yosh
Nov 17 '05 #1
3 18905
In message <eK**************@TK2MSFTNGP12.phx.gbl>, Yosh
<Yo**@nospam.com> writes
I have a web page that I call and I need to get the body text out of
the HTML.
*
<html>
<body>
Hi.
How are you?
</body>
</html>
*
What is the best way to do this in CO# and .NET?


#1 Treat it as a string and parse it using regular expressions.

#2 Use the Microsoft HTML Object Library (mshtml, add reference from COM
tab) to load and parse it, and access it through the document object
model:

using System;
using mshtml;
namespace HTMParse
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
string s = "<html><body>Hi.How are
you?</body></html>";
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[]{s});
doc.close();
Console.Write(doc.body.innerHTML);
Console.Read();
}
}
}

--
Steve Walker
Nov 17 '05 #2
Or, there is a nice parser on Code Project
(http://www.codeproject.com/dotnet/apmilhtml.asp)

BTW: Nice one Steve! I like the IHTMLDocument2 idea. I had not thought of
this one.

Frisky

"Steve Walker" <st***@otolith.demon.co.uk> wrote in message
news:zu**************@otolith.demon.co.uk...
In message <eK**************@TK2MSFTNGP12.phx.gbl>, Yosh
<Yo**@nospam.com> writes
I have a web page that I call and I need to get the body text out of
the HTML.

<html>
<body>
Hi.
How are you?
</body>
</html>

What is the best way to do this in CO# and .NET?


#1 Treat it as a string and parse it using regular expressions.

#2 Use the Microsoft HTML Object Library (mshtml, add reference from COM
tab) to load and parse it, and access it through the document object
model:

using System;
using mshtml;
namespace HTMParse
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
string s = "<html><body>Hi.How are
you?</body></html>";
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[]{s});
doc.close();
Console.Write(doc.body.innerHTML);
Console.Read();
}
}
}

--
Steve Walker

Nov 17 '05 #3


Yosh wrote:
I have a web page that I call and I need to get the body text out of the
HTML.

<html>
<body>
Hi.
How are you?
</body>
</html>

What is the best way to do this in CO# and .NET?


One way is SGMLReader

<http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>
--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Nov 17 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
14
by: Viktor Rosenfeld | last post by:
Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: -...
9
by: RiGGa | last post by:
Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...
0
by: Fuzzyman | last post by:
I am trying to parse an HTML page an only modify URLs within tags - e.g. inside IMG, A, SCRIPT, FRAME tags etc... I have built one that works fine using the HTMLParser.HTMLParser and it works...
3
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...
16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
4
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.