472,125 Members | 1,477 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,125 software developers and data experts.

Parsing HTML document to get at plain text between the <P> tags

Rob
I'm writing an application without a user interface and I have a requirement
to find the plain ASCII text between <Ptags in an HTML document which
happens to have been obtained via POP3 and parsed out of a MIME message
body.

If I had a user interface, I could drop and IE web control onto a form, load
the HTML into that and then use the document parser. I've always hated that
route as it's so clunky and anyway, this app doesn't have a user interface
(it's processing emails in a service).

Any suggestions how to go about this?

I've come across SgmlReader from here:

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

This could convert the HTML document into a clean XML document which I could
then parse with the dotnet XML objects. Does this sound a sensible route?

I thought implementing a POP3 reader (easy) and internet email MIME parser
(mindblowing) was difficult enough :-)

Thanks, Rob.
Jan 3 '07 #1
9 9585
Could you just use a regular expression to pick out the pieces of text you
want?

Andrew

"Rob" <rob_nicholson@nospam_unforgettable.comwrote in message
news:ze******************@newsfe5-win.ntli.net...
I'm writing an application without a user interface and I have a
requirement to find the plain ASCII text between <Ptags in an HTML
document which happens to have been obtained via POP3 and parsed out of a
MIME message body.

If I had a user interface, I could drop and IE web control onto a form,
load the HTML into that and then use the document parser. I've always
hated that route as it's so clunky and anyway, this app doesn't have a
user interface (it's processing emails in a service).

Any suggestions how to go about this?

I've come across SgmlReader from here:

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

This could convert the HTML document into a clean XML document which I
could then parse with the dotnet XML objects. Does this sound a sensible
route?

I thought implementing a POP3 reader (easy) and internet email MIME parser
(mindblowing) was difficult enough :-)

Thanks, Rob.

Jan 4 '07 #2
Rob wrote:
I've come across SgmlReader from here:

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

This could convert the HTML document into a clean XML document which I could
then parse with the dotnet XML objects. Does this sound a sensible route?
Yes, use that route to parse the HTML in a way that you can chain it to
existing XML/XPath APIs in .NET. You can either use SgmlReader or HTML
agility pack
<http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack>
--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Jan 4 '07 #3
Rob wrote:
I'm writing an application without a user interface and I have a requirement
to find the plain ASCII text between <Ptags in an HTML document which
happens to have been obtained via POP3 and parsed out of a MIME message
body.
Use Tidy to make it XHTML and then run XSLT to extract the paragraph text.

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Jan 4 '07 #4
Rob
Could you just use a regular expression to pick out the pieces of text you
want?
Maybe... I'm not an expert on regular expressions - any references for this
particular use of regular expressions? I could just about formulate an
expression to get the text between <Pand </Pbut I also want to strip out
other tags like:

<p>Some <b>text</bstring</p>

I'm just interested in the "Some text string" bit.

Thanks, Rob.
Jan 6 '07 #5
Rob
<http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack>

Thanks for the reference.

Cheers, Rob.
Jan 6 '07 #6
Rob
Thanks for the reference.

Later... the HTMLAgilityPack looks, at first glance, just what I need.

Cheers, Rob.
Jan 6 '07 #7
hehe ok - maybe thinking about this a little more would mean a fairly
complicated regular expression, especially if there are tags nested inside
:)

Andrew

"Rob" <rob_nicholson@nospam_unforgettable.comwrote in message
news:gr*******************@newsfe4-win.ntli.net...
>Could you just use a regular expression to pick out the pieces of text
you want?

Maybe... I'm not an expert on regular expressions - any references for
this particular use of regular expressions? I could just about formulate
an expression to get the text between <Pand </Pbut I also want to
strip out other tags like:

<p>Some <b>text</bstring</p>

I'm just interested in the "Some text string" bit.

Thanks, Rob.

Jan 8 '07 #8
kind of raw but you can remove anything that starts with < and ends with >
and everything in between. You might have to take care of any HTML tags that
are escaped e.g. &gt; for etc.

"Rob" <rob_nicholson@nospam_unforgettable.comwrote in message
news:ze******************@newsfe5-win.ntli.net...
I'm writing an application without a user interface and I have a
requirement to find the plain ASCII text between <Ptags in an HTML
document which happens to have been obtained via POP3 and parsed out of a
MIME message body.

If I had a user interface, I could drop and IE web control onto a form,
load the HTML into that and then use the document parser. I've always
hated that route as it's so clunky and anyway, this app doesn't have a
user interface (it's processing emails in a service).

Any suggestions how to go about this?

I've come across SgmlReader from here:

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

This could convert the HTML document into a clean XML document which I
could then parse with the dotnet XML objects. Does this sound a sensible
route?

I thought implementing a POP3 reader (easy) and internet email MIME parser
(mindblowing) was difficult enough :-)

Thanks, Rob.
Jan 8 '07 #9
Rob
hehe ok - maybe thinking about this a little more would mean a fairly
complicated regular expression, especially if there are tags nested inside
Actually, maybe not. I can just start to envisage an expression that finds
any <??>text</??tag and then keep calling recursively. Might work.

Then again, there are various HTML -XML libraries which might work even
better.

Cheers, Rob.
Jan 8 '07 #10

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by sindre hiåsen | last post: by
4 posts views Thread by Lasse Edsvik | last post: by
15 posts views Thread by Jeff North | last post: by
2 posts views Thread by taras.di | last post: by
17 posts views Thread by V S Rawat | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.