Connecting Tech Pros Worldwide Forums | Help | Site Map

Parsing HTML to remove pictures and stylesheets

Seb
Guest
 
Posts: n/a
#1: Oct 21 '06
Hello,

I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.

Any ideas?

Thanks,
Seb


John Timney \(MVP\)
Guest
 
Posts: n/a
#2: Oct 21 '06

re: Parsing HTML to remove pictures and stylesheets


I would start with the HTML agility pack and se if it helps you.

http://www.codeplex.com/Wiki/View.as...tmlagilitypack

If that fails, then a few well targetted regular expressions would do the
job I expect in findinf the offending parts. string.replace takes regular
expressions.
--
Regards

John Timney (MVP)
VISIT MY WEBSITE:
http://www.johntimney.com


"Seb" <solive@gmail.comwrote in message
news:1161429845.346228.278260@b28g2000cwb.googlegr oups.com...
Quote:
Hello,
>
I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.
>
Any ideas?
>
Thanks,
Seb
>

Eric Theil
Guest
 
Posts: n/a
#3: Oct 21 '06

re: Parsing HTML to remove pictures and stylesheets


Hey Seb,

I don't know of a program or function that already does this kind of thing,
but you could implement it by using Regular Expressions
(System.Text.RegularExpressions)

check out:
http://msdn.microsoft.com/library/de...classtopic.asp

Eric

"Seb" <solive@gmail.comwrote in message
news:1161429845.346228.278260@b28g2000cwb.googlegr oups.com...
Quote:
Hello,
>
I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.
>
Any ideas?
>
Thanks,
Seb
>

Chris Fulstow
Guest
 
Posts: n/a
#4: Oct 21 '06

re: Parsing HTML to remove pictures and stylesheets


I'd second that, the Html Agility Pack is ideal for this sort of thing.
In fact, it includes sample code for converting HTML into plain text.

More info here:
http://chrisfulstow.blogspot.com/200...ml-in-net.html

--
Chris Fulstow
MCP, MCTS
http://chrisfulstow.blogspot.com/


John Timney (MVP) wrote:
Quote:
I would start with the HTML agility pack and se if it helps you.
>
http://www.codeplex.com/Wiki/View.as...tmlagilitypack
>
If that fails, then a few well targetted regular expressions would do the
job I expect in findinf the offending parts. string.replace takes regular
expressions.
--
Regards
>
John Timney (MVP)
VISIT MY WEBSITE:
http://www.johntimney.com
>
>
"Seb" <solive@gmail.comwrote in message
news:1161429845.346228.278260@b28g2000cwb.googlegr oups.com...
Quote:
Hello,

I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.

Any ideas?

Thanks,
Seb
Closed Thread