"keith" <ke********@gmail.com> wrote in
news:11*********************@f14g2000cwb.googlegro ups.com:
Hi,
I'm using WebClient to retrieve the contents of a particular
page. I would like to get a string containing only the page's
text and no html markup.
How can I do this? Is there a class to take care of this?
Many thanks!
Keith
Keith,
Here's a method that does that:
/// <summary>
/// Given a string containing HTML/XML/SGML tags, this method strips
/// out all of the tags and returns the remaining text.
/// </summary>
/// <param name="html">
/// The HTML/XML/SGML to search.
/// </param>
/// <returns>
/// The <c>html</c> text stripped of all tags. If <c>html</c> is null or empty,
/// then <c>html</c> is returned.
/// </returns>
public static string GetTextStrippedOfHtml(string html)
{
if ((html == null) || (html.Trim().Length == 0))
return html;
return Regex.Replace(html, @"
< # Tag's opening less-than sign.
[^>]+? # One or more characters that aren't a tag's closing greater-than sign (non-greedy). # Tag's closing greater-than sign.",
string.Empty,
RegexOptions.Singleline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);
}
--
Hope this helps.
Chris.
-------------
C.R. Timmons Consulting, Inc.
http://www.crtimmonsinc.com/