469,648 Members | 1,411 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,648 developers. It's quick & easy.

Stripping html tags from text

Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.

Mar 6 '06 #1
4 3943
HTML is complex. It would be better instead to say that you want to
*retrieve* *only* all of the following tags. That way, they are the only
tags the Regular Expression will have to look for.

The following will do this:

(?i)<\s*(a|br|h1|h2|h3)[^>]*>(?:([^<\r\n]+)(?=(?:<\/\1)|(?:\r?\n)))?

Note: Grouping is used in this Regular Expression. It groups the tag names
into Group 1, and the InnerText into Group 2, in case you need either of
these.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.

Mar 6 '06 #2


i use this in VB

Private Function stripHTML(ByVal strHTML) As String

Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

Return objRegExp.Replace(strHTML, "")

End Function

so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

does the trick

so in C# it would be ( i am a VB coder so don`t shoot me )

private string stripHTML(object strHTML)

{

System.Text.RegularExpressions.Regex objRegExp = new
System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;

return objRegExp.Replace(strHTML, "");

}

regards

Michel Posseth [MCP]

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.

Mar 6 '06 #3
The problem with that Regular Expression (in this case) is that it simply
matches all tags in the page. It doesn't match InnerText, as he requested,
and it matches end tags as separate matches. It is excellent for, for
example, stripping HTML tags from a page, but not for his requirements.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"m.posseth" <mi*****@nohausystems.nl> wrote in message
news:%2****************@TK2MSFTNGP11.phx.gbl...


i use this in VB

Private Function stripHTML(ByVal strHTML) As String

Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

Return objRegExp.Replace(strHTML, "")

End Function

so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

does the trick

so in C# it would be ( i am a VB coder so don`t shoot me )

private string stripHTML(object strHTML)

{

System.Text.RegularExpressions.Regex objRegExp = new
System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;

return objRegExp.Replace(strHTML, "");

}

regards

Michel Posseth [MCP]

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.


Mar 6 '06 #4
Oops :-)

i just read "Stripping html tags from text" and missed the exclusion part
except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

my code will convert
<html>
<head>
<body>
<table>
<tr><td>bla bla </td></tr>
</table>
</body>
</head>
</html>

into

bla bla
regards

Michel


"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:eI**************@TK2MSFTNGP09.phx.gbl... The problem with that Regular Expression (in this case) is that it simply
matches all tags in the page. It doesn't match InnerText, as he requested,
and it matches end tags as separate matches. It is excellent for, for
example, stripping HTML tags from a page, but not for his requirements.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"m.posseth" <mi*****@nohausystems.nl> wrote in message
news:%2****************@TK2MSFTNGP11.phx.gbl...


i use this in VB

Private Function stripHTML(ByVal strHTML) As String

Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

Return objRegExp.Replace(strHTML, "")

End Function

so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

does the trick

so in C# it would be ( i am a VB coder so don`t shoot me )

private string stripHTML(object strHTML)

{

System.Text.RegularExpressions.Regex objRegExp = new
System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;

return objRegExp.Replace(strHTML, "");

}

regards

Michel Posseth [MCP]

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.



Mar 7 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Steveo | last post: by
3 posts views Thread by shank | last post: by
15 posts views Thread by Jeff North | last post: by
258 posts views Thread by Terry Andersen | last post: by
4 posts views Thread by Lance | last post: by
6 posts views Thread by Medros | last post: by
3 posts views Thread by Jason | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.