473,398 Members | 2,525 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

Stripping html tags from text

Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.

Mar 6 '06 #1
4 4129
HTML is complex. It would be better instead to say that you want to
*retrieve* *only* all of the following tags. That way, they are the only
tags the Regular Expression will have to look for.

The following will do this:

(?i)<\s*(a|br|h1|h2|h3)[^>]*>(?:([^<\r\n]+)(?=(?:<\/\1)|(?:\r?\n)))?

Note: Grouping is used in this Regular Expression. It groups the tag names
into Group 1, and the InnerText into Group 2, in case you need either of
these.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.

Mar 6 '06 #2


i use this in VB

Private Function stripHTML(ByVal strHTML) As String

Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

Return objRegExp.Replace(strHTML, "")

End Function

so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

does the trick

so in C# it would be ( i am a VB coder so don`t shoot me )

private string stripHTML(object strHTML)

{

System.Text.RegularExpressions.Regex objRegExp = new
System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;

return objRegExp.Replace(strHTML, "");

}

regards

Michel Posseth [MCP]

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.

Mar 6 '06 #3
The problem with that Regular Expression (in this case) is that it simply
matches all tags in the page. It doesn't match InnerText, as he requested,
and it matches end tags as separate matches. It is excellent for, for
example, stripping HTML tags from a page, but not for his requirements.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"m.posseth" <mi*****@nohausystems.nl> wrote in message
news:%2****************@TK2MSFTNGP11.phx.gbl...


i use this in VB

Private Function stripHTML(ByVal strHTML) As String

Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

Return objRegExp.Replace(strHTML, "")

End Function

so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

does the trick

so in C# it would be ( i am a VB coder so don`t shoot me )

private string stripHTML(object strHTML)

{

System.Text.RegularExpressions.Regex objRegExp = new
System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;

return objRegExp.Replace(strHTML, "");

}

regards

Michel Posseth [MCP]

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.


Mar 6 '06 #4
Oops :-)

i just read "Stripping html tags from text" and missed the exclusion part
except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

my code will convert
<html>
<head>
<body>
<table>
<tr><td>bla bla </td></tr>
</table>
</body>
</head>
</html>

into

bla bla
regards

Michel


"Kevin Spencer" <ke***@DIESPAMMERSDIEtakempis.com> wrote in message
news:eI**************@TK2MSFTNGP09.phx.gbl... The problem with that Regular Expression (in this case) is that it simply
matches all tags in the page. It doesn't match InnerText, as he requested,
and it matches end tags as separate matches. It is excellent for, for
example, stripping HTML tags from a page, but not for his requirements.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"m.posseth" <mi*****@nohausystems.nl> wrote in message
news:%2****************@TK2MSFTNGP11.phx.gbl...


i use this in VB

Private Function stripHTML(ByVal strHTML) As String

Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

Return objRegExp.Replace(strHTML, "")

End Function

so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

does the trick

so in C# it would be ( i am a VB coder so don`t shoot me )

private string stripHTML(object strHTML)

{

System.Text.RegularExpressions.Regex objRegExp = new
System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;

return objRegExp.Replace(strHTML, "");

}

regards

Michel Posseth [MCP]

"Spondishy" <sp*******@tiscali.co.uk> wrote in message
news:11*********************@z34g2000cwc.googlegro ups.com...
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.



Mar 7 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Steveo | last post by:
I am currently stripping HTML from a string with the following code. (I know it's not the best way to strip HTML but bear with me) re.compile("<.*?>") I wanted to allow all H1 and H2 tags so i...
3
by: shank | last post by:
I'm querying a text field with an 8000 character limit. The text also contains HTML tags like <p> <br> and more. Is there a way to strip all HTML tags in the resulting recordset, or do I have to...
15
by: Jeff North | last post by:
Hi, I'm using a control called HTMLArea which allows a person to enter text and converts the format instructions to html tags. Most of my users know nothing about html so this is perfect for my...
258
by: Terry Andersen | last post by:
If I have: struct one_{ unsigned int one_1; unsigned short one_2; unsigned short one_3; }; struct two_{ unsigned int two_1;
4
by: Lance | last post by:
Hi, What way could I strip certain tags (like HTML comments) from the HTML being delivered to the client? I don't mean what regexp to use, but where do I put this stripping code? I'm thinking...
6
by: Medros | last post by:
I understand that you can strip html out of a txt file so that all the information is left is the visable information that is needed (e.g. everything that has < > around is gone). My question is...
3
by: Jason | last post by:
First things first, let me say that I couldn't decide whether to post this to the PHP ng, or to an XML ng. I know from experience that you guys know what you're talking about, though, and all of...
2
by: Big Moxy | last post by:
I want to send html formatted text yet strip out special characters (e.g. quotes and semi colons). I've seen preg_replace examples like $messageout = preg_replace('/\(\)<>]/i','',$message); to...
1
by: since | last post by:
I figured I would post my solution to the following. Resizable column tables. Search and replace values in a table. (IE only) Scrollable tables. Sortable tables. It is based on a lot...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.