473,405 Members | 2,171 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

How to strip HTML markup from string?

Hello,

I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?

I can surely write my own function, which would simply strip everything with
< and >. But if someonew has already written something similar for .NET, I
would prefer more clever solution, which would try to retain original
layout, at least paragraphs, hyperlinks etc - something like Outlook does
when changing HTML to plain text.
--
Michal A. Valasek, Altair Communications, http://www.altaircom.net
Please do not reply to this e-mail, for contact see http://www.rider.cz
Keeping Freedom safe from Democracy
Nov 17 '05 #1
3 2397
Function stripHTML(strHTML)
'Strips the HTML tags from strHTML

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "<(.|\n)+?>"

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strHTML, "")

'Replace all < and > with &lt; and &gt;
strOutput = Replace(strOutput, "<", "&lt;")
strOutput = Replace(strOutput, ">", "&gt;")

stripHTML = strOutput 'Return the value of strOutput

Set objRegExp = Nothing
End Function
"Michal A. Valasek" <ne**@altaircom.net> wrote in message
news:u5****************@TK2MSFTNGP10.phx.gbl...
Hello,

I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?

I can surely write my own function, which would simply strip everything with < and >. But if someonew has already written something similar for .NET, I
would prefer more clever solution, which would try to retain original
layout, at least paragraphs, hyperlinks etc - something like Outlook does
when changing HTML to plain text.
--
Michal A. Valasek, Altair Communications, http://www.altaircom.net
Please do not reply to this e-mail, for contact see http://www.rider.cz
Keeping Freedom safe from Democracy

Nov 17 '05 #2
Function stripHTML(strHTML)
'Strips the HTML tags from strHTML

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "<(.|\n)+?>"

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strHTML, "")

'Replace all < and > with &lt; and &gt;
strOutput = Replace(strOutput, "<", "&lt;")
strOutput = Replace(strOutput, ">", "&gt;")

stripHTML = strOutput 'Return the value of strOutput

Set objRegExp = Nothing
End Function
"Michal A. Valasek" <ne**@altaircom.net> wrote in message
news:u5****************@TK2MSFTNGP10.phx.gbl...
Hello,

I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?

I can surely write my own function, which would simply strip everything with < and >. But if someonew has already written something similar for .NET, I
would prefer more clever solution, which would try to retain original
layout, at least paragraphs, hyperlinks etc - something like Outlook does
when changing HTML to plain text.
--
Michal A. Valasek, Altair Communications, http://www.altaircom.net
Please do not reply to this e-mail, for contact see http://www.rider.cz
Keeping Freedom safe from Democracy

Nov 17 '05 #3
Hello Michal,

The page in 4guysfromrolla.com (introduced by Ravikanth) and RegEx (introduced by another dev) could work for you.

However, there are some other issues. Even after you entirely strip out all the <htmltags> you may be left with HTML-
encoded strings such as which you will also want to parse. These are easily handled with

System.Web.HTTPUtility.HTMLDecode()

And now, the long explanation of why this won't be good enough. There are still many unresolved issues: (It was posted by
others before)

1) Rendered line feeds versus actual line feeds. In any HTML source the line feeds that are in there are generally NOT the
ones that are rendered. BR, P and others are the elements that determine the position on the rendered page.

2) What you're going to do with any elements outside the BODY tag, and what you are going to do with text that is left over
between elements such as OBJECT or SCRIPT?

3) Complex pages that have multiple DIV, LAYER or SPAN elements - some of which are only displayed conditionally
based on things such as browser version or client-side events.

4) TABLEs. Even though the HTML source for a table is entered in a left-to-right fashion, rows and columns can be spanned
so you may not find two words which are rendered together on the page to be next to each other in the source code.

Basically, you need to decide, in advance, what you are looking for and what your end result is going to be. If you're just
trying to parse a simple HTML page with a reasonably predictable format then a simple regex will do the trick. If you are
looking for specific elements with some important text then a regex and running a for...next loop through the matches would
be in order.

Best regards,
Yanhong Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

--------------------
!From: "Michal A. Valasek" <ne**@altaircom.net>
!Subject: How to strip HTML markup from string?
!Date: Sat, 9 Aug 2003 04:48:20 +0200
!Lines: 18
!X-Priority: 3
!X-MSMail-Priority: Normal
!X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
!X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
!Message-ID: <u5**************@TK2MSFTNGP10.phx.gbl>
!Newsgroups: microsoft.public.dotnet.framework.aspnet
!NNTP-Posting-Host: gateway.haje.altaircom.net 62.24.73.162
!Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTN GP10.phx.gbl
!Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.framework.aspnet:166353
!X-Tomcat-NG: microsoft.public.dotnet.framework.aspnet
!
!Hello,
!
!I want to transform text with HTML markup to plain text. Is there some
!simple way how to do it?
!
!I can surely write my own function, which would simply strip everything with
!< and >. But if someonew has already written something similar for .NET, I
!would prefer more clever solution, which would try to retain original
!layout, at least paragraphs, hyperlinks etc - something like Outlook does
!when changing HTML to plain text.
!
!
!--
!Michal A. Valasek, Altair Communications, http://www.altaircom.net
!Please do not reply to this e-mail, for contact see http://www.rider.cz
!Keeping Freedom safe from Democracy
!
!
!
Nov 17 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Voetleuce en f?nsievry | last post by:
G'day I have some pages written by a bot and much of the code does not concern the visible content on the site. I'd like to strip all the codes that do not affect or influence the visible stuff...
9
by: Julie Miles | last post by:
I need to pull several tables of data from Excel into a web page, but when I use Excel's "Save as web page" function, I get an enormous file containing a massive amount of css formatting. I'd like...
17
by: Stanimir Stamenkov | last post by:
Is it possible to make two inline elements to appear adjacent stripping any white space appearing in between in the source? Example: <span class="adj">1</span> <span class="adj">2</span>...
2
by: Daniel M. Hendricks | last post by:
I'm looking for a function/regex in C# to strip unwanted HTML tags from comments posted to my web site. Previously, it was written in PHP and I used this function to strip unwanted tags: ...
18
by: pkassianidis | last post by:
Hello everybody, I am in the process of writing my very first web application in Python, and I need a way to generate dynamic HTML pages with data from a database. I have to say I am...
4
by: Steve | last post by:
Hi, I'm a complete PHP n00b slowly finding my way around I'm using the following function that I found on php.net to strip out html and return only the text. It works well except for when you...
7
by: code937 | last post by:
Hey Guys, First time here, i usually resort to google (but for the first time in 3 years he (or she) cant help me) ... Im making some software that generates html reports which can be viewed...
6
by: Christoph Zwerschke | last post by:
In Python programs, you will quite frequently find code like the following for removing a certain prefix from a string: if url.startswith('http://'): url = url Similarly for stripping...
2
by: Bruno Schneider | last post by:
I've seen this page, that seems invalid to me. Doctype is incomplete and it does not have a <bodytag. However, W3C validator validates it as HTML 4.01 strict, even when I force the DOCTYPE. I think...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.