473,811 Members | 3,719 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to strip HTML markup from string?

Hello,

I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?

I can surely write my own function, which would simply strip everything with
< and >. But if someonew has already written something similar for .NET, I
would prefer more clever solution, which would try to retain original
layout, at least paragraphs, hyperlinks etc - something like Outlook does
when changing HTML to plain text.
--
Michal A. Valasek, Altair Communications, http://www.altaircom.net
Please do not reply to this e-mail, for contact see http://www.rider.cz
Keeping Freedom safe from Democracy
Nov 17 '05 #1
3 2429
Function stripHTML(strHT ML)
'Strips the HTML tags from strHTML

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.Ignor eCase = True
objRegExp.Globa l = True
objRegExp.Patte rn = "<(.|\n)+?> "

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Repla ce(strHTML, "")

'Replace all < and > with &lt; and &gt;
strOutput = Replace(strOutp ut, "<", "&lt;")
strOutput = Replace(strOutp ut, ">", "&gt;")

stripHTML = strOutput 'Return the value of strOutput

Set objRegExp = Nothing
End Function
"Michal A. Valasek" <ne**@altaircom .net> wrote in message
news:u5******** ********@TK2MSF TNGP10.phx.gbl. ..
Hello,

I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?

I can surely write my own function, which would simply strip everything with < and >. But if someonew has already written something similar for .NET, I
would prefer more clever solution, which would try to retain original
layout, at least paragraphs, hyperlinks etc - something like Outlook does
when changing HTML to plain text.
--
Michal A. Valasek, Altair Communications, http://www.altaircom.net
Please do not reply to this e-mail, for contact see http://www.rider.cz
Keeping Freedom safe from Democracy

Nov 17 '05 #2
Function stripHTML(strHT ML)
'Strips the HTML tags from strHTML

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.Ignor eCase = True
objRegExp.Globa l = True
objRegExp.Patte rn = "<(.|\n)+?> "

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Repla ce(strHTML, "")

'Replace all < and > with &lt; and &gt;
strOutput = Replace(strOutp ut, "<", "&lt;")
strOutput = Replace(strOutp ut, ">", "&gt;")

stripHTML = strOutput 'Return the value of strOutput

Set objRegExp = Nothing
End Function
"Michal A. Valasek" <ne**@altaircom .net> wrote in message
news:u5******** ********@TK2MSF TNGP10.phx.gbl. ..
Hello,

I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?

I can surely write my own function, which would simply strip everything with < and >. But if someonew has already written something similar for .NET, I
would prefer more clever solution, which would try to retain original
layout, at least paragraphs, hyperlinks etc - something like Outlook does
when changing HTML to plain text.
--
Michal A. Valasek, Altair Communications, http://www.altaircom.net
Please do not reply to this e-mail, for contact see http://www.rider.cz
Keeping Freedom safe from Democracy

Nov 17 '05 #3
Hello Michal,

The page in 4guysfromrolla. com (introduced by Ravikanth) and RegEx (introduced by another dev) could work for you.

However, there are some other issues. Even after you entirely strip out all the <htmltags> you may be left with HTML-
encoded strings such as which you will also want to parse. These are easily handled with

System.Web.HTTP Utility.HTMLDec ode()

And now, the long explanation of why this won't be good enough. There are still many unresolved issues: (It was posted by
others before)

1) Rendered line feeds versus actual line feeds. In any HTML source the line feeds that are in there are generally NOT the
ones that are rendered. BR, P and others are the elements that determine the position on the rendered page.

2) What you're going to do with any elements outside the BODY tag, and what you are going to do with text that is left over
between elements such as OBJECT or SCRIPT?

3) Complex pages that have multiple DIV, LAYER or SPAN elements - some of which are only displayed conditionally
based on things such as browser version or client-side events.

4) TABLEs. Even though the HTML source for a table is entered in a left-to-right fashion, rows and columns can be spanned
so you may not find two words which are rendered together on the page to be next to each other in the source code.

Basically, you need to decide, in advance, what you are looking for and what your end result is going to be. If you're just
trying to parse a simple HTML page with a reasonably predictable format then a simple regex will do the trick. If you are
looking for specific elements with some important text then a regex and running a for...next loop through the matches would
be in order.

Best regards,
Yanhong Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

--------------------
!From: "Michal A. Valasek" <ne**@altaircom .net>
!Subject: How to strip HTML markup from string?
!Date: Sat, 9 Aug 2003 04:48:20 +0200
!Lines: 18
!X-Priority: 3
!X-MSMail-Priority: Normal
!X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
!X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
!Message-ID: <u5************ **@TK2MSFTNGP10 .phx.gbl>
!Newsgroups: microsoft.publi c.dotnet.framew ork.aspnet
!NNTP-Posting-Host: gateway.haje.al taircom.net 62.24.73.162
!Path: cpmsftngxa06.ph x.gbl!TK2MSFTNG P08.phx.gbl!TK2 MSFTNGP10.phx.g bl
!Xref: cpmsftngxa06.ph x.gbl microsoft.publi c.dotnet.framew ork.aspnet:1663 53
!X-Tomcat-NG: microsoft.publi c.dotnet.framew ork.aspnet
!
!Hello,
!
!I want to transform text with HTML markup to plain text. Is there some
!simple way how to do it?
!
!I can surely write my own function, which would simply strip everything with
!< and >. But if someonew has already written something similar for .NET, I
!would prefer more clever solution, which would try to retain original
!layout, at least paragraphs, hyperlinks etc - something like Outlook does
!when changing HTML to plain text.
!
!
!--
!Michal A. Valasek, Altair Communications, http://www.altaircom.net
!Please do not reply to this e-mail, for contact see http://www.rider.cz
!Keeping Freedom safe from Democracy
!
!
!
Nov 17 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
2184
by: Voetleuce en f?nsievry | last post by:
G'day I have some pages written by a bot and much of the code does not concern the visible content on the site. I'd like to strip all the codes that do not affect or influence the visible stuff (although I'd like to keep the nested tables, if possible). Some of this can be stripped using Search/Replace, but some of it contains codes which differ from page to page. How many pages? About 750, totalling 80 megabytes of data, which I'm
9
8900
by: Julie Miles | last post by:
I need to pull several tables of data from Excel into a web page, but when I use Excel's "Save as web page" function, I get an enormous file containing a massive amount of css formatting. I'd like to strip out this css so that I can apply my own style sheet; does anyone know whether there is a way to do this, other than manually? Either a way to export the data from Excel that doesn't include the css formatting, or a utility that will...
17
9135
by: Stanimir Stamenkov | last post by:
Is it possible to make two inline elements to appear adjacent stripping any white space appearing in between in the source? Example: <span class="adj">1</span> <span class="adj">2</span> <span class="adj">3</span> --
2
2121
by: Daniel M. Hendricks | last post by:
I'm looking for a function/regex in C# to strip unwanted HTML tags from comments posted to my web site. Previously, it was written in PHP and I used this function to strip unwanted tags: function removeEvilTags($source) { $allowedTags='<b><i><blockquote><ul><ol><li><br><a>'; $source = strip_tags($source, $allowedTags); return preg_replace('/<(.*?)>/ie', "'<'.removeEvilAttributes('\\1').'>'", $source);
18
4980
by: pkassianidis | last post by:
Hello everybody, I am in the process of writing my very first web application in Python, and I need a way to generate dynamic HTML pages with data from a database. I have to say I am overwhelmed by the plethora of different frameworks, templating engines, HTML generation tools etc that exist. After some thought I decided to leave the various frameworks aside for the
4
3041
by: Steve | last post by:
Hi, I'm a complete PHP n00b slowly finding my way around I'm using the following function that I found on php.net to strip out html and return only the text. It works well except for when you find styles embedded within the tags eg: <h3 id="pageName">Have a great day!! </h3> This throws an error, whereas <h3 >Thank you for your purchase! </h3works like a charm. It also falls over when crappy code has <h3>&nbsp;</h3between the tags.
7
1422
by: code937 | last post by:
Hey Guys, First time here, i usually resort to google (but for the first time in 3 years he (or she) cant help me) ... Im making some software that generates html reports which can be viewed within the program and they're exported with bookmark links so that when you click a link (bookmark) it adds the bookmark to the end of the address (changes the window to http://address/file.html#timmy) as this happens the program realises the address has...
6
4141
by: Christoph Zwerschke | last post by:
In Python programs, you will quite frequently find code like the following for removing a certain prefix from a string: if url.startswith('http://'): url = url Similarly for stripping suffixes: if filename.endswith('.html'): filename = filename
2
2108
by: Bruno Schneider | last post by:
I've seen this page, that seems invalid to me. Doctype is incomplete and it does not have a <bodytag. However, W3C validator validates it as HTML 4.01 strict, even when I force the DOCTYPE. I think it should be a bug in the validator, but perhaps, I missed something. What do you think? Page: http://www.bcc.ufla.br/~lpgomes/ Validator:
0
9605
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10651
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9208
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7671
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6893
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5556
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5697
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4342
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3020
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.