473,796 Members | 2,532 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

reg exp question: removing class and style from html

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNorma l">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha
Jul 17 '05 #1
7 3719
chotiwallah wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNorma l">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha


Something like this may help you:

$pattern='`(cla ss|style)=(\\\' |\\").*\\2/Ui';
$str=preg_repla ce($pattern,'', $str);

Here are some of the things I've been using:
http://www.regular-expressions.info/
http://www.comp.leeds.ac.uk/Perl/matching.html
http://www.anaesthetist.com/mnm/perl/regex.htm

And there are also tools like these:
http://www.weitz.de/regex-coach/
http://laurent.riesterer.free.fr/regexp/

Sorry, none are German, but you may be able to translate withBabblefish:
http://babblefish.com/babblefish/language_webt.htm

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
Jul 17 '05 #2
On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

Jul 17 '05 #3
In article <78************ *************@p osting.google.c om>, chotiwallah wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNorma l">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.


As soon as you get to deal with nested tags etc, regular expressions
aren't that handy anymore. You could have a look at Tidy (a websearch
will help you). You buffer all the output, clean it up with tidy, and
output the cleaned up code.

--
Tim Van Wassenhove <http://home.mysth.be/~timvw>
Jul 17 '05 #4
Josip wrote:
On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');


The problem is that you cannot/should not ever "rely" on javascript as
it may be turned off due it's inherent security risks.

Jul 17 '05 #5
chotiwallah schrieb:
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.


http://www.regenechsen.de/regex_de/

Any more questions? Visit the german speaking newsgroup
de.comp.lang.ph p.misc.

Regards,
Matthias
Jul 17 '05 #6
Josip <jo***@sdfs.s d> wrote in message news:<dv******* *************** **********@4ax. com>...
On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha
Jul 17 '05 #7
chotiwallah wrote:
Josip <jo***@sdfs.s d> wrote in message news:<dv******* *************** **********@4ax. com>...
On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');


brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha


While client-side processing/validation is a good idea, you should no longer
rely on it as not all users turn on javascript or activeX. There are too many
idiots out there writing malicious code that is making a good thing bad.
--
Michael Austin.
Consultant - Available.
Donations welcomed. Http://www.firstdbasource.com/donations.html
:)
Jul 17 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1694
by: Jim Ley | last post by:
Hi, IE has the ability to setExpressions on stylesheets so you can calculate the value of the css property through script. For various reasons I'm wanting to use a side-effect of this to attach an event to every element of a class in a document (I'm including content from a lot of large 3rd party content, and iterating over the entire DOM searching for the classes and then attaching the event is proving too slow, aswell as being too...
14
6626
by: theo | last post by:
if I have nested div combinations, can I call for styles only to specific nested combos? It's 3 lists <li>, on one page, needing different styles. <div id=list1><li> <a id="t1" href=...>main</a></li><div> <div id=list2><li> <a id="t1" href=...>main</a></li><div> <div id=list3><li> <a id="t1" href=...>main</a></li><div>
11
4825
by: Hello | last post by:
Hello, I am a self-taught home developer: Question: As it seems, most CSS people like to use DIVs as a division between styles. So, they would have a style for a div tag that would hold some other styles and other tags... One thing I fail to understand about people being so addicted to DIV is that it this tag is similar to <P> tag; it creates a new paragraph whenever you
11
2648
by: NS | last post by:
I am relativly new to css positioning and have a question regarding the display of a DHTML pop-up Here is the basic HTML I am using: <html> <head> <script language="JavaScript"> <!--
4
3045
by: Nigel Molesworth | last post by:
I've Googled, but can't find what I need, perhaps I asking the wrong question! I want a "FAQ" page on a web site, I hate those pages that scroll you to the answer so and I figured that a good way to do it would be to have hidden content under each question, something like this : What is the first letter of the alphabet?
2
8404
by: Hazzard | last post by:
I just realized that the code I inherited is using all asp.net server controls (ie. webform controls) and when I try to update textboxes on the client side, I lose the new value of the textbox when submitting the form to update the database. The server doesn't have the client side value any more. It seems to me that as I begin to write the client side javacript code for form validation and client side editing capabilities in order to save...
5
1797
by: Stan R. | last post by:
Greetings. I have a couple of questions concerning CSS layouts, as apposed to the old <tablemethod for creating layouts . Even after spending the last few days searching all over Google Groups, I haven't not been able to find a solution to my collective dilemma, and I hope some of you fine folks here in these neck of the UseNet woods might be able to share some wisdom with a fellow coder. My questions are in regard to what are the proper...
2
3025
by: runway27 | last post by:
my question is about removing the borders that are displayed by a select tag. presently i have a select tag that displays 8 values to which i have given a size of 5. code for this is = <select name="firstselect" size="5" class="select" style="width:219px;"> <option> </option> </select> the table in which this select tag is placed is as follows = <TABLE style="BORDER-RIGHT: 1px solid; BORDER-TOP: 0px; BORDER-LEFT: 1px solid;...
20
2962
omerbutt
by: omerbutt | last post by:
hi there i am making an application in which i have to populate columns that consist of some textfields and some input boxes the problem is at the mozilla's end, it creates a new node and appends the new created or child node to the parent node it is working fine to the point of addition in the Explorer And Mozzila but when it comes to deleting the column it still works perfect in explorer without any javascript or other error but when i try to...
0
9679
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10223
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10172
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9050
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7546
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6785
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5441
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4115
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3730
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.