473,320 Members | 1,969 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

reg exp question: removing class and style from html

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNormal">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha
Jul 17 '05 #1
7 3687
chotiwallah wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNormal">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha


Something like this may help you:

$pattern='`(class|style)=(\\\'|\\").*\\2/Ui';
$str=preg_replace($pattern,'',$str);

Here are some of the things I've been using:
http://www.regular-expressions.info/
http://www.comp.leeds.ac.uk/Perl/matching.html
http://www.anaesthetist.com/mnm/perl/regex.htm

And there are also tools like these:
http://www.weitz.de/regex-coach/
http://laurent.riesterer.free.fr/regexp/

Sorry, none are German, but you may be able to translate withBabblefish:
http://babblefish.com/babblefish/language_webt.htm

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
Jul 17 '05 #2
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

Jul 17 '05 #3
In article <78*************************@posting.google.com> , chotiwallah wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNormal">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.


As soon as you get to deal with nested tags etc, regular expressions
aren't that handy anymore. You could have a look at Tidy (a websearch
will help you). You buffer all the output, clean it up with tidy, and
output the cleaned up code.

--
Tim Van Wassenhove <http://home.mysth.be/~timvw>
Jul 17 '05 #4
Josip wrote:
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');


The problem is that you cannot/should not ever "rely" on javascript as
it may be turned off due it's inherent security risks.

Jul 17 '05 #5
chotiwallah schrieb:
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.


http://www.regenechsen.de/regex_de/

Any more questions? Visit the german speaking newsgroup
de.comp.lang.php.misc.

Regards,
Matthias
Jul 17 '05 #6
Josip <jo***@sdfs.sd> wrote in message news:<dv********************************@4ax.com>. ..
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha
Jul 17 '05 #7
chotiwallah wrote:
Josip <jo***@sdfs.sd> wrote in message news:<dv********************************@4ax.com>. ..
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');


brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha


While client-side processing/validation is a good idea, you should no longer
rely on it as not all users turn on javascript or activeX. There are too many
idiots out there writing malicious code that is making a good thing bad.
--
Michael Austin.
Consultant - Available.
Donations welcomed. Http://www.firstdbasource.com/donations.html
:)
Jul 17 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Jim Ley | last post by:
Hi, IE has the ability to setExpressions on stylesheets so you can calculate the value of the css property through script. For various reasons I'm wanting to use a side-effect of this to...
14
by: theo | last post by:
if I have nested div combinations, can I call for styles only to specific nested combos? It's 3 lists <li>, on one page, needing different styles. <div id=list1><li> <a id="t1"...
11
by: Hello | last post by:
Hello, I am a self-taught home developer: Question: As it seems, most CSS people like to use DIVs as a division between styles. So, they would have a style for a div tag that would hold some...
11
by: NS | last post by:
I am relativly new to css positioning and have a question regarding the display of a DHTML pop-up Here is the basic HTML I am using: <html> <head> <script language="JavaScript"> <!--
4
by: Nigel Molesworth | last post by:
I've Googled, but can't find what I need, perhaps I asking the wrong question! I want a "FAQ" page on a web site, I hate those pages that scroll you to the answer so and I figured that a good...
2
by: Hazzard | last post by:
I just realized that the code I inherited is using all asp.net server controls (ie. webform controls) and when I try to update textboxes on the client side, I lose the new value of the textbox when...
5
by: Stan R. | last post by:
Greetings. I have a couple of questions concerning CSS layouts, as apposed to the old <tablemethod for creating layouts . Even after spending the last few days searching all over Google Groups, I...
2
by: runway27 | last post by:
my question is about removing the borders that are displayed by a select tag. presently i have a select tag that displays 8 values to which i have given a size of 5. code for this is = <select...
20
omerbutt
by: omerbutt | last post by:
hi there i am making an application in which i have to populate columns that consist of some textfields and some input boxes the problem is at the mozilla's end, it creates a new node and appends the...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.