By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,748 Members | 1,427 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,748 IT Pros & Developers. It's quick & easy.

reg exp question: removing class and style from html

P: n/a
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNormal">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha
Jul 17 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
chotiwallah wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNormal">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha


Something like this may help you:

$pattern='`(class|style)=(\\\'|\\").*\\2/Ui';
$str=preg_replace($pattern,'',$str);

Here are some of the things I've been using:
http://www.regular-expressions.info/
http://www.comp.leeds.ac.uk/Perl/matching.html
http://www.anaesthetist.com/mnm/perl/regex.htm

And there are also tools like these:
http://www.weitz.de/regex-coach/
http://laurent.riesterer.free.fr/regexp/

Sorry, none are German, but you may be able to translate withBabblefish:
http://babblefish.com/babblefish/language_webt.htm

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
Jul 17 '05 #2

P: n/a
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

Jul 17 '05 #3

P: n/a
In article <78*************************@posting.google.com> , chotiwallah wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

<p class="MsoNormal">Some text</p>

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.


As soon as you get to deal with nested tags etc, regular expressions
aren't that handy anymore. You could have a look at Tidy (a websearch
will help you). You buffer all the output, clean it up with tidy, and
output the cleaned up code.

--
Tim Van Wassenhove <http://home.mysth.be/~timvw>
Jul 17 '05 #4

P: n/a
Josip wrote:
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');


The problem is that you cannot/should not ever "rely" on javascript as
it may be turned off due it's inherent security risks.

Jul 17 '05 #5

P: n/a
chotiwallah schrieb:
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.


http://www.regenechsen.de/regex_de/

Any more questions? Visit the german speaking newsgroup
de.comp.lang.php.misc.

Regards,
Matthias
Jul 17 '05 #6

P: n/a
Josip <jo***@sdfs.sd> wrote in message news:<dv********************************@4ax.com>. ..
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha
Jul 17 '05 #7

P: n/a
chotiwallah wrote:
Josip <jo***@sdfs.sd> wrote in message news:<dv********************************@4ax.com>. ..
On 12 Jul 2004 09:07:23 -0700, ch*********@web.de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:


This is written in JavaScript... It works great!!!!

if ( (D.indexOf('class=Mso') >= 0) || (D.indexOf('class="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/<b [^>]*>/gi,'<b>').
replace(/<br [^>]*>/gi,'<br />').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');

// mozilla doesn't like <em> tags
D = D.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<strong><\/strong>/gi,'').
//replace(/<i><\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');


brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha


While client-side processing/validation is a good idea, you should no longer
rely on it as not all users turn on javascript or activeX. There are too many
idiots out there writing malicious code that is making a good thing bad.
--
Michael Austin.
Consultant - Available.
Donations welcomed. Http://www.firstdbasource.com/donations.html
:)
Jul 17 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.