reg exp question: removing class and style from html

chotiwallah

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

Some text

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha

Jul 17 '05 #1

Subscribe Reply

3719

Justin Koivisto

chotiwallah wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

Some text

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

any help appreciated, micha

Something like this may help you:

$pattern='`(cla ss|style)=(\\\' |\\").*\\2/Ui';
$str=preg_repla ce($pattern,'', $str);

Here are some of the things I've been using:
http://www.regular-expressions.info/
http://www.comp.leeds.ac.uk/Perl/matching.html
http://www.anaesthetist.com/mnm/perl/regex.htm

And there are also tools like these:
http://www.weitz.de/regex-coach/
http://laurent.riesterer.free.fr/regexp/

Sorry, none are German, but you may be able to translate withBabblefish:
http://babblefish.com/babblefish/language_webt.htm

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.

Jul 17 '05 #2

Josip

On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/]*>/gi,'').
replace(/ ]*>/gi,' ').
replace(/]*>/gi,'').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(//gi,'').
replace(/<\/b>/gi,'');

// mozilla doesn't like tags
D = D.replace(//gi,'').
replace(/<\/em>/gi,'');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<\/strong>/gi,'').
//replace(/<\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

Jul 17 '05 #3

Tim Van Wassenhove

In article <78************ *************@p osting.google.c om>, chotiwallah wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

Some text

now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

As soon as you get to deal with nested tags etc, regular expressions
aren't that handy anymore. You could have a look at Tidy (a websearch
will help you). You buffer all the output, clean it up with tidy, and
output the cleaned up code.

--
Tim Van Wassenhove <http://home.mysth.be/~timvw>

Jul 17 '05 #4

Michael Austin

Josip wrote:

On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/]*>/gi,'').
replace(/ ]*>/gi,' ').
replace(/]*>/gi,'').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(//gi,'').
replace(/<\/b>/gi,'');

// mozilla doesn't like tags
D = D.replace(//gi,'').
replace(/<\/em>/gi,'');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<\/strong>/gi,'').
//replace(/<\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

The problem is that you cannot/should not ever "rely" on javascript as
it may be turned off due it's inherent security risks.

Jul 17 '05 #5

Matthias Esken

chotiwallah schrieb:

i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in german) would
help a lot.

http://www.regenechsen.de/regex_de/

Any more questions? Visit the german speaking newsgroup
de.comp.lang.ph p.misc.

Regards,
Matthias

Jul 17 '05 #6

chotiwallah

Josip <jo***@sdfs.s d> wrote in message news:<dv******* *************** **********@4ax. com>...

On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/]*>/gi,'').
replace(/ ]*>/gi,' ').
replace(/]*>/gi,'').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(//gi,'').
replace(/<\/b>/gi,'');

// mozilla doesn't like tags
D = D.replace(//gi,'').
replace(/<\/em>/gi,'');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<\/strong>/gi,'').
//replace(/<\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha

Jul 17 '05 #7

Michael Austin

chotiwallah wrote:

Josip <jo***@sdfs.s d> wrote in message news:<dv******* *************** **********@4ax. com>...
On 12 Jul 2004 09:07:23 -0700, ch*********@web .de (chotiwallah) wrote:

i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:

This is written in JavaScript... It works great!!!!

if ( (D.indexOf('cla ss=Mso') >= 0) || (D.indexOf('cla ss="Mso') >=
0) ) {

// make one line
D = D.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');

// keep tags, strip attributes
D = D.replace(/ class=[^\s|>]*/gi,'').

replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');

//clean up tags
D = D.replace(/]*>/gi,'').
replace(/ ]*>/gi,' ').
replace(/]*>/gi,'').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');

// replace outdated tags
D = D.replace(//gi,'').
replace(/<\/b>/gi,'');

// mozilla doesn't like tags
D = D.replace(//gi,'').
replace(/<\/em>/gi,'');

// kill unwanted tags
D = D.replace(/<\?xml:[^>]*>/g, ''). // Word xml
replace(/<\/?st1:[^>]*>/g,''). // Word SmartTags
replace(/<\/?[a-z]\:[^>]*>/g,''). // All other funny Word
//non-HTML stuff
replace(/<\/?font[^>]*>/gi,''). // Disable if you want to
//keep font formatting
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');

//remove empty tags
//D = D.replace(/<\/strong>/gi,'').
//replace(/<\/i>/gi,'').
//replace(/<P[^>]*><\/P>/gi,'');

// nuke double tags
oldlen = D.length + 1;
while(oldlen > D.length) {
oldlen = D.length;
// join us now and free the tags, we'll be free hackers,
//we'll be free... ;-)
D = D.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
D = D.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');

// nuke double spaces
D = D.replace(/ */gi,' ');

brilliant little script, works great, exactly what i needed :-).
i hadn't even thought of doing the whole cleaning clientside.

thanks a bundle to everyone, micha

While client-side processing/validation is a good idea, you should no longer
rely on it as not all users turn on javascript or activeX. There are too many
idiots out there writing malicious code that is making a good thing bad.
--
Michael Austin.
Consultant - Available.
Donations welcomed. Http://www.firstdbasource.com/donations.html
:)

Jul 17 '05 #8

Similar topics

1694

Removing an expression set in a stylesheet

by: Jim Ley | last post by:

Hi, IE has the ability to setExpressions on stylesheets so you can calculate the value of the css property through script. For various reasons I'm wanting to use a side-effect of this to attach an event to every element of a class in a document (I'm including content from a lot of large 3rd party content, and iterating over the entire DOM searching for the classes and then attaching the event is proving too slow, aswell as being too...

Javascript

6626

nested div stylesheet question

by: theo | last post by:

if I have nested div combinations, can I call for styles only to specific nested combos? It's 3 lists <li>, on one page, needing different styles. <div id=list1><li> <a id="t1" href=...>main</a></li><div> <div id=list2><li> <a id="t1" href=...>main</a></li><div> <div id=list3><li> <a id="t1" href=...>main</a></li><div>

HTML / CSS

4825

General question about DIV usability

by: Hello | last post by:

Hello, I am a self-taught home developer: Question: As it seems, most CSS people like to use DIVs as a division between styles. So, they would have a style for a div tag that would hold some other styles and other tags... One thing I fail to understand about people being so addicted to DIV is that it this tag is similar to tag; it creates a new paragraph whenever you

HTML / CSS

2648

CSS positioning question

by: NS | last post by:

I am relativly new to css positioning and have a question regarding the display of a DHTML pop-up Here is the basic HTML I am using: <html> <head> <script language="JavaScript"> <!--

HTML / CSS

3045

Showing hidden answer when question clicked?

by: Nigel Molesworth | last post by:

I've Googled, but can't find what I need, perhaps I asking the wrong question! I want a "FAQ" page on a web site, I hate those pages that scroll you to the answer so and I figured that a good way to do it would be to have hidden content under each question, something like this : What is the first letter of the alphabet?

Javascript

8404

runat="server"....a simple html textbox or a webform server textbox...that is the question.

by: Hazzard | last post by:

I just realized that the code I inherited is using all asp.net server controls (ie. webform controls) and when I try to update textboxes on the client side, I lose the new value of the textbox when submitting the form to update the database. The server doesn't have the client side value any more. It seems to me that as I begin to write the client side javacript code for form validation and client side editing capabilities in order to save...

ASP.NET

1797

Question about CSS layouts (with a sidebar) and <div> heights...

by: Stan R. | last post by:

Greetings. I have a couple of questions concerning CSS layouts, as apposed to the old <tablemethod for creating layouts . Even after spending the last few days searching all over Google Groups, I haven't not been able to find a solution to my collective dilemma, and I hope some of you fine folks here in these neck of the UseNet woods might be able to share some wisdom with a fellow coder. My questions are in regard to what are the proper...

HTML / CSS

3025

question about removing border for select tag

by: runway27 | last post by:

my question is about removing the borders that are displayed by a select tag. presently i have a select tag that displays 8 values to which i have given a size of 5. code for this is = <select name="firstselect" size="5" class="select" style="width:219px;"> <option> </option> </select> the table in which this select tag is placed is as follows = <TABLE style="BORDER-RIGHT: 1px solid; BORDER-TOP: 0px; BORDER-LEFT: 1px solid;...

HTML / CSS

2962

removing a child node

by: omerbutt | last post by:

hi there i am making an application in which i have to populate columns that consist of some textfields and some input boxes the problem is at the mozilla's end, it creates a new node and appends the new created or child node to the parent node it is working fine to the point of addition in the Explorer And Mozzila but when it comes to deleting the column it still works perfect in explorer without any javascript or other error but when i try to...

Javascript

9679

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10223

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10172

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9050

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7546

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6785

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5441

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4115

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3730

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP