469,270 Members | 1,171 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,270 developers. It's quick & easy.

Remove Empty Tags on page

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

David

Jun 27 '08 #1
11 3028


"David" <no**@none.comwrote in message
news:HeA_j.8105$9H6.7786@trnddc04...
Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to work completely. I am dealing with a page spit out by .NET that leaves
empty tags in the markup. I need a javascript solution to go behind and do
a clean up after the page loads.
David

For any that look at the page you will see the script is only looping
through a certain set of tags...

var tagArray = ["em", "span", "p", "a", "li", "ul"];

Using all tags.. el=document.getElementsByTagName("*") would be the
preferred method but I found myself needing several loops and several
node.parentNode.removeChild()'s and it still didn't work correctly.

David
Jun 27 '08 #2
In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?
Any reason you can't just use the search-and-replace function in your favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.
Jun 27 '08 #3

"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com...
In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.
Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.

David


Jun 27 '08 #4
In article <75B_j.10498$3j.2456@trnddc05>, "David" <no**@none.comwrote:
>
"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com.. .
>In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.

Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.
Well, you could still do it with a Unix shell script... might be easier.
Jun 27 '08 #5
On May 27, 12:58 am, "David" <n...@none.comwrote:
Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...http://mysite.verizon.net/res8xvny/removeTags.html
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

Jun 27 '08 #6
* RobG wrote in comp.lang.javascript:
>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.
But then you'll remove e.g. <span><img/></span>.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 27 '08 #7
"RobG" <rg***@iinet.net.auwrote in message
news:2a**********************************@y22g2000 prd.googlegroups.com...
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

I tried it and it does work, but it leaves in the <p></pin the page in
this scenario...

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

David
Jun 27 '08 #8
On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
* RobG wrote in comp.lang.javascript:
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <span><img/></span>.
Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob
Jun 27 '08 #9


"RobG" <rg***@iinet.net.auwrote in message
news:07**********************************@d19g2000 prm.googlegroups.com...
On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
>* RobG wrote in comp.lang.javascript:
>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <span><img/></span>.

Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob

Yep, that works as well. I really appreciate your help on this.

David
Jun 27 '08 #10
On May 26, 3:58 pm, David wrote:
I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.
No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).
I need a javascript solution to go behind and do a clean
up after the page loads.
That would be the worst possible approach to the problem.
Jun 27 '08 #11

"Henry" <rc*******@raindrop.co.ukwrote in message
news:3c**********************************@56g2000h sm.googlegroups.com...
On May 26, 3:58 pm, David wrote:
>I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.

No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).
>I need a javascript solution to go behind and do a clean
up after the page loads.

That would be the worst possible approach to the problem.
Henry,

Completely agree with you, absolutely, and I told our developers and powers
to be just this but I do not make the decisions and have to deal with them.

David
Jun 27 '08 #12

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Wayno | last post: by
9 posts views Thread by ted | last post: by
18 posts views Thread by Tjerk Wolterink | last post: by
39 posts views Thread by fleemo17 | last post: by
12 posts views Thread by Stefan Weiss | last post: by
11 posts views Thread by ra294 | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.