473,396 Members | 1,929 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Remove Empty Tags on page

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

David

Jun 27 '08 #1
11 3289


"David" <no**@none.comwrote in message
news:HeA_j.8105$9H6.7786@trnddc04...
Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to work completely. I am dealing with a page spit out by .NET that leaves
empty tags in the markup. I need a javascript solution to go behind and do
a clean up after the page loads.
David

For any that look at the page you will see the script is only looping
through a certain set of tags...

var tagArray = ["em", "span", "p", "a", "li", "ul"];

Using all tags.. el=document.getElementsByTagName("*") would be the
preferred method but I found myself needing several loops and several
node.parentNode.removeChild()'s and it still didn't work correctly.

David
Jun 27 '08 #2
In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?
Any reason you can't just use the search-and-replace function in your favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.
Jun 27 '08 #3

"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com...
In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.
Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.

David


Jun 27 '08 #4
In article <75B_j.10498$3j.2456@trnddc05>, "David" <no**@none.comwrote:
>
"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com.. .
>In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.

Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.
Well, you could still do it with a Unix shell script... might be easier.
Jun 27 '08 #5
On May 27, 12:58 am, "David" <n...@none.comwrote:
Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...http://mysite.verizon.net/res8xvny/removeTags.html
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

Jun 27 '08 #6
* RobG wrote in comp.lang.javascript:
>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.
But then you'll remove e.g. <span><img/></span>.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 27 '08 #7
"RobG" <rg***@iinet.net.auwrote in message
news:2a**********************************@y22g2000 prd.googlegroups.com...
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

I tried it and it does work, but it leaves in the <p></pin the page in
this scenario...

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

David
Jun 27 '08 #8
On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
* RobG wrote in comp.lang.javascript:
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <span><img/></span>.
Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob
Jun 27 '08 #9


"RobG" <rg***@iinet.net.auwrote in message
news:07**********************************@d19g2000 prm.googlegroups.com...
On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
>* RobG wrote in comp.lang.javascript:
>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <span><img/></span>.

Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob

Yep, that works as well. I really appreciate your help on this.

David
Jun 27 '08 #10
On May 26, 3:58 pm, David wrote:
I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.
No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).
I need a javascript solution to go behind and do a clean
up after the page loads.
That would be the worst possible approach to the problem.
Jun 27 '08 #11

"Henry" <rc*******@raindrop.co.ukwrote in message
news:3c**********************************@56g2000h sm.googlegroups.com...
On May 26, 3:58 pm, David wrote:
>I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.

No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).
>I need a javascript solution to go behind and do a clean
up after the page loads.

That would be the worst possible approach to the problem.
Henry,

Completely agree with you, absolutely, and I told our developers and powers
to be just this but I do not make the decisions and have to deal with them.

David
Jun 27 '08 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Wayno | last post by:
My php logs are coming up empty. I have done all I can think of, and all that made sense to me. Can someone take a look at my php.ini please and tell me what you think may be the problem. I...
9
by: ted | last post by:
I'm having trouble using the re module to remove empty lines in a file. Here's what I thought would work, but it doesn't: import re f = open("old_site/index.html") for line in f: line =...
13
by: Mikko Ohtamaa | last post by:
From XML specification: The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. (This means that <foo></foo> is equal to...
18
by: Tjerk Wolterink | last post by:
i have the following rule, <xsl:template match="br"> <br/> </xsl:template> This should convert all <br/> to <br/> but, my transformer transforms it all to
4
by: James Geurts | last post by:
Hi all Can someone help me out with a regex to remove all html tags except for <p>,</p>,<br>,<br/> from a string Thank Jim
39
by: fleemo17 | last post by:
I'm wondering whether it's better to leave an alt tag blank (alt=" ") or specify something like "alt='spacer'" when referring to objects that merely help the layout of the page? -Fleemo
12
by: Oberon | last post by:
I have a large HTML document. It has hundreds of <span>s which have no attributes so these <span>s are redundant. How can I remove these tags automatically? The document also has <span>s with...
12
by: Stefan Weiss | last post by:
Hi. (this is somewhat similar to yesterday's thread about empty links) I noticed that Tidy issues warnings whenever it encounters empty tags, and strips those tags if cleanup was requested....
11
by: ra294 | last post by:
I am building a page that needs to recieve some parametes and return blank page (empty response). After I recieve the parametes I write: Response.clear Response.End When I run the page I still...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.