473,240 Members | 1,748 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,240 software developers and data experts.

Remove Empty Tags on page

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

David

Jun 27 '08 #1
11 3277


"David" <no**@none.comwrote in message
news:HeA_j.8105$9H6.7786@trnddc04...
Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to work completely. I am dealing with a page spit out by .NET that leaves
empty tags in the markup. I need a javascript solution to go behind and do
a clean up after the page loads.
David

For any that look at the page you will see the script is only looping
through a certain set of tags...

var tagArray = ["em", "span", "p", "a", "li", "ul"];

Using all tags.. el=document.getElementsByTagName("*") would be the
preferred method but I found myself needing several loops and several
node.parentNode.removeChild()'s and it still didn't work correctly.

David
Jun 27 '08 #2
In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?
Any reason you can't just use the search-and-replace function in your favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.
Jun 27 '08 #3

"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com...
In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.
Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.

David


Jun 27 '08 #4
In article <75B_j.10498$3j.2456@trnddc05>, "David" <no**@none.comwrote:
>
"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com.. .
>In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.

Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.
Well, you could still do it with a Unix shell script... might be easier.
Jun 27 '08 #5
On May 27, 12:58 am, "David" <n...@none.comwrote:
Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...http://mysite.verizon.net/res8xvny/removeTags.html
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

Jun 27 '08 #6
* RobG wrote in comp.lang.javascript:
>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.
But then you'll remove e.g. <span><img/></span>.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 27 '08 #7
"RobG" <rg***@iinet.net.auwrote in message
news:2a**********************************@y22g2000 prd.googlegroups.com...
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

I tried it and it does work, but it leaves in the <p></pin the page in
this scenario...

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

David
Jun 27 '08 #8
On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
* RobG wrote in comp.lang.javascript:
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <span><img/></span>.
Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob
Jun 27 '08 #9


"RobG" <rg***@iinet.net.auwrote in message
news:07**********************************@d19g2000 prm.googlegroups.com...
On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
>* RobG wrote in comp.lang.javascript:
>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <span><img/></span>.

Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob

Yep, that works as well. I really appreciate your help on this.

David
Jun 27 '08 #10
On May 26, 3:58 pm, David wrote:
I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.
No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).
I need a javascript solution to go behind and do a clean
up after the page loads.
That would be the worst possible approach to the problem.
Jun 27 '08 #11

"Henry" <rc*******@raindrop.co.ukwrote in message
news:3c**********************************@56g2000h sm.googlegroups.com...
On May 26, 3:58 pm, David wrote:
>I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.

No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).
>I need a javascript solution to go behind and do a clean
up after the page loads.

That would be the worst possible approach to the problem.
Henry,

Completely agree with you, absolutely, and I told our developers and powers
to be just this but I do not make the decisions and have to deal with them.

David
Jun 27 '08 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Wayno | last post by:
My php logs are coming up empty. I have done all I can think of, and all that made sense to me. Can someone take a look at my php.ini please and tell me what you think may be the problem. I...
9
by: ted | last post by:
I'm having trouble using the re module to remove empty lines in a file. Here's what I thought would work, but it doesn't: import re f = open("old_site/index.html") for line in f: line =...
13
by: Mikko Ohtamaa | last post by:
From XML specification: The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. (This means that <foo></foo> is equal to...
18
by: Tjerk Wolterink | last post by:
i have the following rule, <xsl:template match="br"> <br/> </xsl:template> This should convert all <br/> to <br/> but, my transformer transforms it all to
4
by: James Geurts | last post by:
Hi all Can someone help me out with a regex to remove all html tags except for <p>,</p>,<br>,<br/> from a string Thank Jim
39
by: fleemo17 | last post by:
I'm wondering whether it's better to leave an alt tag blank (alt=" ") or specify something like "alt='spacer'" when referring to objects that merely help the layout of the page? -Fleemo
12
by: Oberon | last post by:
I have a large HTML document. It has hundreds of <span>s which have no attributes so these <span>s are redundant. How can I remove these tags automatically? The document also has <span>s with...
12
by: Stefan Weiss | last post by:
Hi. (this is somewhat similar to yesterday's thread about empty links) I noticed that Tidy issues warnings whenever it encounters empty tags, and strips those tags if cleanup was requested....
11
by: ra294 | last post by:
I am building a page that needs to recieve some parametes and return blank page (empty response). After I recieve the parametes I write: Response.clear Response.End When I run the page I still...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.