Remove Empty Tags on page

David

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.







Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

David

Jun 27 '08 #1

Subscribe Post Reply

3289

David

"David" <no**@none.comwrote in message
news:HeA_j.8105$9H6.7786@trnddc04...

Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to work completely. I am dealing with a page spit out by .NET that leaves
empty tags in the markup. I need a javascript solution to go behind and do
a clean up after the page loads.
David

For any that look at the page you will see the script is only looping
through a certain set of tags...

var tagArray = ["em", "span", "p", "a", "li", "ul"];

Using all tags.. el=document.getElementsByTagName("*") would be the
preferred method but I found myself needing several loops and several
node.parentNode.removeChild()'s and it still didn't work correctly.

David

Jun 27 '08 #2

Doug Miller

In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:

>Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.







Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.

Jun 27 '08 #3

David

"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com...

In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.







Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.

Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.

David

Jun 27 '08 #4

Doug Miller

In article <75B_j.10498$3j.2456@trnddc05>, "David" <no**@none.comwrote:

>
"Doug Miller" <sp******@milmac.comwrote in message
news:Z%*****************@nlpi065.nbdc.sbc.com.. .
>In article <HeA_j.8105$9H6.7786@trnddc04>, "David" <no**@none.comwrote:
>>>Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to
work completely. I am dealing with a page spit out by .NET that leaves
empty
tags in the markup. I need a javascript solution to go behind and do a
clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.







Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.

Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.

Well, you could still do it with a Unix shell script... might be easier.

Jun 27 '08 #5

RobG

On May 27, 12:58 am, "David" <n...@none.comwrote:

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <spantags are not empty, as they contain
<emtags they also need to be removed.







Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...http://mysite.verizon.net/res8xvny/removeTags.html

Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

Jun 27 '08 #6

Bjoern Hoehrmann

* RobG wrote in comp.lang.javascript:

>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <img/>.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jun 27 '08 #7

David

"RobG" <rg***@iinet.net.auwrote in message
news:2a**********************************@y22g2000 prd.googlegroups.com...

Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x[i];

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}
function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}

--
Rob

I tried it and it does work, but it leaves in the </pin the page in
this scenario...







David

Jun 27 '08 #8

RobG

On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

* RobG wrote in comp.lang.javascript:

Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <img/>.

Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob

Jun 27 '08 #9

David

"RobG" <rg***@iinet.net.auwrote in message
news:07**********************************@d19g2000 prm.googlegroups.com...

On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
>* RobG wrote in comp.lang.javascript:

>Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <img/>.

Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes[i];
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
--
Rob

Yep, that works as well. I really appreciate your help on this.

David

Jun 27 '08 #10

Henry

On May 26, 3:58 pm, David wrote:

I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.

No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).

I need a javascript solution to go behind and do a clean
up after the page loads.

That would be the worst possible approach to the problem.

Jun 27 '08 #11

David

"Henry" <rc*******@raindrop.co.ukwrote in message
news:3c**********************************@56g2000h sm.googlegroups.com...

On May 26, 3:58 pm, David wrote:
>I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.

No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).

>I need a javascript solution to go behind and do a clean
up after the page loads.

That would be the worst possible approach to the problem.

Henry,

Completely agree with you, absolutely, and I told our developers and powers
to be just this but I do not make the decisions and have to deal with them.

David

Jun 27 '08 #12

Similar topics

Error log empty

by: Wayno | last post by:

My php logs are coming up empty. I have done all I can think of, and all that made sense to me. Can someone take a look at my php.ini please and tell me what you think may be the problem. I...

PHP

How to remove empty lines with re?

by: ted | last post by:

I'm having trouble using the re module to remove empty lines in a file. Here's what I thought would work, but it doesn't: import re f = open("old_site/index.html") for line in f: line =...

Python

XHTML user agent behavior regarding empty elements

by: Mikko Ohtamaa | last post by:

From XML specification: The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. (This means that <foo></foo> is equal to...

.NET Framework

Empty element match

by: Tjerk Wolterink | last post by:

i have the following rule, <xsl:template match="br"> </xsl:template> This should convert all to but, my transformer transforms it all to

.NET Framework

Regular expression to remove all html tags except for p and br

by: James Geurts | last post by:

Hi all Can someone help me out with a regex to remove all html tags except for ,, , from a string Thank Jim

.NET Framework

Empty Alt Tags

by: fleemo17 | last post by:

I'm wondering whether it's better to leave an alt tag blank (alt=" ") or specify something like "alt='spacer'" when referring to objects that merely help the layout of the page? -Fleemo

HTML / CSS

How can I remove tags which have no attributes?

by: Oberon | last post by:

I have a large HTML document. It has hundreds of s which have no attributes so these s are redundant. How can I remove these tags automatically? The document also has s with...

HTML / CSS

Tidy trimming empty tags

by: Stefan Weiss | last post by:

Hi. (this is somewhat similar to yesterday's thread about empty links) I noticed that Tidy issues warnings whenever it encounters empty tags, and strips those tags if cleanup was requested....

HTML / CSS

Create a page with empty response

by: ra294 | last post by:

I am building a page that needs to recieve some parametes and return blank page (empty response). After I recieve the parametes I write: Response.clear Response.End When I run the page I still...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice