Debbie wrote:
Is there a standard way to extract text from a web page, without using
innertext/innerhtml?
It's an academic exercise, and we've been advised that we can't use
Internet Explorer DOM extensions that are not part of the W3C DOM.
Well then use the W3C DOM, text will sit in text nodes as leaf nodes of
the DOM tree and each text node has a property named nodeValue that will
give you the text in the text node. You could also use the data property
for that.
If you want the text in an element then you will either have to go
through the child nodes and concatenate the text of the child nodes
(where you might have to recursively go down the tree until you have the
text nodes) or depending on your needs and requirements you can use the
W3C DOM Level 3 property named textContent which Mozilla has been
supporting for quite some time and which at least Opera supports too now.
Then there is the W3C DOM Level 2 Range API that also allows you to get
the text in a range so you could position the range on an element node
and call toString on the range e.g.
var range = document.createRange();
range.selectNodeContents(someNode);
var text = range.toString();
Mozilla and Opera 8 and later support the Range API.
--
Martin Honnen
http://JavaScript.FAQTs.com/