By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,258 Members | 1,209 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,258 IT Pros & Developers. It's quick & easy.

Navigating text string that contains HTML of a page as DOM object?

P: n/a
Hello.

First, with AJAX I will get a remote web page into a string. Thus, a
string will contain HTML tags and such. I will need to extract text
from one <span> for which I know the ID the inner text.

Is it possible to access in this way "string variable".getElementByID()
somehow?

Thank you.

PS: Just thinking of a proper/efficient way to extract the information
from such a string. I am open to other ideas. I could load that page in
IFRAME and get my access to DOM that way, yet probably it is not an
eligant solution.

Thank you again.

Mar 21 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Alex wrote:
First, with AJAX I will get a remote web page into a string. Thus, a
string will contain HTML tags and such. I will need to extract text
from one <span> for which I know the ID the inner text.

Is it possible to access in this way "string variable".getElementByID()
somehow?
Provided that the `span' element contains no other elements:

var m = x.responseText.match(
/<span [^>]*\bid="yourID"[^>]*>([^<]+)<\/span>/i);
if (m)
{
var spanText = m[1];
}

This is the second time you ask about something that can be solved with
Regular Expressions. Please RTFineM on that:

<URL:http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Reference:Global_Objects:RegEx p>
<URL:http://msdn.microsoft.com/library/en-us/jscript7/html/jsobjregexpression.asp>
<URL:http://oreilly.com/catalog/regex/> (you don't have to buy the book[s];
the sample chapters are helpful already)
PS: Just thinking of a proper/efficient way to extract the information
from such a string. I am open to other ideas. I could load that page in
IFRAME and get my access to DOM that way, yet probably it is not an
eligant solution.


You could also serve XML (text/xml) and use getElementById()
on the Document object XMLHTTPRequest::responseXML refers to.
Quickhack:

var d = x.responseXML;
if (d)
{
var span = d.getElementById("yourID"), spanText;
if (span)
{
if (span.textContent)
{
// W3C DOM Level 3 Core
spanText = span.textContent;
}
else if (span.innerText)
{
// proprietary
spanText = span.innerText;
}
else if (span.innerHTML)
{
// proprietary
spanText = span.innerHTML;
}
}
}
PointedEars
Mar 21 '06 #2

P: n/a
Alex said on 21/03/2006 1:49 PM AEST:
Hello.

First, with AJAX I will get a remote web page into a string. Thus, a
string will contain HTML tags and such. I will need to extract text
from one <span> for which I know the ID the inner text.

Is it possible to access in this way "string variable".getElementByID()
somehow?


The first question is why are you sending more HTML to the client than
is necessary? But supposing you have a good reason for that...

There are two ways, one is to use a regular expression only, the other
to use a RegExp in concert with innerHTML and getElementById. Whichever
is best is up to you.

The first method is kinda quick 'n dirty but may suite. The second
method is a bit more general and may be better where you want to access
multiple elements, but it still has its failings. It processes the
HTML, creates a new div element, sets its style.display property to
'none', injects the processed HTML as the div's innerHTML, then uses
getElementById to get the element and its content.

You may not bother with all of it depending on where you sourced the
HTML from:

1. Strip stuff outside body - necessary, must do else will have
invalid HTML.

2. Remove script tags & content - necessary to cut down
on bulk and stop script executing when added to document
or from interfering with other scripts

3. Remove img tags - no point in downloading images

4. Replace onload attribute with onclick to stop script
executing onload - may not be necessary, if text 'onload'
appears in document text it will be altered too.
Watch for wrapping, I've tried to avoid it.
<script type="text/javascript">

var HTMLstring = [
'<html><head><title>The title</title></head><body>',
'<script type="text/javascript">function b(){ }<\/script>',
'<p onload="blah">A para<span id="xx"><i><b>Content of </b>',
'xx</i></span> more para</p>',
'<img src="reallyBigImg.jpg" alt="ha ha">',
'<img src="reallyBigImg.jpg" alt="ha ha">',
'<p onload = "blah" id="b">A para<span id="yy">Content <b><i>',
'of</i></b> yy</span> more para</p>',
'<script type="text/javascript">function c(){ }<\/script>',
'</body></html>'].join('');

// Straight RegExp and replace
function getInnerTextRE(id)
{
var reS = new RegExp('.*<span[^>]*\\b' + id + '\\b[^>]*>','i');
var reE = new RegExp('<\/span>.*','i');
alert( id + ': ' +
HTMLstring.replace(reS,'').replace(reE,'').replace (/<[^>]*>/g,'')
);
}

// RegExp, innerHTML and getElementById
function getInnerText(id)
{

// Remove everything outside body tags, including the body tags
HTMLstring = HTMLstring.replace(/.*<body[^>]*>/i,'')
HTMLstring = HTMLstring.replace(/<\/body>.*/i,'');

// Remove script tags & content (wrapped for posting)
HTMLstring =
HTMLstring.replace(/<script[^>]*>[^<>]*<\/script>/ig,'');

// Remove image tags
HTMLstring = HTMLstring.replace(/<img[^>]*>/ig,'');

// Replace onload attribute with onclick to stop them executing
HTMLstring = HTMLstring.replace(/onload/g,'onclick');

var d = document.createElement('div');
d.style.display = 'none';
d.innerHTML = HTMLstring;
document.body.appendChild(d);

alert( id + ': ' + getText(id));

document.body.removeChild(d);
}

function getText(id)
{
var el;
if ( document.getElementById
&& (el = document.getElementById(id))){
if (el.textContent) return el.textContent;
if (el.innerText) return el.innerText;
return el.innerHTML.replace(/<[^>]*>/g,'');
}
}

</script>

<button onclick="getInnerText('xx');getInnerText('yy');">
Get text using RegExp & getElementById</button>

<button onclick="getInnerTextRE('xx');getInnerTextRE('yy') ;">
Get text using regular expression only</button>
--
Rob
Mar 21 '06 #3

P: n/a
I think I will go with responseXML. Regular Expressions is hard for me
to debug since I still have not learned them. Plus, I think responseXML
will be less CPU intensive task.

Mar 21 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.