By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,480 Members | 763 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,480 IT Pros & Developers. It's quick & easy.

Problem loading html containing scripts using Dom LoadHTML

P: n/a
This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.

$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();

Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}

Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';
}

Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks

May 14 '07 #1
Share this Question
Share on Google+
9 Replies


P: n/a
On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:
This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.

$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();

Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}

Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';

}

Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks
start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards

May 14 '07 #2

P: n/a
On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:
On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:


This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.
$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();
Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}
Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';
}
Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks

start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -

- Show quoted text -
The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.

May 14 '07 #3

P: n/a
loretta wrote:
On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:
>On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:


>>This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.
$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();
Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}
Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';
}
Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks
start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -

- Show quoted text -

The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.
But does it validate (http://validator.w3.org)? Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================
May 14 '07 #4

P: n/a
On May 14, 7:47 pm, loretta <lorb...@optonline.netwrote:
On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:
On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:
This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.
$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();
Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}
Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';
}
Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks
start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -
- Show quoted text -

The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.
this is what i find on google.
http://developer.mozilla.org/en/docs...HTML_Documents
use <!CDATA or the "xhtml" document is no such thing, btw it should
not just claim to be xhtml but should be properly validated as such,
including the content-type text/xml+xhtml (served with as .xhtml)
once you have obtained the webpage, and parsed it adding the right
instructions for the xml parser, all should work, if indeed the rest
of the doc is valid xml.

May 14 '07 #5

P: n/a
On May 14, 9:58 pm, shimmyshack <matt.fa...@gmail.comwrote:
On May 14, 7:47 pm, loretta <lorb...@optonline.netwrote:
On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:
On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:
This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.
$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();
Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}
Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';
}
Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks
start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -
- Show quoted text -
The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.

this is what i find on google.http://developer.mozilla.org/en/docs..._and_JavaScrip...
use <!CDATA or the "xhtml" document is no such thing, btw it should
not just claim to be xhtml but should be properly validated as such,
including the content-type text/xml+xhtml (served with as .xhtml)
once you have obtained the webpage, and parsed it adding the right
instructions for the xml parser, all should work, if indeed the rest
of the doc is valid xml.
oops, application/xml+xhtml of course

May 14 '07 #6

P: n/a
Jerry Stuckle wrote:
But does it validate (http://validator.w3.org)? Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.
From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:

'&' ='&amp;'
'<' ='&lt;'

at a minimum. If not, then the document is not valid.

If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!

--
Toby A Inkster BSc (Hons) ARCS
http://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux
May 15 '07 #7

P: n/a
On May 15, 9:50 am, Toby A Inkster <usenet200...@tobyinkster.co.uk>
wrote:
Jerry Stuckle wrote:
But does it validate (http://validator.w3.org)? Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.

From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:

'&' ='&amp;'
'<' ='&lt;'

at a minimum. If not, then the document is not valid.

If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!

--
Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux
uising a CDATA block means that the parse wont be tripped up by < and
so forth.

May 15 '07 #8

P: n/a
On May 15, 7:32 am, shimmyshack <matt.fa...@gmail.comwrote:
On May 15, 9:50 am, Toby A Inkster <usenet200...@tobyinkster.co.uk>
wrote:


Jerry Stuckle wrote:
But does it validate (http://validator.w3.org)?Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.
From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:
'&' ='&amp;'
'<' ='&lt;'
at a minimum. If not, then the document is not valid.
If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!
--
Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux

uising a CDATA block means that the parse wont be tripped up by < and
so forth.- Hide quoted text -

- Show quoted text -
The webpage does not validate, however the errors are nowhere near the
extra tags in the javascirpt being inserted at the head tag, i.e.
there is an unordered list somewhere in the html that is closed twice
and an incorrect checkbox attribute. The page validates in tidy, with
warnings only. There is this CDATA block around all the javascript
functions, in a comment:
//<![CDATA[
//]]>
It seems to me that the parser is seeing the '</head>' tag in the
javascrpt variable and putting in the end script tag and body tags

May 16 '07 #9

P: n/a
loretta wrote:
On May 15, 7:32 am, shimmyshack <matt.fa...@gmail.comwrote:
>On May 15, 9:50 am, Toby A Inkster <usenet200...@tobyinkster.co.uk>
wrote:


>>Jerry Stuckle wrote:
But does it validate (http://validator.w3.org)?Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.
From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:
'&' ='&amp;'
'<' ='&lt;'
at a minimum. If not, then the document is not valid.
If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!
--
Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux
uising a CDATA block means that the parse wont be tripped up by < and
so forth.- Hide quoted text -

- Show quoted text -

The webpage does not validate, however the errors are nowhere near the
extra tags in the javascirpt being inserted at the head tag, i.e.
there is an unordered list somewhere in the html that is closed twice
and an incorrect checkbox attribute. The page validates in tidy, with
warnings only. There is this CDATA block around all the javascript
functions, in a comment:
//<![CDATA[
//]]>
It seems to me that the parser is seeing the '</head>' tag in the
javascrpt variable and putting in the end script tag and body tags
Since you haven't told us the page you're trying to load, we can't see
what the problem is.

And BTW - instead of using "something.com", which is a valid domain, you
should use "example.com" - which is reserved just for such use.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================
May 16 '07 #10

This discussion thread is closed

Replies have been disabled for this discussion.