Problem loading html containing scripts using Dom LoadHTML

loretta

This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.

$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();

Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}

Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';
}

Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks

May 14 '07 #1

Subscribe Post Reply

6174

shimmyshack

On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:

This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.

$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();

Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}

Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';

}

Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks

start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards

May 14 '07 #2

loretta

On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:

On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:

This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.

$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();

Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}

Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';

}

Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks

start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -

- Show quoted text -

The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.

May 14 '07 #3

Jerry Stuckle

loretta wrote:

On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:
>On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:

>>This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.
$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();
Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}
Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';
}
Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks
start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -

- Show quoted text -

The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.

But does it validate (http://validator.w3.org)? Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

May 14 '07 #4

shimmyshack

On May 14, 7:47 pm, loretta <lorb...@optonline.netwrote:

On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:

On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:

This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.

$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();

Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}

Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';

}

Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks

start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -

- Show quoted text -

The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.

this is what i find on google.
http://developer.mozilla.org/en/docs...HTML_Documents
use <!CDATA or the "xhtml" document is no such thing, btw it should
not just claim to be xhtml but should be properly validated as such,
including the content-type text/xml+xhtml (served with as .xhtml)
once you have obtained the webpage, and parsed it adding the right
instructions for the xml parser, all should work, if indeed the rest
of the doc is valid xml.

May 14 '07 #5

shimmyshack

On May 14, 9:58 pm, shimmyshack <matt.fa...@gmail.comwrote:

On May 14, 7:47 pm, loretta <lorb...@optonline.netwrote:

On May 14, 2:16 pm, shimmyshack <matt.fa...@gmail.comwrote:

On May 14, 6:08 pm, loretta <lorb...@optonline.netwrote:

This code is just reading html and printing , eventually I want to
modify the html. However, the original html contains javascript and
the output html contains tags not in the original.

$url = "http://www.something.com";
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
print $doc->saveHTML();

Original html snippet:
function exampleFunction() {
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</head>';
doc += '<body onload="self.focus();">';
doc += '</body></html>';
}

Html after saveHTML:
function exampleFunction() {
('about:blank','imagemanagerpopup',settings);
var doc = '<html><head>';
doc += '<title>Title</title>';
doc += '</script>
</head>
<body>
<p>';
doc += '</body>
</html><html><body>
<p>';

}

Extra tags to end the script, head and begin a new body are being
added before the </bodytag and after the <body onload=self.focus()>
tag in the js variable. Is there a way for the Dom to leave the
javascript as is without trying to 'fix' the html ? The changes being
made are causing a javascript error.
Thanks

start off with xHTML, so it can be loaded with no errors, see google
on how to add javascript in a way that is compliant with xml standards- Hide quoted text -

- Show quoted text -

The html I am retrieving has a xhtml doctype. I also have no control
over the original webpage. The original webpage loads with no errors
in both IE and FF.

this is what i find on google.http://developer.mozilla.org/en/docs..._and_JavaScrip...
use <!CDATA or the "xhtml" document is no such thing, btw it should
not just claim to be xhtml but should be properly validated as such,
including the content-type text/xml+xhtml (served with as .xhtml)
once you have obtained the webpage, and parsed it adding the right
instructions for the xml parser, all should work, if indeed the rest
of the doc is valid xml.

oops, application/xml+xhtml of course

May 14 '07 #6

Toby A Inkster

Jerry Stuckle wrote:

But does it validate (http://validator.w3.org)? Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.

From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:

'&' ='&'
'<' ='<'

at a minimum. If not, then the document is not valid.

If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!

--
Toby A Inkster BSc (Hons) ARCS
http://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux

May 15 '07 #7

shimmyshack

On May 15, 9:50 am, Toby A Inkster <usenet200...@tobyinkster.co.uk>
wrote:

Jerry Stuckle wrote:
But does it validate (http://validator.w3.org)? Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.

From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:

'&' ='&'
'<' ='<'

at a minimum. If not, then the document is not valid.

If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!

--
Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux

uising a CDATA block means that the parse wont be tripped up by < and
so forth.

May 15 '07 #8

loretta

On May 15, 7:32 am, shimmyshack <matt.fa...@gmail.comwrote:

On May 15, 9:50 am, Toby A Inkster <usenet200...@tobyinkster.co.uk>
wrote:

Jerry Stuckle wrote:
But does it validate (http://validator.w3.org)?Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.

From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:

'&' ='&'
'<' ='<'

at a minimum. If not, then the document is not valid.

If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!

--
Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux

uising a CDATA block means that the parse wont be tripped up by < and
so forth.- Hide quoted text -

- Show quoted text -

The webpage does not validate, however the errors are nowhere near the
extra tags in the javascirpt being inserted at the head tag, i.e.
there is an unordered list somewhere in the html that is closed twice
and an incorrect checkbox attribute. The page validates in tidy, with
warnings only. There is this CDATA block around all the javascript
functions, in a comment:
//<![CDATA[
//]]>
It seems to me that the parser is seeing the '</head>' tag in the
javascrpt variable and putting in the end script tag and body tags

May 16 '07 #9

Jerry Stuckle

loretta wrote:

On May 15, 7:32 am, shimmyshack <matt.fa...@gmail.comwrote:
>On May 15, 9:50 am, Toby A Inkster <usenet200...@tobyinkster.co.uk>
wrote:

>>Jerry Stuckle wrote:
But does it validate (http://validator.w3.org)?Pages can load in
browsers without error and still not validate. The browsers are very
forgiving, and make a "best guess" as to what the page creator wanted.
From the excerpts posted, no. Javascript blocks in XHTML must be entity
encoded -- that is:
'&' ='&'
'<' ='<'
at a minimum. If not, then the document is not valid.
If a document is not valid, then DOMDocument might not be able to load it
correctly. Or rather, "correctly" is not defined, so DOMDocument is free
to interpret it however it likes!
--
Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux
uising a CDATA block means that the parse wont be tripped up by < and
so forth.- Hide quoted text -

- Show quoted text -

The webpage does not validate, however the errors are nowhere near the
extra tags in the javascirpt being inserted at the head tag, i.e.
there is an unordered list somewhere in the html that is closed twice
and an incorrect checkbox attribute. The page validates in tidy, with
warnings only. There is this CDATA block around all the javascript
functions, in a comment:
//<![CDATA[
//]]>
It seems to me that the parser is seeing the '</head>' tag in the
javascrpt variable and putting in the end script tag and body tags

Since you haven't told us the page you're trying to load, we can't see
what the problem is.

And BTW - instead of using "something.com", which is a valid domain, you
should use "example.com" - which is reserved just for such use.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

May 16 '07 #10

Similar topics

ActiveXObject, Problem loading into DOM

by: Nomad | last post by:

I'm trying to load an XML document into the DOM using the ActiveXObject I've succeeded in doing this on one machine. Which shouldn't becaus I've checked for the ActiveXObject and it doesn't...

.NET Framework

Help: Problem with accessing form data using javascript.

by: Ant | last post by:

Hi, I'm, having some problems with this function. function displayElements() { for (i=0;i<document.forms.elements.length; ++i) { document.writeln(document.forms.elements.value); }

Javascript

SP1 Problem SOAPException doesn't return quote and Umlaute correcty

by: Dany | last post by:

Our web service was working fine until we installed .net Framework 1.1 service pack 1. Uninstalling SP1 is not an option because our largest customer says service packs marked as "critical" by...

.NET Framework

Problem loading a WebBrowser in a separate thread

by: Jaret Brower | last post by:

I'm trying to parse html that resides locally by using the HtmlDocument class and unfortunately you can only get an instance of an HtmlDocument through the WebBrowser control. Some of the html...

C# / C Sharp

DOMDocument thinks "<html></html>" has two childNodes?

by: yawnmoth | last post by:

I'm trying to mess around with PHP5's DOM functions and have run into something that confuses me: <?php $dom = new DOMDocument(); $dom->loadHTML('<html></html>'); echo...

PHP

loading new JavaScript after the web page is loaded

by: www.gerardvignes.com | last post by:

I am using this to load the client JavaScript for a web application when it is selected by the user) via an Ajax connection to the server. I have found only two ways of loading new JavaScript...

Javascript

AJAX Problem: Loading URL into Div

by: Shigun | last post by:

On a website I am working on I am trying to load another page into a div on the the page the user does his work from. What I have works correctly in FireFox, but not in IE. I've rummaged Google for...

Javascript

Load html containing scripts

by: charlie imac | last post by:

I have a question on the capability of Ajax. My question is: Is it possible to dynamically load any of the javascript gallery programs such as: Adobe Spry Gallery SmoothGallery others I...

Javascript

parsing html with php5

by: juicymixx | last post by:

I must be completely missing something here. I can't seem to figure out how to parse using the DOM in PHP5... For instance, as a test I'm trying to scrape out the weather conditions table from:...

PHP

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware