By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,669 Members | 2,203 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,669 IT Pros & Developers. It's quick & easy.

HELP: strange php behavior downloading html

P: n/a
Please help!

This MIGHT even be a bug in PHP!

I'll provide version numbers and site specific information (browser, OS,
and kernel versions) if others cannot reproduce this problem.

I'm running into some PHP behavior that I do not understand in PHP 5.1.2.

I need to parse the HTML from the following carefully constructed URI:
http://crenner.smugmug.com/homepage/...allery/1960121

The problem is that when PHP downloads the HTML using file_get_contents,
or any other method of opening a remote file in PHP that I have tried,
it gives me the wrong page!

This URI is supposed to yield the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
version of the page, selectable from the dropdown box at the top of the
page.

The correct page is downloaded in IE, SeaMonkey, and in wget!

But when downloading in PHP, I get the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
small" version of the page, selectable from the dropdown box at the top
of the page.

Please note that the templatechange.mg page is merely a server-side
script that takes the arguments passed to it (TemplateID and origin),
and redirects the browser to the correct version of the page at
"origin", based on the "TemplateID".

Here is how to reproduce the problem:
* Download the page with wget so that you have a copy of the correct
results:

--commandline start here--
wget
"http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"
-O correct.html
--commandline end here--

* Download the same page with php 5.1.2:

--file incorrect.php start here--
<?php
print(file_get_contents("http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"));
?>
--file incorrect.php end here--

--commandline start here--
php incorrect.php incorrect.html
--commandline end here--

* You should now have two very different HTML files (correct.html and
incorrect.html), even though both were downloaded using the same URI!

* Open correct.html in a web browser. You will see a thumbnails
("allthumbs") only version of a smugmug.com picture gallery.

* Open incorrect.html in a web browser. You will see a paginated
version of the same smugmug.com picture gallery ("smugmug small"), with
a larger image on the right.

I know that I could make a workaround by having my PHP scripts call wget
instead of using intrinsic functions to download the HTML. This is not
practical for me for a number of reasons, including code portability and
streamlining.

Can anyone help me with this? I know that the templatechange.mg uses a
302 to redirect the browser, based on the output I get from wget. I
also know that the redirect is happening in PHP (even if it is happening
incorrectly), because I'm not getting the contents of the
templatechange.mg file, but a different version of the gallery itself.

This is driving me crazy. I can find no logical reason why PHP would
yield different results for the same URI than I get in 3 other browsers
(SeaMonkey, IE, and wget).

I have also attached the results pages and the php script (correct.html,
incorrect.html, and incorrect.php) in php_download_strangeness.tar.bz2
(a bzip2 compressed tar archive)

- Chuck Renner

Oct 27 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Rik
Chuck Renner wrote:
<snip file_get_contents() behaves unexpected for the OP>
First of all: no binaries with your post please.

Second:
HTTP/1.1 302 Found
Date: Fri, 27 Oct 2006 10:17:31 GMT
Server: Apache
X-Powered-By: smugmug/1.2.0
Set-Cookie: SMSESS=879c1d8a0378b8304671becdf6ff28c8; path=/;
domain=.smugmug.com
Cache-Control: private, max-age=1, must-revalidate
Pragma:
Set-Cookie: Template=7; expires=Sun, 26-Nov-2006 10:17:31 GMT; path=/;
domain=.smugmug.com

file_get_contents() will NOT honour these Set-Cookie's whatsoever. It isn't
meant to do that.

If you want to do this, use cURL.
This is not a bug, this is a documented limitation.
--
Rik Wasmus
Oct 27 '06 #2

P: n/a
Rik wrote:
First of all: no binaries with your post please.
sorry...
Second:
HTTP/1.1 302 Found
Date: Fri, 27 Oct 2006 10:17:31 GMT
Server: Apache
X-Powered-By: smugmug/1.2.0
Set-Cookie: SMSESS=879c1d8a0378b8304671becdf6ff28c8; path=/;
domain=.smugmug.com
Cache-Control: private, max-age=1, must-revalidate
Pragma:
Set-Cookie: Template=7; expires=Sun, 26-Nov-2006 10:17:31 GMT; path=/;
domain=.smugmug.com

file_get_contents() will NOT honour these Set-Cookie's whatsoever. It isn't
meant to do that.

If you want to do this, use cURL.
This is not a bug, this is a documented limitation.
Thanks. I had already spent hours in google and php documentation
before posting, and had not found that. I did not find any php
documentation on file_get_contents limitations and Set-Cookie. I'll
start looking for cURL documentation now.

Thanks again for pointing me in the right direction.

- Chuck Renner
Oct 28 '06 #3

P: n/a
Thanks Rik for pointing out that the HTTP headers on that redirected
page were setting and using cookies and for pointing me in the right
direction with cURL.

I was able to yield a correctly working result for my HTML downloading
problem in less than an hour, using cURL with PHP.

With the function I have below, I just call tempnam() to give me a
temporary filename, call my function with the uri and the results from
tempnam(), and then read the file with file_get_contents(). I then can
delete the file with unlink().

Here is the function I wrote to download a uri into a file (following
all redirects, ignoring old cookies, and passing set cookies to redirects):
<?php
function uri_download($uri, $fileName) {
// use cURL to download uri
// make a curl resource, setting the uri as it's target to open
$curl = curl_init($uri);
// make a file resource and create/empty the file for writing
$hFile = fopen($fileName, "w+");
// set curl options
// set the file resource that curl will write to
curl_setopt($curl, CURLOPT_FILE, $hFile);
// do not let curl output the HTTP headers
curl_setopt($curl, CURLOPT_HEADER, false);
// let curl follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// set a location for curl to handle cookies
curl_setopt($curl, CURLOPT_COOKIEJAR, "/tmp");
// tell curl to mark this as a new cookie session
curl_setopt($curl, CURLOPT_COOKIESESSION, true);
// execute curl (download the uri to the temp file)
curl_exec($curl);
// close the curl resource
curl_close($curl);
// unset the curl resource
unset($curl);
// close the temp file and file resource
fclose($hFile);
// unset the file resource
unset($hFile);
}
?>

Chuck Renner wrote:
Please help!

This MIGHT even be a bug in PHP!

I'll provide version numbers and site specific information (browser, OS,
and kernel versions) if others cannot reproduce this problem.

I'm running into some PHP behavior that I do not understand in PHP 5.1.2.

I need to parse the HTML from the following carefully constructed URI:
http://crenner.smugmug.com/homepage/...allery/1960121

The problem is that when PHP downloads the HTML using file_get_contents,
or any other method of opening a remote file in PHP that I have tried,
it gives me the wrong page!

This URI is supposed to yield the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
version of the page, selectable from the dropdown box at the top of the
page.

The correct page is downloaded in IE, SeaMonkey, and in wget!

But when downloading in PHP, I get the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
small" version of the page, selectable from the dropdown box at the top
of the page.

Please note that the templatechange.mg page is merely a server-side
script that takes the arguments passed to it (TemplateID and origin),
and redirects the browser to the correct version of the page at
"origin", based on the "TemplateID".

Here is how to reproduce the problem:
* Download the page with wget so that you have a copy of the correct
results:

--commandline start here--
wget
"http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"
-O correct.html
--commandline end here--

* Download the same page with php 5.1.2:

--file incorrect.php start here--
<?php
print(file_get_contents("http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"));
?>
--file incorrect.php end here--

--commandline start here--
php incorrect.php incorrect.html
--commandline end here--

* You should now have two very different HTML files (correct.html and
incorrect.html), even though both were downloaded using the same URI!

* Open correct.html in a web browser. You will see a thumbnails
("allthumbs") only version of a smugmug.com picture gallery.

* Open incorrect.html in a web browser. You will see a paginated
version of the same smugmug.com picture gallery ("smugmug small"), with
a larger image on the right.

I know that I could make a workaround by having my PHP scripts call wget
instead of using intrinsic functions to download the HTML. This is not
practical for me for a number of reasons, including code portability and
streamlining.

Can anyone help me with this? I know that the templatechange.mg uses a
302 to redirect the browser, based on the output I get from wget. I
also know that the redirect is happening in PHP (even if it is happening
incorrectly), because I'm not getting the contents of the
templatechange.mg file, but a different version of the gallery itself.

This is driving me crazy. I can find no logical reason why PHP would
yield different results for the same URI than I get in 3 other browsers
(SeaMonkey, IE, and wget).

I have also attached the results pages and the php script (correct.html,
incorrect.html, and incorrect.php) in php_download_strangeness.tar.bz2
(a bzip2 compressed tar archive)

- Chuck Renner

Oct 28 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.