473,396 Members | 1,666 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

HELP: strange php behavior downloading html

Please help!

This MIGHT even be a bug in PHP!

I'll provide version numbers and site specific information (browser, OS,
and kernel versions) if others cannot reproduce this problem.

I'm running into some PHP behavior that I do not understand in PHP 5.1.2.

I need to parse the HTML from the following carefully constructed URI:
http://crenner.smugmug.com/homepage/...allery/1960121

The problem is that when PHP downloads the HTML using file_get_contents,
or any other method of opening a remote file in PHP that I have tried,
it gives me the wrong page!

This URI is supposed to yield the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
version of the page, selectable from the dropdown box at the top of the
page.

The correct page is downloaded in IE, SeaMonkey, and in wget!

But when downloading in PHP, I get the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
small" version of the page, selectable from the dropdown box at the top
of the page.

Please note that the templatechange.mg page is merely a server-side
script that takes the arguments passed to it (TemplateID and origin),
and redirects the browser to the correct version of the page at
"origin", based on the "TemplateID".

Here is how to reproduce the problem:
* Download the page with wget so that you have a copy of the correct
results:

--commandline start here--
wget
"http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"
-O correct.html
--commandline end here--

* Download the same page with php 5.1.2:

--file incorrect.php start here--
<?php
print(file_get_contents("http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"));
?>
--file incorrect.php end here--

--commandline start here--
php incorrect.php incorrect.html
--commandline end here--

* You should now have two very different HTML files (correct.html and
incorrect.html), even though both were downloaded using the same URI!

* Open correct.html in a web browser. You will see a thumbnails
("allthumbs") only version of a smugmug.com picture gallery.

* Open incorrect.html in a web browser. You will see a paginated
version of the same smugmug.com picture gallery ("smugmug small"), with
a larger image on the right.

I know that I could make a workaround by having my PHP scripts call wget
instead of using intrinsic functions to download the HTML. This is not
practical for me for a number of reasons, including code portability and
streamlining.

Can anyone help me with this? I know that the templatechange.mg uses a
302 to redirect the browser, based on the output I get from wget. I
also know that the redirect is happening in PHP (even if it is happening
incorrectly), because I'm not getting the contents of the
templatechange.mg file, but a different version of the gallery itself.

This is driving me crazy. I can find no logical reason why PHP would
yield different results for the same URI than I get in 3 other browsers
(SeaMonkey, IE, and wget).

I have also attached the results pages and the php script (correct.html,
incorrect.html, and incorrect.php) in php_download_strangeness.tar.bz2
(a bzip2 compressed tar archive)

- Chuck Renner

Oct 27 '06 #1
3 2600
Rik
Chuck Renner wrote:
<snip file_get_contents() behaves unexpected for the OP>
First of all: no binaries with your post please.

Second:
HTTP/1.1 302 Found
Date: Fri, 27 Oct 2006 10:17:31 GMT
Server: Apache
X-Powered-By: smugmug/1.2.0
Set-Cookie: SMSESS=879c1d8a0378b8304671becdf6ff28c8; path=/;
domain=.smugmug.com
Cache-Control: private, max-age=1, must-revalidate
Pragma:
Set-Cookie: Template=7; expires=Sun, 26-Nov-2006 10:17:31 GMT; path=/;
domain=.smugmug.com

file_get_contents() will NOT honour these Set-Cookie's whatsoever. It isn't
meant to do that.

If you want to do this, use cURL.
This is not a bug, this is a documented limitation.
--
Rik Wasmus
Oct 27 '06 #2
Rik wrote:
First of all: no binaries with your post please.
sorry...
Second:
HTTP/1.1 302 Found
Date: Fri, 27 Oct 2006 10:17:31 GMT
Server: Apache
X-Powered-By: smugmug/1.2.0
Set-Cookie: SMSESS=879c1d8a0378b8304671becdf6ff28c8; path=/;
domain=.smugmug.com
Cache-Control: private, max-age=1, must-revalidate
Pragma:
Set-Cookie: Template=7; expires=Sun, 26-Nov-2006 10:17:31 GMT; path=/;
domain=.smugmug.com

file_get_contents() will NOT honour these Set-Cookie's whatsoever. It isn't
meant to do that.

If you want to do this, use cURL.
This is not a bug, this is a documented limitation.
Thanks. I had already spent hours in google and php documentation
before posting, and had not found that. I did not find any php
documentation on file_get_contents limitations and Set-Cookie. I'll
start looking for cURL documentation now.

Thanks again for pointing me in the right direction.

- Chuck Renner
Oct 28 '06 #3
Thanks Rik for pointing out that the HTTP headers on that redirected
page were setting and using cookies and for pointing me in the right
direction with cURL.

I was able to yield a correctly working result for my HTML downloading
problem in less than an hour, using cURL with PHP.

With the function I have below, I just call tempnam() to give me a
temporary filename, call my function with the uri and the results from
tempnam(), and then read the file with file_get_contents(). I then can
delete the file with unlink().

Here is the function I wrote to download a uri into a file (following
all redirects, ignoring old cookies, and passing set cookies to redirects):
<?php
function uri_download($uri, $fileName) {
// use cURL to download uri
// make a curl resource, setting the uri as it's target to open
$curl = curl_init($uri);
// make a file resource and create/empty the file for writing
$hFile = fopen($fileName, "w+");
// set curl options
// set the file resource that curl will write to
curl_setopt($curl, CURLOPT_FILE, $hFile);
// do not let curl output the HTTP headers
curl_setopt($curl, CURLOPT_HEADER, false);
// let curl follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// set a location for curl to handle cookies
curl_setopt($curl, CURLOPT_COOKIEJAR, "/tmp");
// tell curl to mark this as a new cookie session
curl_setopt($curl, CURLOPT_COOKIESESSION, true);
// execute curl (download the uri to the temp file)
curl_exec($curl);
// close the curl resource
curl_close($curl);
// unset the curl resource
unset($curl);
// close the temp file and file resource
fclose($hFile);
// unset the file resource
unset($hFile);
}
?>

Chuck Renner wrote:
Please help!

This MIGHT even be a bug in PHP!

I'll provide version numbers and site specific information (browser, OS,
and kernel versions) if others cannot reproduce this problem.

I'm running into some PHP behavior that I do not understand in PHP 5.1.2.

I need to parse the HTML from the following carefully constructed URI:
http://crenner.smugmug.com/homepage/...allery/1960121

The problem is that when PHP downloads the HTML using file_get_contents,
or any other method of opening a remote file in PHP that I have tried,
it gives me the wrong page!

This URI is supposed to yield the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
version of the page, selectable from the dropdown box at the top of the
page.

The correct page is downloaded in IE, SeaMonkey, and in wget!

But when downloading in PHP, I get the HTML from the page at
http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
small" version of the page, selectable from the dropdown box at the top
of the page.

Please note that the templatechange.mg page is merely a server-side
script that takes the arguments passed to it (TemplateID and origin),
and redirects the browser to the correct version of the page at
"origin", based on the "TemplateID".

Here is how to reproduce the problem:
* Download the page with wget so that you have a copy of the correct
results:

--commandline start here--
wget
"http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"
-O correct.html
--commandline end here--

* Download the same page with php 5.1.2:

--file incorrect.php start here--
<?php
print(file_get_contents("http://crenner.smugmug.com/homepage/templatechange.mg?TemplateID=7&origin=http://crenner.smugmug.com/gallery/1960121"));
?>
--file incorrect.php end here--

--commandline start here--
php incorrect.php incorrect.html
--commandline end here--

* You should now have two very different HTML files (correct.html and
incorrect.html), even though both were downloaded using the same URI!

* Open correct.html in a web browser. You will see a thumbnails
("allthumbs") only version of a smugmug.com picture gallery.

* Open incorrect.html in a web browser. You will see a paginated
version of the same smugmug.com picture gallery ("smugmug small"), with
a larger image on the right.

I know that I could make a workaround by having my PHP scripts call wget
instead of using intrinsic functions to download the HTML. This is not
practical for me for a number of reasons, including code portability and
streamlining.

Can anyone help me with this? I know that the templatechange.mg uses a
302 to redirect the browser, based on the output I get from wget. I
also know that the redirect is happening in PHP (even if it is happening
incorrectly), because I'm not getting the contents of the
templatechange.mg file, but a different version of the gallery itself.

This is driving me crazy. I can find no logical reason why PHP would
yield different results for the same URI than I get in 3 other browsers
(SeaMonkey, IE, and wget).

I have also attached the results pages and the php script (correct.html,
incorrect.html, and incorrect.php) in php_download_strangeness.tar.bz2
(a bzip2 compressed tar archive)

- Chuck Renner

Oct 28 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: cwdjr | last post by:
Real One has a page to copy on their site that detects if the browser of a viewer of a page has Real One installed. The page is located at...
19
by: Razvan | last post by:
Hi ! I have a big problem with my web site www.mihaiu.name. Sometimes when I visit my page with IE6 the browser ask me to download the index.html file ! The options are open, save, cancel,...
1
by: Luigi | last post by:
I noted a particular behavior shown by IE. Look at the simple page attached at the bottom of the post. In it, there is a box floated with the float property and another box that jumps it with the...
87
by: expertware | last post by:
Dear friends, My name is Pamela, I know little about CSS, but I would like to ask a question I have an image on a web page within a css layer: <DIV ID=MyLayer STYLE = "position:...
4
by: trialproduct2004 | last post by:
Hi all i am having application in C#. here what i want it to update one datagrid depending on particular value. I want to start minimum of 5 threads at a time and all these threads are updating...
0
by: the friendly display name | last post by:
Hi, I have a filled multiline textbox on the site. I can scroll it with IE and Firefox, but under Opera (tested under 7.54, and Opera 8, under "identify as MSIE" and under Opera identification)...
1
by: Alexander Inochkin | last post by:
Hi! I found same strange behavior of ASP.NET. It is possible this is the bug. Follow the steps:
10
by: John Kraft | last post by:
Hello all, I'm experiencing some, imo, strange behavior with the StreamReader object I am using in the code below. Summary is that I am downloading a file from a website and saving it to disk...
4
by: David | last post by:
I'm using the AxSHDocVw.WebBrowser control to download data from a webpage at work (it's an internal page on my company's intranet). The page produces a runtime error after a while and the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.