Can you avoid that googlebot indexes PHPSESSID pages?

CAH

Hi

Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?

Here is an exsample of url that google register, that might make is
more clear what is happening

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

I do use session registred ID, but if I visit my site I never see those
kind of urls? So how come google gets a hold of them?

Best regards
Mads

Apr 3 '06 #1

Subscribe Post Reply

4809

Jerry Stuckle

CAH wrote:

Hi

Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?

Here is an exsample of url that google register, that might make is
more clear what is happening

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

I do use session registred ID, but if I visit my site I never see those
kind of urls? So how come google gets a hold of them?

Best regards
Mads

http://www.php.net/manual/en/ref.session.php

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Apr 3 '06 #2

Chung Leong

CAH wrote:

Hi

Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?
Well, one way to handle this is to check the User-Agent header to see
if the client is Googlebot and not enable session. Obviously if a page
is dependent on session then it ceases to be indexible.
Here is an exsample of url that google register, that might make is
more clear what is happening

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

I do use session registred ID, but if I visit my site I never see those
kind of urls? So how come google gets a hold of them?

If session.use_trans_sid is enabled, then PHP tries to compensate for
the lack of cookie by inserting the session id into any possible links.

I think you have quite a problem on your hand. Once those links are in
Google's database, the bot will keep returning to them. You'll need to
detect the condition and tell Googlebot to buzz off so it doesn't eat
up your bandwidth quota.

Apr 3 '06 #3

CAH

CAH skrev:

Hi

Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?

Here is an exsample of url that google register, that might make is
more clear what is happening

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

I do use session registred ID, but if I visit my site I never see those
kind of urls? So how come google gets a hold of them?

Best regards
Mads

I am now testing this as a solution

"Using .htaccess often, you need to put the following two lines in the
..htaccess file, if your host is using PHP as an Apache module:

php_value session.use_only_cookies 1
php_value session.use_trans_sid 0 "

The downside is my site now only functions when user has cookies
enabled, and I am still not sure whethers this will do the trick.

Apr 3 '06 #4

noone

CAH wrote:

CAH skrev:
Hi

Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?

Here is an exsample of url that google register, that might make is
more clear what is happening

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

I do use session registred ID, but if I visit my site I never see those
kind of urls? So how come google gets a hold of them?

Best regards
Mads

I am now testing this as a solution "Using .htaccess often, you need to put the following two lines in the
..htaccess file, if your host is using PHP as an Apache module: php_value session.use_only_cookies 1
php_value session.use_trans_sid 0 " The downside is my site now only functions when user has cookies
enabled, and I am still not sure whethers this will do the trick.

IIRC, google and other sites search for a file called robots.txt that give
directives on what it can and cannot index. Do a google search for
robots.txt to see... (to verify, look in your webserver log files - it
does show up as a request in my apache log files...)

If your robots.txt includes the following directive - it will skip the
entire site.

User-agent: *
Disallow: *

or to limit the scope of it's search:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: *.php

Apr 3 '06 #5

CAH

> IIRC, google and other sites search for a file called robots.txt that give

directives on what it can and cannot index. Do a google search for
robots.txt to see... (to verify, look in your webserver log files - it
does show up as a request in my apache log files...)

If your robots.txt includes the following directive - it will skip the
entire site.

User-agent: *
Disallow: *

or to limit the scope of it's search:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: *.php

I was testing this robot.txt

User-agent: Googlebot
Disallow: /*PHPSESSID

And that might solve it, I just do not know whether is works or not.

Mads

Apr 3 '06 #6

Scott

On Mon, 2006-04-03 at 01:20 -0700, CAH wrote:

Hi

Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?

Here is an exsample of url that google register, that might make is
more clear what is happening

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

I do use session registred ID, but if I visit my site I never see those
kind of urls? So how come google gets a hold of them?

Best regards
Mads

There was some discussion of forcing cookies, but the author didn't want
to limit his users, so...

How about doing something like this:

// See if the user agent is Googlebot
$isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');

// If it is, use ini_set to only allow cookies for the session variable
if ($isGoogle !== false) {
ini_set('session.use_only_cookies', '1');
}

Apr 4 '06 #7

CAH

> There was some discussion of forcing cookies, but the author didn't want

to limit his users, so...

How about doing something like this:

// See if the user agent is Googlebot
$isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');

// If it is, use ini_set to only allow cookies for the session variable
if ($isGoogle !== false) {
ini_set('session.use_only_cookies', '1');
}

That is a cool solution, but can one be sure that one can reconize
googlebot? And how about all the other robots? Could one make a "is not
robot test"?

Thanks for the help
Mads

Apr 4 '06 #8

Scott

On Mon, 2006-04-03 at 23:57 -0700, CAH wrote:

There was some discussion of forcing cookies, but the author didn't want
to limit his users, so...

How about doing something like this:

// See if the user agent is Googlebot
$isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');

// If it is, use ini_set to only allow cookies for the session variable
if ($isGoogle !== false) {
ini_set('session.use_only_cookies', '1');
}

That is a cool solution, but can one be sure that one can reconize
googlebot? And how about all the other robots? Could one make a "is not
robot test"?

Thanks for the help
Mads

I wouldn't expect all (or even most) robots to be easily identified by
the user-agent. Maybe you could make an array of the most common ones
(Googlebot, Inktomi, etc) and loop through it with the logic I
suggested. I also don't think you could check to see if it's a browser,
because firewalls & proxy servers may not send that information through.

Sorry! (It's not my internet. I just work here!)

Scott

Apr 4 '06 #9

CAH

> I wouldn't expect all (or even most) robots to be easily identified
by

the user-agent. Maybe you could make an array of the most common ones
(Googlebot, Inktomi, etc) and loop through it with the logic I
suggested. I also don't think you could check to see if it's a browser,
because firewalls & proxy servers may not send that information through.

I see what you mean.

Do you think this solution will work?

"Using .htaccess often, you need to put the following two lines in the
..htaccess file, if your host is using PHP as an Apache module:

php_value session.use_only_cookies 1
php_value session.use_trans_sid 0 "

I think it does, and even though you then have to rely on cookies, I
think it is the better solution because today this is a small minus,
compared to search engine problems.

If this solutions works

User-agent: Googlebot
Disallow: /*PHPSESSID

it would be by far the simplest, I do however not feel to sure that it
does work, and have no opportunity to check it at this time.

Regards
Mads

Apr 4 '06 #10

R. Rajesh Jeba Anbiah

CAH wrote:

Hi

Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?

Here is an exsample of url that google register, that might make is
more clear what is happening

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

Such a change in session id shouldn't happen in a normal site. Also,
AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).

FWIW, <news:11**********************@g43g2000cwa.googleg roups.com> (
http://groups.google.com/group/comp....7bb41576afe16d )

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Apr 4 '06 #11

CAH

> > http://www.winches.dk/winches.php?ar...6f0d46334659ff...

http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
Such a change in session id shouldn't happen in a normal site.

Why not? I would think a session ID should be unique. If you think I am
doing something wrong, what could that be then?

Also, AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).

you can try this seach in google site:www.winches.dk

or click her

http://www.google.com/search?q=site:...e=off&filter=0

Look at the last 100 entries or so.

Best regards
mads

Apr 4 '06 #12

CAH

> There was some discussion of forcing cookies, but the author didn't want

to limit his users, so...

How about doing something like this:

// See if the user agent is Googlebot
$isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');

// If it is, use ini_set to only allow cookies for the session variable
if ($isGoogle !== false) {
ini_set('session.use_only_cookies', '1');
}

Hi Scott

The solution you have come up with, is the cool one. How can I test if
my host allows me to set ini_set('session.use_only_cookies', '1');
The way you suggest in your code? Can all do this on any host?
Any ideas as to how to chek to See if the user agent is Googlebot?

Thanks for the suggestions. I must say I have had my first encounter
with cookie problems, so I would like to get the PHPSESID back in the
url.

Best regards
Mads

Apr 6 '06 #13

CAH

> If this solutions works

User-agent: Googlebot
Disallow: /*PHPSESSID

it would be by far the simplest, I do however not feel to sure that it
does work, and have no opportunity to check it at this time.

PHPSESSID URLs restricted by robots.txt

In Google sitemap BETA I can see 10 URLs restricted by robots.txt , and
that is with the above robot text. So i guess that might do the trick,
what do you think is this and indication that the above robot text is
egnoug?
Cah

Apr 6 '06 #14

R. Rajesh Jeba Anbiah

CAH wrote:

http://www.winches.dk/winches.php?ar...6f0d46334659ff...
http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
Such a change in session id shouldn't happen in a normal site.

Why not? I would think a session ID should be unique. If you think I am
doing something wrong, what could that be then?

It shouldn't happen in a single session--session id remains same for
the single session unless:
1. Crawler is returning and caching in multiple run
2. You have used session_regenerate_id()
3. There are random absoulte links poining in from your site to your
site (instead of relative links)
Also,
AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).

you can try this seach in google site:www.winches.dk

or click her

http://www.google.com/search?q=site:...e=off&filter=0

Look at the last 100 entries or so.

It doesn't seem to strip session id as I thought. If your site
contents doesn't rely on session (for non-members), you may safely turn
off trans sid
<news:11**********************@l41g2000cwc.googleg roups.com> (
http://groups.google.com/group/comp....24f27f2b7ac610 )
--even you can selectively turn off only for the crawler by sniffing
user agent string and or IP.

But, if your site depends on session (for non-members and hence
crawler) and you'd like to enable session for crawler, but doesn't want
the trans sid, you need to go for some other hack. If that is your
situation, I may help you with the hack.

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Apr 7 '06 #15

CAH

R. Rajesh Jeba Anbiah skrev:

CAH wrote:
> http://www.winches.dk/winches.php?ar...6f0d46334659ff...
> http://www.winches.dk/winches.php?ar...b6aed41fc142ea...

Such a change in session id shouldn't happen in a normal site.
Why not? I would think a session ID should be unique. If you think I am
doing something wrong, what could that be then?

It shouldn't happen in a single session--session id remains same for
the single session unless:
1. Crawler is returning and caching in multiple run

I would think this is what happens.
2. You have used session_regenerate_id()
3. There are random absoulte links poining in from your site to your
site (instead of relative links)
Also,
AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).
you can try this seach in google site:www.winches.dk

or click her

http://www.google.com/search?q=site:...e=off&filter=0

Look at the last 100 entries or so.

It doesn't seem to strip session id as I thought. If your site
contents doesn't rely on session (for non-members), you may safely turn
off trans sid> <news:11**********************@l41g2000cwc.googleg roups.com> (
http://groups.google.com/group/comp....24f27f2b7ac610 )
--even you can selectively turn off only for the crawler by sniffing
user agent string and or IP.

But, if your site depends on session (for non-members and hence
crawler)

it does denpend on sessions for non-members

and you'd like to enable session for crawler, but doesn't want the trans sid, you need to go for some other hack. If that is your
situation, I may help you with the hack.
Thanks, that is very kind of you. I think the robot text might be doing
the trick, and then no further tricks og hacks should be needed. But I
am following Google closely. Now, session.use_trans_sid, what does that
do, does it not turn of sessions I URL, and force cookies on the users?
I found this at another site

if(strpos($_SERVER['HTTP_USER_AGENT'],"google")!==false or
strpos($_SERVER['HTTP_USER_AGENT'],"MSIECrawler")!==false)
{
ini_set("url_rewriter.tags","");
}

http://www.mtdev.com/2002/06/why-you...use_trans_sid/

But have not testet it.

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Apr 7 '06 #16

R. Rajesh Jeba Anbiah

CAH wrote:

R. Rajesh Jeba Anbiah skrev: <snip>
It doesn't seem to strip session id as I thought. If your site
contents doesn't rely on session (for non-members), you may safely turn
off trans sid> <news:11**********************@l41g2000cwc.googleg roups.com> (
http://groups.google.com/group/comp....24f27f2b7ac610 )
--even you can selectively turn off only for the crawler by sniffing
user agent string and or IP.

But, if your site depends on session (for non-members and hence
crawler)

it does denpend on sessions for non-members

I doubt that. Anyway, it's better you check if it depends on
session again.
and you'd like to enable session for crawler, but doesn't want
the trans sid, you need to go for some other hack. If that is your
situation, I may help you with the hack.
<snip>
I found this at another site

if(strpos($_SERVER['HTTP_USER_AGENT'],"google")!==false or
strpos($_SERVER['HTTP_USER_AGENT'],"MSIECrawler")!==false)
{
ini_set("url_rewriter.tags","");
}

http://www.mtdev.com/2002/06/why-you...use_trans_sid/

Actually you're turning off trans sid (see my link above) and there
by you're turning off the session for crawler. But, you said your site
needs session for crawler. And here goes my untested--to be improved--a
quick dirty hack:

<?php
/* Crawler SID removal hack: begin--------*/
/* Hack code should be placed on the top of every accessible script.
* or place it in a global common file say header.php or so.
* Important Assumption: Crawler indexes the final redirected URI */

/**
* Test if the request is from the crawler
*
* @return boolean
* @todo implement it or google for hundreds of codes
**/
function IsCrawler()
{
return true;
}

if (IsCrawler())
{
define('CRAWLER_SID_FILE', 'crawler_sid.txt');

if (isset($_GET[session_name()])) // Is session id found in query
string?
{
$tmp_get = $_GET;
unset($tmp_get[session_name()]); //remove session id
// now rebuild query string...
$new_get = http_build_query($tmp_get);
$default_ports = array('https' => 443, 'http' => 80);
$prefix = (!empty($_SERVER['HTTPS']) ? 'https' : 'http');
$current_url = $prefix .
(($_SERVER['SERVER_PORT'] != $default_ports[$prefix]) ?
':' . $_SERVER['SERVER_PORT'] : '') . '://'
. $_SERVER['HTTP_HOST']
. $_SERVER['PHP_SELF'];
// redirect to self, but with SID removed
header('Location: '.$current_url . '?' . $new_get);
exit;
}
else // SID is not found (page got redirected); so we need to
set/load the crawler's session id
{
// the session id is been actually found in
CRAWLER_SID_FILE for the crawler
if (($crawler_sid=@file_get_contents(CRAWLER_SID_FILE ))!==false)
session_id($crawler_sid);
}
}
/*----break: Crawler SID removal hack*/
// normal code...
session_start();
/* Crawler SID removal hack: continue ----*/
if (IsCrawler())
{
file_put_contents(CRAWLER_SID_FILE, session_id()); //safely store
the crawler's session id in CRAWLER_SID_FILE
}
/*----end: Crawler SID removal hack*/

// now, again normal code...
// ...
// testing a link...
echo '<a href="' . $_SERVER['PHP_SELF'] . '?a=100&b=50&c=5">Test
link</a>';
$_SESSION['test'] = !isset($_SESSION['test']) ? 0 :
($_SESSION['test']+1);
echo ''.$_SESSION['test'].'';
?>
--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Apr 11 '06 #17

fletch

This is not the way to do it. This will prevent google from seeing any
page with a session id. Bad bad bad. You may as well just disallow
google altogether.

Apr 11 '06 #18

fletch

Ithink the best solutioin to this problem is to detect googlebot (+ any
other bots, yahoo slurp, msnbot..) and then set the session id to a
known value using session_id() that way you always have the session
when you expect it regardless of the current user-agent. Works a treat
for me.

Apr 11 '06 #19

R. Rajesh Jeba Anbiah

fletch wrote:

Ithink the best solutioin to this problem is to detect googlebot (+ any
other bots, yahoo slurp, msnbot..) and then set the session id to a
known value using session_id() that way you always have the session
when you expect it regardless of the current user-agent. Works a treat
for me.

This is cool. I was thinking about similar hack after posting that
as previous did add SID to the links on the page (will be a problem if
users visit via cached pages).

So, I'm thinking of turning off trans sid for crawler and keeping
the session id in server and setting it via session_id()

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Apr 12 '06 #20

CAH

Thansk a lot for all the comments and code, it is great that you take
an interrest. I belive session ID is also a problem with regards to
validation of site, so there are good reasons to find a solid and
simple solution.

But, if your site depends on session (for non-members and hence
crawler)
it does denpend on sessions for non-members

I doubt that. Anyway, it's better you check if it depends on
session again.

Well, there is no login on my site, there is no members. The session id
is used to transfer informations from one page to the next in af
electronic guide, that is made with forms.
Actually you're turning off trans sid (see my link above) and there
by you're turning off the session for crawler.
I still do not understand the difference between trans sid og turning
sid off.

What is the difference between these two?

php_value session.use_only_cookies
php_value session.use_trans_sid
But, you said your site
needs session for crawler.
I am sorry that I have given that impression, I do not need sessions
for the crawler. But I do not like to turn off sessions in the url,
because there are still users who does not like to get cookies on there
computers.

And here goes my untested--to be improved--a quick dirty hack:

<?php
/* Crawler SID removal hack: begin--------*/
/* Hack code should be placed on the top of every accessible script.
* or place it in a global common file say header.php or so.
* Important Assumption: Crawler indexes the final redirected URI */

/**
* Test if the request is from the crawler
*
* @return boolean
* @todo implement it or google for hundreds of codes
**/
function IsCrawler()
{
return true;
}
Does the above that code test if it is a crawler?

if (IsCrawler())
{
define('CRAWLER_SID_FILE', 'crawler_sid.txt');

if (isset($_GET[session_name()])) // Is session id found in query
string?
{
$tmp_get = $_GET;
unset($tmp_get[session_name()]); //remove session id
// now rebuild query string...
$new_get = http_build_query($tmp_get);
$default_ports = array('https' => 443, 'http' => 80);
$prefix = (!empty($_SERVER['HTTPS']) ? 'https' : 'http');
$current_url = $prefix .
(($_SERVER['SERVER_PORT'] != $default_ports[$prefix]) ?
':' . $_SERVER['SERVER_PORT'] : '') . '://'
. $_SERVER['HTTP_HOST']
. $_SERVER['PHP_SELF'];
// redirect to self, but with SID removed
header('Location: '.$current_url . '?' . $new_get);
exit;
}
else // SID is not found (page got redirected); so we need to
set/load the crawler's session id
{
// the session id is been actually found in
CRAWLER_SID_FILE for the crawler
if (($crawler_sid=@file_get_contents(CRAWLER_SID_FILE ))!==false)
session_id($crawler_sid);
}
}
/*----break: Crawler SID removal hack*/
// normal code...
session_start();
/* Crawler SID removal hack: continue ----*/
if (IsCrawler())
{
file_put_contents(CRAWLER_SID_FILE, session_id()); //safely store
the crawler's session id in CRAWLER_SID_FILE
}
/*----end: Crawler SID removal hack*/

// now, again normal code...
// ...
// testing a link...
echo '<a href="' . $_SERVER['PHP_SELF'] . '?a=100&b=50&c=5">Test
link</a>';
$_SESSION['test'] = !isset($_SESSION['test']) ? 0 :
($_SESSION['test']+1);
echo ''.$_SESSION['test'].'';
?>
--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

I am going to have to read this a few times to get it, i am in the
beginner class. But I will look more closely into it.

But how about the robots.txt solution, it seems simple and I think it
works.

User-agent: Googlebot
Disallow: /*PHPSESSID

But I can not say for sure, I have to wait and see if google removes
the old listings.

Once again thanks for help and code.

Best regards
Mads Larsen

Apr 12 '06 #21

CAH

fletch skrev:

Ithink the best solutioin to this problem is to detect googlebot (+ any
other bots, yahoo slurp, msnbot..) and then set the session id to a
known value using session_id() that way you always have the session
when you expect it regardless of the current user-agent. Works a treat
for me.

Hi Fletch

Could you post the code as to how you do this, and have you treid a
robots.txt solution?

Best regards
Mads

Apr 12 '06 #22

fletch

I'll gve you a bit, but not all becasue it belongs to the company I
work for, I know he won't mind if I give a bit, but only if I plug the
company as well. Contact Zedcore Systems Ltd for php development!

if (isset($_SERVER['HTTP_USER_AGENT']))
{
$aSpiderUAs =
array(16=>'abachobot',11=>'abcdatos',3=>'altavista ',9=>'antibot',29=>'arach',6=>'archiver',7=>'aster ias',8=>'atomz',4=>'crawler',10=>'ezresult',2=>'fa st',12=>'ferret',13=>'find',14=>'fireball',15=>'ge ckobot',1=>'google',17=>'gulliver',18=>'hubater',1 9=>'incywincy',20=>'infoseek',21=>'mercator',38=>' msnbot',22=>'nazilla',23=>'roach',24=>'robot',25=> 'scooter',26=>'scrubby',27=>'sightquest',28=>'slur p',5=>'spider',30=>'spyder',31=>'teoma',32=>'touta tis',33=>'ultraseek',34=>'webrefiner',35=>'wscbot' ,36=>'yandex',37=>'zyborg');
$sHTTPUserAgent = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach($aSpiderUAs as $iPos=>$sSpiderUA)
{
if (strpos($sHTTPUserAgent,$sSpiderUA)!==false)
{
$iSpiderID = $iPos;
$sSID = $iSpiderID.'00000000';
$bCaughtASpider = true;
}
}
}

if (!$bCaughtASpider)
{
do {
$sNewSID = md5(mt_rand().mt_rand()); // An MD5 hash is 32 chars, the
same as a PHP-generated SID;
} while (file_exists(ini_get('session.save_path').'/sess_'.$sNewSID));
$sSID = $sNewSID;
}

The rest is left as an excercise for the reader!

Apr 12 '06 #23

fletch

I don't see the robots.txt to be a solution. You are telling google not
to index pages, you don't want that. The more google indexes the better.

Apr 12 '06 #24

Juan José Gutiérrez de QuevedoPérez

El 12 Apr 2006 03:13:01 -0700
fletch escribió:

[..]

foreach($aSpiderUAs as $iPos=>$sSpiderUA) {
if (strpos($sHTTPUserAgent,$sSpiderUA)!==false)
{
$iSpiderID = $iPos;
$sSID = $iSpiderID.'00000000';
$bCaughtASpider = true;
}
}
}

just an off-topic note:

you really want to break from the foreach loop once you have found a
spider so that you don't have to loop through the whole array once you
have found a spider(that array may become really big with time).

also, you are keeping $iSpiderID, so for example if altavista user
agent is "altavista spider", you'll get the $iSpiderID of spider, NOT
of altavista.

it should be like this:

foreach($aSpiderUAs as $iPos=>$sSpiderUA)
if(strpos($sHTTPUserAgent,$sSpiderUA)!==false)
{
$iSpiderID=$iPos;
$sSID=$iSpiderID.'00000000';
$bCaughtASpider=true;
break;
}

you should also initialize $bCaughtASpider as false(you may already be
doing it, but you haven't pasted it in this bit) as it's always good to
initialize variables.
--
Juan José Gutiérrez de Quevedo
Director Técnico (ju****@iteisa.com)
ITEISA (http://www.iteisa.com)
942544036 - 637447953

Apr 12 '06 #25

fletch

cheers! I will fix this. very useful.

Apr 12 '06 #26

CAH

fletch skrev:

I don't see the robots.txt to be a solution. You are telling google not
to index pages, you don't want that. The more google indexes the better.

Well, the suggested text for robots.txt should not remove any sites,
but only the session ID from the url. I have however only found one
place were it has been claimed to work, but if it works as claimed the
I can not find any downside to this suggestion.

User-agent: Googlebot
Disallow: /*PHPSESSID

Apr 12 '06 #27

CAH

fletch skrev:

I'll gve you a bit, but not all becasue it belongs to the company I
work for, I know he won't mind if I give a bit, but only if I plug the
company as well. Contact Zedcore Systems Ltd for php development!

Hi Flech

Thanks a lot, it is kind of you to share this.

I have but a no index on my trouble pages, since I am currently engaged
in another project. But I will return to the solutions suggested here
later on, and see if I can put something usable together.

Thanks to all for the help and suggestions

Best regards
Mads

Apr 12 '06 #28

fletch

Honestly, this is designed to disallow google from indexing pages
ending with PHPSESSID, but wont work anyway because robots.txt doesn't
support globbing

From http://www.robotstxt.org/wc/exclusion-admin.html
Note also that regular expression are not supported in either
the User-agent or Disallow lines. The '*' in the User-agent field
is a special value meaning "any robot". Specifically, you cannot
have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

Apr 12 '06 #29

CAH

fletch skrev:

Honestly, this is designed to disallow google from indexing pages
ending with PHPSESSID, but wont work anyway because robots.txt doesn't
support globbing

I found this

Additionally, Google has introduced increased flexibility to the
robots.txt file standard through the use asterisks. Disallow patterns
may include "*" to match any sequence of characters, and patterns may
end in "$" to indicate the end of a name.

here

http://www.google.com/webmasters/rem...#exclude_pages

But this only works for google, however session id seam to be mainly a
problem with google. However then there is still the problem of
validating a site that uses session id .

Apr 12 '06 #30

Can you avoid that googlebot indexes PHPSESSID pages?

Similar topics