Can you avoid that googlebot indexes PHPSESSID pages? | |
Hi
Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
indexing pages with PHPSESSID, which makes it think my page has a
infinite number of pages. How can one avoid this?
Here is an exsample of url that google register, that might make is
more clear what is happening http://www.winches.dk/winches.php?ar...6f0d46334659ff... http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
I do use session registred ID, but if I visit my site I never see those
kind of urls? So how come google gets a hold of them?
Best regards
Mads | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
CAH wrote:[color=blue]
> Hi
>
> Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
> indexing pages with PHPSESSID, which makes it think my page has a
> infinite number of pages. How can one avoid this?[/color]
Well, one way to handle this is to check the User-Agent header to see
if the client is Googlebot and not enable session. Obviously if a page
is dependent on session then it ceases to be indexible.
[color=blue]
> Here is an exsample of url that google register, that might make is
> more clear what is happening
>
> http://www.winches.dk/winches.php?ar...6f0d46334659ff...
> http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
>
> I do use session registred ID, but if I visit my site I never see those
> kind of urls? So how come google gets a hold of them?[/color]
If session.use_trans_sid is enabled, then PHP tries to compensate for
the lack of cookie by inserting the session id into any possible links.
I think you have quite a problem on your hand. Once those links are in
Google's database, the bot will keep returning to them. You'll need to
detect the condition and tell Googlebot to buzz off so it doesn't eat
up your bandwidth quota. | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
CAH skrev:
[color=blue]
> Hi
>
> Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
> indexing pages with PHPSESSID, which makes it think my page has a
> infinite number of pages. How can one avoid this?
>
> Here is an exsample of url that google register, that might make is
> more clear what is happening
>
> http://www.winches.dk/winches.php?ar...6f0d46334659ff...
> http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
>
> I do use session registred ID, but if I visit my site I never see those
> kind of urls? So how come google gets a hold of them?
>
> Best regards
> Mads[/color]
I am now testing this as a solution
"Using .htaccess often, you need to put the following two lines in the
..htaccess file, if your host is using PHP as an Apache module:
php_value session.use_only_cookies 1
php_value session.use_trans_sid 0 "
The downside is my site now only functions when user has cookies
enabled, and I am still not sure whethers this will do the trick. | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
CAH wrote:
[color=blue]
> CAH skrev:[/color]
[color=blue][color=green]
>> Hi
>>
>> Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
>> indexing pages with PHPSESSID, which makes it think my page has a
>> infinite number of pages. How can one avoid this?
>>
>> Here is an exsample of url that google register, that might make is
>> more clear what is happening
>>
>> http://www.winches.dk/winches.php?ar...6f0d46334659ff...
>> http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
>>
>> I do use session registred ID, but if I visit my site I never see those
>> kind of urls? So how come google gets a hold of them?
>>
>> Best regards
>> Mads[/color][/color]
[color=blue]
> I am now testing this as a solution[/color]
[color=blue]
> "Using .htaccess often, you need to put the following two lines in the
> ..htaccess file, if your host is using PHP as an Apache module:[/color]
[color=blue]
> php_value session.use_only_cookies 1
> php_value session.use_trans_sid 0 "[/color]
[color=blue]
> The downside is my site now only functions when user has cookies
> enabled, and I am still not sure whethers this will do the trick.[/color]
IIRC, google and other sites search for a file called robots.txt that give
directives on what it can and cannot index. Do a google search for
robots.txt to see... (to verify, look in your webserver log files - it
does show up as a request in my apache log files...)
If your robots.txt includes the following directive - it will skip the
entire site.
User-agent: *
Disallow: *
or to limit the scope of it's search:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: *.php | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
> IIRC, google and other sites search for a file called robots.txt that give[color=blue]
> directives on what it can and cannot index. Do a google search for
> robots.txt to see... (to verify, look in your webserver log files - it
> does show up as a request in my apache log files...)
>
> If your robots.txt includes the following directive - it will skip the
> entire site.
>
> User-agent: *
> Disallow: *
>
> or to limit the scope of it's search:
> User-agent: *
> Disallow: /cgi-bin/
> Disallow: /images/
> Disallow: *.php[/color]
I was testing this robot.txt
User-agent: Googlebot
Disallow: /*PHPSESSID
And that might solve it, I just do not know whether is works or not.
Mads | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
On Mon, 2006-04-03 at 01:20 -0700, CAH wrote:[color=blue]
> Hi
>
> Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
> indexing pages with PHPSESSID, which makes it think my page has a
> infinite number of pages. How can one avoid this?
>
> Here is an exsample of url that google register, that might make is
> more clear what is happening
>
> http://www.winches.dk/winches.php?ar...6f0d46334659ff...
> http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
>
> I do use session registred ID, but if I visit my site I never see those
> kind of urls? So how come google gets a hold of them?
>
> Best regards
> Mads
>[/color]
There was some discussion of forcing cookies, but the author didn't want
to limit his users, so...
How about doing something like this:
// See if the user agent is Googlebot
$isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');
// If it is, use ini_set to only allow cookies for the session variable
if ($isGoogle !== false) {
ini_set('session.use_only_cookies', '1');
} | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
> There was some discussion of forcing cookies, but the author didn't want[color=blue]
> to limit his users, so...
>
> How about doing something like this:
>
> // See if the user agent is Googlebot
> $isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');
>
> // If it is, use ini_set to only allow cookies for the session variable
> if ($isGoogle !== false) {
> ini_set('session.use_only_cookies', '1');
> }[/color]
That is a cool solution, but can one be sure that one can reconize
googlebot? And how about all the other robots? Could one make a "is not
robot test"?
Thanks for the help
Mads | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
On Mon, 2006-04-03 at 23:57 -0700, CAH wrote:[color=blue][color=green]
> > There was some discussion of forcing cookies, but the author didn't want
> > to limit his users, so...
> >
> > How about doing something like this:
> >
> > // See if the user agent is Googlebot
> > $isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');
> >
> > // If it is, use ini_set to only allow cookies for the session variable
> > if ($isGoogle !== false) {
> > ini_set('session.use_only_cookies', '1');
> > }[/color]
>
> That is a cool solution, but can one be sure that one can reconize
> googlebot? And how about all the other robots? Could one make a "is not
> robot test"?
>
> Thanks for the help
> Mads
>[/color]
I wouldn't expect all (or even most) robots to be easily identified by
the user-agent. Maybe you could make an array of the most common ones
(Googlebot, Inktomi, etc) and loop through it with the logic I
suggested. I also don't think you could check to see if it's a browser,
because firewalls & proxy servers may not send that information through.
Sorry! (It's not my internet. I just work here!)
Scott | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
> I wouldn't expect all (or even most) robots to be easily identified
by[color=blue]
> the user-agent. Maybe you could make an array of the most common ones
> (Googlebot, Inktomi, etc) and loop through it with the logic I
> suggested. I also don't think you could check to see if it's a browser,
> because firewalls & proxy servers may not send that information through.[/color]
I see what you mean.
Do you think this solution will work?
"Using .htaccess often, you need to put the following two lines in the
..htaccess file, if your host is using PHP as an Apache module:
php_value session.use_only_cookies 1
php_value session.use_trans_sid 0 "
I think it does, and even though you then have to rely on cookies, I
think it is the better solution because today this is a small minus,
compared to search engine problems.
If this solutions works
User-agent: Googlebot
Disallow: /*PHPSESSID
it would be by far the simplest, I do however not feel to sure that it
does work, and have no opportunity to check it at this time.
Regards
Mads | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
> There was some discussion of forcing cookies, but the author didn't want[color=blue]
> to limit his users, so...
>
> How about doing something like this:
>
> // See if the user agent is Googlebot
> $isGoogle = stripos($_SERVER['HTTP_USER_AGENT'], 'Googlebot');
>
> // If it is, use ini_set to only allow cookies for the session variable
> if ($isGoogle !== false) {
> ini_set('session.use_only_cookies', '1');
> }[/color]
Hi Scott
The solution you have come up with, is the cool one. How can I test if
my host allows me to set ini_set('session.use_only_cookies', '1');
The way you suggest in your code? Can all do this on any host?
Any ideas as to how to chek to See if the user agent is Googlebot?
Thanks for the suggestions. I must say I have had my first encounter
with cookie problems, so I would like to get the PHPSESID back in the
url.
Best regards
Mads | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
> If this solutions works[color=blue]
>
> User-agent: Googlebot
> Disallow: /*PHPSESSID
>
> it would be by far the simplest, I do however not feel to sure that it
> does work, and have no opportunity to check it at this time.[/color]
PHPSESSID URLs restricted by robots.txt
In Google sitemap BETA I can see 10 URLs restricted by robots.txt , and
that is with the above robot text. So i guess that might do the trick,
what do you think is this and indication that the above robot text is
egnoug?
Cah | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
CAH wrote:[color=blue][color=green][color=darkred]
> > > http://www.winches.dk/winches.php?ar...6f0d46334659ff...
> > > http://www.winches.dk/winches.php?ar...b6aed41fc142ea...[/color]
> >
> > Such a change in session id shouldn't happen in a normal site.[/color]
>
> Why not? I would think a session ID should be unique. If you think I am
> doing something wrong, what could that be then?[/color]
It shouldn't happen in a single session--session id remains same for
the single session unless:
1. Crawler is returning and caching in multiple run
2. You have used session_regenerate_id()
3. There are random absoulte links poining in from your site to your
site (instead of relative links)
[color=blue]
> Also,[color=green]
> > AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).[/color]
>
> you can try this seach in google site: www.winches.dk
>
> or click her
>
> http://www.google.com/search?q=site:...e=off&filter=0
>
> Look at the last 100 entries or so.[/color]
It doesn't seem to strip session id as I thought. If your site
contents doesn't rely on session (for non-members), you may safely turn
off trans sid
<news:1111603962.594721.154710@l41g2000cwc.googleg roups.com> ( http://groups.google.com/group/comp....24f27f2b7ac610 )
--even you can selectively turn off only for the crawler by sniffing
user agent string and or IP.
But, if your site depends on session (for non-members and hence
crawler) and you'd like to enable session for crawler, but doesn't want
the trans sid, you need to go for some other hack. If that is your
situation, I may help you with the hack.
--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/ | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
R. Rajesh Jeba Anbiah skrev:
[color=blue]
> CAH wrote:[color=green][color=darkred]
> > > > http://www.winches.dk/winches.php?ar...6f0d46334659ff...
> > > > http://www.winches.dk/winches.php?ar...b6aed41fc142ea...
> > >
> > > Such a change in session id shouldn't happen in a normal site.[/color]
> >
> > Why not? I would think a session ID should be unique. If you think I am
> > doing something wrong, what could that be then?[/color]
>
> It shouldn't happen in a single session--session id remains same for
> the single session unless:
> 1. Crawler is returning and caching in multiple run[/color]
I would think this is what happens.
[color=blue]
> 2. You have used session_regenerate_id()
> 3. There are random absoulte links poining in from your site to your
> site (instead of relative links)
>[color=green]
> > Also,[color=darkred]
> > > AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).[/color]
> >
> > you can try this seach in google site: www.winches.dk
> >
> > or click her
> >
> > http://www.google.com/search?q=site:...e=off&filter=0
> >
> > Look at the last 100 entries or so.[/color]
>
> It doesn't seem to strip session id as I thought. If your site
> contents doesn't rely on session (for non-members), you may safely turn
> off trans sid> <news:1111603962.594721.154710@l41g2000cwc.googleg roups.com> (
> http://groups.google.com/group/comp....24f27f2b7ac610 )
> --even you can selectively turn off only for the crawler by sniffing
> user agent string and or IP.
>
> But, if your site depends on session (for non-members and hence
> crawler)[/color]
it does denpend on sessions for non-members
and you'd like to enable session for crawler, but doesn't want[color=blue]
> the trans sid, you need to go for some other hack. If that is your
> situation, I may help you with the hack.[/color]
Thanks, that is very kind of you. I think the robot text might be doing
the trick, and then no further tricks og hacks should be needed. But I
am following Google closely. Now, session.use_trans_sid, what does that
do, does it not turn of sessions I URL, and force cookies on the users?
I found this at another site
if(strpos($_SERVER['HTTP_USER_AGENT'],"google")!==false or
strpos($_SERVER['HTTP_USER_AGENT'],"MSIECrawler")!==false)
{
ini_set("url_rewriter.tags","");
} http://www.mtdev.com/2002/06/why-you...use_trans_sid/
But have not testet it.
[color=blue]
>
> --
> <?php echo 'Just another PHP saint'; ?>
> Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/[/color] | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
CAH wrote:[color=blue]
> R. Rajesh Jeba Anbiah skrev:[/color]
<snip>[color=blue][color=green]
> > It doesn't seem to strip session id as I thought. If your site
> > contents doesn't rely on session (for non-members), you may safely turn
> > off trans sid> <news:1111603962.594721.154710@l41g2000cwc.googleg roups.com> (
> > http://groups.google.com/group/comp....24f27f2b7ac610 )
> > --even you can selectively turn off only for the crawler by sniffing
> > user agent string and or IP.
> >
> > But, if your site depends on session (for non-members and hence
> > crawler)[/color]
>
> it does denpend on sessions for non-members[/color]
I doubt that. Anyway, it's better you check if it depends on
session again.
[color=blue]
> and you'd like to enable session for crawler, but doesn't want[color=green]
> > the trans sid, you need to go for some other hack. If that is your
> > situation, I may help you with the hack.[/color][/color]
<snip>
[color=blue]
> I found this at another site
>
> if(strpos($_SERVER['HTTP_USER_AGENT'],"google")!==false or
> strpos($_SERVER['HTTP_USER_AGENT'],"MSIECrawler")!==false)
> {
> ini_set("url_rewriter.tags","");
> }
>
> http://www.mtdev.com/2002/06/why-you...use_trans_sid/[/color]
Actually you're turning off trans sid (see my link above) and there
by you're turning off the session for crawler. But, you said your site
needs session for crawler. And here goes my untested--to be improved--a
quick dirty hack:
<?php
/* Crawler SID removal hack: begin--------*/
/* Hack code should be placed on the top of every accessible script.
* or place it in a global common file say header.php or so.
* Important Assumption: Crawler indexes the final redirected URI */
/**
* Test if the request is from the crawler
*
* @return boolean
* @todo implement it or google for hundreds of codes
**/
function IsCrawler()
{
return true;
}
if (IsCrawler())
{
define('CRAWLER_SID_FILE', 'crawler_sid.txt');
if (isset($_GET[session_name()])) // Is session id found in query
string?
{
$tmp_get = $_GET;
unset($tmp_get[session_name()]); //remove session id
// now rebuild query string...
$new_get = http_build_query($tmp_get);
$default_ports = array('https' => 443, 'http' => 80);
$prefix = (!empty($_SERVER['HTTPS']) ? 'https' : 'http');
$current_url = $prefix .
(($_SERVER['SERVER_PORT'] != $default_ports[$prefix]) ?
':' . $_SERVER['SERVER_PORT'] : '') . '://'
. $_SERVER['HTTP_HOST']
. $_SERVER['PHP_SELF'];
// redirect to self, but with SID removed
header('Location: '.$current_url . '?' . $new_get);
exit;
}
else // SID is not found (page got redirected); so we need to
set/load the crawler's session id
{
// the session id is been actually found in
CRAWLER_SID_FILE for the crawler
if (($crawler_sid=@file_get_contents(CRAWLER_SID_FILE ))!==false)
session_id($crawler_sid);
}
}
/*----break: Crawler SID removal hack*/
// normal code...
session_start();
/* Crawler SID removal hack: continue ----*/
if (IsCrawler())
{
file_put_contents(CRAWLER_SID_FILE, session_id()); //safely store
the crawler's session id in CRAWLER_SID_FILE
}
/*----end: Crawler SID removal hack*/
// now, again normal code...
// ...
// testing a link...
echo '<p><a href="' . $_SERVER['PHP_SELF'] . '?a=100&b=50&c=5">Test
link</a></p>';
$_SESSION['test'] = !isset($_SESSION['test']) ? 0 :
($_SESSION['test']+1);
echo '<p>'.$_SESSION['test'].'</p>';
?>
--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/ | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
This is not the way to do it. This will prevent google from seeing any
page with a session id. Bad bad bad. You may as well just disallow
google altogether. | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
Ithink the best solutioin to this problem is to detect googlebot (+ any
other bots, yahoo slurp, msnbot..) and then set the session id to a
known value using session_id() that way you always have the session
when you expect it regardless of the current user-agent. Works a treat
for me. | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
fletch wrote:[color=blue]
> Ithink the best solutioin to this problem is to detect googlebot (+ any
> other bots, yahoo slurp, msnbot..) and then set the session id to a
> known value using session_id() that way you always have the session
> when you expect it regardless of the current user-agent. Works a treat
> for me.[/color]
This is cool. I was thinking about similar hack after posting that
as previous did add SID to the links on the page (will be a problem if
users visit via cached pages).
So, I'm thinking of turning off trans sid for crawler and keeping
the session id in server and setting it via session_id()
--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/ | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
Thansk a lot for all the comments and code, it is great that you take
an interrest. I belive session ID is also a problem with regards to
validation of site, so there are good reasons to find a solid and
simple solution.
[color=blue][color=green][color=darkred]
> > > But, if your site depends on session (for non-members and hence
> > > crawler)[/color]
> >
> > it does denpend on sessions for non-members[/color]
>
> I doubt that. Anyway, it's better you check if it depends on
> session again.[/color]
Well, there is no login on my site, there is no members. The session id
is used to transfer informations from one page to the next in af
electronic guide, that is made with forms.
[color=blue]
> Actually you're turning off trans sid (see my link above) and there
> by you're turning off the session for crawler.[/color]
I still do not understand the difference between trans sid og turning
sid off.
What is the difference between these two?
php_value session.use_only_cookies
php_value session.use_trans_sid
[color=blue]
>But, you said your site
> needs session for crawler.[/color]
I am sorry that I have given that impression, I do not need sessions
for the crawler. But I do not like to turn off sessions in the url,
because there are still users who does not like to get cookies on there
computers.
And here goes my untested--to be improved--a[color=blue]
> quick dirty hack:
>
> <?php
> /* Crawler SID removal hack: begin--------*/
> /* Hack code should be placed on the top of every accessible script.
> * or place it in a global common file say header.php or so.
> * Important Assumption: Crawler indexes the final redirected URI */
>
> /**
> * Test if the request is from the crawler
> *
> * @return boolean
> * @todo implement it or google for hundreds of codes
> **/
> function IsCrawler()
> {
> return true;
> }[/color]
Does the above that code test if it is a crawler?
[color=blue]
> if (IsCrawler())
> {
> define('CRAWLER_SID_FILE', 'crawler_sid.txt');
>
> if (isset($_GET[session_name()])) // Is session id found in query
> string?
> {
> $tmp_get = $_GET;
> unset($tmp_get[session_name()]); //remove session id
> // now rebuild query string...
> $new_get = http_build_query($tmp_get);
> $default_ports = array('https' => 443, 'http' => 80);
> $prefix = (!empty($_SERVER['HTTPS']) ? 'https' : 'http');
> $current_url = $prefix .
> (($_SERVER['SERVER_PORT'] != $default_ports[$prefix]) ?
> ':' . $_SERVER['SERVER_PORT'] : '') . '://'
> . $_SERVER['HTTP_HOST']
> . $_SERVER['PHP_SELF'];
> // redirect to self, but with SID removed
> header('Location: '.$current_url . '?' . $new_get);
> exit;
> }
> else // SID is not found (page got redirected); so we need to
> set/load the crawler's session id
> {
> // the session id is been actually found in
> CRAWLER_SID_FILE for the crawler
> if (($crawler_sid=@file_get_contents(CRAWLER_SID_FILE ))!==false)
> session_id($crawler_sid);
> }
> }
> /*----break: Crawler SID removal hack*/
> // normal code...
> session_start();
> /* Crawler SID removal hack: continue ----*/
> if (IsCrawler())
> {
> file_put_contents(CRAWLER_SID_FILE, session_id()); //safely store
> the crawler's session id in CRAWLER_SID_FILE
> }
> /*----end: Crawler SID removal hack*/
>
> // now, again normal code...
> // ...
> // testing a link...
> echo '<p><a href="' . $_SERVER['PHP_SELF'] . '?a=100&b=50&c=5">Test
> link</a></p>';
> $_SESSION['test'] = !isset($_SESSION['test']) ? 0 :
> ($_SESSION['test']+1);
> echo '<p>'.$_SESSION['test'].'</p>';
> ?>
>
>
> --
> <?php echo 'Just another PHP saint'; ?>
> Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/[/color]
I am going to have to read this a few times to get it, i am in the
beginner class. But I will look more closely into it.
But how about the robots.txt solution, it seems simple and I think it
works.
User-agent: Googlebot
Disallow: /*PHPSESSID
But I can not say for sure, I have to wait and see if google removes
the old listings.
Once again thanks for help and code.
Best regards
Mads Larsen | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
fletch skrev:
[color=blue]
> Ithink the best solutioin to this problem is to detect googlebot (+ any
> other bots, yahoo slurp, msnbot..) and then set the session id to a
> known value using session_id() that way you always have the session
> when you expect it regardless of the current user-agent. Works a treat
> for me.[/color]
Hi Fletch
Could you post the code as to how you do this, and have you treid a
robots.txt solution?
Best regards
Mads | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
I'll gve you a bit, but not all becasue it belongs to the company I
work for, I know he won't mind if I give a bit, but only if I plug the
company as well. Contact Zedcore Systems Ltd for php development!
if (isset($_SERVER['HTTP_USER_AGENT']))
{
$aSpiderUAs =
array(16=>'abachobot',11=>'abcdatos',3=>'altavista ',9=>'antibot',29=>'arach',6=>'archiver',7=>'aster ias',8=>'atomz',4=>'crawler',10=>'ezresult',2=>'fa st',12=>'ferret',13=>'find',14=>'fireball',15=>'ge ckobot',1=>'google',17=>'gulliver',18=>'hubater',1 9=>'incywincy',20=>'infoseek',21=>'mercator',38=>' msnbot',22=>'nazilla',23=>'roach',24=>'robot',25=> 'scooter',26=>'scrubby',27=>'sightquest',28=>'slur p',5=>'spider',30=>'spyder',31=>'teoma',32=>'touta tis',33=>'ultraseek',34=>'webrefiner',35=>'wscbot' ,36=>'yandex',37=>'zyborg');
$sHTTPUserAgent = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach($aSpiderUAs as $iPos=>$sSpiderUA)
{
if (strpos($sHTTPUserAgent,$sSpiderUA)!==false)
{
$iSpiderID = $iPos;
$sSID = $iSpiderID.'00000000';
$bCaughtASpider = true;
}
}
}
if (!$bCaughtASpider)
{
do {
$sNewSID = md5(mt_rand().mt_rand()); // An MD5 hash is 32 chars, the
same as a PHP-generated SID;
} while (file_exists(ini_get('session.save_path').'/sess_'.$sNewSID));
$sSID = $sNewSID;
}
The rest is left as an excercise for the reader! | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
I don't see the robots.txt to be a solution. You are telling google not
to index pages, you don't want that. The more google indexes the better. | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
El 12 Apr 2006 03:13:01 -0700
fletch escribió:
[..][color=blue]
> foreach($aSpiderUAs as $iPos=>$sSpiderUA) {
> if (strpos($sHTTPUserAgent,$sSpiderUA)!==false)
> {
> $iSpiderID = $iPos;
> $sSID = $iSpiderID.'00000000';
> $bCaughtASpider = true;
> }
> }
> }[/color]
just an off-topic note:
you really want to break from the foreach loop once you have found a
spider so that you don't have to loop through the whole array once you
have found a spider(that array may become really big with time).
also, you are keeping $iSpiderID, so for example if altavista user
agent is "altavista spider", you'll get the $iSpiderID of spider, NOT
of altavista.
it should be like this:
foreach($aSpiderUAs as $iPos=>$sSpiderUA)
if(strpos($sHTTPUserAgent,$sSpiderUA)!==false)
{
$iSpiderID=$iPos;
$sSID=$iSpiderID.'00000000';
$bCaughtASpider=true;
break;
}
you should also initialize $bCaughtASpider as false(you may already be
doing it, but you haven't pasted it in this bit) as it's always good to
initialize variables.
--
Juan José Gutiérrez de Quevedo
Director Técnico (juanjo@iteisa.com)
ITEISA ( http://www.iteisa.com)
942544036 - 637447953 | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
cheers! I will fix this. very useful. | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
fletch skrev:
[color=blue]
> I don't see the robots.txt to be a solution. You are telling google not
> to index pages, you don't want that. The more google indexes the better.[/color]
Well, the suggested text for robots.txt should not remove any sites,
but only the session ID from the url. I have however only found one
place were it has been claimed to work, but if it works as claimed the
I can not find any downside to this suggestion.
User-agent: Googlebot
Disallow: /*PHPSESSID | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
fletch skrev:
[color=blue]
> I'll gve you a bit, but not all becasue it belongs to the company I
> work for, I know he won't mind if I give a bit, but only if I plug the
> company as well. Contact Zedcore Systems Ltd for php development!
>[/color]
Hi Flech
Thanks a lot, it is kind of you to share this.
I have but a no index on my trouble pages, since I am currently engaged
in another project. But I will return to the solutions suggested here
later on, and see if I can put something usable together.
Thanks to all for the help and suggestions
Best regards
Mads | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
Honestly, this is designed to disallow google from indexing pages
ending with PHPSESSID, but wont work anyway because robots.txt doesn't
support globbing
[color=blue]
>From http://www.robotstxt.org/wc/exclusion-admin.html
> Note also that regular expression are not supported in either
> the User-agent or Disallow lines. The '*' in the User-agent field
> is a special value meaning "any robot". Specifically, you cannot
> have lines like "Disallow: /tmp/*" or "Disallow: *.gif".[/color] | | | | re: Can you avoid that googlebot indexes PHPSESSID pages?
fletch skrev:
[color=blue]
> Honestly, this is designed to disallow google from indexing pages
> ending with PHPSESSID, but wont work anyway because robots.txt doesn't
> support globbing[/color]
I found this
Additionally, Google has introduced increased flexibility to the
robots.txt file standard through the use asterisks. Disallow patterns
may include "*" to match any sequence of characters, and patterns may
end in "$" to indicate the end of a name.
here http://www.google.com/webmasters/rem...#exclude_pages
But this only works for google, however session id seam to be mainly a
problem with google. However then there is still the problem of
validating a site that uses session id . |  | | | | /bytes/about
We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights.
Get the best answers to your questions from over 226,366 network members.
|