By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,368 Members | 1,273 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,368 IT Pros & Developers. It's quick & easy.

How to keep a site out of the search engines?

P: n/a
Hi;

I have a site that I do not want the search engines to pick up
on.....attracts people and problems I do not want.

Is there a tag ( or some other means ) of preventing this?

Thanks

Steve

May 24 '06 #1
Share this Question
Share on Google+
11 Replies


P: n/a
Steve wrote:
I have a site that I do not want the search engines to pick up
on
http://www.robotstxt.org/wc/exclusion-admin.html
.....attracts people and problems I do not want.


If you want a private website, then use some form of password
protection on it.

May 24 '06 #2

P: n/a
David Dorward wrote:
Steve wrote:
I have a site that I do not want the search engines to pick up
on


http://www.robotstxt.org/wc/exclusion-admin.html


Does this actually work? Or is it like <meta name="robots"
content="noindex,nofollow"> which google still crawls but doesn't index.
Though I guess that's the same difference.

--
Brian O'Connor (ironcorona)
May 24 '06 #3

P: n/a
ironcorona wrote:
David Dorward wrote:
Steve wrote:
I have a site that I do not want the search engines to pick up
on


http://www.robotstxt.org/wc/exclusion-admin.html


Does this actually work?


Reputable bots obey it.
--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
May 24 '06 #4

P: n/a
On 24 May 2006 06:52:41 -0700, "Steve" <st**********@yahoo.com>
wrote:
Hi;

I have a site that I do not want the search engines to pick up
on.....attracts people and problems I do not want.

Is there a tag ( or some other means ) of preventing this?


Don't put clickable links to it in a pubically accessable web
page.
May 24 '06 #5

P: n/a
> Is there a tag ( or some other means ) of preventing this?

You could use robots.txt... However, a better yet solution
may be to use .htaccess... Perhaps a user/password system?

Do have access to the webserver? Apache?
What about PHP (or ASP if Windows) ?
--
best regards
Thomas Schulz
http://www.micro-sys.dk/products/sitemap-generator/
http://www.micro-sys.dk/products/website-analyzer/
May 26 '06 #6

P: n/a

Si Ballenger wrote:
Don't put clickable links to it in a pubically accessable web
page.


Not reliable. Google Toolbar (for just one) is a backchannel that
feeds the URLs of "hidden" web sites back to Google, where they then
get spidered.

You also have no control over other people linking to your site.
If you want to "avoid indexing", then just use robots.txt (Maybe your
site isn't released yet).

If you want to keep your content hidden, then disallow anyone and
everyone from accessing it (by web server config, such as .htacccess).
Then specifically _allow_ content to be visible to a small set of
permitted users, such as by password access.

There is no practical way to identify "a spider" as distinct from "a
user". So any attempt to make content generally available and
_disallow_ spiders is always doomed to be unreliable and susceptible to
some level of leakage.

May 26 '06 #7

P: n/a
di*****@codesmiths.com <di*****@codesmiths.com> scripsit:
There is no practical way to identify "a spider" as distinct from "a
user".
There is. At least search engine spiders obey the Robots Exclusion Standard
in practice. Occasionally, there might be a misbehaving spider, but such
spiders are rare and they serve odd purposes. You can't beat them, or
separate them from users in any reasonable way, but there's no need to do
that either. They don't make your page findable using commonly used search
engines.
So any attempt to make content generally available and
_disallow_ spiders is always doomed to be unreliable and susceptible
to some level of leakage.


That's correct if you mean complete control. Generally, complete control
does not work on the WWW.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

May 26 '06 #8

P: n/a

Jukka K. Korpela wrote:
di*****@codesmiths.com <di*****@codesmiths.com> scripsit:
There is no practical way to identify "a spider" as distinct from "a
user".
There is. At least search engine spiders obey the Robots Exclusion Standard
in practice.


That is requesting a behaviour from a spider, not identifying it.

Occasionally, there might be a misbehaving spider,
Such as Google. Google's spiders have Been Evil of late, hammering on
some sites excessively and also ignoring the robot exclusion protocol
in the case of deep URLs obtained through the Google Toolbar.
Generally, complete control does not work on the WWW.


It does, provided you begin by forbidding _everything_ to _everyone_,
then only relaxing this rule for the very small domain where you can
control things (generally a crypto or password based subset of accepted
user agents). The "lack of control on the web" is a result of the web
being broadly accessible to a broad range of agents (and generally a
good thing too).

May 26 '06 #9

P: n/a
In article <11**********************@i40g2000cwc.googlegroups .com>,
"Andy Dingley <di*****@codesmiths.com>" <di*****@codesmiths.com> writes
Google Toolbar (for just one) is a backchannel that feeds the URLs of
"hidden" web sites back to Google, where they then get spidered.


Also, Google has taken to looking at newly registered domain names to
see if there is a web site there. This means that even if your site
doesn't have any links to it and you don't use the Google toolbar,
Google could still find it!!

--
Alan Silver
(anything added below this line is nothing to do with me)
May 29 '06 #10

P: n/a
Alan Silver schrieb:
In article <11**********************@i40g2000cwc.googlegroups .com>,
"Andy Dingley <di*****@codesmiths.com>" <di*****@codesmiths.com> writes
Google Toolbar (for just one) is a backchannel that feeds the URLs of
"hidden" web sites back to Google, where they then get spidered.


Also, Google has taken to looking at newly registered domain names to
see if there is a web site there. This means that even if your site
doesn't have any links to it and you don't use the Google toolbar,
Google could still find it!!

Google will always be able to "find" your page. You can just tell Google
not to list your page.
You can use either robots.txt or <meta>-Tags to keep out a site from the
indexes of most search-engines. Although there are Spiders who do not
follow your rules in robots.txt, most of the searchbots do.

If you just want a single page not to be listed by search-engines insert
the following tag into your HTML-head:
<meta name="robots" content="noindex">

If you want a whole directory not to be listed you have to create a text
file called "robots.txt" in the main directory of your domain. In this
file you write:
User-agent: *
Disallow: /DIRECTORY/
(Replace DIRECTORY with the name of the directory you want to disallow.)

Hope I could help you.
Jun 19 '06 #11

P: n/a
>>Google Toolbar (for just one) is a backchannel that feeds the URLs of
>>"hidden" web sites back to Google, where they then get spidered.

Also, Google has taken to looking at newly registered domain names to see if
there is a web site there. This means that even if your site doesn't have any
links to it and you don't use the Google toolbar, Google could still find
it!!
Google will always be able to "find" your page. You can just tell Google not
to list your page.
You can use either robots.txt or <meta>-Tags to keep out a site from the
indexes of most search-engines. Although there are Spiders who do not follow
your rules in robots.txt, most of the searchbots do.

If you just want a single page not to be listed by search-engines insert the
following tag into your HTML-head:
<meta name="robots" content="noindex">

If you want a whole directory not to be listed you have to create a text file
called "robots.txt" in the main directory of your domain. In this file you
write:
User-agent: *
Disallow: /DIRECTORY/
(Replace DIRECTORY with the name of the directory you want to disallow.)
You can request that your URL be removed from Google's index at
http://services.google.com:8882/urlc...&lastcmd=login
and read more about Google's webmaster's guidelines at
http://www.google.com/support/webmas...y?answer=35769

Google generally plays by the rules, so a Disallow instruction in robots.txt
should work - for google's bot. But don't expect all bots to heed your
instructions (many will ignore robots.txt entirely). It's like you are on a
crowded public street telling people not to look at you. As long as you're in
sight, there's nothing preventing people (good, bad, and indifferent) from
looking.

Jul 6 '06 #12

This discussion thread is closed

Replies have been disabled for this discussion.