473,405 Members | 2,445 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

How to keep a site out of the search engines?

Hi;

I have a site that I do not want the search engines to pick up
on.....attracts people and problems I do not want.

Is there a tag ( or some other means ) of preventing this?

Thanks

Steve

May 24 '06 #1
11 2035
Steve wrote:
I have a site that I do not want the search engines to pick up
on
http://www.robotstxt.org/wc/exclusion-admin.html
.....attracts people and problems I do not want.


If you want a private website, then use some form of password
protection on it.

May 24 '06 #2
David Dorward wrote:
Steve wrote:
I have a site that I do not want the search engines to pick up
on


http://www.robotstxt.org/wc/exclusion-admin.html


Does this actually work? Or is it like <meta name="robots"
content="noindex,nofollow"> which google still crawls but doesn't index.
Though I guess that's the same difference.

--
Brian O'Connor (ironcorona)
May 24 '06 #3
ironcorona wrote:
David Dorward wrote:
Steve wrote:
I have a site that I do not want the search engines to pick up
on


http://www.robotstxt.org/wc/exclusion-admin.html


Does this actually work?


Reputable bots obey it.
--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
May 24 '06 #4
On 24 May 2006 06:52:41 -0700, "Steve" <st**********@yahoo.com>
wrote:
Hi;

I have a site that I do not want the search engines to pick up
on.....attracts people and problems I do not want.

Is there a tag ( or some other means ) of preventing this?


Don't put clickable links to it in a pubically accessable web
page.
May 24 '06 #5
> Is there a tag ( or some other means ) of preventing this?

You could use robots.txt... However, a better yet solution
may be to use .htaccess... Perhaps a user/password system?

Do have access to the webserver? Apache?
What about PHP (or ASP if Windows) ?
--
best regards
Thomas Schulz
http://www.micro-sys.dk/products/sitemap-generator/
http://www.micro-sys.dk/products/website-analyzer/
May 26 '06 #6

Si Ballenger wrote:
Don't put clickable links to it in a pubically accessable web
page.


Not reliable. Google Toolbar (for just one) is a backchannel that
feeds the URLs of "hidden" web sites back to Google, where they then
get spidered.

You also have no control over other people linking to your site.
If you want to "avoid indexing", then just use robots.txt (Maybe your
site isn't released yet).

If you want to keep your content hidden, then disallow anyone and
everyone from accessing it (by web server config, such as .htacccess).
Then specifically _allow_ content to be visible to a small set of
permitted users, such as by password access.

There is no practical way to identify "a spider" as distinct from "a
user". So any attempt to make content generally available and
_disallow_ spiders is always doomed to be unreliable and susceptible to
some level of leakage.

May 26 '06 #7
di*****@codesmiths.com <di*****@codesmiths.com> scripsit:
There is no practical way to identify "a spider" as distinct from "a
user".
There is. At least search engine spiders obey the Robots Exclusion Standard
in practice. Occasionally, there might be a misbehaving spider, but such
spiders are rare and they serve odd purposes. You can't beat them, or
separate them from users in any reasonable way, but there's no need to do
that either. They don't make your page findable using commonly used search
engines.
So any attempt to make content generally available and
_disallow_ spiders is always doomed to be unreliable and susceptible
to some level of leakage.


That's correct if you mean complete control. Generally, complete control
does not work on the WWW.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

May 26 '06 #8

Jukka K. Korpela wrote:
di*****@codesmiths.com <di*****@codesmiths.com> scripsit:
There is no practical way to identify "a spider" as distinct from "a
user".
There is. At least search engine spiders obey the Robots Exclusion Standard
in practice.


That is requesting a behaviour from a spider, not identifying it.

Occasionally, there might be a misbehaving spider,
Such as Google. Google's spiders have Been Evil of late, hammering on
some sites excessively and also ignoring the robot exclusion protocol
in the case of deep URLs obtained through the Google Toolbar.
Generally, complete control does not work on the WWW.


It does, provided you begin by forbidding _everything_ to _everyone_,
then only relaxing this rule for the very small domain where you can
control things (generally a crypto or password based subset of accepted
user agents). The "lack of control on the web" is a result of the web
being broadly accessible to a broad range of agents (and generally a
good thing too).

May 26 '06 #9
In article <11**********************@i40g2000cwc.googlegroups .com>,
"Andy Dingley <di*****@codesmiths.com>" <di*****@codesmiths.com> writes
Google Toolbar (for just one) is a backchannel that feeds the URLs of
"hidden" web sites back to Google, where they then get spidered.


Also, Google has taken to looking at newly registered domain names to
see if there is a web site there. This means that even if your site
doesn't have any links to it and you don't use the Google toolbar,
Google could still find it!!

--
Alan Silver
(anything added below this line is nothing to do with me)
May 29 '06 #10
Alan Silver schrieb:
In article <11**********************@i40g2000cwc.googlegroups .com>,
"Andy Dingley <di*****@codesmiths.com>" <di*****@codesmiths.com> writes
Google Toolbar (for just one) is a backchannel that feeds the URLs of
"hidden" web sites back to Google, where they then get spidered.


Also, Google has taken to looking at newly registered domain names to
see if there is a web site there. This means that even if your site
doesn't have any links to it and you don't use the Google toolbar,
Google could still find it!!

Google will always be able to "find" your page. You can just tell Google
not to list your page.
You can use either robots.txt or <meta>-Tags to keep out a site from the
indexes of most search-engines. Although there are Spiders who do not
follow your rules in robots.txt, most of the searchbots do.

If you just want a single page not to be listed by search-engines insert
the following tag into your HTML-head:
<meta name="robots" content="noindex">

If you want a whole directory not to be listed you have to create a text
file called "robots.txt" in the main directory of your domain. In this
file you write:
User-agent: *
Disallow: /DIRECTORY/
(Replace DIRECTORY with the name of the directory you want to disallow.)

Hope I could help you.
Jun 19 '06 #11
>>Google Toolbar (for just one) is a backchannel that feeds the URLs of
>>"hidden" web sites back to Google, where they then get spidered.

Also, Google has taken to looking at newly registered domain names to see if
there is a web site there. This means that even if your site doesn't have any
links to it and you don't use the Google toolbar, Google could still find
it!!
Google will always be able to "find" your page. You can just tell Google not
to list your page.
You can use either robots.txt or <meta>-Tags to keep out a site from the
indexes of most search-engines. Although there are Spiders who do not follow
your rules in robots.txt, most of the searchbots do.

If you just want a single page not to be listed by search-engines insert the
following tag into your HTML-head:
<meta name="robots" content="noindex">

If you want a whole directory not to be listed you have to create a text file
called "robots.txt" in the main directory of your domain. In this file you
write:
User-agent: *
Disallow: /DIRECTORY/
(Replace DIRECTORY with the name of the directory you want to disallow.)
You can request that your URL be removed from Google's index at
http://services.google.com:8882/urlc...&lastcmd=login
and read more about Google's webmaster's guidelines at
http://www.google.com/support/webmas...y?answer=35769

Google generally plays by the rules, so a Disallow instruction in robots.txt
should work - for google's bot. But don't expect all bots to heed your
instructions (many will ignore robots.txt entirely). It's like you are on a
crowded public street telling people not to look at you. As long as you're in
sight, there's nothing preventing people (good, bad, and indifferent) from
looking.

Jul 6 '06 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Aardwolf | last post by:
I have recently started to convert several of my websites over to dynamic sites with pages written as requested with php and in some cases using mysql databases to supply data within parts of the...
1
by: Bosconian | last post by:
I know this question is asked from time to time, but the offerings change often enough that it deserves repeating. I have a dynamic database-driven web site using PHP/MySQL on Linux. I need to...
0
by: R. Rajesh Jeba Anbiah | last post by:
Q: Is PHP search engine friendly? Q: Will search engine spiders crawl my PHP pages? A: Spiders should crawl anything provided they're accessible. Since, nowadays most of the websites are been...
1
by: disaia | last post by:
2 problems: Example: If a person types in a part number into Yahoo: 1. Is there a way for Yahoo to list your web site as one of the results. 2. If the user clicks on your link, can your web...
67
by: Sandy.Pittendrigh | last post by:
Here's a question I don't know the answer to: I have a friend who makes very expensive, hand-made bamboo flyrods. He's widely recognized (in the fishing industry) as one of the 3-5 'best' rod...
3
by: Mark | last post by:
Our site gets searched by robots all the time. This is great. However, many of our pages that we want to be cataloged are data driven, so we end up with pages like: ...
8
by: Sandy Pittendrigh | last post by:
I have a how-to-do-it manual like site, related to fishing. I want to add a new interactive question/comment feature to each instructional page on the site. I want (registered) users to be able...
2
by: Griff | last post by:
Hi We have an eCommerce site that was designed as a BusinessToBusiness system. When anyone accesses a page, the site checks to see whether they have a current session (i.e. already...
0
by: passion | last post by:
"Specialized Search Engines" along with Google Search Capability (2 in 1): http://specialized-search-engines.blogspot.com/ Billions of websites are available on the web and plenty of extremely...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.