By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,880 Members | 1,817 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,880 IT Pros & Developers. It's quick & easy.

html source to prevent web bot searching

P: n/a
I read that there are some tags that can be entered in a web page's
meta tags in order to prevent web bot searching and indexing of the
web page for search engines.

What is the tagging that I would need to use?
Jul 20 '05 #1
Share this Question
Share on Google+
14 Replies


P: n/a
*Ludwig77* wrote:
I read that there are some tags that can be entered in a web page's
meta tags in order to prevent web bot searching and indexing of the
web page for search engines.

What is the tagging that I would need to use?


http://www.robotstxt.org/wc/faq.html#noindex
--
Andrew Urquhart
- FAQ: www.htmlhelp.org/faq/html/
- Archive: www.tinyurl.com/2zw7m (Google Groups)
- My reply address is invalid, use: www.andrewu.co.uk/contact/
Jul 20 '05 #2

P: n/a
Ludwig77 <gr********@yahoo.com> wrote:
I read that there are some tags that can be entered in a web page's
meta tags in order to prevent web bot searching and indexing of the
web page for search engines.

What is the tagging that I would need to use?


There's always robots.txt
http://www.searchengineworld.com/rob...s_tutorial.htm

--
_Deirdre http://deirdre.net
"Memes are a hoax! Pass it on!"
Jul 20 '05 #3

P: n/a
----- Original Message -----
From: "Ludwig77" <>
Newsgroups: comp.infosystems.www.authoring.html
Sent: Monday, June 14, 2004 2:48 PM
Subject: html source to prevent web bot searching

I read that there are some tags that can be entered in a web page's
meta tags in order to prevent web bot searching and indexing of the
web page for search engines.

What is the tagging that I would need to use?

In the HEAD of the page, insert the following four lines:

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="robots" CONTENT="noarchive">
<meta name="robots" content="noimageindex, nomediaindex" />
<META HTTP-EQUIV="pragma" CONTENT="no-cache">

PLEASE note: The key word in your inquiry is "prevent."
Neither the use of the aforementioned four lines or the use of disallow in
robots.txt PREVENTS ANYTHING.
Rather the interpretation is that honorable bots will abide by your wishes.
On the other hand there are many, many dishonorbale bots.
The solution to those bots is with the implementation of an effective
"htaccess" file. Htaccess is a control rather than a request and properly
used enables the PREVENT you inquiried about.
Jul 20 '05 #4

P: n/a
lostinspace wrote:
Original Message From: "Ludwig77"
I read that there are some tags that can be entered in a web
page's meta tags in order to prevent web bot searching and
indexing of the web page for search engines.
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="robots" CONTENT="noarchive">


Ok, though it'd be worth mentioning robots.txt as well.
<meta name="robots" content="noimageindex, nomediaindex" />
Why did you switch to xhtml syntax for this one line?
<META HTTP-EQUIV="pragma" CONTENT="no-cache">


Pardon? What does caching have to do with search engine indexing?

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/
Jul 20 '05 #5

P: n/a
----- Original Message -----
From: "Brian" <>
Newsgroups: comp.infosystems.www.authoring.html
Sent: Monday, June 14, 2004 11:07 PM
Subject: Re: html source to prevent web bot searching

lostinspace wrote:
Original Message From: "Ludwig77"
I read that there are some tags that can be entered in a web
page's meta tags in order to prevent web bot searching and
indexing of the web page for search engines.


<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="robots" CONTENT="noarchive">


Ok, though it'd be worth mentioning robots.txt as well.
<meta name="robots" content="noimageindex, nomediaindex" />


Why did you switch to xhtml syntax for this one line?
<META HTTP-EQUIV="pragma" CONTENT="no-cache">


Pardon? What does caching have to do with search engine indexing?

--
Brian (remove ".invalid" to email me)

Brian,
I've been using those lines for some time.
Google and some other SE's "each" have their individual preference for page
exclusions.
http://google.netscape.com/webmasters/faq.html#cached
Now isn't that absurd that Google requires something different from the
industry norm?
They aren't alone.
I'm unable to recall which bot the xhtml syntax addresses however it is
specific.

Caching Vs indexing?
If they don't see it they can't read it.

Over a period of "time" controlling ALL the cache (with what ever mean
possible) will provide your logs with the majority of your visitors :-))
This is "contrary" to what many folks will tell you.

BTW the inquiry was specific to "prevent."
If he's unable to find any mention of robot's txt, htaccess or INDEX-NOINDEX
or even do a simple google on "prevent+web+bot+searching" what in your
opinion is going to be his understanding and experience in these regards?
Jul 20 '05 #6

P: n/a
lostinspace wrote:
From: "Brian" <>
lostinspace wrote:
Original Message From: "Ludwig77"

I read that there are some tags that can be entered in a web
page's meta tags in order to prevent web bot searching and
indexing

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="robots" CONTENT="noarchive">
<meta name="robots" content="noimageindex, nomediaindex" />
Why did you switch to xhtml syntax for this one line?
<META HTTP-EQUIV="pragma" CONTENT="no-cache">


Pardon? What does caching have to do with search engine indexing?

I've been using those lines for some time.


That may be, but it sheds no light on *why* you use them, and why
you're telling others to use them.
Google and some other SE's "each" have their individual preference
for page exclusions.
Which search engines ignore the robots exclusion policy?
http://google.netscape.com/webmasters/faq.html#cached
This link is a Netscape page; strange that you didn't reference a
google.com page.
Now isn't that absurd that Google requires something different from
the industry norm?
What is absurd is that you are offering advice about something which
you have badly misunderstood.

Google cache is a service by which Google offers users a view of the
page as it was last indexed by Googlebot. It is completely unrelated
to http caching (more on that below). Googlebot does respect the
robots policy. If you don't want the page indexed, editing the page
with the meta robots element or the site's robots.txt file will suffice.
I'm unable to recall which bot the xhtml syntax addresses however
it is specific.
Then I can only assume you're mistaken.
Caching Vs indexing? If they don't see it they can't read it.
Exactly. So including

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

is entirely unnecessary if the robot has been blocked from indexing
the site in the first place.
Over a period of "time" controlling ALL the cache (with what ever
mean possible) will provide your logs with the majority of your
visitors :-)) This is "contrary" to what many folks will tell you.
No, this is not contrary to what many folks will tell me. In fact, a
search of web forums will turn up many people who as clueless
about caching as you are.

First, there is no way to ensure that your document is not cached
using pragma; you have a better chance with cache-control. Impeding
caches is an incredibly stupid thing to do in most situations, since
it slows down your site with no appreciable gain. It should only be
done if there is a genuine reason to block caching (security, privacy,
etc.). Vain attempts to "improve" your server logs is one of the
silliest reasons to block caches.

You have muddied the waters further by confusing Google's cache on one
hand with proxy and browser caching on the other. They have *nothing*
to do with each other.
If he's unable to find any mention of robot's txt, htaccess or
INDEX-NOINDEX or even do a simple google on
"prevent+web+bot+searching" what in your opinion is going to be his
understanding and experience in these regards?


It couldn't possibly be more misleading than your post. You -- and the
op -- can start learning about robots exclusion:

http://www.robotstxt.org/wc/robots.html

Google additions can be found on their site:

http://www.google.com/bot.html

And for pete's sake, please stop giving advice on caching until you
understand it better. Start here:

http://www.web-caching.com/mnot_tutorial/

P.S. Please follow the norms for posting in this group: trim your
quotes, and insert your replies after the relevant quoted parts. See

http://www.xs4all.nl/%7ewijnands/nnq/nquote.html

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/
Jul 20 '05 #7

P: n/a
----- Original Message -----
From: "Brian" <>
Newsgroups: comp.infosystems.www.authoring.html
Sent: Tuesday, June 15, 2004 1:18 AM
Subject: Re: html source to prevent web bot searching

lostinspace wrote:
From: "Brian" <>
lostinspace wrote:

Original Message From: "Ludwig77"

> I read that there are some tags that can be entered in a web
> page's meta tags in order to prevent web bot searching and
> indexing

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="robots" CONTENT="noarchive">
<meta name="robots" content="noimageindex, nomediaindex" />

Why did you switch to xhtml syntax for this one line?

<META HTTP-EQUIV="pragma" CONTENT="no-cache">

Pardon? What does caching have to do with search engine indexing?

I've been using those lines for some time.


That may be, but it sheds no light on *why* you use them, and why
you're telling others to use them.
Google and some other SE's "each" have their individual preference
for page exclusions.


Which search engines ignore the robots exclusion policy?
http://google.netscape.com/webmasters/faq.html#cached


This link is a Netscape page; strange that you didn't reference a
google.com page.
Now isn't that absurd that Google requires something different from
the industry norm?


What is absurd is that you are offering advice about something which
you have badly misunderstood.

Google cache is a service by which Google offers users a view of the
page as it was last indexed by Googlebot. It is completely unrelated
to http caching (more on that below). Googlebot does respect the
robots policy. If you don't want the page indexed, editing the page
with the meta robots element or the site's robots.txt file will suffice.
I'm unable to recall which bot the xhtml syntax addresses however
it is specific.


Then I can only assume you're mistaken.
Caching Vs indexing? If they don't see it they can't read it.


Exactly. So including

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

is entirely unnecessary if the robot has been blocked from indexing
the site in the first place.
Over a period of "time" controlling ALL the cache (with what ever
mean possible) will provide your logs with the majority of your
visitors :-)) This is "contrary" to what many folks will tell you.


No, this is not contrary to what many folks will tell me. In fact, a
search of web forums will turn up many people who as clueless
about caching as you are.

First, there is no way to ensure that your document is not cached
using pragma; you have a better chance with cache-control. Impeding
caches is an incredibly stupid thing to do in most situations, since
it slows down your site with no appreciable gain. It should only be
done if there is a genuine reason to block caching (security, privacy,
etc.). Vain attempts to "improve" your server logs is one of the
silliest reasons to block caches.

You have muddied the waters further by confusing Google's cache on one
hand with proxy and browser caching on the other. They have *nothing*
to do with each other.
If he's unable to find any mention of robot's txt, htaccess or
INDEX-NOINDEX or even do a simple google on
"prevent+web+bot+searching" what in your opinion is going to be his
understanding and experience in these regards?


It couldn't possibly be more misleading than your post. You -- and the
op -- can start learning about robots exclusion:

http://www.robotstxt.org/wc/robots.html

Google additions can be found on their site:

http://www.google.com/bot.html

And for pete's sake, please stop giving advice on caching until you
understand it better. Start here:

http://www.web-caching.com/mnot_tutorial/

P.S. Please follow the norms for posting in this group: trim your
quotes, and insert your replies after the relevant quoted parts. See

http://www.xs4all.nl/%7ewijnands/nnq/nquote.html

--
Brian (remove ".invalid" to email me)


Brian,
If you are as knowledgable in regards to these as issues as you
believe you are than WHY didn't you advise the op before I?
It's quite easy for you to sit on your backside and tear apart emails
"after the fact."
You really have less of a clue than you understand. I've been using these
methods and htaccess methods for nearly six years on my websites and today
your attempting to convey that in that period I learned nothing of traffic
patterne :-))

If my sumbissions are upsetting you than filter me out.
Jul 20 '05 #8

P: n/a
On Tue, 15 Jun 2004 13:25:52 GMT, lostinspace
<lo*********@123-universe.com> wrote:
If you are as knowledgable in regards to these as issues as
you
believe you are than WHY didn't you advise the op before I?
Posts propogate, and people check in, differently. Possible he never saw
the post before you did.
It's quite easy for you to sit on your backside and tear apart emails
"after the fact."
You really have less of a clue than you understand. I've been using these
methods and htaccess methods for nearly six years on my websites and
today
your attempting to convey that in that period I learned nothing of
traffic
patterne :-))


I think he'd like some evidence, beyond "I heard this works". I would as
well. Google's methods are well known. Where did you learn of the other
methods?
Jul 20 '05 #9

P: n/a
----- Original Message -----
From: "Neal" <>
Newsgroups: comp.infosystems.www.authoring.html
Sent: Tuesday, June 15, 2004 9:43 AM
Subject: Re: html source to prevent web bot searching

On Tue, 15 Jun 2004 13:25:52 GMT, lostinspace
<> wrote:
If you are as knowledgable in regards to these as issues as
you
believe you are than WHY didn't you advise the op before I?


Posts propogate, and people check in, differently. Possible he never saw
the post before you did.
It's quite easy for you to sit on your backside and tear apart emails
"after the fact."
You really have less of a clue than you understand. I've been using these methods and htaccess methods for nearly six years on my websites and
today
your attempting to convey that in that period I learned nothing of
traffic
patterne :-))


I think he'd like some evidence, beyond "I heard this works". I would as
well. Google's methods are well known. Where did you learn of the other
methods?


Hello Neal,
When I began with my websites, I also started following
alt.html and alt.www.webmaster. From following threads of interest, I began
doing internet searches on lead words which had been supplied in
conversation. Later I participated in a Webmaster World forum surrounding
identifying robots (that forum has since been non-activated.)

There is so much more to this than was previously conveyed, however I felt
no reason to overwhelm the original inquiry.

Brian's concerns and interest are of no relevance to me. I provided the OP
with some lines as he requested which will possibly lead him to some
expanded insights, provided he learns how to use SE's :-)))

htaccess? Just do a google.
Proxies? Google or anybody else will be no help here. Most of the
proxy-server cache bots don't even identify themselves when spidering. AOL
is easy, they are using a UA ( "Mozilla/3.01 (compatible;)" ) .

This very extensive thread will provide you with a wealth of information:
http://www.webmasterworld.com/forum1...ht=perfect+ban

I'm not sure if the search capability still exists in that forum:
http://www.webmasterworld.com/forum11/index.htm
The entire defunct forum surrounded what has been touched on here.

In the end, each webmaster does what he/she determines to best enhance their
websites. Personally, I've in effect created an intranet on the open
internet by denying countries and regions. To take the time to explain and
"debate" over issues that some folks believe they understand to be so and
what I actually see occur is IMO not worth the any time spent convincing
them otherwise. Nor is it my desire to chase URL's that I long ago chased to
solve issues ONLY to support an effective solution which I implemented long
ago, merely to support a mail submission I provided an insight to.

I rarely post in this forum and this required detail for assisting somebody
explains why. :-(((
I provided the simple solution that the OP was asking for. In his original
mail, he inquired about something to include in the <head></head> although
he didn't realize that. He made NO inquiry about robots.txt, htaccess,
proxies, cache or all this other nonsense (at least in regard to his
inquiry.) In effect, I answered his question and I'm required to defend
myself. BS!
Jul 20 '05 #10

P: n/a
lostinspace wrote:
From: "Brian"
lostinspace wrote:
From: "Brian"

lostinspace wrote: <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
> <META NAME="robots" CONTENT="noarchive">
> <meta name="robots" content="noimageindex, nomediaindex" />
> <META HTTP-EQUIV="pragma" CONTENT="no-cache"> What does caching have to do with search engine indexing? Google and some other SE's "each" have their individual
preference for page exclusions.
Now isn't that absurd that Google requires something different
from the industry norm? I'm unable to recall which bot the xhtml syntax addresses
however it is specific.
Please follow the norms for posting in this group: trim your
quotes, and insert your replies after the relevant quoted parts.
See

http://www.xs4all.nl/%7ewijnands/nnq/nquote.html

Please actually *read* this document.
Brian, If you are as knowledgable in regards to these as issues as
you believe you are than WHY didn't you advise the op before I?
I was away from my computer. I do apologize.
It's quite easy for you to sit on your backside and tear apart
emails "after the fact."
It was easy for you to spew a bunch of nonsense, too.
You really have less of a clue than you understand.
Oh? Then please point out where I was wrong instead of merely writing
an ad hominem attack.
I've been using these methods and htaccess methods for nearly six
years on my websites and today your attempting to convey that in
that period I learned nothing of traffic patterne :-))
What I read in your posts suggests that you do not understand the
difference between proxy/browser caching and Google's cache. Until you
do, you have no business giving bogus advice to unsuspecting newcomers.
If my sumbissions are upsetting you than filter me out.


If you continue to ignore the posting styles -- you again refused to
snip a single line of quotes -- then you'll indeed enter my killfile.

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/
Jul 20 '05 #11

P: n/a
lostinspace wrote:
I provided the OP with some lines as he requested which will
possibly lead him to some expanded insights, provided he learns how
to use SE's :-)))
You still haven't grasped the concept: pragma has nothing to do with
search engines.
Proxies? Google or anybody else will be no help here.
http://www.google.com/search?q=proxies

It looks pretty helpful to me.
Most of the proxy-server cache bots don't even identify themselves
when spidering.
Naturally, since they do not spider. Why do you refuse to simply read
what a proxy cache is, and how it differs from a bot?
I rarely post in this forum and this required detail for assisting
somebody explains why. :-(((
This isn't about minutiae. Your advice, if followed, would almost
certainly have a serious negative impact on the poor op.
I provided the simple solution that the OP was asking for.
Not quite. As a look at your first posting in this thread shows that
you provided a solution, *with additional information* that was
unrelated and likely harmful.
In his original mail, he inquired about something to include in the
<head></head> although he didn't realize that. He made NO inquiry
about robots.txt, htaccess, proxies, cache


Right. So, if he didn't ask about proxy and browser caching, why did
you advise him to tell proxies (and browsers) not to cache his documents?

I get the impression that you're just stubbornly refusing to admit error.

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/
Jul 20 '05 #12

P: n/a
----- Original Message -----
From: "Brian" <>
Newsgroups: comp.infosystems.www.authoring.html
Sent: Tuesday, June 15, 2004 11:04 AM
Subject: Re: html source to prevent web bot searching

lostinspace wrote:
From: "Brian"
lostinspace wrote:

From: "Brian"

> lostinspace wrote: <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
>> <META NAME="robots" CONTENT="noarchive">
>> <meta name="robots" content="noimageindex, nomediaindex" />
>> <META HTTP-EQUIV="pragma" CONTENT="no-cache"> What does caching have to do with search engine indexing? Google and some other SE's "each" have their individual
preference for page exclusions.
Now isn't that absurd that Google requires something different
from the industry norm? I'm unable to recall which bot the xhtml syntax addresses
however it is specific. Please follow the norms for posting in this group: trim your
quotes, and insert your replies after the relevant quoted parts.
See

http ://www.xs4all.nl/%7ewijnands/nnq/nquote.html


Please actually *read* this document.
Brian, If you are as knowledgable in regards to these as issues as
you believe you are than WHY didn't you advise the op before I?


I was away from my computer. I do apologize.
It's quite easy for you to sit on your backside and tear apart
emails "after the fact."


It was easy for you to spew a bunch of nonsense, too.
You really have less of a clue than you understand.


Oh? Then please point out where I was wrong instead of merely writing
an ad hominem attack.
I've been using these methods and htaccess methods for nearly six
years on my websites and today your attempting to convey that in
that period I learned nothing of traffic patterne :-))


What I read in your posts suggests that you do not understand the
difference between proxy/browser caching and Google's cache. Until you
do, you have no business giving bogus advice to unsuspecting newcomers.
If my sumbissions are upsetting you than filter me out.


If you continue to ignore the posting styles -- you again refused to
snip a single line of quotes -- then you'll indeed enter my killfile.

--
Brian (remove ".invalid" to email me)


I've no interest if reading any links provided by you, regardless of where
they may lead.
Please add me than you'll spare me your so-called rhetoric.
Jul 20 '05 #13

P: n/a
lostinspace wrote:
Brian wrote:
If you continue to ignore the posting styles -- you again refused
to snip a single line of quotes -- then you'll indeed enter my
killfile.

I've no interest if reading any links provided by you, regardless
of where they may lead. Please add me than you'll spare me your
so-called rhetoric.


Ask and ye shall receive. *plonk*

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/
Jul 20 '05 #14

P: n/a

"lostinspace" <lo*********@123-universe.com> wrote in message
news:kE***************@newssvr19.news.prodigy.com. ..
----- Original Message -----
From: "Brian" <>
Newsgroups: comp.infosystems.www.authoring.html
Sent: Monday, June 14, 2004 11:07 PM
Subject: Re: html source to prevent web bot searching

lostinspace wrote:
Original Message From: "Ludwig77"

> I read that there are some tags that can be entered in a web
> page's meta tags in order to prevent web bot searching and
> indexing of the web page for search engines.

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="robots" CONTENT="noarchive">
Ok, though it'd be worth mentioning robots.txt as well.
<meta name="robots" content="noimageindex, nomediaindex" />


Why did you switch to xhtml syntax for this one line?
<META HTTP-EQUIV="pragma" CONTENT="no-cache">


Pardon? What does caching have to do with search engine indexing?

--
Brian (remove ".invalid" to email me)

Brian,
I've been using those lines for some time.


You may have been using the line

<META HTTP-EQUIV="pragma" CONTENT="no-cache">

for some time, but no matter how long you use it, it will continue to have
nothing to do with preventing indexing by search engines.
Google and some other SE's "each" have their individual preference for page exclusions.
http://google.netscape.com/webmasters/faq.html#cached
There's no mention on this page of the no-cache pragma.
Now isn't that absurd that Google requires something different from the
industry norm?
They aren't alone.
I'm unable to recall which bot the xhtml syntax addresses however it is
specific.

Caching Vs indexing?
If they don't see it they can't read it.
What do you think the no-cache pragma has to do with whether the robot can
see or read the page? If it's reading the pragma, then it's already reading
the page!

Over a period of "time" controlling ALL the cache (with what ever mean
possible) will provide your logs with the majority of your visitors :-))
This is "contrary" to what many folks will tell you.


Huh?

Jul 20 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.