how google spider access my web site?

baroque Chou

anyone know how google spiders access web site, how dose they manage to
get the href information? do they have special access right or
something? any help is appreciated

Jan 26 '06 #1

Subscribe Reply

1923

look up robots

http://www.robotstxt.org/wc/faq.html

Jan 26 '06 #2

Brian Cryer

"baroque Chou" <ba********@126.com> wrote in message
news:11**********************@g44g2000cwa.googlegr oups.com...

anyone know how google spiders access web site, how dose they manage to
get the href information? do they have special access right or
something? any help is appreciated

No, google doesn't have any special access rights, they access your website
the same way as anyone else. This means that if you have a login screen
which you need to get past to view your site then the google spider won't
get past it. Some sites explicitly grant the google bot (or other bots)
access, but that's an exception not the rule.

In summary, what you can see in your browser (or better still, what I could
see in my browser if you gave me the url) is what the google spider can see.
The only exception to this is that the google spider is a little more fussy
about correct html than most browsers, so its worth checking that your code
validates and links are correct.
--
Brian Cryer
www.cryer.co.uk/brian

Jan 26 '06 #3

baroque Chou

thanks, seems google spider has some attributes that browser has. but
if I am using dynamic page, say, apsx, which dosen't produce an output
page before the web server execute it. how google know the href in that
page, and most time,even the executed page, the href is more like has a
form of
<a href='Middlelayer_Top10.aspx?id=105>
how will the spider make a deeper crawl? if both not access my source
code nor dose it make any request

Jan 26 '06 #4

KMA

Generally it goes like this:

You send google a reference to your homepage. Obviously this page shouldn't
requre logging in or a password.
The google bot downloads this page and strips out all the links. It makes a
"score" of the page for the Google index then downloads every page from the
link list and repeats the same procedure until all links are processed.

The exact details of the scoring mechanism are not published to prevent
people artificially pushing their page up the page rankings.

Some say that parameterised links (with a gfdg.aspx?productID=1234) are not
followed.

To get more of an idea, create an aspx page with links, run your prog, then
in the browser, right click and choose View Source. This is exactly what the
googlebot gets.

"baroque Chou" <ba********@126.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...

thanks, seems google spider has some attributes that browser has. but
if I am using dynamic page, say, apsx, which dosen't produce an output
page before the web server execute it. how google know the href in that
page, and most time,even the executed page, the href is more like has a
form of
<a href='Middlelayer_Top10.aspx?id=105>
how will the spider make a deeper crawl? if both not access my source
code nor dose it make any request

Jan 26 '06 #5

baroque Chou

thank you very much, some one suggest that you should use some rewrite
rule to make the url more search engine friendly,
e.g. gfdg.aspx?productID=1234 rewirte to gfdg.aspx/productID/1234
, but this page actully dosen't exist in my web server,
what exist is just the source page, the "instance" of that page is
created everytime by individual request. so do I need to archive the
instance of that page to some location(the hierarchy of the directory
may be well packaged following the url patten so that the spider can
have a better crawl)?

Jan 27 '06 #6

KMA

OK, basically it goes like this.

On your web pages you write bot-friendly urls, like
gfds/product/toasters/toastomatic5000.aspx.

But like you say, this page doesn't really exist. When the bot requests the
page, IIS will not be able to find the page, but if you implement your own
404 handler then IIS will call this. A normal 404 handler just gives back a
page saying "Sorry, page not found" but your special 404 handler will be
passed the url of the requested page. You then can strip off the productID
from the url and build your page for the product. This page is then sent
back to the bot.

In a way you are fooling the bot that you have lots of web pages, but really
you just have one page handler plus a databse of product data. Bot writers
expect this, because they know that it's very difficult to maintain a large
site in any other way.

"baroque Chou" <ba********@126.com> wrote in message
news:11**********************@g44g2000cwa.googlegr oups.com...

thank you very much, some one suggest that you should use some rewrite
rule to make the url more search engine friendly,
e.g. gfdg.aspx?productID=1234 rewirte to gfdg.aspx/productID/1234
, but this page actully dosen't exist in my web server,
what exist is just the source page, the "instance" of that page is
created everytime by individual request. so do I need to archive the
instance of that page to some location(the hierarchy of the directory
may be well packaged following the url patten so that the spider can
have a better crawl)?

Jan 27 '06 #7

Alan Silver

Why not just URL rewriting? Much cleaner.

OK, basically it goes like this.

On your web pages you write bot-friendly urls, like
gfds/product/toasters/toastomatic5000.aspx.

But like you say, this page doesn't really exist. When the bot requests the
page, IIS will not be able to find the page, but if you implement your own
404 handler then IIS will call this. A normal 404 handler just gives back a
page saying "Sorry, page not found" but your special 404 handler will be
passed the url of the requested page. You then can strip off the productID
from the url and build your page for the product. This page is then sent
back to the bot.

In a way you are fooling the bot that you have lots of web pages, but really
you just have one page handler plus a databse of product data. Bot writers
expect this, because they know that it's very difficult to maintain a large
site in any other way.

"baroque Chou" <ba********@126.com> wrote in message
news:11**********************@g44g2000cwa.googleg roups.com...
thank you very much, some one suggest that you should use some rewrite
rule to make the url more search engine friendly,
e.g. gfdg.aspx?productID=1234 rewirte to gfdg.aspx/productID/1234
, but this page actully dosen't exist in my web server,
what exist is just the source page, the "instance" of that page is
created everytime by individual request. so do I need to archive the
instance of that page to some location(the hierarchy of the directory
may be well packaged following the url patten so that the spider can
have a better crawl)?

--
Alan Silver
(anything added below this line is nothing to do with me)

Feb 2 '06 #8

Similar topics

3095

PHP<->Google?

by: R. Rajesh Jeba Anbiah | last post by:

I'm so curious to know whether Google uses PHP or not. Some pages on the net says that Google uses PHP. Is it true? If so, what database they use? Also heard that Yahoo is moving from yScript to...

PHP

2959

Google ranking vs encryption

by: Brian Murphy | last post by:

I own a PHP-based website.I want to encrypt the HTML output.That will for sure make the site unindexable by Google.Is there a way to encrypt the output to the users but not to Google.I suppose that...

PHP

2371

Google groups email spider,Auction software, Directory PPC search engine software, email spiders - 1

by: Auction software | last post by:

Free download full version , all products http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups. Millions of valid...

PHP

3641

Why learn Python ??

by: Bicho Verde | last post by:

I have now free time and money to do what I want :-) I have some basic skills in programming (C, Pascal, Macromedia Actionscript) but don't know exactly what to do in the world of programming. And...

Python

6181

Fixed font sizes

by: Pamel | last post by:

I know this must have been asked elsewhere, but I cannot find it. There is a piece of text on my web page that I don't want browsers to resize. IE won't resize it if I specify the size in px, but...

HTML / CSS

2006

Google groups email spider,Auction software, Directory PPC search engine software, email spiders - 5

by: Auction software | last post by:

Free download full version , all products from Mewsoft dot com http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups....

Microsoft Access / VBA

2058

Announcing New dtSearch® .NET Spider API; Terabyte Engine for Linux; OpenOffice Support

by: dtsearch | last post by:

New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a...

.NET Framework

2028

Google ignoring robot exclusion tags

by: Philip Ronan | last post by:

Hi, I recently discovered that Google's mobile search robot doesn't understand the "robots" Meta tag. Here's an example: ...

HTML / CSS

2379

Big Bertha Thing spider

by: Tony Lance | last post by:

Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site...

C / C++

7134

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7180

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

6901

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7392

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5479

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4605

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3105

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

667

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

307

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General