468,761 Members | 1,776 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,761 developers. It's quick & easy.

Teaching a Crawler to Identify a Blog

Hello All,

I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)

xml tags

and/or php code.

I do know that cms(content management system) is used for several
blogs, does anyone else have any suggestions to help me determine
criteria.

I am aware that any criteria is subjective, especially when
considering sites such as slashdot which has been around longer than
Blogs...

thanks,
David
Jul 17 '05 #1
2 2829
Metropolis wrote:
I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)
Maybe *some* blogs contain this tag, but I'm betting most don't.
xml tags
So do lots of other websites, and I'm betting other websites have 'em more
than blogs do.
and/or php code.


How can you tell it's a PHP document? You can't see any PHP code because
what you are served up is a static HTML page. The only hint you can have is
that the file extension ends with .php but not all PHP pages end in .php

In any case just because it's PHP doesn't make it a blog.
I think you'll need to do a lot more than your suggestions here to determine
if it's a blog or not.

A lot of them do have date boxes on the page somewhere so you can navigate
back to previous days postings. Things like this, and other elements that
are common to blogs, are what you should be looking for, and not stuff like
whether it contains XML style tags or PHP file extensions.

--
Chris Hope - The Electric Toolbox - http://www.electrictoolbox.com/
Jul 17 '05 #2
Chris Hope <bl*******@electrictoolbox.com> wrote in message news:<11*************@216.128.74.129>...
Metropolis wrote:
I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)


Maybe *some* blogs contain this tag, but I'm betting most don't.
xml tags


So do lots of other websites, and I'm betting other websites have 'em more
than blogs do.
and/or php code.


How can you tell it's a PHP document? You can't see any PHP code because
what you are served up is a static HTML page. The only hint you can have is
that the file extension ends with .php but not all PHP pages end in .php

In any case just because it's PHP doesn't make it a blog.


All true.

Start by thinking about how -you- identify a blog. That ain't easy,
if my attempts at explaining what a blog is to other people is any
indication.

Look for references to time and self. E.g. "yesterday, I"

What IS a blog, anyway?

Not duck soup, or a piece of cake, this problem.
Jul 17 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Gomez | last post: by
1 post views Thread by Benjamin Lefevre | last post: by
822 posts views Thread by Turamnvia Suouriviaskimatta | last post: by
1 post views Thread by Steve Ocsic | last post: by
4 posts views Thread by StevePBurgess | last post: by
12 posts views Thread by disappearedng | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.