Teaching a Crawler to Identify a Blog

Metropolis

Hello All,

I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)

xml tags

and/or php code.

I do know that cms(content management system) is used for several
blogs, does anyone else have any suggestions to help me determine
criteria.

I am aware that any criteria is subjective, especially when
considering sites such as slashdot which has been around longer than
Blogs...

thanks,
David

Jul 17 '05 #1

Subscribe Post Reply

3241

Chris Hope

Metropolis wrote:

I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)
Maybe *some* blogs contain this tag, but I'm betting most don't.
xml tags
So do lots of other websites, and I'm betting other websites have 'em more
than blogs do.
and/or php code.

How can you tell it's a PHP document? You can't see any PHP code because
what you are served up is a static HTML page. The only hint you can have is
that the file extension ends with .php but not all PHP pages end in .php

In any case just because it's PHP doesn't make it a blog.
I think you'll need to do a lot more than your suggestions here to determine
if it's a blog or not.

A lot of them do have date boxes on the page somewhere so you can navigate
back to previous days postings. Things like this, and other elements that
are common to blogs, are what you should be looking for, and not stuff like
whether it contains XML style tags or PHP file extensions.

--
Chris Hope - The Electric Toolbox - http://www.electrictoolbox.com/

Jul 17 '05 #2

Razzbar

Chris Hope <bl*******@electrictoolbox.com> wrote in message news:<11*************@216.128.74.129>...

Metropolis wrote:
I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)

Maybe *some* blogs contain this tag, but I'm betting most don't.
xml tags

So do lots of other websites, and I'm betting other websites have 'em more
than blogs do.
and/or php code.

How can you tell it's a PHP document? You can't see any PHP code because
what you are served up is a static HTML page. The only hint you can have is
that the file extension ends with .php but not all PHP pages end in .php

In any case just because it's PHP doesn't make it a blog.

All true.

Start by thinking about how -you- identify a blog. That ain't easy,
if my attempts at explaining what a blog is to other people is any
indication.

Look for references to time and self. E.g. "yesterday, I"

What IS a blog, anyway?

Not duck soup, or a piece of cake, this problem.

Jul 17 '05 #3

Similar topics

User vs. Crawler

by: Gomez | last post by:

Hi, Is there a way to know if a session on my web server is from an actual user or an automated crawler. please advise. G

ASP / Active Server Pages

C# Crawler and performance (speed of crawling)

by: Benjamin Lefevre | last post by:

I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...

.NET Framework

822

Teaching new tricks to an old dog (C++ -->Ada)

by: Turamnvia Suouriviaskimatta | last post by:

I 'm following various posting in "comp.lang.ada, comp.lang.c++ , comp.realtime, comp.software-eng" groups regarding selection of a programming language of C, C++ or Ada for safety critical...

C / C++

crawler pool

by: Steve Ocsic | last post by:

Hi, I've coded a basic crawler where by you enter the URL and it will then crawl the said URL. What I would like to do now is to take it one step further and do the following: 1. pick up the...

C# / C Sharp

Why crawler don't get to my aspx page unless the Work Process is started?

by: Nicolas | last post by:

I need HELP!!!!! The crawler (Google or other) don't index my web site unless the web site is currently visited If there is nobody visiting those .aspx page therefor activating the aspnet no...

ASP.NET

Using Request.Browser.Crawler - is it reliable?

by: Bill | last post by:

Has anyone used/tested Request.Browser.Crawler ? Is it reliable, or are there false positives/negatives? Thanks!

ASP.NET

Checking if referrer is web crawler

by: StevePBurgess | last post by:

Hi. I have a book affiliate website. Whenever a visitor clicks on one of the books, a script adds one to a field in a mysql database and then takes the visitor to the shopping basket on the book...

PHP

Web Crawler - Python or Perl?

by: disappearedng | last post by:

Hi all, I am currently planning to write my own web crawler. I know Python but not Perl, and I am interested in knowing which of these two are a better choice given the following scenario: 1)...

Python

Creating a web bot/crawler/spider for multiple websites

by: kishorealla | last post by:

Hello I need to create a web bot/crawler/spider that would go into different web sites and collect data for us and store in a database. The crawler needs to 'READ' the options on a website (either...

.NET Framework

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++