472,951 Members | 2,035 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,951 software developers and data experts.

Teaching a Crawler to Identify a Blog

Hello All,

I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)

xml tags

and/or php code.

I do know that cms(content management system) is used for several
blogs, does anyone else have any suggestions to help me determine
criteria.

I am aware that any criteria is subjective, especially when
considering sites such as slashdot which has been around longer than
Blogs...

thanks,
David
Jul 17 '05 #1
2 3227
Metropolis wrote:
I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)
Maybe *some* blogs contain this tag, but I'm betting most don't.
xml tags
So do lots of other websites, and I'm betting other websites have 'em more
than blogs do.
and/or php code.


How can you tell it's a PHP document? You can't see any PHP code because
what you are served up is a static HTML page. The only hint you can have is
that the file extension ends with .php but not all PHP pages end in .php

In any case just because it's PHP doesn't make it a blog.
I think you'll need to do a lot more than your suggestions here to determine
if it's a blog or not.

A lot of them do have date boxes on the page somewhere so you can navigate
back to previous days postings. Things like this, and other elements that
are common to blogs, are what you should be looking for, and not stuff like
whether it contains XML style tags or PHP file extensions.

--
Chris Hope - The Electric Toolbox - http://www.electrictoolbox.com/
Jul 17 '05 #2
Chris Hope <bl*******@electrictoolbox.com> wrote in message news:<11*************@216.128.74.129>...
Metropolis wrote:
I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)


Maybe *some* blogs contain this tag, but I'm betting most don't.
xml tags


So do lots of other websites, and I'm betting other websites have 'em more
than blogs do.
and/or php code.


How can you tell it's a PHP document? You can't see any PHP code because
what you are served up is a static HTML page. The only hint you can have is
that the file extension ends with .php but not all PHP pages end in .php

In any case just because it's PHP doesn't make it a blog.


All true.

Start by thinking about how -you- identify a blog. That ain't easy,
if my attempts at explaining what a blog is to other people is any
indication.

Look for references to time and self. E.g. "yesterday, I"

What IS a blog, anyway?

Not duck soup, or a piece of cake, this problem.
Jul 17 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Gomez | last post by:
Hi, Is there a way to know if a session on my web server is from an actual user or an automated crawler. please advise. G
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
822
by: Turamnvia Suouriviaskimatta | last post by:
I 'm following various posting in "comp.lang.ada, comp.lang.c++ , comp.realtime, comp.software-eng" groups regarding selection of a programming language of C, C++ or Ada for safety critical...
1
by: Steve Ocsic | last post by:
Hi, I've coded a basic crawler where by you enter the URL and it will then crawl the said URL. What I would like to do now is to take it one step further and do the following: 1. pick up the...
0
by: Nicolas | last post by:
I need HELP!!!!! The crawler (Google or other) don't index my web site unless the web site is currently visited If there is nobody visiting those .aspx page therefor activating the aspnet no...
3
by: Bill | last post by:
Has anyone used/tested Request.Browser.Crawler ? Is it reliable, or are there false positives/negatives? Thanks!
4
by: StevePBurgess | last post by:
Hi. I have a book affiliate website. Whenever a visitor clicks on one of the books, a script adds one to a field in a mysql database and then takes the visitor to the shopping basket on the book...
12
by: disappearedng | last post by:
Hi all, I am currently planning to write my own web crawler. I know Python but not Perl, and I am interested in knowing which of these two are a better choice given the following scenario: 1)...
0
by: kishorealla | last post by:
Hello I need to create a web bot/crawler/spider that would go into different web sites and collect data for us and store in a database. The crawler needs to 'READ' the options on a website (either...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
tracyyun
by: tracyyun | last post by:
Hello everyone, I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...
2
by: giovanniandrean | last post by:
The energy model is structured as follows and uses excel sheets to give input data: 1-Utility.py contains all the functions needed to calculate the variables and other minor things (mentions...
4
NeoPa
by: NeoPa | last post by:
Hello everyone. I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report). I know it can be done by selecting :...
1
by: Teri B | last post by:
Hi, I have created a sub-form Roles. In my course form the user selects the roles assigned to the course. 0ne-to-many. One course many roles. Then I created a report based on the Course form and...
0
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.