473,385 Members | 2,269 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Spidering the web to find RDF

Last year, I did an experiment of allowing a very polite
web spider run for a few days trying to find RDF markup
embedded in web pages. I found close to zero RDF - not
encouraging!

I a recent post, I compalined about not being able to
embed RDF in XHTML (at least no standard way to do it
and still pass th W3C XHTML validator). Another poster
(Jeen Broekstr) provided a good example of simply
linking to a RDF file at the same site.

I was concerned about spiders being able to find
links to RDF because there is no standard for this,
then a few minutes ago I had one of those "Duh!" experiences:

A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf". If such a link is found, assume that
it decribes to the page linking it.

Anyway, I will try my experiment again (when I have
time to set it up) and report the results. I hope that
lots of people link to separate RDF files on their sites
and my results will be better than last year when I
only looked for embedded RDF.

-Mark
Jul 20 '05 #1
6 1945
In article <a7**************************@posting.google.com >, one of infinite monkeys
at the keyboard of ma***@markwatson.com (Mark Watson) wrote:
A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf".
Ahem ... the last few characters of a URL have absolutely no significance
except by convention. A spider that did that would be broken.

It could, however, look for links with the type="application/rdf+xml"
attribute. It would find a couple in my pages, for instance.
If such a link is found, assume that
it decribes to the page linking it.
Wouldn't it be better to believe the RDF concerning its own subject?
only looked for embedded RDF.


I played with embedding RDF (for automatically-generated reports),
but abandoned the idea as a nonstarter.

--
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
Jul 20 '05 #2
Nick Kew wrote:
In article <a7**************************@posting.google.com >,
one of infinite monkeys at the keyboard of
ma***@markwatson.com (Mark Watson) wrote:
A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf".


Ahem ... the last few characters of a URL have absolutely no
significance except by convention. A spider that did that
would be broken.

It could, however, look for links with the
type="application/rdf+xml" attribute. It would find a couple
in my pages, for instance.


That would, however, only work if the web server from which the
file is hosted is aware of this mime type. I don't know if Apache
comes preconfigured with it these days but I'll bet that older
versions won't spot it (for example, my rdf file would not be
found since the department web server serves it as text/plain).

You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

Jeen
--
Jeen Broekstra http://www.cs.vu.nl/~jbroeks/

New York is real. The rest is done with mirrors.
Jul 20 '05 #3
In article <sl********************@flits.cs.vu.nl>, one of infinite monkeys
at the keyboard of Jeen Broekstra <jb*****@not4mail.cs.vu.nl> wrote:
It could, however, look for links with the
type="application/rdf+xml" attribute. It would find a couple
in my pages, for instance.
That would, however, only work if the web server from which the
file is hosted is aware of this mime type.

Nope. I said attribute.
<link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">
I don't know if Apache
comes preconfigured with it these days but I'll bet that older
Neither do I; in any case it wouldn't do anything for the above example
which I deliberately (and perfectly legitimately) ended with .html
The server should of course serve it with the correct MIME type,
but that's another issue.
You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.


Even if .rdf gets something, it'll miss out on lots of .cgi, .php,
..xml and other things. It's simply broken.

Relying on the attribute will also miss out on many instances.
It's no more than a more correct thing than ".rdf" to look for
in (x)html links.

--
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
Jul 20 '05 #4
Jeen Broekstra <jb*****@not4mail.cs.vu.nl> wrote in message news:<sl********************@flits.cs.vu.nl>...
You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.


It sounds like what I need to do is to roll all the ideas for spidering
RDF together and be as opportunistic as possible in collecting RDF.

So, I will use both Nick's and Jeen's ideas.

Thanks,
Mark
Jul 20 '05 #5
Nick Kew wrote:
In article <sl********************@flits.cs.vu.nl>, one of
infinite monkeys at the keyboard of Jeen Broekstra
<jb*****@not4mail.cs.vu.nl> wrote:
It could, however, look for links with the
type="application/rdf+xml" attribute. It would find a
couple in my pages, for instance.


That would, however, only work if the web server from which the
file is hosted is aware of this mime type.

Nope. I said attribute.
<link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">


Blimey. My bad, I completely misread your post.

Jeen
--
Jeen Broekstra http://www.cs.vu.nl/~jbroeks/

Write a wise saying and your name will live forever.
-- Anonymous
Jul 20 '05 #6
In article <a7**************************@posting.google.com >, one of infinite monkeys
at the keyboard of ma***@markwatson.com (Mark Watson) wrote:
It sounds like what I need to do is to roll all the ideas for spidering
RDF together and be as opportunistic as possible in collecting RDF.


My previous post was just a correction to something you said, which I
felt called for correction because it so often leads to confusion.

My *practical" suggestion would be to send HEAD requests from the spider
to ascertain the type of any URL before actually fetching it. Then fetch
HTML and XHTML pages to spider for more links, and RDF pages for your
collection.

I happen to have spidering software that'll do all that - among other
things:-) Though I have the feeling you may not have the budget for it,
given the experimental nature of your task.

--
Nick Kew
Jul 20 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: amit | last post by:
I want to find out that if there is a mechanism to find a text inside a C# file and replace it with another string. I am using DTE to do it, the find proerty does it, the results are getting...
0
by: AMIT PUROHIT | last post by:
hi, this is a qry which I m stuck up with I want to find out that if there is a mechanism to find a text inside a C# file and replace it with another string. I am using DTE(EnvDTE) to do it,...
0
by: amit | last post by:
hi I have created a tool which does a find and replace thru DTE, now after it is done, it opens up a window, "FIND REACHED THE STARTING POINT OF SEARCH" I want to disbale this window...
5
by: Mike Labosh | last post by:
In VB 6, the Form_QueryUnload event had an UnloadMode parameter that let me find out *why* a form is unloading, and then conditionally cancel the event. In VB.NET, the Closing event passes a...
3
by: David T. Ashley | last post by:
Hi, Red Hat Enterprise Linux 4.X. I'm writing command-line PHP scripts for the first time. I get the messages below. What do they mean? Are these operating system library modules, or...
0
by: Derek | last post by:
I am creating an intranet using Visual Web Developer Express Edition. Everything has been working OK until yesterday when I started getting 62 messages all beginning "Could not find schema...
5
by: dananrg | last post by:
O'Reilly's Spidering Hacks books terrific. One problem. All the code samples are in Perl. Nothing Pythonic. Is there a book out there for Python which covers spidering / crawling in depth?
5
by: David Waizer | last post by:
Hello.. I'm looking for a script (perl, python, sh...)or program (such as wget) that will help me get a list of ALL the links on a website. For example ./magicscript.pl www.yahoo.com and...
1
by: George Orwell | last post by:
Would I be missing much if I stopped trying to learn Perl well enough to use for spidering, screen scraping etc. and converted over to PHP ? I am looking to do all, or at least most of the hacks...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.