473,399 Members | 3,919 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Convert ODP RDF to Static HTML Pages?

I would like to download the RDF dump and generate static HTML pages (with
customizable headers and footers). I have only found one program called
iHierarchy that claims to do this ( http://simiax.com/ihier.html ) however
it is $199 and a demo to test is not available. Plus they're not answering
emails :(

Are there any other applications that will do this? Also, does any know or
care to guesstimate the size of the final static HTML output? I have a
dedicated server with about 60 GB of free space so hopefully there would be
room for the output.

I would need the program to maintain the exact ODP directory structure:

i.e. www.domain.com/ODP/Recreation/Outdoors/
Jul 23 '05 #1
17 5013
Me

"DesignGuy" <do********@nowhere.com> wrote in message
news:vuW8d.316151$mD.193055@attbi_s02...
I would like to download the RDF dump and generate static HTML pages (with
customizable headers and footers). I have only found one program called
iHierarchy that claims to do this ( http://simiax.com/ihier.html ) however
it is $199 and a demo to test is not available. Plus they're not answering
emails :(

Are there any other applications that will do this? Also, does any know or
care to guesstimate the size of the final static HTML output? I have a
dedicated server with about 60 GB of free space so hopefully there would
be
room for the output.

I would need the program to maintain the exact ODP directory structure:

i.e. www.domain.com/ODP/Recreation/Outdoors/

Powerseek from http://www.focalmedia.net/ will do the job

Regards

Eddie

http://www.englandsportal.com

Jul 23 '05 #2
On Wed, 06 Oct 2004 18:06:19 +0000, DesignGuy wrote:
I would like to download the RDF dump and generate static HTML pages
(with customizable headers and footers).


As long as you are complying with the proper attribution portion
of the "NOD" license that the ODP content is covered under, there are
quite a few dozen applications that can convert the RDF to HTML under
POSIX-compliant platforms (which excludes Microsoft Windows, of course).
I've been happily converting RDF to HTML for at least two years with some
perl glue.
Jul 23 '05 #3
On Sun, 10 Oct 2004 15:06:51 -0400, John Doe wrote:
As long as you are complying with the proper attribution portion
of the "NOD" license that the ODP content is covered under


Oops, I should have clarified... if you're repurposing the content
from ODP, with your custom headers and footers, you have to include the
following attribution in every page:

<p><table border="0" bgcolor="#336600" cellpadding="3" cellspacing="0">
<tr>
<td>
<table width="100%" cellpadding="2" cellspacing="0" border="0">
<tr align="center">
<td><font face="sans-serif, Arial, Helvetica" size="2"
color="#FFFFFF">Help build the largest human-edited
directory on the web.</font></td></tr>
<tr bgcolor="#CCCCCC" align="center">
<td><font face="sans-serif, Arial, Helvetica" size="2"> <a
href="http://dmoz.org/cgi-bin/add.cgi?where=$cat">Submit
a Site</a> - <a href="http://dmoz.org/about.html"><b>Open Directory
Project</b></a> -
<a href="http://dmoz.org/cgi-bin/apply.cgi?where=$cat">Become
an Editor</a> </font>
</td></tr>
</table>
</td>
</tr>

This is documented here:

http://dmoz.org/become_an_editor/

Jul 23 '05 #4
DesignGuy wrote:
I would like to download the RDF dump and generate static HTML pages
(with customizable headers and footers). I have only found one
program called iHierarchy that claims to do this (
http://simiax.com/ihier.html ) however it is $199 and a demo to test
is not available. Plus they're not answering emails :(

Are there any other applications that will do this? Also, does any
know or care to guesstimate the size of the final static HTML output?
I have a dedicated server with about 60 GB of free space so hopefully
there would be room for the output.

I would need the program to maintain the exact ODP directory
structure:

i.e. www.domain.com/ODP/Recreation/Outdoors/


Just a Plan B suggestion here: screenscrape the content dynamically. No
RDF dump, no storage.

--
Google Blogoscoped
http://blog.outer-court.com
Jul 23 '05 #5
On Mon, 18 Oct 2004 08:34:10 +0000, Philipp Lenssen wrote:
Just a Plan B suggestion here: screenscrape the content dynamically. No
RDF dump, no storage.


And risk getting your IP banned? Not a smart move. Google, Slashdot
and DMOZ all do this from time to time to keep abusers and abusive spiders
at bay.

One word: Don't.
Jul 23 '05 #6
John Doe wrote:
On Mon, 18 Oct 2004 08:34:10 +0000, Philipp Lenssen wrote:
Just a Plan B suggestion here: screenscrape the content
dynamically. No RDF dump, no storage.


And risk getting your IP banned? Not a smart move. Google, Slashdot
and DMOZ all do this from time to time to keep abusers and abusive
spiders at bay.


I'm using a mixture of Google API and screen-scraping and it worked
well so far (you can search, or browse):
http://findforward.com/?t=directory

After all it's called "open" directory, so I'm taking their word for it.

--
Google Blogoscoped
http://blog.outer-court.com
Jul 23 '05 #7
Us

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:2t*************@uni-berlin.de...
John Doe wrote:
On Mon, 18 Oct 2004 08:34:10 +0000, Philipp Lenssen wrote:
> Just a Plan B suggestion here: screenscrape the content
> dynamically. No RDF dump, no storage.


And risk getting your IP banned? Not a smart move. Google, Slashdot
and DMOZ all do this from time to time to keep abusers and abusive
spiders at bay.


I'm using a mixture of Google API and screen-scraping and it worked
well so far (you can search, or browse):
http://findforward.com/?t=directory

After all it's called "open" directory, so I'm taking their word for it.

http://dmoz.org/license.html

Jul 23 '05 #8
Us wrote:

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:2t*************@uni-berlin.de...


I'm using a mixture of Google API and screen-scraping and it worked
well so far (you can search, or browse):
http://findforward.com/?t=directory

http://dmoz.org/license.html


OK, where does it read that screen-scraping is not allowed? I'm
actually asking, because it sounds like the kind of legalese written
for lawyers, not webmasters (all upper-case?).

Let's face it, what Google and other search engines do all day is
screen-scraping. They extract information and present it on their site.
Like the Google cache. Oh wait, there's not even a "noarchive" thingie
in DMOZ.org.

--
Google Blogoscoped
http://blog.outer-court.com
Jul 23 '05 #9
Us

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:2t*************@uni-berlin.de...
Us wrote:

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:2t*************@uni-berlin.de...

>
> I'm using a mixture of Google API and screen-scraping and it worked
> well so far (you can search, or browse):
> http://findforward.com/?t=directory
> >

http://dmoz.org/license.html


OK, where does it read that screen-scraping is not allowed? I'm
actually asking, because it sounds like the kind of legalese written
for lawyers, not webmasters (all upper-case?).

Let's face it, what Google and other search engines do all day is
screen-scraping. They extract information and present it on their site.
Like the Google cache. Oh wait, there's not even a "noarchive" thingie
in DMOZ.org.


Your ignorance is matched by your poor eye sight. In fact only the last 2
paragraphs are in upper case and the terms and conditions should be clear
enough for anyone who is intending to use the data.

Incidently, Dmoz is a directory, unlike google that is a search engine and
does spider sites. Dmoz does not.

damnat quod non intelligunt


Jul 23 '05 #10
Us wrote:

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:2t*************@uni-berlin.de...
Us wrote:
>
http://dmoz.org/license.html
OK, where does it read that screen-scraping is not allowed? I'm
actually asking, because it sounds like the kind of legalese written
for lawyers, not webmasters (all upper-case?).


Your ignorance is matched by your poor eye sight. In fact only the
last 2 paragraphs are in upper case and the terms and conditions
should be clear enough for anyone who is intending to use the data.
So why don't you help me and point out the part where it reads I can't
do with FindForward.com what I'm doing now? Then I could adjust my site.

Incidently, Dmoz is a directory, unlike google that is a search
engine and does spider sites. Dmoz does not.


So? My point was that Dmoz lets itself be spidered by search engines
because it doesn't use the "noarchive" tag. FindForward.com is a search
engine (or meta engine, if you want). I'm adding functionality on top
of the Google API, and other web services.

[Follow-up to alt.internet.search-engines only]
Jul 23 '05 #11
On 19 Oct 2004 14:22:57 GMT, Philipp Lenssen in alt.internet.search-engines wrote:
Us wrote:

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:2t*************@uni-berlin.de...
> Us wrote:
> >> >
>> http://dmoz.org/license.html
>
> OK, where does it read that screen-scraping is not allowed? I'm
> actually asking, because it sounds like the kind of legalese written
> for lawyers, not webmasters (all upper-case?).

Your ignorance is matched by your poor eye sight. In fact only the
last 2 paragraphs are in upper case and the terms and conditions
should be clear enough for anyone who is intending to use the data. So why don't you help me and point out the part where it reads I can't
do with FindForward.com what I'm doing now? Then I could adjust my site.
Incidently, Dmoz is a directory, unlike google that is a search
engine and does spider sites. Dmoz does not.

So? My point was that Dmoz lets itself be spidered by search engines
because it doesn't use the "noarchive" tag. FindForward.com is a search
engine (or meta engine, if you want). I'm adding functionality on top
of the Google API, and other web services. [Follow-up to alt.internet.search-engines only]


Too bad, for some reason my newsreader isn't honouring this.

Very cute, you could have changed the followups at any time upthread,
but instead chose to do it now. Seems like someone is fearful of a
rebuttal...

--
marathon
"America is a stronger nation for the ACLU's uncompromising effort."
-- President John F. Kennedy
Jul 23 '05 #12
marathon wrote:
On 19 Oct 2004 14:22:57 GMT, Philipp Lenssen in
alt.internet.search-engines wrote:

So? My point was that Dmoz lets itself be spidered by search engines
because it doesn't use the "noarchive" tag. FindForward.com is a
search engine (or meta engine, if you want). I'm adding
functionality on top of the Google API, and other web services.

[Follow-up to alt.internet.search-engines only]


Too bad, for some reason my newsreader isn't honouring this.

Very cute, you could have changed the followups at any time upthread,
but instead chose to do it now. Seems like someone is fearful of a
rebuttal...


Actually, I typed lt.internet.search-engines so that caused some
confusion in my newsreader. In any case, when I restrict the follow-up,
then because I don't like crossposts. I happen to read along in both
newsgroups. If you want to stay on topic, then maybe you can give the
rebuttal you are talking about -- I'll be happy to hear it, as I'm here
to learn.

--
Google Blogoscoped
http://blog.outer-court.com
Jul 23 '05 #13
On 20 Oct 2004 07:43:22 GMT, Philipp Lenssen in
alt.internet.search-engines wrote:
marathon wrote:
On 19 Oct 2004 14:22:57 GMT, Philipp Lenssen in
alt.internet.search-engines wrote:
> So? My point was that Dmoz lets itself be spidered by search engines
> because it doesn't use the "noarchive" tag. FindForward.com is a
> search engine (or meta engine, if you want). I'm adding
> functionality on top of the Google API, and other web services.

> [Follow-up to alt.internet.search-engines only]


Too bad, for some reason my newsreader isn't honouring this.

Very cute, you could have changed the followups at any time upthread,
but instead chose to do it now. Seems like someone is fearful of a
rebuttal...

Actually, I typed lt.internet.search-engines so that caused some
confusion in my newsreader. In any case, when I restrict the follow-up,
then because I don't like crossposts. I happen to read along in both
newsgroups. If you want to stay on topic, then maybe you can give the
rebuttal you are talking about -- I'll be happy to hear it, as I'm here
to learn.


Nope not from me, I want to read the rebuttal as much as you appear to,
as I can always learn something new, to.

BTW cross posting isn't necessarily bad -- it's allowable when both are
in context for either/or newsgroups. In any decent newsreader,
cross posts can be marked read in the other groups. ;)

--
marathon
I came, I saw, I deleted all your files.
Jul 23 '05 #14
Us

"marathon" <M@linux.ca> wrote in message
news:qn************@barnyard.sweetpig.dyndns.org.. .
On 20 Oct 2004 07:43:22 GMT, Philipp Lenssen in
alt.internet.search-engines wrote:
marathon wrote:

On 19 Oct 2004 14:22:57 GMT, Philipp Lenssen in
alt.internet.search-engines wrote: So? My point was that Dmoz lets itself be spidered by search engines
> because it doesn't use the "noarchive" tag. FindForward.com is a
> search engine (or meta engine, if you want). I'm adding
> functionality on top of the Google API, and other web services.

> [Follow-up to alt.internet.search-engines only]

Too bad, for some reason my newsreader isn't honouring this.

Very cute, you could have changed the followups at any time upthread,
but instead chose to do it now. Seems like someone is fearful of a
rebuttal...

Actually, I typed lt.internet.search-engines so that caused some
confusion in my newsreader. In any case, when I restrict the follow-up,
then because I don't like crossposts. I happen to read along in both
newsgroups. If you want to stay on topic, then maybe you can give the
rebuttal you are talking about -- I'll be happy to hear it, as I'm here
to learn.


Nope not from me, I want to read the rebuttal as much as you appear to,
as I can always learn something new, to.

BTW cross posting isn't necessarily bad -- it's allowable when both are
in context for either/or newsgroups. In any decent newsreader,
cross posts can be marked read in the other groups. ;)

--
marathon
I came, I saw, I deleted all your files.


"
2. Attribution Requirement. As a material condition of this Open Directory
License, you must provide the below applicable attribution statements on (1)
all copies of the Open Directory, in whole or in part, and derivative works
thereof which are either distributed (internally or otherwise) or published
(made available on the Internet and/or internally over any internal
network/intranet or otherwise), whether distributed or published
electronically, on hard copy media or by any other means, and (2) on any
program/web page from which you directly link to/access any information
contained within the Open Directory, in whole or in part, or any derivative
work thereof: "

Or in plain English you are welcome to use ODP data so long as you place the
correct attribution on each page ( as google does in it's directory"

You also mentioned the "Open" part. It is a common misconception that open
source software is free. It may be but it doesn't have to be.

Eddie


Jul 23 '05 #15
marathon wrote:
BTW cross posting isn't necessarily bad -- it's allowable when both
are in context for either/or newsgroups. In any decent newsreader,
cross posts can be marked read in the other groups. ;)


Hmm, that sounds very useful. I'm using XanaNews.
Jul 23 '05 #16
Us wrote:
Or in plain English you are welcome to use ODP data so long as you
place the correct attribution on each page ( as google does in it's
directory"


Just did, thanks.
<http://www.findforward.com>
Jul 23 '05 #17
"Philipp Lenssen" <in**@outer-court.com> wrote in message news:<2t*************@uni-berlin.de>...
marathon wrote:
BTW cross posting isn't necessarily bad -- it's allowable when both
are in context for either/or newsgroups. In any decent newsreader,
cross posts can be marked read in the other groups. ;)


Hmm, that sounds very useful. I'm using XanaNews.


BTW, You can turn this feature on/off in XanaNews. Tools/Options,
General tab. 'Automatic Crosspost Detection' tab.

--
Colin - author of XanaNews
Jul 23 '05 #18

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Ardin | last post by:
I know via mod rewrite you can mimi static pages, but how does one do this for ASP? Before I go crazy modifying my code to generate a static page for all of our products (1000's of html pages)...
8
by: Roger stevenson | last post by:
Hi I am a converting VB6 programmer. Having completed a Windows forms project , andbefore moving on to some of my ASP.Net projects I want to add a lot of function to a VB 6 application that...
2
by: Joergen Bech | last post by:
Is there a function in the .Net 1.1 framework that will take, say, a string containing Scandinavian characters and output the corret HTML entities, such as &aelig; &oslash; &aring; etc.
25
by: Charles Law | last post by:
I thought this was going to be straight forward, given the wealth of conversion functions in .NET, but it is proving more convoluted than imagined. Given the following <code> Dim ba(1) As...
5
by: moondaddy | last post by:
I have a website that currently has all static htm pages and nothing will be dynamic for quite some time. This site is made up of a bunch of htm pages. is there any advantage or disadvantage of...
5
by: pittendrigh | last post by:
There must be millions of dynamically generated html pages out there now, built by on-the-fly php code (and jsp, perl cgi, asp, etc). Programatic page generation is transparently useful. But...
2
by: steven | last post by:
I have heard static html is a good way to do the default.html as it is faster and save resource hits over aspx for example. But what if you want to generate and overwrite the static html pages...
0
by: buntyindia | last post by:
Hi, I have some static html pages I want to convert them all to JSP... How can i do this using ANT ? thre are some regular expression function supported by ANT....like Regexp Replace Regex...
4
by: Jeff | last post by:
I've been working on porting some perl CMS code to PHP. What I would do in perl is search through a template for instruction and replace those instructions with specific bits for that particular...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.