By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,487 Members | 1,090 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,487 IT Pros & Developers. It's quick & easy.

Import existing web pages into database?

P: n/a
I've inherited a site that has 1000+ product pages that follow a similar
layout. I would like to import these pages into a database for ease of
future updating. The pages have the usual banners, navigation menus, etc.
and other extraneous code that must be removed.

Does software exist to import existing HTML pages into a database - even a
rough import would work as I can massage the data later.

Ideas/suggestions will be appreciated.

Jul 20 '05 #1
Share this Question
Share on Google+
15 Replies


P: n/a
"DesignGuy" <do********@nowhere.com> wrote in message news:8cTrc.43561$gr.4348411@attbi_s52...
: I've inherited a site that has 1000+ product pages that follow a similar
: layout. I would like to import these pages into a database for ease of
: future updating. The pages have the usual banners, navigation menus, etc.
: and other extraneous code that must be removed.
:
: Does software exist to import existing HTML pages into a database - even a
: rough import would work as I can massage the data later.
:
: Ideas/suggestions will be appreciated.
:
Storing content in a database alone won't give you any flexibility for
massaging/managing content. What about serving the content via
a web server?

You'd be better off using a CMS that offers these higher-level functions,
including site import.

--
Long
www.webcharm.ca - Integrated content management web hosting
Jul 20 '05 #2

P: n/a

"Long - CM web hosting" <ro******@rogers.com> wrote in message
news:lJ*******************@news04.bloor.is.net.cab le.rogers.com...
"DesignGuy" <do********@nowhere.com> wrote in message news:8cTrc.43561$gr.4348411@attbi_s52... : I've inherited a site that has 1000+ product pages that follow a similar
: layout. I would like to import these pages into a database for ease of
: future updating. The pages have the usual banners, navigation menus, etc. : and other extraneous code that must be removed.
:
: Does software exist to import existing HTML pages into a database - even a : rough import would work as I can massage the data later.
:
: Ideas/suggestions will be appreciated.
:
Storing content in a database alone won't give you any flexibility for
massaging/managing content. What about serving the content via
a web server?

You'd be better off using a CMS that offers these higher-level functions,
including site import.


The product database would reside on a web server -- the problem is getting
the existing pages into a database to beginth (none was used to create the
original pages).

Importing existing pages into a CMS won't do what is needed, as the layout
will be changed completely.

Jul 20 '05 #3

P: n/a
"DesignGuy" <do********@nowhere.com> wrote in message news:xFWrc.13910$JC5.1307707@attbi_s54...
:
: "Long - CM web hosting" <ro******@rogers.com> wrote in message
: news:lJ*******************@news04.bloor.is.net.cab le.rogers.com...
: > "DesignGuy" <do********@nowhere.com> wrote in message
: news:8cTrc.43561$gr.4348411@attbi_s52...
: > : I've inherited a site that has 1000+ product pages that follow a similar
: > : layout. I would like to import these pages into a database for ease of
: > : future updating. The pages have the usual banners, navigation menus,
: etc.
: > : and other extraneous code that must be removed.
: > :
: > : Does software exist to import existing HTML pages into a database - even
: a
: > : rough import would work as I can massage the data later.
: > :
: > : Ideas/suggestions will be appreciated.
: > :
: > Storing content in a database alone won't give you any flexibility for
: > massaging/managing content. What about serving the content via
: > a web server?
: >
: > You'd be better off using a CMS that offers these higher-level functions,
: > including site import.
:
: The product database would reside on a web server -- the problem is getting
: the existing pages into a database to beginth (none was used to create the
: original pages).
:
Yes, all CMSes that are database driven also have a database sitting behind
a web server. The CMS may have functions to import a web site from your
local disk.

: Importing existing pages into a CMS won't do what is needed, as the layout
: will be changed completely.
:
Granted with certain CMSes you have to use the predefined layout. But it
does depend on the selected CMS and how flexible it is in terms of HTML
source control. In our case, you have full control of every HTML element
(including CSS) so you can change the layout and content as often as you like.

webcharm.ca is hosted by our CMS. It uses a specific layout that is completely
separate and independent from any other site hosted on the same system.

--
Long
www.webcharm.ca - Integrated content management web hosting
Jul 20 '05 #4

P: n/a
"DesignGuy" <do********@nowhere.com> wrote in message news:<8cTrc.43561$gr.4348411@attbi_s52>...
Does software exist to import existing HTML pages into a database


No, it needs to be written as a one-off for each site. If you're a
coder you'll probably just sit down and write it yourself (try Perl
and one of the HTML parsing modules). If you're not naturally inclined
to see everything as an excuse to cut some code, then you might find a
semi-automatic tool that writes this import script fragment for you
and wraps it up inside its standard page-reading / record-adding loop.
Note that however you do it, "software" is being created to map HTML
fragments onto database fields and this is just an inherently awkward
task, no matter how you approach it.

As an example, M$oft SQL Server has an import tool called DTS and it's
quite capable of being made to read and parse web pages. However the
page -> field values parser is the hard bit, and the loop is the easy
bit, so this semi-auto scripting approach doesn't really add much and
is often almost as complicated.

The difficulty of this sort of data import varies hugely depending on
the particular site concerned. It doesn't depend on the size of the
site, and only slightly on the number of field values to extract from
each page. The big variable is the structure of each HTML page,
particularly the semantic visibility of their underlying structure.
Real killers that will ruin your day are needing to do this on other
people's sites, needing to do it continuously for the forseeable
future (i.e. scraping a daily feed) and needing to do it on a site
with redesigns happening to it.

If you want a further opinion, please post a URL for the existing
site.

An alternative approach is to avoid doing this from HTML. Sites with
1000's of pages were rarely built directly from hand-coded HTML
anyway. Can you grab their content at a higher level? Word docs,
database?
Jul 20 '05 #5

P: n/a
"Long - CM web hosting" <ro******@rogers.com> wrote in message news:<Ww*****************@news01.bloor.is.net.cabl e.rogers.com>...
Yes, all CMSes that are database driven also have a database sitting behind
a web server.


Not all of them. There is still scope for some low-cost hosting where
an off-line database makes pages, then these are static published to a
web server.
Jul 20 '05 #6

P: n/a
Andy Dingley wrote:
There is still scope for some low-cost hosting where an off-line
database makes pages, then these are static published to a web
server.


Lots of sites are done this way, offline or not. It's called "baking"
the content.

--
Brian (remove "invalid" from my address to email me)
http://www.tsmchughs.com/
Jul 20 '05 #7

P: n/a
"Andy Dingley" <di*****@codesmiths.com> wrote in message
news:28**************************@posting.google.c om...
: "Long - CM web hosting" <ro******@rogers.com> wrote in message
news:<Ww*****************@news01.bloor.is.net.cabl e.rogers.com>...
:
: > Yes, all CMSes that are database driven also have a database sitting behind
: > a web server.
:
: Not all of them. There is still scope for some low-cost hosting where
: an off-line database makes pages, then these are static published to a
: web server.

Perhaps I should have said "web-based/online/live" system instead...

Such off-line systems cost more but offer little benefit:
1. need to manage two versions of the website and keeping them in sync will be
more difficult, especially for larger sites.
2. more work is needed to generate dynamic content

It is unfortunate people not value their time and effort more when considering
a web host. I suppose you do get what you pay for...

--
Long
www.webcharm.ca - Integrated content management web hosting
Jul 20 '05 #8

P: n/a
"Andy Dingley" <di*****@codesmiths.com> wrote in message
news:28**************************@posting.google.c om...
: "DesignGuy" <do********@nowhere.com> wrote in message news:<8cTrc.43561$gr.4348411@attbi_s52>...
:
: > Does software exist to import existing HTML pages into a database
:
: No, it needs to be written as a one-off for each site. If you're a
: coder you'll probably just sit down and write it yourself (try Perl
: and one of the HTML parsing modules).

Not necessarily. It depends how you define your import "chunks".
If you need to parse the content and extract specific elements, then
yes. Otherwise file-level chunks can be import with a more general
module, if available with selected CMS.

In this case a generalized module can be written to import any site
while preserving content and directory structures.

--
Long
www.webcharm.ca - Integrated content management web hosting
Jul 20 '05 #9

P: n/a
On Mon, 24 May 2004 21:11:14 GMT, "Long - CM web hosting"
<ro******@rogers.com> wrote:
"Andy Dingley" <di*****@codesmiths.com> wrote in message
news:28**************************@posting.google. com...
: "Long - CM web hosting" <ro******@rogers.com> wrote in message
news:<Ww*****************@news01.bloor.is.net.cab le.rogers.com>...
:
: There is still scope for some low-cost hosting where
: an off-line database makes pages, then these are static published to a
: web server.

Perhaps I should have said "web-based/online/live" system instead...

Such off-line systems cost more but offer little benefit:
1. need to manage two versions of the website and keeping them in sync will be
more difficult, especially for larger sites.
2. more work is needed to generate dynamic content

It is unfortunate people not value their time and effort more when considering
a web host. I suppose you do get what you pay for...


There seems little point in dynamically generating a site which is, on
the whole, static. I used to run such a site, and it changed
infrequently so we considered it pointless to rebuild the page for
each request.

Instead, we just used a preprocessing tool, with the output from that
tool going straight into the "live" directory on the server. Much of
the content was actually from files on disk rather than a database,
but there was some database-driven content and the principle still
applies.

If a site contains content which has long-term relevance there is
often little need for dynamicism.

-Claire
Jul 20 '05 #10

P: n/a
Long - CM web hosting wrote:
Such off-line systems cost more but offer little benefit:
1. need to manage two versions of the website and keeping them in
sync will be more difficult, especially for larger sites.
2. more work is needed to generate dynamic content


Yes, but they often offer excellent caching support with no extra effort
by the content producer. Generated-on-the-fly sites either must
replicate that behavior or do without, with the obvious consequences.

--
Brian (remove "invalid" from my address to email me)
http://www.tsmchughs.com/
Jul 20 '05 #11

P: n/a
"Claire Tucker" <fa**@invalid.com> wrote in message
news:1e********************************@4ax.com...
: On Mon, 24 May 2004 21:11:14 GMT, "Long - CM web hosting"
: <ro******@rogers.com> wrote:
:
: >"Andy Dingley" <di*****@codesmiths.com> wrote in message
: >news:28**************************@posting.google. com...
: >: "Long - CM web hosting" <ro******@rogers.com> wrote in message
: >news:<Ww*****************@news01.bloor.is.net.cab le.rogers.com>...
: >:
: >: There is still scope for some low-cost hosting where
: >: an off-line database makes pages, then these are static published to a
: >: web server.
: >
: >Perhaps I should have said "web-based/online/live" system instead...
: >
: >Such off-line systems cost more but offer little benefit:
: >1. need to manage two versions of the website and keeping them in sync will be
: >more difficult, especially for larger sites.
: >2. more work is needed to generate dynamic content
: >
: >It is unfortunate people not value their time and effort more when considering
: >a web host. I suppose you do get what you pay for...
:
: There seems little point in dynamically generating a site which is, on
: the whole, static. I used to run such a site, and it changed
: infrequently so we considered it pointless to rebuild the page for
: each request.
:
One concern in regenerating a page is the overhead when compared to
serving static file. This is particularly true of pre-1GHz processors, but it
is no longer the case (proccessing overhead is now neglegible).

Regardless of how often a site may be updated, being able to maintain
a live site reduces the extra steps and hence effort.

: Instead, we just used a preprocessing tool, with the output from that
: tool going straight into the "live" directory on the server. Much of
: the content was actually from files on disk rather than a database,
: but there was some database-driven content and the principle still
: applies.
:
These are some of the extra steps I was alluding to. Perhaps it does
work particularly well for your site, but it may not work as well for others.

: If a site contains content which has long-term relevance there is
: often little need for dynamicism.
:
Perhaps...I wonder how many sites fit this category without becoming
"dead wood" sites.

--
Long
www.webcharm.ca - Integrated content management web hosting
Jul 20 '05 #12

P: n/a
"Brian" <us*****@julietremblay.com.invalid> wrote in message
news:10*************@corp.supernews.com...
: Long - CM web hosting wrote:
:
: > Such off-line systems cost more but offer little benefit:
: > 1. need to manage two versions of the website and keeping them in
: > sync will be more difficult, especially for larger sites.
: > 2. more work is needed to generate dynamic content
:
: Yes, but they often offer excellent caching support with no extra effort
: by the content producer. Generated-on-the-fly sites either must
: replicate that behavior or do without, with the obvious consequences.
:

In my (perhaps limited) experience, user-agents cache a page base on
the LAST_MODIFIED timestamp. For a static page, the value is
supplied automatically by the webserver. For a generated page (with
content really static), the value can be inserted in the HEADER section at
response generation time. This is what our CMS actually do automatically.

Regardless of the method, I think the user-agent will cache the page correctly.
One situation that may be problematic is if an include content was modified, but
the main template had not. This means the last_mod timestamp will be incorrect,
but the solution is to touch the template and update its timestamp accordingly.

--
Long
www.webcharm.ca - Integrated content management web hosting
Jul 20 '05 #13

P: n/a
"Long - CM web hosting" <ro******@rogers.com> wrote in message news:<lJ*****************@news04.bloor.is.net.cabl e.rogers.com>...

[Parsing existing HTML requires some code creation, either hand-coded
or automatic]
Not necessarily. It depends how you define your import "chunks".
If you need to parse the content and extract specific elements, then
yes. Otherwise file-level chunks can be import with a more general
module, if available with selected CMS.


What's a "file level chunk" ? We're talking about HTML pages here,
where the only assumption has to be that there are entities intended
for separate fields, and that these will be mixed around on the page
(hopefully with some conveniently identifiable structure to help find
them). This needs some sort of coding approach; maybe a simple parse
into a DOM and then retrieval by id value (although you'll be lucky to
find a site that allows this). More likely you're dealing with a
standard HTML tokeniser, then some flakey sort of hand-coded state
machine to find _which_ <td> it is in the <table> with the header
string of "Latest Prices". I've written a bunch of these things - the
only time they haven't ended up really ugly has been where the site
served XHTML and had some pretence to semantic markup.
Jul 20 '05 #14

P: n/a
"Andy Dingley" <di*****@codesmiths.com> wrote in message
news:28**************************@posting.google.c om...
: "Long - CM web hosting" <ro******@rogers.com> wrote in message
news:<lJ*****************@news04.bloor.is.net.cabl e.rogers.com>...
:
: [Parsing existing HTML requires some code creation, either hand-coded
: or automatic]
:
: > Not necessarily. It depends how you define your import "chunks".
: > If you need to parse the content and extract specific elements, then
: > yes. Otherwise file-level chunks can be import with a more general
: > module, if available with selected CMS.
:
: What's a "file level chunk" ? We're talking about HTML pages here,

I meant the entity is either a complete HTML page or application document
(such as images/pdf/doc/...).

--
Long
www.webcharm.ca - Integrated content management web hosting
Jul 20 '05 #15

P: n/a
Long - CM web hosting wrote:
In my (perhaps limited) experience, user-agents cache a page base on
the LAST_MODIFIED timestamp.
Among other factors, such as etag.
For a static page, the value is supplied automatically by the
webserver. For a generated page (with content really static), the
value can be inserted in the HEADER section at response generation
time. This is what our CMS actually do automatically.
And do you also provide an etag? In any case, that's only half of the
picture. It says quite a lot that a cms provider doesn't even mention
conditional get requests.
Regardless of the method, I think the user-agent will cache the page
correctly.
Well, that depends on the settings of the ua. If it does not check the
freshness of the document against the server, then you're right. And
many caches will try to make educated guesses about the freshness of a
document based on the last-modified header. But if a ua checks the
freshness with the server, e.g., with if-modified-since, then the server
will recreate and resend the page, even if it has not changed. The only
way to correct this is to parse the header, check for if-modified-since,
and if the last-modified is equal (or older, though that would be an odd
occurance), send 304 Not Modified and end the transaction.

This is not trivial to set up. But Apache does this automatically. So
using static pages whenever possible will make your site cache friendly
with little work on the author's part.
One situation that may be problematic is if an include content was
modified, but the main template had not. This means the last_mod
timestamp will be incorrect


With all due respect, that is not the only problem with your
implementation, unless you've left some parts out.

--
Brian (remove "invalid" from my address to email me)
http://www.tsmchughs.com/
Jul 20 '05 #16

This discussion thread is closed

Replies have been disabled for this discussion.