"DesignGuy" <do********@now here.com> wrote in message news:<8cTrc.435 61$gr.4348411@a ttbi_s52>...
Does software exist to import existing HTML pages into a database
No, it needs to be written as a one-off for each site. If you're a
coder you'll probably just sit down and write it yourself (try Perl
and one of the HTML parsing modules). If you're not naturally inclined
to see everything as an excuse to cut some code, then you might find a
semi-automatic tool that writes this import script fragment for you
and wraps it up inside its standard page-reading / record-adding loop.
Note that however you do it, "software" is being created to map HTML
fragments onto database fields and this is just an inherently awkward
task, no matter how you approach it.
As an example, M$oft SQL Server has an import tool called DTS and it's
quite capable of being made to read and parse web pages. However the
page -> field values parser is the hard bit, and the loop is the easy
bit, so this semi-auto scripting approach doesn't really add much and
is often almost as complicated.
The difficulty of this sort of data import varies hugely depending on
the particular site concerned. It doesn't depend on the size of the
site, and only slightly on the number of field values to extract from
each page. The big variable is the structure of each HTML page,
particularly the semantic visibility of their underlying structure.
Real killers that will ruin your day are needing to do this on other
people's sites, needing to do it continuously for the forseeable
future (i.e. scraping a daily feed) and needing to do it on a site
with redesigns happening to it.
If you want a further opinion, please post a URL for the existing
site.
An alternative approach is to avoid doing this from HTML. Sites with
1000's of pages were rarely built directly from hand-coded HTML
anyway. Can you grab their content at a higher level? Word docs,
database?