Hello,
Are there any utilities to help me extract Content from HTML ?
I'd like to store this data in a database.
The HTML consists of about 10,000 files with a total size of
about 160 Mb. Each file is a thread from a message forum. Each
thread has several contributions. The threads are in linear
order of date posted with filenames such as 000125633.html. The
HTML is marked up with <table>, etc tags. This HTML is very
badly formed with crucial tags missing (such as <TR>, <BODY>,
etc.). There is no coherence to this; no system - sometimes tags
are missing and sometimes they are present. Despite this, the
threads seem to render correctly; such is the forgiving nature
of modern browsers.
Fields for each post are usually identified by an attribute tag.
(usually an attribute of a <TD> or <SPAN>.
Sometimes I need to actually store HTML with the content (for
instance when a post includes a link, colored writing or text
formatted with <PRE> tags.
My purpose in storing this in a database is to make the content
(a) easier to search and (b) use a more efficient storage
medium.
The original database from which these web-forum posts were
taken is no longer available on the web nor does it look like it
ever will be again. Nor can I contact the person who 'owns' it.
If I did contact them, they would be unlikely to release the
data.
Despite this, there are no copyright issues here. Every single
post made to the forum was made using an alias and no forum
poster wants to be identified, nor do any posters wish to claim
"ownership" of their contributions.