Extract Content from HTML ?

mark4

Hello,

Are there any utilities to help me extract Content from HTML ?

I'd like to store this data in a database.

The HTML consists of about 10,000 files with a total size of
about 160 Mb. Each file is a thread from a message forum. Each
thread has several contributions. The threads are in linear
order of date posted with filenames such as 000125633.html. The
HTML is marked up with <table>, etc tags. This HTML is very
badly formed with crucial tags missing (such as <TR>, <BODY>,
etc.). There is no coherence to this; no system - sometimes tags
are missing and sometimes they are present. Despite this, the
threads seem to render correctly; such is the forgiving nature
of modern browsers.

Fields for each post are usually identified by an attribute tag.
(usually an attribute of a <TD> or <SPAN>.

Sometimes I need to actually store HTML with the content (for
instance when a post includes a link, colored writing or text
formatted with <PRE> tags.

My purpose in storing this in a database is to make the content
(a) easier to search and (b) use a more efficient storage
medium.

The original database from which these web-forum posts were
taken is no longer available on the web nor does it look like it
ever will be again. Nor can I contact the person who 'owns' it.
If I did contact them, they would be unlikely to release the
data.

Despite this, there are no copyright issues here. Every single
post made to the forum was made using an alias and no forum
poster wants to be identified, nor do any posters wish to claim
"ownership" of their contributions.

Jul 23 '05 #1

Subscribe Reply

6853

Toby Inkster

mark4 wrote:

Are there any utilities to help me extract Content from HTML ?
I'd like to store this data in a database.
Looks to me like you'd have to write your own customised program to
extract the data.

To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.
Nor can I contact the person who 'owns' it. If I did contact them, they
would be unlikely to release the data.

Despite this, there are no copyright issues here. Every single post made
to the forum was made using an alias and no forum poster wants to be
identified, nor do any posters wish to claim "ownership" of their
contributions.

Sounds to me like there are *major* copyright issues!

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Jul 23 '05 #2

mark4

On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
<us**********@tobyinkster.co.uk> wrote:

mark4 wrote:
Are there any utilities to help me extract Content from HTML ?
I'd like to store this data in a database.
Looks to me like you'd have to write your own customised program to
extract the data.

I expected as much.
To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.

Thanks. Being a microserf, I don't normally code in Perl but I
may look into this. It's either that or WSH Javascript with
it's regular expressions. Fortunately I already have a top
level design and it looks pretty simple. I may look into this
Perl module but it will probably be easier to use microserf
technology with which I'm intimate with. I shall probably store
it in MSSQL.

Nor can I contact the person who 'owns' it. If I did contact them, they
would be unlikely to release the data.

Despite this, there are no copyright issues here. Every single post made
to the forum was made using an alias and no forum poster wants to be
identified, nor do any posters wish to claim "ownership" of their
contributions.

Sounds to me like there are *major* copyright issues!

I can't see what those issues are. Who owns the data? Not the
original forum provider. The data posted to a forum is copyright
of the original author - no matter what ToS my be specified in
the forum. All those original authors have an alias and don't
actually want to be identified. What I'm doing is no more a
violation of copyright than someone keeping newspaper clippings.

So long as I don't republish it.

Jul 23 '05 #3

Sherm Pendley

mark4 wrote:

On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
<us**********@tobyinkster.co.uk> wrote:
To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.

Mark's right. I don't do the whole "language cheerleader" thing - but for
this particular problem, Perl's an ideal fit.
Thanks. Being a microserf, I don't normally code in Perl but I
may look into this. It's either that or WSH Javascript with
it's regular expressions.

There's Perl for Windows, you know. It integrates nicely with WSH too.

<http://www.activestate.com>

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org

Jul 23 '05 #4

Philip Herlihy

Access can link to HTML (direct from the web) and will recognise tables.
You might be lucky! It would make a very quick solution. File > Get
External Data > Link... and then choose HTML. I was surprised how well it
worked when I tried it on a table I'd created in FrontPage.

--
####################
## PH, London
####################
"mark4" <mark4asp@#notthis#ntlworld.com> wrote in message
news:8e********************************@4ax.com...

Hello,

Are there any utilities to help me extract Content from HTML ?

I'd like to store this data in a database.

The HTML consists of about 10,000 files with a total size of
about 160 Mb. Each file is a thread from a message forum. Each
thread has several contributions. The threads are in linear
order of date posted with filenames such as 000125633.html. The
HTML is marked up with <table>, etc tags. This HTML is very
badly formed with crucial tags missing (such as <TR>, <BODY>,
etc.). There is no coherence to this; no system - sometimes tags
are missing and sometimes they are present. Despite this, the
threads seem to render correctly; such is the forgiving nature
of modern browsers.

Fields for each post are usually identified by an attribute tag.
(usually an attribute of a <TD> or <SPAN>.

Sometimes I need to actually store HTML with the content (for
instance when a post includes a link, colored writing or text
formatted with <PRE> tags.

My purpose in storing this in a database is to make the content
(a) easier to search and (b) use a more efficient storage
medium.

The original database from which these web-forum posts were
taken is no longer available on the web nor does it look like it
ever will be again. Nor can I contact the person who 'owns' it.
If I did contact them, they would be unlikely to release the
data.

Despite this, there are no copyright issues here. Every single
post made to the forum was made using an alias and no forum
poster wants to be identified, nor do any posters wish to claim
"ownership" of their contributions.

Jul 23 '05 #5

Jim Royal

In article <8e********************************@4ax.com>, mark4
<mark4asp@#notthis#ntlworld.com> wrote:

Are there any utilities to help me extract Content from HTML ?

BBEdit has a simple menu command to remove markup from an HTML page,
leaving only the content. You should then perform any kind of regex
operation to massage the data before saving it.

To process all those files, it should be a pretty simple matter to
write an AppleScript to automate this procesure.

However, this solution is Macintosh-only.

--
Jim Royal
"Understanding is a three-edged sword"
http://JimRoyal.com

Jul 23 '05 #6

Chrissy Cruiser

On Mon, 28 Feb 2005 08:32:19 GMT, mark4 wrote:

Nor can I contact the person who 'owns' it. If I did contact them, they
would be unlikely to release the data.

Despite this, there are no copyright issues here. Every single post made
to the forum was made using an alias and no forum poster wants to be
identified, nor do any posters wish to claim "ownership" of their
contributions.

Sounds to me like there are *major* copyright issues!

I can't see what those issues are.

By law, those posts are copyrighted and owned by the posters.

Jul 23 '05 #7

John Fitzsimons

On Mon, 28 Feb 2005 06:06:36 GMT, mark4
<mark4asp@#notthis#ntlworld.com> wrote:

Hello, Are there any utilities to help me extract Content from HTML ?

< snip >

Notetab ? Modify - Strip HTML tags ?

http://www.notetab.com/

Not sure whether that is in the freeware version or not.

Regards, John.

Jul 23 '05 #8

Toby Inkster

mark4 wrote:

Thanks. Being a microserf, I don't normally code in Perl but I
may look into this.
I am told ActiveState's Windows port of Perl is pretty good. Alternatively
there is also a Cygwin version of Perl.
I can't see what those issues are. Who owns the data?
Its original authors, unless they explicitly signed away the copyright.
All those original authors have an alias and don't actually want to be
identified.
Publishing anonymously or under a pseudonym does not mean you forgo
copyright.
So long as I don't republish it.

If you are keeping the database for private use, then you can probably
"get away with it", but the natural assumption on alt.html is that posters
are wanting to publish their efforts to the web, unless it's explicitly
stated otherwise.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Jul 23 '05 #9

ggrothendieck

> >To do that, I recommend using Perl. Perl has a module called
HTML::Parser

which is apparently pretty good at extracting information from malformedHTML files. Whatsmore, it is generally very good at text handling and hasdecent database modules too.

Thanks. Being a microserf, I don't normally code in Perl but I
may look into this. It's either that or WSH Javascript with
it's regular expressions. Fortunately I already have a top
level design and it looks pretty simple. I may look into this
Perl module but it will probably be easier to use microserf
technology with which I'm intimate with. I shall probably store
it in MSSQL.

You could use the InternetExplorer.Application COM object.
That would give you the facilities for performing HTML
parsing without regexps. It would therefore be
more robust and readily doable in your favorite language.
Try google for examples.

Jul 23 '05 #10

mbstevens

mark4 wrote:

Hello,

Are there any utilities to help me extract Content from HTML ?

lynx -dump http://whateverTheHeck.com > temp.txt

.... is the shortest program I know of for this kind of thing.
The '>' redirection to temp.txt may vary somewhat between operating systems.
--
mbstevens http://www.mbstevens.com

Jul 23 '05 #11

Similar topics

Eregi() to extract author meta tag?

by: Jane Doe | last post by:

Hi I took a quick look in the archives, but didn't find an answer to this one. I'd like to display a list of HTML files in a directory, showing the author's name between brackets after the...