Hello,
Are there any utilities to help me extract Content from HTML ?
I'd like to store this data in a database.
The HTML consists of about 10,000 files with a total size of
about 160 Mb. Each file is a thread from a message forum. Each
thread has several contributions. The threads are in linear
order of date posted with filenames such as 000125633.html. The
HTML is marked up with <table>, etc tags. This HTML is very
badly formed with crucial tags missing (such as <TR>, <BODY>,
etc.). There is no coherence to this; no system - sometimes tags
are missing and sometimes they are present. Despite this, the
threads seem to render correctly; such is the forgiving nature
of modern browsers.
Fields for each post are usually identified by an attribute tag.
(usually an attribute of a <TD> or <SPAN>.
Sometimes I need to actually store HTML with the content (for
instance when a post includes a link, colored writing or text
formatted with <PRE> tags.
My purpose in storing this in a database is to make the content
(a) easier to search and (b) use a more efficient storage
medium.
The original database from which these web-forum posts were
taken is no longer available on the web nor does it look like it
ever will be again. Nor can I contact the person who 'owns' it.
If I did contact them, they would be unlikely to release the
data.
Despite this, there are no copyright issues here. Every single
post made to the forum was made using an alias and no forum
poster wants to be identified, nor do any posters wish to claim
"ownership" of their contributions. 10 6907
mark4 wrote: Are there any utilities to help me extract Content from HTML ? I'd like to store this data in a database.
Looks to me like you'd have to write your own customised program to
extract the data.
To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.
Nor can I contact the person who 'owns' it. If I did contact them, they would be unlikely to release the data.
Despite this, there are no copyright issues here. Every single post made to the forum was made using an alias and no forum poster wants to be identified, nor do any posters wish to claim "ownership" of their contributions.
Sounds to me like there are *major* copyright issues!
--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
<us**********@t obyinkster.co.u k> wrote: mark4 wrote:
Are there any utilities to help me extract Content from HTML ? I'd like to store this data in a database. Looks to me like you'd have to write your own customised program to extract the data.
I expected as much.
To do that, I recommend using Perl. Perl has a module called HTML::Parser which is apparently pretty good at extracting information from malformed HTML files. Whatsmore, it is generally very good at text handling and has decent database modules too.
Thanks. Being a microserf, I don't normally code in Perl but I
may look into this. It's either that or WSH Javascript with
it's regular expressions. Fortunately I already have a top
level design and it looks pretty simple. I may look into this
Perl module but it will probably be easier to use microserf
technology with which I'm intimate with. I shall probably store
it in MSSQL. Nor can I contact the person who 'owns' it. If I did contact them, they would be unlikely to release the data.
Despite this, there are no copyright issues here. Every single post made to the forum was made using an alias and no forum poster wants to be identified, nor do any posters wish to claim "ownership" of their contributions.
Sounds to me like there are *major* copyright issues!
I can't see what those issues are. Who owns the data? Not the
original forum provider. The data posted to a forum is copyright
of the original author - no matter what ToS my be specified in
the forum. All those original authors have an alias and don't
actually want to be identified. What I'm doing is no more a
violation of copyright than someone keeping newspaper clippings.
So long as I don't republish it.
mark4 wrote: On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster <us**********@t obyinkster.co.u k> wrote:
To do that, I recommend using Perl. Perl has a module called HTML::Parser which is apparently pretty good at extracting information from malformed HTML files. Whatsmore, it is generally very good at text handling and has decent database modules too.
Mark's right. I don't do the whole "language cheerleader" thing - but for
this particular problem, Perl's an ideal fit.
Thanks. Being a microserf, I don't normally code in Perl but I may look into this. It's either that or WSH Javascript with it's regular expressions.
There's Perl for Windows, you know. It integrates nicely with WSH too.
<http://www.activestate .com>
sherm--
--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
Access can link to HTML (direct from the web) and will recognise tables.
You might be lucky! It would make a very quick solution. File > Get
External Data > Link... and then choose HTML. I was surprised how well it
worked when I tried it on a table I'd created in FrontPage.
--
############### #####
## PH, London
############### #####
"mark4" <mark4asp@#nott his#ntlworld.co m> wrote in message
news:8e******** *************** *********@4ax.c om... Hello,
Are there any utilities to help me extract Content from HTML ?
I'd like to store this data in a database.
The HTML consists of about 10,000 files with a total size of about 160 Mb. Each file is a thread from a message forum. Each thread has several contributions. The threads are in linear order of date posted with filenames such as 000125633.html. The HTML is marked up with <table>, etc tags. This HTML is very badly formed with crucial tags missing (such as <TR>, <BODY>, etc.). There is no coherence to this; no system - sometimes tags are missing and sometimes they are present. Despite this, the threads seem to render correctly; such is the forgiving nature of modern browsers.
Fields for each post are usually identified by an attribute tag. (usually an attribute of a <TD> or <SPAN>.
Sometimes I need to actually store HTML with the content (for instance when a post includes a link, colored writing or text formatted with <PRE> tags.
My purpose in storing this in a database is to make the content (a) easier to search and (b) use a more efficient storage medium.
The original database from which these web-forum posts were taken is no longer available on the web nor does it look like it ever will be again. Nor can I contact the person who 'owns' it. If I did contact them, they would be unlikely to release the data.
Despite this, there are no copyright issues here. Every single post made to the forum was made using an alias and no forum poster wants to be identified, nor do any posters wish to claim "ownership" of their contributions.
In article <8e************ *************** *****@4ax.com>, mark4
<mark4asp@#nott his#ntlworld.co m> wrote: Are there any utilities to help me extract Content from HTML ?
BBEdit has a simple menu command to remove markup from an HTML page,
leaving only the content. You should then perform any kind of regex
operation to massage the data before saving it.
To process all those files, it should be a pretty simple matter to
write an AppleScript to automate this procesure.
However, this solution is Macintosh-only.
--
Jim Royal
"Understand ing is a three-edged sword" http://JimRoyal.com
On Mon, 28 Feb 2005 08:32:19 GMT, mark4 wrote: Nor can I contact the person who 'owns' it. If I did contact them, they would be unlikely to release the data.
Despite this, there are no copyright issues here. Every single post made to the forum was made using an alias and no forum poster wants to be identified, nor do any posters wish to claim "ownership" of their contributions.
Sounds to me like there are *major* copyright issues!
I can't see what those issues are.
By law, those posts are copyrighted and owned by the posters.
On Mon, 28 Feb 2005 06:06:36 GMT, mark4
<mark4asp@#nott his#ntlworld.co m> wrote: Hello,
Are there any utilities to help me extract Content from HTML ?
< snip >
Notetab ? Modify - Strip HTML tags ? http://www.notetab.com/
Not sure whether that is in the freeware version or not.
Regards, John.
mark4 wrote: Thanks. Being a microserf, I don't normally code in Perl but I may look into this.
I am told ActiveState's Windows port of Perl is pretty good. Alternatively
there is also a Cygwin version of Perl.
I can't see what those issues are. Who owns the data?
Its original authors, unless they explicitly signed away the copyright.
All those original authors have an alias and don't actually want to be identified.
Publishing anonymously or under a pseudonym does not mean you forgo
copyright.
So long as I don't republish it.
If you are keeping the database for private use, then you can probably
"get away with it", but the natural assumption on alt.html is that posters
are wanting to publish their efforts to the web, unless it's explicitly
stated otherwise.
--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
> >To do that, I recommend using Perl. Perl has a module called
HTML::Parser which is apparently pretty good at extracting information from
malformedHTML files. Whatsmore, it is generally very good at text handling
and hasdecent database modules too.
Thanks. Being a microserf, I don't normally code in Perl but I may look into this. It's either that or WSH Javascript with it's regular expressions. Fortunately I already have a top level design and it looks pretty simple. I may look into this Perl module but it will probably be easier to use microserf technology with which I'm intimate with. I shall probably store it in MSSQL.
You could use the InternetExplore r.Application COM object.
That would give you the facilities for performing HTML
parsing without regexps. It would therefore be
more robust and readily doable in your favorite language.
Try google for examples. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Jane Doe |
last post by:
Hi
I took a quick look in the archives, but didn't find an answer
to this one.
I'd like to display a list of HTML files in a directory, showing the
author's name between brackets after the file name. I can successfully
extract the TITLE section, but no luck with the AUTHOR part. Any idea
why?
|
by: Phong Ho |
last post by:
Hi everyone,
I try to write a simple web crawler. It has to do the following:
1) Open an URL and retrieve a HTML file.
2) Extract news headlines from the HTML file
3) Put the headlines into a RSS file.
For example, I want to go to this site and extract the headlines:
www.unstrung.com/section.asp?section_id=86
|
by: SteveJ |
last post by:
All,
Can someone help me solve the next step.
First of all let me say I'm new to php. I pieced the following code together
from samples
I found on the net and a book I bought called PHP Cookbook. So please
forgive me if this isn't the best approach - I'm open to suggestions
I finally got my code to work that logs into another site and pulls the
orderstatus page to my server.
|
by: Selen |
last post by:
I would like to be able to extract a BLOB from the database (SqlServer)
and pass it to a browser without writing it to a file. (The BLOB's
are word doc's, MS project doc's, and Excel spreadsheets.
How can I do this?
|
by: centur |
last post by:
I need to acquire content body of MIME encoded message (as IMessage
object).I want using C# and CDO Interop extract such data
("eJ8+IggVAQaQ..." unicode part).
Here is example of Bodypart (complete original here
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/exchserv/html/backbone_767n.asp)
------ =_NextPart_000_01BAE995.CB66A290
Content-Type: application/ms-tnef
Content-Transfer-Encoding: base64
| |
by: gregmcmullinjr |
last post by:
Hello,
I am new to the concept of XSL and am looking for some assistance.
Take the following XML document:
<binder>
<author>Greg</author>
<notes>
<time>11:45</time>
|
by: nkg1234567 |
last post by:
I'm trying to extract HTML from a website in the form of a string, and then I want to extract particular elements from the string using the substr function:
here is some sample code that I have thus far:
use HTTP::Request::Common;
use LWP::UserAgent;
use LWP::Simple;
$ua = LWP::UserAgent->new;
|
by: snewman18 |
last post by:
I'm parsing NNTP messages that have XML file attachments. How can I
extract the encoded text back into a file? I looked for a solution
with mimetools (the way I'd approach it for email) but found nothing.
Here's a long snippet of the message:
('220 116431 <D8PANKEG1@news.ap.orgarticle', '116431',
'<D8PANKEG1@news.ap.org>', ['MIME-Version: 1.0', 'Message-ID:
<D8PANKEG1@news.ap.org>', 'Content-Type: Multipart/Mixed;', '...
|
by: Alberto Sartori |
last post by:
Hello,
I have a html text with custom tags which looks like html comment,
such:
"text text text <p>text</ptext test test
text text text <p>text</ptext test test
<!-- @MyTag@ -->extract this<!-- /@MyTag@ -->
text text text <p>text</ptext test test
<!-- @MyTag@ -->and this<!-- /@MyTag@ -->
text text text <p>text</ptext test test"
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |