Extracting semantic structure of HTML page

dayzman

Hi,

I'm in need of a program that extracts the semantic structure of HTML
pages -- a program that groups paragraphs with the corresponding
headings etc. I know it's not too difficult to extract well structured
documents, e.g. formal essays. However, with complicated sites like CNN
and CNet, where many tables are used to align text, to me it'd become
extremely difficult. Does anyone know of any existing applications? If
not, should it be possible to write a program that is rule-based? Is it
likely for the rules to clash?

Please help.

Cheers,
Michael

Jul 23 '05 #1

Subscribe Post Reply

1613

Matthias Gutfeldt

da*****@hotmail.com wrote:

I'm in need of a program that extracts the semantic structure of HTML
pages -- a program that groups paragraphs with the corresponding
headings etc. I know it's not too difficult to extract well structured
documents, e.g. formal essays. However, with complicated sites like CNN
and CNet, where many tables are used to align text, to me it'd become
extremely difficult. Does anyone know of any existing applications? If
not, should it be possible to write a program that is rule-based? Is it
likely for the rules to clash?

<http://www.hotscripts.com/PHP/Scripts_and_Programs/Web_Fetching/index.html>

But if you want to grab their content, it's far easier to use one of the
many, many CNET RSS feeds: <http://www.cnet.com/4520-6022-5115113.html>.
Their usage guidelines are very sensible, IMHO.
Matthias

Jul 23 '05 #2

Michael Rozdoba

Matthias Gutfeldt wrote:

da*****@hotmail.com wrote:
I'm in need of a program that extracts the semantic structure of
HTML pages --
[snip]
If not, should it be possible to write a program that is
rule-based? Is it likely for the rules to clash?

But if you want to grab their content, it's far easier to use one of
the many, many CNET RSS feeds:
<http://www.cnet.com/4520-6022-5115113.html>. Their usage guidelines
are very sensible, IMHO.

The question was phrased in a way which made me think of a course
assignment, so perhaps a practical solution won't help, bicbw ;)

As is clear, markup in use in the real world has no implicit semantic
structure, therefore any attempt to extract such information which might
be present can at best be heuristic, hence obviously yes, there is no
logical reason why you won't discover useful rules which in some
circumstances clash.

As to how likely, you'd probably need to investigate possible rulesets
and analyse exisiting markup before you could answer that properly.

At a guess, given the mess that seems typical at present, I'd say it's
very likely.

--
Michael
m r o z a t u k g a t e w a y d o t n e t

Jul 23 '05 #3

by: Richard Cornford | last post by:

I am interested in hearing opinions on the semantic meaning of FORM (elements) in HTML. I have to start of apologising because this question arose in a context that is not applicable to the...

HTML / CSS

Reading an HTML document & extracting content

by: Cognizance | last post by:

Hi gang, I'm an ASP developer by trade, but I've had to create client side scripts with JavaScript many times in the past. Simple things, like validating form elements and such. Now I've been...

Javascript

Semantic Structure of HTML page

by: dayzman | last post by:

Hi, Does anyone know of any application, or simply any development, on extracting semantic structure of HTML pages? Cheers, Michael

HTML / CSS

Extracting Semantic Structure of HTML Doc

by: dayzman | last post by:

Hi, I'm interested in projects evolve about extracing semantic structure of HTML documents. What I mean by extracting semantic structure is to analyse HTML doc and outputs a model (perhaps a...

HTML / CSS

Extracting Semantic Structure of HTML Document- Feature based

by: dayzman | last post by:

Hi, I've read somewhere that feature-based analysis can be used to extract the semantic structure of HTML documents. By semantic structure, they mean the model of the rendered view a reader...

HTML / CSS

Problems extracting Mouse position from LParam in WndProc

by: crafuse | last post by:

Hello, I've overridden the WndProc function in my form to hand some special behavior. Specifically, I look for the WM_NCMOUSEMOVE event to tell me when the user is trying to move the window by...

Visual Basic .NET

extracting part of a document

by: Une Bévue | last post by:

the purpose : avoid all banners and unusefull contents of an html document the leaves intact the part from start to body and inside the body leave only the part where user has clicked (by...

Javascript

Java + DOM + extracting text from XHTML

by: Damo | last post by:

I have a program, That retrieves a webpage , such as a search engine results page from the web, Then I need to go through the document and retrieve just the search results. The problem is I want to...

.NET Framework

Extract String From Enclosing Tuple

by: rshepard | last post by:

I'm a bit embarrassed to have to ask for help on this, but I'm not finding the solution in the docs I have here. Data are assembled for writing to a database table. A representative tuple looks...

Python

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Extracting semantic structure of HTML page

Similar topics