473,385 Members | 1,320 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Extracting semantic structure of HTML page

Hi,

I'm in need of a program that extracts the semantic structure of HTML
pages -- a program that groups paragraphs with the corresponding
headings etc. I know it's not too difficult to extract well structured
documents, e.g. formal essays. However, with complicated sites like CNN
and CNet, where many tables are used to align text, to me it'd become
extremely difficult. Does anyone know of any existing applications? If
not, should it be possible to write a program that is rule-based? Is it
likely for the rules to clash?

Please help.

Cheers,
Michael

Jul 23 '05 #1
2 1613
da*****@hotmail.com wrote:

I'm in need of a program that extracts the semantic structure of HTML
pages -- a program that groups paragraphs with the corresponding
headings etc. I know it's not too difficult to extract well structured
documents, e.g. formal essays. However, with complicated sites like CNN
and CNet, where many tables are used to align text, to me it'd become
extremely difficult. Does anyone know of any existing applications? If
not, should it be possible to write a program that is rule-based? Is it
likely for the rules to clash?


<http://www.hotscripts.com/PHP/Scripts_and_Programs/Web_Fetching/index.html>

But if you want to grab their content, it's far easier to use one of the
many, many CNET RSS feeds: <http://www.cnet.com/4520-6022-5115113.html>.
Their usage guidelines are very sensible, IMHO.
Matthias

Jul 23 '05 #2
Matthias Gutfeldt wrote:
da*****@hotmail.com wrote:
I'm in need of a program that extracts the semantic structure of
HTML pages --
[snip]
If not, should it be possible to write a program that is
rule-based? Is it likely for the rules to clash?


But if you want to grab their content, it's far easier to use one of
the many, many CNET RSS feeds:
<http://www.cnet.com/4520-6022-5115113.html>. Their usage guidelines
are very sensible, IMHO.


The question was phrased in a way which made me think of a course
assignment, so perhaps a practical solution won't help, bicbw ;)

As is clear, markup in use in the real world has no implicit semantic
structure, therefore any attempt to extract such information which might
be present can at best be heuristic, hence obviously yes, there is no
logical reason why you won't discover useful rules which in some
circumstances clash.

As to how likely, you'd probably need to investigate possible rulesets
and analyse exisiting markup before you could answer that properly.

At a guess, given the mess that seems typical at present, I'd say it's
very likely.

--
Michael
m r o z a t u k g a t e w a y d o t n e t
Jul 23 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Richard Cornford | last post by:
I am interested in hearing opinions on the semantic meaning of FORM (elements) in HTML. I have to start of apologising because this question arose in a context that is not applicable to the...
1
by: Cognizance | last post by:
Hi gang, I'm an ASP developer by trade, but I've had to create client side scripts with JavaScript many times in the past. Simple things, like validating form elements and such. Now I've been...
7
by: dayzman | last post by:
Hi, Does anyone know of any application, or simply any development, on extracting semantic structure of HTML pages? Cheers, Michael
3
by: dayzman | last post by:
Hi, I'm interested in projects evolve about extracing semantic structure of HTML documents. What I mean by extracting semantic structure is to analyse HTML doc and outputs a model (perhaps a...
3
by: dayzman | last post by:
Hi, I've read somewhere that feature-based analysis can be used to extract the semantic structure of HTML documents. By semantic structure, they mean the model of the rendered view a reader...
4
by: crafuse | last post by:
Hello, I've overridden the WndProc function in my form to hand some special behavior. Specifically, I look for the WM_NCMOUSEMOVE event to tell me when the user is trying to move the window by...
7
by: Une Bévue | last post by:
the purpose : avoid all banners and unusefull contents of an html document the leaves intact the part from start to body and inside the body leave only the part where user has clicked (by...
4
by: Damo | last post by:
I have a program, That retrieves a webpage , such as a search engine results page from the web, Then I need to go through the document and retrieve just the search results. The problem is I want to...
12
by: rshepard | last post by:
I'm a bit embarrassed to have to ask for help on this, but I'm not finding the solution in the docs I have here. Data are assembled for writing to a database table. A representative tuple looks...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.