Matthias Gutfeldt wrote:
da*****@hotmail.com wrote:
I'm in need of a program that extracts the semantic structure of
HTML pages --
[snip]
If not, should it be possible to write a program that is
rule-based? Is it likely for the rules to clash?
But if you want to grab their content, it's far easier to use one of
the many, many CNET RSS feeds:
<http://www.cnet.com/4520-6022-5115113.html>. Their usage guidelines
are very sensible, IMHO.
The question was phrased in a way which made me think of a course
assignment, so perhaps a practical solution won't help, bicbw ;)
As is clear, markup in use in the real world has no implicit semantic
structure, therefore any attempt to extract such information which might
be present can at best be heuristic, hence obviously yes, there is no
logical reason why you won't discover useful rules which in some
circumstances clash.
As to how likely, you'd probably need to investigate possible rulesets
and analyse exisiting markup before you could answer that properly.
At a guess, given the mess that seems typical at present, I'd say it's
very likely.
--
Michael
m r o z a t u k g a t e w a y d o t n e t