473,322 Members | 1,504 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Extracting Semantic Structure of HTML Doc

Hi,

I'm interested in projects evolve about extracing semantic structure of
HTML documents. What I mean by extracting semantic structure is to
analyse HTML doc and outputs a model (perhaps a tree structure) that
relates paragraphs to headings/sub headings. It should be a difficult
problem, since HTML is a structure language. Does anyone know of any
existing research?

Cheers,
Michael

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <co*****@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]
Jul 23 '05 #1
3 1652
On Tue, 25 Jan 2005 22:31:27 GMT, <da*****@hotmail.com> wrote:
I'm interested in projects evolve about extracing semantic structure of
HTML documents. What I mean by extracting semantic structure is to
analyse HTML doc and outputs a model (perhaps a tree structure) that
relates paragraphs to headings/sub headings.


Isn't this what the DOM-inspector in Mozilla does?

--
,-- --<--@ -- PretLetters: 'woest wyf', met vele interesses: ----------.
| weblog | http://home.wanadoo.nl/b.de.zoete/_private/weblog.html |
| webontwerp | http://home.wanadoo.nl/b.de.zoete/html/webontwerp.html |
|zweefvliegen | http://home.wanadoo.nl/b.de.zoete/html/vliegen.html |
`-------------------------------------------------- --<--@ ------------'

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <co*****@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]
Jul 23 '05 #2
Hi,

I thought the DOM inspector only show the parse tree. No?
Cheers,
Michael

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <co*****@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]
Jul 23 '05 #3
da*****@hotmail.com wrote:
Hi,

I thought the DOM inspector only show the parse tree. No?


Your original used the word "tree", and sounded as if the DOM
is what you meant. Semantic structure is rarely strong in webpages,
whereas a DOM can always be constructed.

AccessValet is related to this task, in that its page analysis
includes a heuristic evaluation of semantic structure, and it will
issue warnings if it appears wrong. For example,
<p><big><b>My Homepage</b></big></p> will suggest that this should
be a heading, while <h4>a great long passage of text and
<img ...>other things<br> that really don't look like a heading</h4>
will generate an opposite warning. Of course, there's a big grey
area between obviously-right and obviously-wrong uses, where only
human evaluation will serve. AccessValet also offers a range of
options for presentation of results, including annotated tree views.

A simple summary is offered by page outliners in tools such as the
W3C validator, mod_accessibility, and some assistive browsers.
Check the archives of the w3c-wai-er (Evaluation and Repair tools)
list for relevant discussion and various experimental software.

--
Nick Kew

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <co*****@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]
Jul 23 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Richard Cornford | last post by:
I am interested in hearing opinions on the semantic meaning of FORM (elements) in HTML. I have to start of apologising because this question arose in a context that is not applicable to the...
7
by: dayzman | last post by:
Hi, Does anyone know of any application, or simply any development, on extracting semantic structure of HTML pages? Cheers, Michael
2
by: dayzman | last post by:
Hi, I'm in need of a program that extracts the semantic structure of HTML pages -- a program that groups paragraphs with the corresponding headings etc. I know it's not too difficult to extract...
3
by: dayzman | last post by:
Hi, I've read somewhere that feature-based analysis can be used to extract the semantic structure of HTML documents. By semantic structure, they mean the model of the rendered view a reader...
4
by: crafuse | last post by:
Hello, I've overridden the WndProc function in my form to hand some special behavior. Specifically, I look for the WM_NCMOUSEMOVE event to tell me when the user is trying to move the window by...
7
by: Une Bévue | last post by:
the purpose : avoid all banners and unusefull contents of an html document the leaves intact the part from start to body and inside the body leave only the part where user has clicked (by...
12
by: rshepard | last post by:
I'm a bit embarrassed to have to ask for help on this, but I'm not finding the solution in the docs I have here. Data are assembled for writing to a database table. A representative tuple looks...
3
by: eliss | last post by:
I'm trying to find a way to extract all the function definitions AND function uses from thousands of C++ files. For example, if foo.cpp contains: int func(char b) { return 0; }
2
by: ravidor | last post by:
I need to display the number of a question out of total number of questions (example 6/18). What HTML tags should I use to build it semantically?
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.