By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,648 Members | 1,199 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,648 IT Pros & Developers. It's quick & easy.

lexical analysis of html

P: n/a
Hi

I am trying to make lexical analysis of some simplified html code with
flex tool. However that kind of work is new to me and I don't know
where to start. I have searched a web but I didn't find anything
useful. I found tools like LEXHTML.CXX library but I have no need for
that.

What I need is simple overview of working ideas of most usual html
lexical analysators like ones inside IE or Gecko. Something like good
article or post where is described how lexical atoms or operators and
similar particles are recognized in HTML (what is what and where it
goes).

Kudos for your help..

Jul 20 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Hi, bariole,

Generally speaking it is impossible to create scanner and parser of SGML
(HTML and XML) using flex/bison and the like.
These are special languages as they require building the parser dynamically
on the fly, based on a Document Type Declaration (DTD).

Moreover, practical HTML scanner and parser shall include also parser for
JavaScript (or its subset) as you may bump into following:
<SCRIPT>
function foo() { a.write("</SCRIPT>"); }
</SCRIPT>
(not recommended by spec but happens)

To be short I think you will not find ready to use lex/y files for HTML.

"Finite state automata" of HTML parser is not so hard to write.
And there are plenty of examples in the Net.
For example : http://www.do.org/products/parser/

"Per aspera ad astra!" :)

Andrew Fedoniouk.
http://terrainformatica.com

"bariole" <ba*******@SPAMyahoo.com> wrote in message
news:ls********************************@4ax.com...
Hi

I am trying to make lexical analysis of some simplified html code with
flex tool. However that kind of work is new to me and I don't know
where to start. I have searched a web but I didn't find anything
useful. I found tools like LEXHTML.CXX library but I have no need for
that.

What I need is simple overview of working ideas of most usual html
lexical analysators like ones inside IE or Gecko. Something like good
article or post where is described how lexical atoms or operators and
similar particles are recognized in HTML (what is what and where it
goes).

Kudos for your help..

Jul 20 '05 #2

P: n/a
bariole wrote:
Hi

I am trying to make lexical analysis of some simplified html code with
flex tool. However that kind of work is new to me and I don't know
where to start. I have searched a web but I didn't find anything
useful. I found tools like LEXHTML.CXX library but I have no need for
that.

What I need is simple overview of working ideas of most usual html
lexical analysators like ones inside IE or Gecko. Something like good
article or post where is described how lexical atoms or operators and
similar particles are recognized in HTML (what is what and where it
goes).


As Andrew wrote, arbitrary valid HTML is too "fluffy" to be easily
defined as flex requires.

If you are writing the HTML yourself, you can make it a bit stricter by
doing things like always closing elements which will help you. By
writing Appendix C-compliant XHTML-1.0, you can make your code into
valid XML as well, which will make it a lot easier to parse.

--
Mark.
Jul 20 '05 #3

P: n/a
On Wed, 19 May 2004 04:05:32 GMT, "Andrew Fedoniouk"
<ne**@terrainformatica.com> wrote:
Generally speaking it is impossible to create scanner and parser of SGML
(HTML and XML) using flex/bison and the like.
These are special languages as they require building the parser dynamically
on the fly, based on a Document Type Declaration (DTD).

Moreover, practical HTML scanner and parser shall include also parser for
JavaScript (or its subset) as you may bump into following:
<SCRIPT>
function foo() { a.write("</SCRIPT>"); }
</SCRIPT>
(not recommended by spec but happens)
I was thinking about similar problems. To me lex seems like very
"strict" tool, actually to strict to be used for language like html.
To be short I think you will not find ready to use lex/y files for HTML.


I didn't ask for that although it would be very nice for me to see
that files if they existed somewhere. Lex and lexical analysis are
something new to me and I needed ideas how those things are done.

However I am pleased to say that after two days of thinking I found a
solution for my problem.

Thank you for your replay. Your post was more helpful than you can
imagine..

Jul 20 '05 #4

P: n/a
On Wed, 19 May 2004 08:06:35 +0100, Mark Tranchant
<ma**@tranchant.plus.com> wrote:
bariole wrote: As Andrew wrote, arbitrary valid HTML is too "fluffy" to be easily
defined as flex requires.

If you are writing the HTML yourself, you can make it a bit stricter by
doing things like always closing elements which will help you. By
writing Appendix C-compliant XHTML-1.0, you can make your code into
valid XML as well, which will make it a lot easier to parse.


Thanks Mark. I will do something like that.
Jul 20 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.