Connecting Tech Pros Worldwide Help | Site Map

lexical analysis of html

bariole
Guest
 
Posts: n/a
#1: Jul 20 '05
Hi

I am trying to make lexical analysis of some simplified html code with
flex tool. However that kind of work is new to me and I don't know
where to start. I have searched a web but I didn't find anything
useful. I found tools like LEXHTML.CXX library but I have no need for
that.

What I need is simple overview of working ideas of most usual html
lexical analysators like ones inside IE or Gecko. Something like good
article or post where is described how lexical atoms or operators and
similar particles are recognized in HTML (what is what and where it
goes).

Kudos for your help..

Andrew Fedoniouk
Guest
 
Posts: n/a
#2: Jul 20 '05

re: lexical analysis of html


Hi, bariole,

Generally speaking it is impossible to create scanner and parser of SGML
(HTML and XML) using flex/bison and the like.
These are special languages as they require building the parser dynamically
on the fly, based on a Document Type Declaration (DTD).

Moreover, practical HTML scanner and parser shall include also parser for
JavaScript (or its subset) as you may bump into following:
<SCRIPT>
function foo() { a.write("</SCRIPT>"); }
</SCRIPT>
(not recommended by spec but happens)

To be short I think you will not find ready to use lex/y files for HTML.

"Finite state automata" of HTML parser is not so hard to write.
And there are plenty of examples in the Net.
For example : http://www.do.org/products/parser/

"Per aspera ad astra!" :)

Andrew Fedoniouk.
http://terrainformatica.com





"bariole" <barioleNO@SPAMyahoo.com> wrote in message
news:ls5la0tajasu3svh00cvh7p2g2750f5bs7@4ax.com...[color=blue]
> Hi
>
> I am trying to make lexical analysis of some simplified html code with
> flex tool. However that kind of work is new to me and I don't know
> where to start. I have searched a web but I didn't find anything
> useful. I found tools like LEXHTML.CXX library but I have no need for
> that.
>
> What I need is simple overview of working ideas of most usual html
> lexical analysators like ones inside IE or Gecko. Something like good
> article or post where is described how lexical atoms or operators and
> similar particles are recognized in HTML (what is what and where it
> goes).
>
> Kudos for your help..
>[/color]


Mark Tranchant
Guest
 
Posts: n/a
#3: Jul 20 '05

re: lexical analysis of html


bariole wrote:
[color=blue]
> Hi
>
> I am trying to make lexical analysis of some simplified html code with
> flex tool. However that kind of work is new to me and I don't know
> where to start. I have searched a web but I didn't find anything
> useful. I found tools like LEXHTML.CXX library but I have no need for
> that.
>
> What I need is simple overview of working ideas of most usual html
> lexical analysators like ones inside IE or Gecko. Something like good
> article or post where is described how lexical atoms or operators and
> similar particles are recognized in HTML (what is what and where it
> goes).[/color]

As Andrew wrote, arbitrary valid HTML is too "fluffy" to be easily
defined as flex requires.

If you are writing the HTML yourself, you can make it a bit stricter by
doing things like always closing elements which will help you. By
writing Appendix C-compliant XHTML-1.0, you can make your code into
valid XML as well, which will make it a lot easier to parse.

--
Mark.
bariole
Guest
 
Posts: n/a
#4: Jul 20 '05

re: lexical analysis of html


On Wed, 19 May 2004 04:05:32 GMT, "Andrew Fedoniouk"
<news@terrainformatica.com> wrote:
[color=blue]
>Generally speaking it is impossible to create scanner and parser of SGML
>(HTML and XML) using flex/bison and the like.
>These are special languages as they require building the parser dynamically
>on the fly, based on a Document Type Declaration (DTD).
>
>Moreover, practical HTML scanner and parser shall include also parser for
>JavaScript (or its subset) as you may bump into following:
><SCRIPT>
>function foo() { a.write("</SCRIPT>"); }
></SCRIPT>
>(not recommended by spec but happens)[/color]

I was thinking about similar problems. To me lex seems like very
"strict" tool, actually to strict to be used for language like html.
[color=blue]
>To be short I think you will not find ready to use lex/y files for HTML.[/color]

I didn't ask for that although it would be very nice for me to see
that files if they existed somewhere. Lex and lexical analysis are
something new to me and I needed ideas how those things are done.

However I am pleased to say that after two days of thinking I found a
solution for my problem.

Thank you for your replay. Your post was more helpful than you can
imagine..



bariole
Guest
 
Posts: n/a
#5: Jul 20 '05

re: lexical analysis of html


On Wed, 19 May 2004 08:06:35 +0100, Mark Tranchant
<mark@tranchant.plus.com> wrote:
[color=blue]
>bariole wrote:[/color]
[color=blue]
>As Andrew wrote, arbitrary valid HTML is too "fluffy" to be easily
>defined as flex requires.
>
>If you are writing the HTML yourself, you can make it a bit stricter by
>doing things like always closing elements which will help you. By
>writing Appendix C-compliant XHTML-1.0, you can make your code into
>valid XML as well, which will make it a lot easier to parse.[/color]

Thanks Mark. I will do something like that.
Closed Thread