Compilers - 3: Grammars

11,448 Expert 8TB

Greetings,

this week we discuss the design of the syntactic aspects of our little language;
it helps with the design for the parser(s) that recognize such syntax. Last week
we saw the tokenizer: it just groups sequences of bytes into tokens; it doesn't
know anything about any syntax of any language at all. That's the job of the
parser; a parser on the other hand doesn't care where these tokens come from,
nor how they were formed. Parsers and tokenizers work closely together in order
to understand and check the correctness of a certain input stream. But first,
let's design our little language.

A small programming language

I want this language to be as small as possible because I don't want to write
an entire novel on this topic. Our little language will be a small expression
language. It will translate sequences of expressions which will be interpreted
by an interpreter that produces the values of these expressions.

The language must know about variables, functions, special functions as well
as the 'atomic' parts, i.e. floating point numbers and simple lists. There must
be built-in functions as well as user defined functions. The list of built-in
functions must be easily extendable in case you want to extend this language.

Syntactically speaking a program in our little language looks like this:

Expand|Select|Wrap|Line Numbers

program: ((definition | expression) ';')*

This reads: a program is a list of zero or more of the following: either a
definition or an expression followed by a ';' character. The trailing star
indicates that the previous sequence in parentheses occurs zero or more times.
The vertical bar indicates a choice: either the part on the left or the part
on the right side of the vertical bar.

The word 'program' is named a 'rule' of the grammar/syntax. 'definition' and
'expression' are also rules; we just haven't defined them yet; anything in
quotes anywhere in a rule definition is a token delivered by the tokenizer.
The ';' is a simple token. The parentheses are for grouping parts of a rule
together while the | and * have a special meaning (see above). The + plus
sign also has a special meaning, much like the * star: a star indicates zero
or more of the previous rule part, a plus indicates one or more of the previous
rule part. If I had written this instead:

Expand|Select|Wrap|Line Numbers

program: ((definition | expression) ';')+

A program would have been syntactically correct if it contains at least one
definition or expression. I like it when a program can be empty, so I used
the star (zero or more definitions or expressions). There's yet another special
symbol: the question mark; a ? indicates that the previous part should occur
at most once; so if I had written this:

Expand|Select|Wrap|Line Numbers

program: ((definition | expression) ';')?

A syntactically valid program would either be empty or just one definition or
expression were allowed. For our purposes that would be too restrictive.

EBNF notation

The syntax description above is called 'EBNF': (Extended Backus-Naur Format).
John Backus invented this notation and Peter Naur simplified it somewhat later.
Both were inspired by the work of Noam Chomsky who came up with a totally new
way of thinking about languages. Somewhere in the late 1950s he published his
work and was immediately hated for it by classical linguist who argued that
there would be no place anymore for 'feeling' and 'intuition' for languages.
Noam Chomsky was a stubborn scientist and didn't care at all about the furious
critiques he recieved for his work and argued that linguistics had never been
a proper science before he had published his work.

Chomsky went as far as stating that every human being was born with a natural
universal grammar and that all languages were derivatives of that universal
grammar. All a little toddler had to do was figure out what the peculiarities
of his mother's language are in order to learn how to speak and form sentences
that are syntactically correct. This imprinted universal grammar fades away
when a human grows up. That phenomon may explain why little children learn a
second language so quickly (they simply match it against the universal syntax),
while grown ups have to study hard when they want to master another language.

The Extended Backus-Naur Format formalizes this notion of a syntax.

Definition

Back to our programming language again: a definition mentioned in that grammar
rule actually can be either a function definition or a function declaration.

We need a notion of a declaration because our parser is a 'one pass' parser,
i.e. it reads the token stream once and after that all the work must be done.
When the parser 'sees' a name delivered by the tokenizer and it doesn't know
yet that this name represents a user function it will throw an error: it
doesn't recognize the name as a user function. If two functions call each
other recursively both functions must be declared. But one function is used
in the body of the other function and vice versa. One function body definition
must come first and that's why we need declarations; consider this:

Expand|Select|Wrap|Line Numbers

 
function foo(x) = bar(x-1);

function bar(x) = foo(x/2);

When the function body of function foo is parsed, nothing is known about bar
yet. But we can't define function bar first because it refers to function foo
again. Using declarations, we can solve this little catch 22:

Expand|Select|Wrap|Line Numbers

 
function bar(x);

function foo(x) = bar(x-1);

function bar = foo(x/2);

The first line declares the existence of function bar but doesn't define it yet.
The second line declares and defines a function foo that uses function bar and
the third line finally defines function bar by supplying the body of it. Note
that when a function has been declared (in line one) it is not allowed to repeat
the formal parameter list again, i.e. that formal parameter list was already
defined in the declaration of the function.

This allows for a simpler parser too because it doesn't have to check whether
or not the two formal parameter lists are equal, which they must be of course.

Also note that I introduced the word 'function'; it's a reserved word; another
reserved word will be sneakily introduced in a moment. No user defined function
nor variable can be named 'function'. It's sole purpose is to introduce the
start of a function declaration or definition.

Note that line two where function foo was declared and defined in one go is
sometimes called a 'tentative' definition. Both the function is declared and
its formal parameter list is defined as well as the body of the function itself.

Also note that other programming languages use a different syntax, but we are
defining our own little language and we're free to do what we want.

What would all that stuff above look like in EBNF notation? Have a look at this:

Expand|Select|Wrap|Line Numbers

 
definition: ( 'function' | 'listfunc' ) 'name' defordecl

defordecl: ('(' paramlist ')' ( '=' expression )?) | ( '=' expression ) 

paramlist ( 'name' ( ',' 'name' )* )?

The first line is easy: it just reads that the keyword 'function' or 'listfunc'
needs to be followed by a 'name', followed by another rule defordecl. A defordecl
rule defines the function declaration or the (tentative) definition of a function.
We either parse a parameter list optionally followed by an '=' and an expression
or we skip the parameter list and expect an '=' and an expression right away.

A parameter list is either a 'name' followed by the rest of the parameter list
or the list is completely empty. The rest of the parameter list is a sequence
of a ',' followed by another 'name' zero or more times.

For now we ignore the 'listfunc' reserved word a bit; it indicates that the
user defined function takes entire lists as its argument(s). We'll get back to it.
We have designed the syntax for most of our little language; there's one part
still missing: the syntax for an expression:

Expression

We want to parse ordinary, quite readable, expressions. The expressions include
both binary and unary operators as well as variable names, functions, either
user defined (see above) or built-in functions. We also want to be able to
parse lists.

Lets start with the binary operator expressions. Some binary operators bind more
tightly to their operands than others, e.g. we expect 4+2*19 to be equal to 42,
not 114. We want the multiplication to be done first and the addition last.
The multiplication operator has a higher precedence than the addition operator.

Our little language knows about six different precedence levels. from lowest to
highest precedence the operators are:

Expand|Select|Wrap|Line Numbers

 
:

== !=

<= < > >=

+ -

* /

^

From a bird's eye view an expression is just one or more other expressions
with ':' tokens in between. When there's just one other expression there's
none of the ':' token present in the expression. The 'other expressions'
don't contain the token ':'. Here's the EBNF form of this rule:

Expand|Select|Wrap|Line Numbers

 
expression: comparison ( operator0 comparison )*

operator0: ':'

The ':' operator is just a way to express a sequence of expressions instead
of just one; the value of the expression is the expression on the right of
it; the value of the expression on the left of the ':' token is simply forgotten.

An equality expression tests two other expression of (in)equality:

Expand|Select|Wrap|Line Numbers

 
expression: comparison ( operator1 comparison )*

operator1: '==' | '!='

A comparison doesn't look much different from a top-level expression:

Expand|Select|Wrap|Line Numbers

 
comparison: addition ( operator2 addition )*

operator2: '<=' | '<' | '>' | '>='

And the addition expression looks similar too:

Expand|Select|Wrap|Line Numbers

 
addition: multiplication ( operator3 multiplication )*

operator3: '+' | '-'

It's getting a bit boring already; here's the EBNF for a multiplication:

Expand|Select|Wrap|Line Numbers

 
multiplication: power ( operator4 power )*

operator4: '*' | '/'

Finally here's the expression that handles the binary expression that has the
highest precedence:

Expand|Select|Wrap|Line Numbers

 
power: unary ( operator5 unary )*

operator5: '^'

If I had named the expression, comparison, addition, multiplication and power
rules differently we would have gotten this:

Expand|Select|Wrap|Line Numbers

 
expression0: expression1 ( operator0 expression1 )*

expression1: expression2 ( operator1 expression2 )*

expression2: expression3 ( operator2 expression3 )*

expression3: expression4 ( operator3 expression4 )*

expression4: expression5 ( operator4 expression4 )*

expression5: unary       ( operator5 unary       )*

operator0: ':'

operator1: '==' | '!='

operator2: '<=' | '<' | '>' | '>='

operator3: '+' | '-'

operator4: '*' | '/'

operator5: '^'

Remember this renaming trick when we are going to develop actual code for our
binary expression parser. Next are the 'unary' expressions. Unary expressions
are not so regular as the binary expressions. A unary expression is a unary
operator followed by a unary expression or it is an expression without any
leading unary operator:

Expand|Select|Wrap|Line Numbers

 
unary: ( unaryop unary ) | atomic

unaryop: '+' | '-' | '!'

An atomic expression is one of the following:

- an expression in parentheses (nested) or
- a function call (function) or
- a name optionally followed by an assignment (nameexpr) or
- a constant (constant) or
- a list of expressions (list)

An expression in parentheses is simply this, written in EBNF notation:

Expand|Select|Wrap|Line Numbers

nested: '(' expression ')'

Here's the EBNF notation for a function call:

Expand|Select|Wrap|Line Numbers

 
function: 'name' '(' params ')'

params: ( expression ( ',' expression )* )?

Note the similarity for a params rule and the rule for a formal parameter list
(see the definition section above for that rule).

An assignment (or just a variable name) looks like this:

Expand|Select|Wrap|Line Numbers

 
nameexpr: 'name' ( ( assignop expression ) | '++' | '--' )?

assignop: '=' | '+=' | '-=' | '*=' | '/=' | '^='

If a name is not followed by one of the assignment operators we just want to
use the value of that variable denoted by that 'name'; otherwise we want to
assign another value to the variable using the expression mentioned in the
first rule. Finally, a name can be followed by a unary ++ or -- postfix expression.

A constant is just what it is: it's a floating point constant recognized by
the tokenizer; there is no interesting EBNF notation for it; just this one:

Expand|Select|Wrap|Line Numbers

constant: 'constant'

Finally a list consists of zero or more expressions, separated by commas and
enclosed in curly brackets:

Expand|Select|Wrap|Line Numbers

list: '{' ( expression ( ',' expression )? )* '}'

This is our entire grammar for our little programming language. The atomic
expression thing is the greatest mess and the parser has to check several
things when it parses a 'name', i.e. it could be a simple variable name but
it could also be the name of a built-in function or it could be the name of
a user defined function. It could also be the first part of an assignment.

In order to figure that all out the parser keeps track of a list of built-in
functions as well as a map of user defined functions. But, as they say, that
is just an 'implementation detail' and we leave that for when we actually
start implementing our parser.

Parser generators

We are handcrafting our parser; that isn't needed nowadays: there are quite a
few tools available that can do that nasty job for you. You feed it a simple
text file containing the EBNF notation of a grammar and the tool generates the
entire parser within moments. The same is true for lexical analyzer generators:
they generate an entire lexical analyzer when you feed it a bunch of regular
expressions. We are not going to take that path; we didn't take that path for
our tokenizer (lexical analyzer) and we're not going to take that path either
for our parser. We are going to do it the hard way just to show what the
complexity of those beasts is all about.

Concluding remarks

Just to wet your appetite a bit after all this theoretical stuff in this part
of the compiler article, have a look what our language can do:

Expand|Select|Wrap|Line Numbers

 
function fac(n) =

   if ((n < 2)

      , 1

      , n*fac(n)

   );

This defines a function 'fac' that calculates the faculty of parameter 'n'.
We can even do it with a list of numbers:

Expand|Select|Wrap|Line Numbers

fac( { 1, 2, 3, 4, 5, } );

The result will be { 1, 2, 6, 24, 120 }. Lists as well as a couple of special
built-in functions are going to make our little programming language quite some
fun.

To tell you the truth, I already implemented most of it all: the tokenizer, the
parser(s), the code generator and the interpreter and all the paraphernalia
needed by the system and it is indeed quite a bit of fun to play with. Maybe
this little language finds its use somewhere; who knows? Maybe I'll come up with
a nice application for it.

Next week we have to go through a bit of bookkeeping: the tables for the parsers
and the tokenizer are being explained then. It shows some interesting use of a
Resource class: it handles the raw file resources and supplies valuable data for
the tables to be filled in. The raw file resources are just Properties objects,
but together with a bit of String fiddling they can do quite advanced things,
conducted by that Resource class.

Stay tuned, see you next week and

kind regards,

Jos

Jun 1 '07 #1

Subscribe Post Reply

3874

by: Bill Davidson | last post by:

Hi there, Please forgive me for posting this article on multiple groups. Being new in the newsgroups, I was not sure which group would have been appropriate for my question. Sorry. My...

C / C++

Grammars for C++

by: Jon Slaughter | last post by:

Anyone know where I can get a (E)BNF like grammar for C++ to possibly use either for analysis or for yacc/bison(++)? On a side note I was thinking about writing my own lexer and syntaxer and was...

C / C++

Link compatibility among C compilers?

by: Derek | last post by:

As I understand it there is a good amount of link compatibility among C compilers. For example, I can compile main.c with GCC and func.c with Sun One and link the objects using either linker (GNU...

C / C++

New and enhanced low cost C compilers

by: Chris Stephens | last post by:

Low Cost C Compilers ------------------------------------ HI-TECH Software's C compilers are now available to support the ARM, dsPIC, msp430, 8051, PIC 10 to 17, PIC 18 as well as many other...

C / C++

The element 'compilation' has invalid child element 'compilers'.

by: Robert | last post by:

I have a number of web projects converted from 1.1 to 2.0 in VS2005. I am methodically seeing the error below: The element 'compilation' has invalid child element 'compilers'. List of...

ASP.NET

Any advantage in LISPs having simpler grammars than Python?

by: seberino | last post by:

Is there any advantage to a language having a nice mathematically compact grammar like LISP does? (or at least used to?) Many have admired the mathematically simple grammar of LISP in which much...

Python

Compilers, parsing, etc...

by: Jon Slaughter | last post by:

Is there anything like yacc or spirit for C#? What is the "standard" method for parsing grammars in C#? The last few days I have been looking at boost spirit and really like how it works. What I...

C# / C Sharp

designing of the compilers

by: pransri2006 | last post by:

Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...

C / C++

What's going on with C Compilers and C99??

by: albert.neu | last post by:

Hello! What is a good compiler to use? (for MS Windows, for Linux) Any recommendations?? What's the point of having a C99 standard, if different compilers still produce differing results? ...

C / C++

Compilers - 7: Instructions

by: JosAH | last post by:

Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...

Java

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Compilers - 3: Grammars

Similar topics