By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,503 Members | 1,278 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Compilers - 3: Grammars

Expert 10K+
P: 11,448

this week we discuss the design of the syntactic aspects of our little language;
it helps with the design for the parser(s) that recognize such syntax. Last week
we saw the tokenizer: it just groups sequences of bytes into tokens; it doesn't
know anything about any syntax of any language at all. That's the job of the
parser; a parser on the other hand doesn't care where these tokens come from,
nor how they were formed. Parsers and tokenizers work closely together in order
to understand and check the correctness of a certain input stream. But first,
let's design our little language.

A small programming language

I want this language to be as small as possible because I don't want to write
an entire novel on this topic. Our little language will be a small expression
language. It will translate sequences of expressions which will be interpreted
by an interpreter that produces the values of these expressions.

The language must know about variables, functions, special functions as well
as the 'atomic' parts, i.e. floating point numbers and simple lists. There must
be built-in functions as well as user defined functions. The list of built-in
functions must be easily extendable in case you want to extend this language.

Syntactically speaking a program in our little language looks like this:

Expand|Select|Wrap|Line Numbers
  1. program: ((definition | expression) ';')*
This reads: a program is a list of zero or more of the following: either a
definition or an expression followed by a ';' character. The trailing star
indicates that the previous sequence in parentheses occurs zero or more times.
The vertical bar indicates a choice: either the part on the left or the part
on the right side of the vertical bar.

The word 'program' is named a 'rule' of the grammar/syntax. 'definition' and
'expression' are also rules; we just haven't defined them yet; anything in
quotes anywhere in a rule definition is a token delivered by the tokenizer.
The ';' is a simple token. The parentheses are for grouping parts of a rule
together while the | and * have a special meaning (see above). The + plus
sign also has a special meaning, much like the * star: a star indicates zero
or more of the previous rule part, a plus indicates one or more of the previous
rule part. If I had written this instead:

Expand|Select|Wrap|Line Numbers
  1. program: ((definition | expression) ';')+
A program would have been syntactically correct if it contains at least one
definition or expression. I like it when a program can be empty, so I used
the star (zero or more definitions or expressions). There's yet another special
symbol: the question mark; a ? indicates that the previous part should occur
at most once; so if I had written this:

Expand|Select|Wrap|Line Numbers
  1. program: ((definition | expression) ';')?
A syntactically valid program would either be empty or just one definition or
expression were allowed. For our purposes that would be too restrictive.

EBNF notation

The syntax description above is called 'EBNF': (Extended Backus-Naur Format).
John Backus invented this notation and Peter Naur simplified it somewhat later.
Both were inspired by the work of Noam Chomsky who came up with a totally new
way of thinking about languages. Somewhere in the late 1950s he published his
work and was immediately hated for it by classical linguist who argued that
there would be no place anymore for 'feeling' and 'intuition' for languages.
Noam Chomsky was a stubborn scientist and didn't care at all about the furious
critiques he recieved for his work and argued that linguistics had never been
a proper science before he had published his work.

Chomsky went as far as stating that every human being was born with a natural
universal grammar and that all languages were derivatives of that universal
grammar. All a little toddler had to do was figure out what the peculiarities
of his mother's language are in order to learn how to speak and form sentences
that are syntactically correct. This imprinted universal grammar fades away
when a human grows up. That phenomon may explain why little children learn a
second language so quickly (they simply match it against the universal syntax),
while grown ups have to study hard when they want to master another language.

The Extended Backus-Naur Format formalizes this notion of a syntax.


Back to our programming language again: a definition mentioned in that grammar
rule actually can be either a function definition or a function declaration.

We need a notion of a declaration because our parser is a 'one pass' parser,
i.e. it reads the token stream once and after that all the work must be done.
When the parser 'sees' a name delivered by the tokenizer and it doesn't know
yet that this name represents a user function it will throw an error: it
doesn't recognize the name as a user function. If two functions call each
other recursively both functions must be declared. But one function is used
in the body of the other function and vice versa. One function body definition
must come first and that's why we need declarations; consider this:

Expand|Select|Wrap|Line Numbers
  1. function foo(x) = bar(x-1);
  2. function bar(x) = foo(x/2);
When the function body of function foo is parsed, nothing is known about bar
yet. But we can't define function bar first because it refers to function foo
again. Using declarations, we can solve this little catch 22:

Expand|Select|Wrap|Line Numbers
  1. function bar(x);
  2. function foo(x) = bar(x-1);
  3. function bar = foo(x/2);
The first line declares the existence of function bar but doesn't define it yet.
The second line declares and defines a function foo that uses function bar and
the third line finally defines function bar by supplying the body of it. Note
that when a function has been declared (in line one) it is not allowed to repeat
the formal parameter list again, i.e. that formal parameter list was already
defined in the declaration of the function.

This allows for a simpler parser too because it doesn't have to check whether
or not the two formal parameter lists are equal, which they must be of course.

Also note that I introduced the word 'function'; it's a reserved word; another
reserved word will be sneakily introduced in a moment. No user defined function
nor variable can be named 'function'. It's sole purpose is to introduce the
start of a function declaration or definition.

Note that line two where function foo was declared and defined in one go is
sometimes called a 'tentative' definition. Both the function is declared and
its formal parameter list is defined as well as the body of the function itself.

Also note that other programming languages use a different syntax, but we are
defining our own little language and we're free to do what we want.

What would all that stuff above look like in EBNF notation? Have a look at this:

Expand|Select|Wrap|Line Numbers
  1. definition: ( 'function' | 'listfunc' ) 'name' defordecl
  2. defordecl: ('(' paramlist ')' ( '=' expression )?) | ( '=' expression ) 
  3. paramlist ( 'name' ( ',' 'name' )* )?
The first line is easy: it just reads that the keyword 'function' or 'listfunc'
needs to be followed by a 'name', followed by another rule defordecl. A defordecl
rule defines the function declaration or the (tentative) definition of a function.
We either parse a parameter list optionally followed by an '=' and an expression
or we skip the parameter list and expect an '=' and an expression right away.

A parameter list is either a 'name' followed by the rest of the parameter list
or the list is completely empty. The rest of the parameter list is a sequence
of a ',' followed by another 'name' zero or more times.

For now we ignore the 'listfunc' reserved word a bit; it indicates that the
user defined function takes entire lists as its argument(s). We'll get back to it.
We have designed the syntax for most of our little language; there's one part
still missing: the syntax for an expression:


We want to parse ordinary, quite readable, expressions. The expressions include
both binary and unary operators as well as variable names, functions, either
user defined (see above) or built-in functions. We also want to be able to
parse lists.

Lets start with the binary operator expressions. Some binary operators bind more
tightly to their operands than others, e.g. we expect 4+2*19 to be equal to 42,
not 114. We want the multiplication to be done first and the addition last.
The multiplication operator has a higher precedence than the addition operator.

Our little language knows about six different precedence levels. from lowest to
highest precedence the operators are:

Expand|Select|Wrap|Line Numbers
  1. :
  2. == !=
  3. <= < > >=
  4. + -
  5. * /
  6. ^
From a bird's eye view an expression is just one or more other expressions
with ':' tokens in between. When there's just one other expression there's
none of the ':' token present in the expression. The 'other expressions'
don't contain the token ':'. Here's the EBNF form of this rule:

Expand|Select|Wrap|Line Numbers
  1. expression: comparison ( operator0 comparison )*
  2. operator0: ':'
The ':' operator is just a way to express a sequence of expressions instead
of just one; the value of the expression is the expression on the right of
it; the value of the expression on the left of the ':' token is simply forgotten.

An equality expression tests two other expression of (in)equality:

Expand|Select|Wrap|Line Numbers
  1. expression: comparison ( operator1 comparison )*
  2. operator1: '==' | '!='
A comparison doesn't look much different from a top-level expression:

Expand|Select|Wrap|Line Numbers
  1. comparison: addition ( operator2 addition )*
  2. operator2: '<=' | '<' | '>' | '>='
And the addition expression looks similar too:

Expand|Select|Wrap|Line Numbers
  1. addition: multiplication ( operator3 multiplication )*
  2. operator3: '+' | '-'
It's getting a bit boring already; here's the EBNF for a multiplication:

Expand|Select|Wrap|Line Numbers
  1. multiplication: power ( operator4 power )*
  2. operator4: '*' | '/'
Finally here's the expression that handles the binary expression that has the
highest precedence:

Expand|Select|Wrap|Line Numbers
  1. power: unary ( operator5 unary )*
  2. operator5: '^'
If I had named the expression, comparison, addition, multiplication and power
rules differently we would have gotten this:

Expand|Select|Wrap|Line Numbers
  1. expression0: expression1 ( operator0 expression1 )*
  2. expression1: expression2 ( operator1 expression2 )*
  3. expression2: expression3 ( operator2 expression3 )*
  4. expression3: expression4 ( operator3 expression4 )*
  5. expression4: expression5 ( operator4 expression4 )*
  6. expression5: unary       ( operator5 unary       )*
  7. operator0: ':'
  8. operator1: '==' | '!='
  9. operator2: '<=' | '<' | '>' | '>='
  10. operator3: '+' | '-'
  11. operator4: '*' | '/'
  12. operator5: '^'
Remember this renaming trick when we are going to develop actual code for our
binary expression parser. Next are the 'unary' expressions. Unary expressions
are not so regular as the binary expressions. A unary expression is a unary
operator followed by a unary expression or it is an expression without any
leading unary operator:

Expand|Select|Wrap|Line Numbers
  1. unary: ( unaryop unary ) | atomic
  2. unaryop: '+' | '-' | '!' 
An atomic expression is one of the following:

- an expression in parentheses (nested) or
- a function call (function) or
- a name optionally followed by an assignment (nameexpr) or
- a constant (constant) or
- a list of expressions (list)

An expression in parentheses is simply this, written in EBNF notation:

Expand|Select|Wrap|Line Numbers
  1. nested: '(' expression ')'
Here's the EBNF notation for a function call:

Expand|Select|Wrap|Line Numbers
  1. function: 'name' '(' params ')'
  2. params: ( expression ( ',' expression )* )?
Note the similarity for a params rule and the rule for a formal parameter list
(see the definition section above for that rule).

An assignment (or just a variable name) looks like this:

Expand|Select|Wrap|Line Numbers
  1. nameexpr: 'name' ( ( assignop expression ) | '++' | '--' )?
  2. assignop: '=' | '+=' | '-=' | '*=' | '/=' | '^='
If a name is not followed by one of the assignment operators we just want to
use the value of that variable denoted by that 'name'; otherwise we want to
assign another value to the variable using the expression mentioned in the
first rule. Finally, a name can be followed by a unary ++ or -- postfix expression.

A constant is just what it is: it's a floating point constant recognized by
the tokenizer; there is no interesting EBNF notation for it; just this one:

Expand|Select|Wrap|Line Numbers
  1. constant: 'constant'
Finally a list consists of zero or more expressions, separated by commas and
enclosed in curly brackets:

Expand|Select|Wrap|Line Numbers
  1. list: '{' ( expression ( ',' expression )? )* '}'
This is our entire grammar for our little programming language. The atomic
expression thing is the greatest mess and the parser has to check several
things when it parses a 'name', i.e. it could be a simple variable name but
it could also be the name of a built-in function or it could be the name of
a user defined function. It could also be the first part of an assignment.

In order to figure that all out the parser keeps track of a list of built-in
functions as well as a map of user defined functions. But, as they say, that
is just an 'implementation detail' and we leave that for when we actually
start implementing our parser.

Parser generators

We are handcrafting our parser; that isn't needed nowadays: there are quite a
few tools available that can do that nasty job for you. You feed it a simple
text file containing the EBNF notation of a grammar and the tool generates the
entire parser within moments. The same is true for lexical analyzer generators:
they generate an entire lexical analyzer when you feed it a bunch of regular
expressions. We are not going to take that path; we didn't take that path for
our tokenizer (lexical analyzer) and we're not going to take that path either
for our parser. We are going to do it the hard way just to show what the
complexity of those beasts is all about.

Concluding remarks

Just to wet your appetite a bit after all this theoretical stuff in this part
of the compiler article, have a look what our language can do:

Expand|Select|Wrap|Line Numbers
  1. function fac(n) =
  2.    if ((n < 2)
  3.       , 1
  4.       , n*fac(n)
  5.    );
This defines a function 'fac' that calculates the faculty of parameter 'n'.
We can even do it with a list of numbers:

Expand|Select|Wrap|Line Numbers
  1. fac( { 1, 2, 3, 4, 5, } );
The result will be { 1, 2, 6, 24, 120 }. Lists as well as a couple of special
built-in functions are going to make our little programming language quite some

To tell you the truth, I already implemented most of it all: the tokenizer, the
parser(s), the code generator and the interpreter and all the paraphernalia
needed by the system and it is indeed quite a bit of fun to play with. Maybe
this little language finds its use somewhere; who knows? Maybe I'll come up with
a nice application for it.

Next week we have to go through a bit of bookkeeping: the tables for the parsers
and the tokenizer are being explained then. It shows some interesting use of a
Resource class: it handles the raw file resources and supplies valuable data for
the tables to be filled in. The raw file resources are just Properties objects,
but together with a bit of String fiddling they can do quite advanced things,
conducted by that Resource class.

Stay tuned, see you next week and

kind regards,

Jun 1 '07 #1
Share this Article
Share on Google+