last week's tip was a bit of playtime where we've built a Sudoku solver. This
week we're going to build some complicated stuff: a compiler. Compiler
construction is a difficult branch of CS and I don't want the article(s) to be the
size of a book, so we have to keep things a bit simple.
On the other hand the entire thing must have practical use more or less so the
code to be developed must be extensible so that you can play and experiment a
bit with it if you like.
We also like to see some results so we can't just develop a compiler, we must
also develop a sort of interpreter to see the actual thing running. But first
we must plough through a bit of theory and definitions:
Introduction
A compiler typically translates a source language to a target language. Both the
source and target languages can be anything: the source language can be a
high level programming language, assenbly code language possibly for a virtual
machine or even a natural language such as English or Dutch or Swahili. Most
(but not all) of the time the target language will be a lower level language than
the source language.
A compiler reads a source language and writes a target language. It's a
translator basically. Human languages and their usage as a means of communication
are awefully difficult, especially when speech is involved. For simplicity we're
not going to deal with speech recognition nor the difficult aspects of speech
itself as a coherent sequence of phonemes, lexemes, diphtong- labial- dental-
guttural- plosive- voice- silent- and other funny sounds.
But even when it comes to programming languages things are quite complicated.
At the simplest level a programming language is a sequence of characters where
certain combinations of characters make up a 'token'. A token is an atomic
entity of the programming language. Think of reserved words and symbols such
as parentheses, mathematical operators, numbers etc. etc. But white space can
be an important token too; most programming languages consider white space as
the characters separating the more meaningful tokens, but they are extremely
important in string literals, e.g. "helloworld" is different from "hello world"
although for human beings the difference is futile based on our heuristic and
fascinating ability to match patterns although we can't really explain *how*
we do it.
A compiler is much more stupid than humans. It needs fixed rules to determine
whether or not a sequence of characters makes up a valid token. This is the
task of the first part of the compiler:
The Tokenizer
A tokenizer (or 'lexical analyzer') attempts to form tokens from a sequence
of characters. Have a look at "x+++++y" for example; it could represent the
sequence of tokens [x] [++] [+] [++] [y] which is syntactically valid for
Java, but that's not how Java's tokenizer works (try it).
The notation '[ ... ]' represents a single token, formed from the character
sequence '...'.
The Java tokenizer treats that character sequence as a series of tokens like
this: [x] [++] [++] [+] [x]
Why? because Java's tokenizer is 'greedy', i.e. it always scans the character
sequence from left to right, attempting to 'cut off' the longest possible
valid token. Why? because it needs a simple rule how to cut off tokens from
a character sequence. A tokenizer doesn't know much about the actual language
that needs to be compiled; it just knows about valid tokens and it tries
to produce a series of tokens as efficiently as possible. So the human user
has to put a space somewhere to help the tokenizer: "x++ + ++y". Java considers
spaces as token separators when all else fails.
In other circumstances a tokenizer works correctly without spaces being inserted
all over the place: "if(true)x=1;". the '(' character can not be part of a token
[if(] or [if(t] or [if(tr] etc. because such tokens don't exist in Java. The
longest possible first token here is [if] and no space is needed here. Of
course it helps humans to put spaces here and there for readability reasons.
Things are much worse in C and C++ where even the parentheses are needed.
Have a look at this C example:
Expand|Select|Wrap|Line Numbers
- if (x) * (y=z);
the parentheses:
Expand|Select|Wrap|Line Numbers
- if x * (y=z);
"x*(y=z)" followed by an empty statement or should it be a condition "x"
followed by a (ugly) statement "*(y=z);" Both variants don't make much sense
semantically speaking but what does the compiler know? It is supposed to
translate those character sequences and produce another (machine code) langage.
If I wrote: "<S-F6><S-F7>fsggf*&^&^*&kjgf(&" you wouldn't understand me
because the above doesn't make sense *lexically*, a subject the tokenizer
has to deal with. The tokenizer has to chop of a character sequence and
it must form tokens out of it.
You may have never realized it but a lot of those parentheses, semi-colons
and curly brackets are just there to help another part of the compiler:
The Parser
A parser expects a stream of tokens, produced by the tokenizer. For example,
a Java tokenizer treats the following sequence of characters: ")if(x; )(y"
as a perfectly valid sequence of tokens: [)] [if] [(] [x] [;] [)] [(] [y]
but it doesn't make much sense to the parser. The sequence of tokens may be
valid to the tokenizer, *syntactically* the token sequence doesn't make sense.
Human beings are extremely good at understanding syntactically incorrect
sentences; we all write/speak and listen to them every day.
If I write: "me beer now drink want!" you most likely understand what I'm
all about although the sentence is syntactically incorrect. You wouldn't
understand me when I wrote: "house boat now defenestrate to wants the the"
and rightly so. You wouldn't even understand what I'm talking about if I
wrote: "the house wants to defenestrate the boat now".
Houses don't defenestrate boats. Although this sentence is syntactically correct
is doesn't make sense. It doesn't make sense in a *semantical* way.
Humans are good at 'understanding' syntactically incorrect sentences, we aren't
any good at it when it comes to sentences that don't make any sense; we find
them humorous at best but we don't really care for the syntactical correctness.
We need correct semantics then. Compilers do too:
The Semantic Analyzer
Compilers can hardly deal with semantics: they can deal with type checking
but that's about it. Most of the semantics move on to the interpreter or
the virtual machine or to the real machine. After compilation (translation)
is over and done with the runtime environment (in whatever form) has to deal
with the semantics: it checks for division by zero, it checks whether or not
files exist, if sockets can be opened etc. etc.
The compiler can check whether or not every used method is declared, if every
used member variable is defined and it can check whether or not classes that
are used are defined somewhere. It can also do simple things like this:
Expand|Select|Wrap|Line Numbers
- int x= "hello world";
two things are incorrect: we can't assign a string to an int; the program
fragment is semantically incorrect. But before the runtime system can do
anything first that "anything" has to be generated:
The Code Generator
Code generation can be as simple as emiting bytes to somewhere (machine code
generation), but it can be complex too; Have a look at this little example:
Expand|Select|Wrap|Line Numbers
- double x= 1/42.0;
- x= Math.sin(x)/Math.cos(x)+Math.log(x)-3.14159;
- x= 0;
Obviously it wasn't because at the end variable x will be set to 0 anyway.
So why generate code for all that? Maybe the user wants to debug this code so
all the code should be generated anyway. Maybe the user wants optimized code so
almost no code should be generated at all: a simple "x=0" initialization should
be enough. It's the job of the code generator to generate efficient code in the
context to what it knows. Code generators don't know anything about character
or token sequences. The tokenizer and parser took care of that. The parser does
invoke the code generator when needed but the parser doesn't know whether or
not the code would be really needed.
Normally, code generators are invoked by parsers; they know what code needs to
be generated, redundant or not. The code generators are supposed to figure out
whether or not all that code is really needed.
Concluding remarks
The first part of this article described the several phases (or modules) of
a compiler. Strictly speaking the interpreter or runtime system isn't part
of a compiler. First the character stream is converted to a token stream.
The token stream is checked whether or not it makes sense syntactically.
The parser knows when to invoke the code generator. The code generator must
generate efficient code. When all that is done, the interpreter has its work
to do: perform what the user originally intended with the character stream.
In practice the separation between the phases can be a bit blurred, i.e. the
parser changes the behaviour of the tokenizer, the code generator consults the
parser for some semantic details, the tokenizer talks back again to the parser
etc. etc. But the distinction between the modules helps the development of a
new compiler.
This part of the article contains no useful code whatsoever; that is left for
the next parts. The second part of this article deals with tokenizers and
following parts dig deeper into parsers, code generators and interpreters.
Our tokenizer uses regular expressions for chopping up the character sequence
into token sequences.
The source code of this all deliberately doesn't use any external tools for
the tokenizer part nor for the parser part, just for educational purposes.
There are quite a few very good tools available that generate tokenizers and
parsers for us but this is all about building it ourself from the bottom up.
Now I have to figure out what that language is going to be like and what it
should be capable of; keep your fingers crossed ;-)
See you next week in the second part of this article.
kind regards,
Jos