Compilers - 1: Introduction

11,448 Expert 8TB

Greetings,

last week's tip was a bit of playtime where we've built a Sudoku solver. This
week we're going to build some complicated stuff: a compiler. Compiler
construction is a difficult branch of CS and I don't want the article(s) to be the
size of a book, so we have to keep things a bit simple.

On the other hand the entire thing must have practical use more or less so the
code to be developed must be extensible so that you can play and experiment a
bit with it if you like.

We also like to see some results so we can't just develop a compiler, we must
also develop a sort of interpreter to see the actual thing running. But first
we must plough through a bit of theory and definitions:

Introduction

A compiler typically translates a source language to a target language. Both the
source and target languages can be anything: the source language can be a
high level programming language, assenbly code language possibly for a virtual
machine or even a natural language such as English or Dutch or Swahili. Most
(but not all) of the time the target language will be a lower level language than
the source language.

A compiler reads a source language and writes a target language. It's a
translator basically. Human languages and their usage as a means of communication
are awefully difficult, especially when speech is involved. For simplicity we're
not going to deal with speech recognition nor the difficult aspects of speech
itself as a coherent sequence of phonemes, lexemes, diphtong- labial- dental-
guttural- plosive- voice- silent- and other funny sounds.

But even when it comes to programming languages things are quite complicated.
At the simplest level a programming language is a sequence of characters where
certain combinations of characters make up a 'token'. A token is an atomic
entity of the programming language. Think of reserved words and symbols such
as parentheses, mathematical operators, numbers etc. etc. But white space can
be an important token too; most programming languages consider white space as
the characters separating the more meaningful tokens, but they are extremely
important in string literals, e.g. "helloworld" is different from "hello world"
although for human beings the difference is futile based on our heuristic and
fascinating ability to match patterns although we can't really explain *how*
we do it.

A compiler is much more stupid than humans. It needs fixed rules to determine
whether or not a sequence of characters makes up a valid token. This is the
task of the first part of the compiler:

The Tokenizer

A tokenizer (or 'lexical analyzer') attempts to form tokens from a sequence
of characters. Have a look at "x+++++y" for example; it could represent the
sequence of tokens [x] [++] [+] [++] [y] which is syntactically valid for
Java, but that's not how Java's tokenizer works (try it).

The notation '[ ... ]' represents a single token, formed from the character
sequence '...'.

The Java tokenizer treats that character sequence as a series of tokens like
this: [x] [++] [++] [+] [x]

Why? because Java's tokenizer is 'greedy', i.e. it always scans the character
sequence from left to right, attempting to 'cut off' the longest possible
valid token. Why? because it needs a simple rule how to cut off tokens from
a character sequence. A tokenizer doesn't know much about the actual language
that needs to be compiled; it just knows about valid tokens and it tries
to produce a series of tokens as efficiently as possible. So the human user
has to put a space somewhere to help the tokenizer: "x++ + ++y". Java considers
spaces as token separators when all else fails.

In other circumstances a tokenizer works correctly without spaces being inserted
all over the place: "if(true)x=1;". the '(' character can not be part of a token
[if(] or [if(t] or [if(tr] etc. because such tokens don't exist in Java. The
longest possible first token here is [if] and no space is needed here. Of
course it helps humans to put spaces here and there for readability reasons.

Things are much worse in C and C++ where even the parentheses are needed.
Have a look at this C example:

Expand|Select|Wrap|Line Numbers

if (x) * (y=z);

For a suitable pointer value for y this makes sense in C and C++. Leaving out
the parentheses:

Expand|Select|Wrap|Line Numbers

if x * (y=z);

This doesn't make sense anymore: is it supposed to be an (ugly) condition like:
"x*(y=z)" followed by an empty statement or should it be a condition "x"
followed by a (ugly) statement "*(y=z);" Both variants don't make much sense
semantically speaking but what does the compiler know? It is supposed to
translate those character sequences and produce another (machine code) langage.

If I wrote: "<S-F6><S-F7>fsggf*&^&^*&kjgf(&" you wouldn't understand me
because the above doesn't make sense *lexically*, a subject the tokenizer
has to deal with. The tokenizer has to chop of a character sequence and
it must form tokens out of it.

You may have never realized it but a lot of those parentheses, semi-colons
and curly brackets are just there to help another part of the compiler:

The Parser

A parser expects a stream of tokens, produced by the tokenizer. For example,
a Java tokenizer treats the following sequence of characters: ")if(x; )(y"
as a perfectly valid sequence of tokens: [)] [if] [(] [x] [;] [)] [(] [y]
but it doesn't make much sense to the parser. The sequence of tokens may be
valid to the tokenizer, *syntactically* the token sequence doesn't make sense.

Human beings are extremely good at understanding syntactically incorrect
sentences; we all write/speak and listen to them every day.

If I write: "me beer now drink want!" you most likely understand what I'm
all about although the sentence is syntactically incorrect. You wouldn't
understand me when I wrote: "house boat now defenestrate to wants the the"
and rightly so. You wouldn't even understand what I'm talking about if I
wrote: "the house wants to defenestrate the boat now".

Houses don't defenestrate boats. Although this sentence is syntactically correct
is doesn't make sense. It doesn't make sense in a *semantical* way.

Humans are good at 'understanding' syntactically incorrect sentences, we aren't
any good at it when it comes to sentences that don't make any sense; we find
them humorous at best but we don't really care for the syntactical correctness.

We need correct semantics then. Compilers do too:

The Semantic Analyzer

Compilers can hardly deal with semantics: they can deal with type checking
but that's about it. Most of the semantics move on to the interpreter or
the virtual machine or to the real machine. After compilation (translation)
is over and done with the runtime environment (in whatever form) has to deal
with the semantics: it checks for division by zero, it checks whether or not
files exist, if sockets can be opened etc. etc.

The compiler can check whether or not every used method is declared, if every
used member variable is defined and it can check whether or not classes that
are used are defined somewhere. It can also do simple things like this:

Expand|Select|Wrap|Line Numbers

int x= "hello world";

It's a perfect stream of tokens, it's syntactically all fine but the type of the
two things are incorrect: we can't assign a string to an int; the program
fragment is semantically incorrect. But before the runtime system can do
anything first that "anything" has to be generated:

The Code Generator

Code generation can be as simple as emiting bytes to somewhere (machine code
generation), but it can be complex too; Have a look at this little example:

Expand|Select|Wrap|Line Numbers

 
double x= 1/42.0;

x= Math.sin(x)/Math.cos(x)+Math.log(x)-3.14159;

x= 0;

Was that initialization really necessary? How about that complicated expression?
Obviously it wasn't because at the end variable x will be set to 0 anyway.
So why generate code for all that? Maybe the user wants to debug this code so
all the code should be generated anyway. Maybe the user wants optimized code so
almost no code should be generated at all: a simple "x=0" initialization should
be enough. It's the job of the code generator to generate efficient code in the
context to what it knows. Code generators don't know anything about character
or token sequences. The tokenizer and parser took care of that. The parser does
invoke the code generator when needed but the parser doesn't know whether or
not the code would be really needed.

Normally, code generators are invoked by parsers; they know what code needs to
be generated, redundant or not. The code generators are supposed to figure out
whether or not all that code is really needed.

Concluding remarks

The first part of this article described the several phases (or modules) of
a compiler. Strictly speaking the interpreter or runtime system isn't part
of a compiler. First the character stream is converted to a token stream.
The token stream is checked whether or not it makes sense syntactically.

The parser knows when to invoke the code generator. The code generator must
generate efficient code. When all that is done, the interpreter has its work
to do: perform what the user originally intended with the character stream.

In practice the separation between the phases can be a bit blurred, i.e. the
parser changes the behaviour of the tokenizer, the code generator consults the
parser for some semantic details, the tokenizer talks back again to the parser
etc. etc. But the distinction between the modules helps the development of a
new compiler.

This part of the article contains no useful code whatsoever; that is left for
the next parts. The second part of this article deals with tokenizers and
following parts dig deeper into parsers, code generators and interpreters.
Our tokenizer uses regular expressions for chopping up the character sequence
into token sequences.

The source code of this all deliberately doesn't use any external tools for
the tokenizer part nor for the parser part, just for educational purposes.
There are quite a few very good tools available that generate tokenizers and
parsers for us but this is all about building it ourself from the bottom up.

Now I have to figure out what that language is going to be like and what it
should be capable of; keep your fingers crossed ;-)

See you next week in the second part of this article.

kind regards,

Jos

May 20 '07 #1

Subscribe Post Reply

4768

Similar topics

Installer Introduction

by: Scott Carter | last post by:

Can anyone point me to an introduction of creating installers with VS.Net? I currently need to create one for a windows service, but I'd also like to find some info about creating installers for...

.NET Framework

Introduction to Mono - Your first Mono app

by: Cletis Tout | last post by:

http://www.codeproject.com/cpnet/introtomono1.asp Introduction to Mono - Your first Mono app By Brian Delahunty The first in a series of articles about Mono. This article explains how to...

.NET Framework

Seeking a good C introduction book

by: Xiaoshen Li | last post by:

Dear All, I am relatively new to C(But I have computer science background and know some other programming languages). I am wondering if anyboby can recommend me some C books or websites. I...

C / C++

Finally, a release candidate of the "Pointers" document (introduction to pointers)

by: Alf P. Steinbach | last post by:

Not yet perfect, but: http://home.no.net/dubjai/win32cpptut/special/pointers/ch_01.pdf http://home.no.net/dubjai/win32cpptut/special/pointers/ch_01_examples.zip To access the table of...

C / C++

designing of the compilers

by: pransri2006 | last post by:

Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...

C / C++

Xah's Edu Corner: Introduction to 3D Graphics Programing

by: Xah Lee | last post by:

Of Interest: Introduction to 3D Graphics Programing http://xahlee.org/3d/index.html Currently, this introduction introduces you to the graphics format of Mathematica, and two Java Applet...

Python

Write your C++ compilers in a modern high-performance functional language

by: Jon Harrop | last post by:

Flying Frog just launched the OCaml Journal: http://www.ffconsultancy.com/products/ocaml_journal/free/introduction.html?clcpp The free first article covers the basics of the OCaml programming...

C / C++

Compilers - 7: Instructions

by: JosAH | last post by:

Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...

Java

An Introduction to Exceptions - Ch. 2

by: RedSon | last post by:

Chapter 2: How to handle Exceptions When you call a program, sometimes you get a message about an Unhandled exception type. This means, that something throws (or might throw) an Exception, but the...

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp