473,411 Members | 2,080 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,411 developers and data experts.

Compilers - 1: Introduction

11,448 Expert 8TB
Greetings,

last week's tip was a bit of playtime where we've built a Sudoku solver. This
week we're going to build some complicated stuff: a compiler. Compiler
construction is a difficult branch of CS and I don't want the article(s) to be the
size of a book, so we have to keep things a bit simple.

On the other hand the entire thing must have practical use more or less so the
code to be developed must be extensible so that you can play and experiment a
bit with it if you like.

We also like to see some results so we can't just develop a compiler, we must
also develop a sort of interpreter to see the actual thing running. But first
we must plough through a bit of theory and definitions:

Introduction

A compiler typically translates a source language to a target language. Both the
source and target languages can be anything: the source language can be a
high level programming language, assenbly code language possibly for a virtual
machine or even a natural language such as English or Dutch or Swahili. Most
(but not all) of the time the target language will be a lower level language than
the source language.

A compiler reads a source language and writes a target language. It's a
translator basically. Human languages and their usage as a means of communication
are awefully difficult, especially when speech is involved. For simplicity we're
not going to deal with speech recognition nor the difficult aspects of speech
itself as a coherent sequence of phonemes, lexemes, diphtong- labial- dental-
guttural- plosive- voice- silent- and other funny sounds.

But even when it comes to programming languages things are quite complicated.
At the simplest level a programming language is a sequence of characters where
certain combinations of characters make up a 'token'. A token is an atomic
entity of the programming language. Think of reserved words and symbols such
as parentheses, mathematical operators, numbers etc. etc. But white space can
be an important token too; most programming languages consider white space as
the characters separating the more meaningful tokens, but they are extremely
important in string literals, e.g. "helloworld" is different from "hello world"
although for human beings the difference is futile based on our heuristic and
fascinating ability to match patterns although we can't really explain *how*
we do it.

A compiler is much more stupid than humans. It needs fixed rules to determine
whether or not a sequence of characters makes up a valid token. This is the
task of the first part of the compiler:

The Tokenizer

A tokenizer (or 'lexical analyzer') attempts to form tokens from a sequence
of characters. Have a look at "x+++++y" for example; it could represent the
sequence of tokens [x] [++] [+] [++] [y] which is syntactically valid for
Java, but that's not how Java's tokenizer works (try it).

The notation '[ ... ]' represents a single token, formed from the character
sequence '...'.

The Java tokenizer treats that character sequence as a series of tokens like
this: [x] [++] [++] [+] [x]

Why? because Java's tokenizer is 'greedy', i.e. it always scans the character
sequence from left to right, attempting to 'cut off' the longest possible
valid token. Why? because it needs a simple rule how to cut off tokens from
a character sequence. A tokenizer doesn't know much about the actual language
that needs to be compiled; it just knows about valid tokens and it tries
to produce a series of tokens as efficiently as possible. So the human user
has to put a space somewhere to help the tokenizer: "x++ + ++y". Java considers
spaces as token separators when all else fails.

In other circumstances a tokenizer works correctly without spaces being inserted
all over the place: "if(true)x=1;". the '(' character can not be part of a token
[if(] or [if(t] or [if(tr] etc. because such tokens don't exist in Java. The
longest possible first token here is [if] and no space is needed here. Of
course it helps humans to put spaces here and there for readability reasons.

Things are much worse in C and C++ where even the parentheses are needed.
Have a look at this C example:
Expand|Select|Wrap|Line Numbers
  1. if (x) * (y=z);
For a suitable pointer value for y this makes sense in C and C++. Leaving out
the parentheses:
Expand|Select|Wrap|Line Numbers
  1. if x * (y=z);
This doesn't make sense anymore: is it supposed to be an (ugly) condition like:
"x*(y=z)" followed by an empty statement or should it be a condition "x"
followed by a (ugly) statement "*(y=z);" Both variants don't make much sense
semantically speaking but what does the compiler know? It is supposed to
translate those character sequences and produce another (machine code) langage.

If I wrote: "<S-F6><S-F7>fsggf*&^&^*&kjgf(&" you wouldn't understand me
because the above doesn't make sense *lexically*, a subject the tokenizer
has to deal with. The tokenizer has to chop of a character sequence and
it must form tokens out of it.

You may have never realized it but a lot of those parentheses, semi-colons
and curly brackets are just there to help another part of the compiler:

The Parser

A parser expects a stream of tokens, produced by the tokenizer. For example,
a Java tokenizer treats the following sequence of characters: ")if(x; )(y"
as a perfectly valid sequence of tokens: [)] [if] [(] [x] [;] [)] [(] [y]
but it doesn't make much sense to the parser. The sequence of tokens may be
valid to the tokenizer, *syntactically* the token sequence doesn't make sense.

Human beings are extremely good at understanding syntactically incorrect
sentences; we all write/speak and listen to them every day.

If I write: "me beer now drink want!" you most likely understand what I'm
all about although the sentence is syntactically incorrect. You wouldn't
understand me when I wrote: "house boat now defenestrate to wants the the"
and rightly so. You wouldn't even understand what I'm talking about if I
wrote: "the house wants to defenestrate the boat now".

Houses don't defenestrate boats. Although this sentence is syntactically correct
is doesn't make sense. It doesn't make sense in a *semantical* way.

Humans are good at 'understanding' syntactically incorrect sentences, we aren't
any good at it when it comes to sentences that don't make any sense; we find
them humorous at best but we don't really care for the syntactical correctness.

We need correct semantics then. Compilers do too:

The Semantic Analyzer

Compilers can hardly deal with semantics: they can deal with type checking
but that's about it. Most of the semantics move on to the interpreter or
the virtual machine or to the real machine. After compilation (translation)
is over and done with the runtime environment (in whatever form) has to deal
with the semantics: it checks for division by zero, it checks whether or not
files exist, if sockets can be opened etc. etc.

The compiler can check whether or not every used method is declared, if every
used member variable is defined and it can check whether or not classes that
are used are defined somewhere. It can also do simple things like this:
Expand|Select|Wrap|Line Numbers
  1. int x= "hello world";
It's a perfect stream of tokens, it's syntactically all fine but the type of the
two things are incorrect: we can't assign a string to an int; the program
fragment is semantically incorrect. But before the runtime system can do
anything first that "anything" has to be generated:

The Code Generator

Code generation can be as simple as emiting bytes to somewhere (machine code
generation), but it can be complex too; Have a look at this little example:
Expand|Select|Wrap|Line Numbers
  1. double x= 1/42.0;
  2. x= Math.sin(x)/Math.cos(x)+Math.log(x)-3.14159;
  3. x= 0;
Was that initialization really necessary? How about that complicated expression?
Obviously it wasn't because at the end variable x will be set to 0 anyway.
So why generate code for all that? Maybe the user wants to debug this code so
all the code should be generated anyway. Maybe the user wants optimized code so
almost no code should be generated at all: a simple "x=0" initialization should
be enough. It's the job of the code generator to generate efficient code in the
context to what it knows. Code generators don't know anything about character
or token sequences. The tokenizer and parser took care of that. The parser does
invoke the code generator when needed but the parser doesn't know whether or
not the code would be really needed.

Normally, code generators are invoked by parsers; they know what code needs to
be generated, redundant or not. The code generators are supposed to figure out
whether or not all that code is really needed.

Concluding remarks

The first part of this article described the several phases (or modules) of
a compiler. Strictly speaking the interpreter or runtime system isn't part
of a compiler. First the character stream is converted to a token stream.
The token stream is checked whether or not it makes sense syntactically.

The parser knows when to invoke the code generator. The code generator must
generate efficient code. When all that is done, the interpreter has its work
to do: perform what the user originally intended with the character stream.

In practice the separation between the phases can be a bit blurred, i.e. the
parser changes the behaviour of the tokenizer, the code generator consults the
parser for some semantic details, the tokenizer talks back again to the parser
etc. etc. But the distinction between the modules helps the development of a
new compiler.

This part of the article contains no useful code whatsoever; that is left for
the next parts. The second part of this article deals with tokenizers and
following parts dig deeper into parsers, code generators and interpreters.
Our tokenizer uses regular expressions for chopping up the character sequence
into token sequences.

The source code of this all deliberately doesn't use any external tools for
the tokenizer part nor for the parser part, just for educational purposes.
There are quite a few very good tools available that generate tokenizers and
parsers for us but this is all about building it ourself from the bottom up.

Now I have to figure out what that language is going to be like and what it
should be capable of; keep your fingers crossed ;-)

See you next week in the second part of this article.

kind regards,

Jos
May 20 '07 #1
0 4768

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Scott Carter | last post by:
Can anyone point me to an introduction of creating installers with VS.Net? I currently need to create one for a windows service, but I'd also like to find some info about creating installers for...
14
by: Cletis Tout | last post by:
http://www.codeproject.com/cpnet/introtomono1.asp Introduction to Mono - Your first Mono app By Brian Delahunty The first in a series of articles about Mono. This article explains how to...
1
by: Xiaoshen Li | last post by:
Dear All, I am relatively new to C(But I have computer science background and know some other programming languages). I am wondering if anyboby can recommend me some C books or websites. I...
14
by: Alf P. Steinbach | last post by:
Not yet perfect, but: http://home.no.net/dubjai/win32cpptut/special/pointers/ch_01.pdf http://home.no.net/dubjai/win32cpptut/special/pointers/ch_01_examples.zip To access the table of...
8
by: pransri2006 | last post by:
Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...
12
by: Xah Lee | last post by:
Of Interest: Introduction to 3D Graphics Programing http://xahlee.org/3d/index.html Currently, this introduction introduces you to the graphics format of Mathematica, and two Java Applet...
22
by: Jon Harrop | last post by:
Flying Frog just launched the OCaml Journal: http://www.ffconsultancy.com/products/ocaml_journal/free/introduction.html?clcpp The free first article covers the basics of the OCaml programming...
0
by: JosAH | last post by:
Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...
0
RedSon
by: RedSon | last post by:
Chapter 2: How to handle Exceptions When you call a program, sometimes you get a message about an Unhandled exception type. This means, that something throws (or might throw) an Exception, but the...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.