473,322 Members | 1,431 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,322 developers and data experts.

Compilers - 1: Introduction

11,448 Expert 8TB
Greetings,

last week's tip was a bit of playtime where we've built a Sudoku solver. This
week we're going to build some complicated stuff: a compiler. Compiler
construction is a difficult branch of CS and I don't want the article(s) to be the
size of a book, so we have to keep things a bit simple.

On the other hand the entire thing must have practical use more or less so the
code to be developed must be extensible so that you can play and experiment a
bit with it if you like.

We also like to see some results so we can't just develop a compiler, we must
also develop a sort of interpreter to see the actual thing running. But first
we must plough through a bit of theory and definitions:

Introduction

A compiler typically translates a source language to a target language. Both the
source and target languages can be anything: the source language can be a
high level programming language, assenbly code language possibly for a virtual
machine or even a natural language such as English or Dutch or Swahili. Most
(but not all) of the time the target language will be a lower level language than
the source language.

A compiler reads a source language and writes a target language. It's a
translator basically. Human languages and their usage as a means of communication
are awefully difficult, especially when speech is involved. For simplicity we're
not going to deal with speech recognition nor the difficult aspects of speech
itself as a coherent sequence of phonemes, lexemes, diphtong- labial- dental-
guttural- plosive- voice- silent- and other funny sounds.

But even when it comes to programming languages things are quite complicated.
At the simplest level a programming language is a sequence of characters where
certain combinations of characters make up a 'token'. A token is an atomic
entity of the programming language. Think of reserved words and symbols such
as parentheses, mathematical operators, numbers etc. etc. But white space can
be an important token too; most programming languages consider white space as
the characters separating the more meaningful tokens, but they are extremely
important in string literals, e.g. "helloworld" is different from "hello world"
although for human beings the difference is futile based on our heuristic and
fascinating ability to match patterns although we can't really explain *how*
we do it.

A compiler is much more stupid than humans. It needs fixed rules to determine
whether or not a sequence of characters makes up a valid token. This is the
task of the first part of the compiler:

The Tokenizer

A tokenizer (or 'lexical analyzer') attempts to form tokens from a sequence
of characters. Have a look at "x+++++y" for example; it could represent the
sequence of tokens [x] [++] [+] [++] [y] which is syntactically valid for
Java, but that's not how Java's tokenizer works (try it).

The notation '[ ... ]' represents a single token, formed from the character
sequence '...'.

The Java tokenizer treats that character sequence as a series of tokens like
this: [x] [++] [++] [+] [x]

Why? because Java's tokenizer is 'greedy', i.e. it always scans the character
sequence from left to right, attempting to 'cut off' the longest possible
valid token. Why? because it needs a simple rule how to cut off tokens from
a character sequence. A tokenizer doesn't know much about the actual language
that needs to be compiled; it just knows about valid tokens and it tries
to produce a series of tokens as efficiently as possible. So the human user
has to put a space somewhere to help the tokenizer: "x++ + ++y". Java considers
spaces as token separators when all else fails.

In other circumstances a tokenizer works correctly without spaces being inserted
all over the place: "if(true)x=1;". the '(' character can not be part of a token
[if(] or [if(t] or [if(tr] etc. because such tokens don't exist in Java. The
longest possible first token here is [if] and no space is needed here. Of
course it helps humans to put spaces here and there for readability reasons.

Things are much worse in C and C++ where even the parentheses are needed.
Have a look at this C example:
Expand|Select|Wrap|Line Numbers
  1. if (x) * (y=z);
For a suitable pointer value for y this makes sense in C and C++. Leaving out
the parentheses:
Expand|Select|Wrap|Line Numbers
  1. if x * (y=z);
This doesn't make sense anymore: is it supposed to be an (ugly) condition like:
"x*(y=z)" followed by an empty statement or should it be a condition "x"
followed by a (ugly) statement "*(y=z);" Both variants don't make much sense
semantically speaking but what does the compiler know? It is supposed to
translate those character sequences and produce another (machine code) langage.

If I wrote: "<S-F6><S-F7>fsggf*&^&^*&kjgf(&" you wouldn't understand me
because the above doesn't make sense *lexically*, a subject the tokenizer
has to deal with. The tokenizer has to chop of a character sequence and
it must form tokens out of it.

You may have never realized it but a lot of those parentheses, semi-colons
and curly brackets are just there to help another part of the compiler:

The Parser

A parser expects a stream of tokens, produced by the tokenizer. For example,
a Java tokenizer treats the following sequence of characters: ")if(x; )(y"
as a perfectly valid sequence of tokens: [)] [if] [(] [x] [;] [)] [(] [y]
but it doesn't make much sense to the parser. The sequence of tokens may be
valid to the tokenizer, *syntactically* the token sequence doesn't make sense.

Human beings are extremely good at understanding syntactically incorrect
sentences; we all write/speak and listen to them every day.

If I write: "me beer now drink want!" you most likely understand what I'm
all about although the sentence is syntactically incorrect. You wouldn't
understand me when I wrote: "house boat now defenestrate to wants the the"
and rightly so. You wouldn't even understand what I'm talking about if I
wrote: "the house wants to defenestrate the boat now".

Houses don't defenestrate boats. Although this sentence is syntactically correct
is doesn't make sense. It doesn't make sense in a *semantical* way.

Humans are good at 'understanding' syntactically incorrect sentences, we aren't
any good at it when it comes to sentences that don't make any sense; we find
them humorous at best but we don't really care for the syntactical correctness.

We need correct semantics then. Compilers do too:

The Semantic Analyzer

Compilers can hardly deal with semantics: they can deal with type checking
but that's about it. Most of the semantics move on to the interpreter or
the virtual machine or to the real machine. After compilation (translation)
is over and done with the runtime environment (in whatever form) has to deal
with the semantics: it checks for division by zero, it checks whether or not
files exist, if sockets can be opened etc. etc.

The compiler can check whether or not every used method is declared, if every
used member variable is defined and it can check whether or not classes that
are used are defined somewhere. It can also do simple things like this:
Expand|Select|Wrap|Line Numbers
  1. int x= "hello world";
It's a perfect stream of tokens, it's syntactically all fine but the type of the
two things are incorrect: we can't assign a string to an int; the program
fragment is semantically incorrect. But before the runtime system can do
anything first that "anything" has to be generated:

The Code Generator

Code generation can be as simple as emiting bytes to somewhere (machine code
generation), but it can be complex too; Have a look at this little example:
Expand|Select|Wrap|Line Numbers
  1. double x= 1/42.0;
  2. x= Math.sin(x)/Math.cos(x)+Math.log(x)-3.14159;
  3. x= 0;
Was that initialization really necessary? How about that complicated expression?
Obviously it wasn't because at the end variable x will be set to 0 anyway.
So why generate code for all that? Maybe the user wants to debug this code so
all the code should be generated anyway. Maybe the user wants optimized code so
almost no code should be generated at all: a simple "x=0" initialization should
be enough. It's the job of the code generator to generate efficient code in the
context to what it knows. Code generators don't know anything about character
or token sequences. The tokenizer and parser took care of that. The parser does
invoke the code generator when needed but the parser doesn't know whether or
not the code would be really needed.

Normally, code generators are invoked by parsers; they know what code needs to
be generated, redundant or not. The code generators are supposed to figure out
whether or not all that code is really needed.

Concluding remarks

The first part of this article described the several phases (or modules) of
a compiler. Strictly speaking the interpreter or runtime system isn't part
of a compiler. First the character stream is converted to a token stream.
The token stream is checked whether or not it makes sense syntactically.

The parser knows when to invoke the code generator. The code generator must
generate efficient code. When all that is done, the interpreter has its work
to do: perform what the user originally intended with the character stream.

In practice the separation between the phases can be a bit blurred, i.e. the
parser changes the behaviour of the tokenizer, the code generator consults the
parser for some semantic details, the tokenizer talks back again to the parser
etc. etc. But the distinction between the modules helps the development of a
new compiler.

This part of the article contains no useful code whatsoever; that is left for
the next parts. The second part of this article deals with tokenizers and
following parts dig deeper into parsers, code generators and interpreters.
Our tokenizer uses regular expressions for chopping up the character sequence
into token sequences.

The source code of this all deliberately doesn't use any external tools for
the tokenizer part nor for the parser part, just for educational purposes.
There are quite a few very good tools available that generate tokenizers and
parsers for us but this is all about building it ourself from the bottom up.

Now I have to figure out what that language is going to be like and what it
should be capable of; keep your fingers crossed ;-)

See you next week in the second part of this article.

kind regards,

Jos
May 20 '07 #1
0 4763

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Scott Carter | last post by:
Can anyone point me to an introduction of creating installers with VS.Net? I currently need to create one for a windows service, but I'd also like to find some info about creating installers for...
14
by: Cletis Tout | last post by:
http://www.codeproject.com/cpnet/introtomono1.asp Introduction to Mono - Your first Mono app By Brian Delahunty The first in a series of articles about Mono. This article explains how to...
1
by: Xiaoshen Li | last post by:
Dear All, I am relatively new to C(But I have computer science background and know some other programming languages). I am wondering if anyboby can recommend me some C books or websites. I...
14
by: Alf P. Steinbach | last post by:
Not yet perfect, but: http://home.no.net/dubjai/win32cpptut/special/pointers/ch_01.pdf http://home.no.net/dubjai/win32cpptut/special/pointers/ch_01_examples.zip To access the table of...
8
by: pransri2006 | last post by:
Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...
12
by: Xah Lee | last post by:
Of Interest: Introduction to 3D Graphics Programing http://xahlee.org/3d/index.html Currently, this introduction introduces you to the graphics format of Mathematica, and two Java Applet...
22
by: Jon Harrop | last post by:
Flying Frog just launched the OCaml Journal: http://www.ffconsultancy.com/products/ocaml_journal/free/introduction.html?clcpp The free first article covers the basics of the OCaml programming...
0
by: JosAH | last post by:
Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...
0
RedSon
by: RedSon | last post by:
Chapter 2: How to handle Exceptions When you call a program, sometimes you get a message about an Unhandled exception type. This means, that something throws (or might throw) an Exception, but the...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.