Compilers - 6A: Interpreters

11,448 Expert 8TB

Greetings,

last week we completed an important part of our compiler. We can parse and
compile source text. Source texts contain a list of expressions and function
definitions. Either compilation succeeds and we get a Code object back from
our parser(s) or an InterpreterException is thrown indicating the cause of
the error.

Strictly speaking this Compiler article is finished now: all compilation details
have been discussed in the previous parts of this article. The Tokenizer that
transforms an input sequence of characters to an input sequence of tokens.
The parser(s) check for syntactical correctness and instructs a Generator to
generate instructions in the Code object accordingly.

Not so strictly speaking: I want to see results; it's no fun to have a Code
object that contains a sequence of instructions and there's nothing I can do
with it. For now, let's skip the Generator, it's a dumb thing that generates
the correct instructions for our parser(s) and nothing more.

Let's concentrate on an Interpreter that is able to interpret the instructions
in the Code object so we can see what the results are.

Interpreter

Our interpreter will be a simple interpreter, i.e. all it does is execute a
list of Intructions which is abstracted away a bit by a Code object (see the
previous article part).

Basically all it does is this:

Expand|Select|Wrap|Line Numbers

 
public Stack execute(Code code) throws InterpreterException {
 
    for (Instruction instruction : code) {

        preExecute(code, instruction);

        instruction.execute(this);

        preExecute(code, instruction);

    }
 
    return stack;

}

All the interpreter does is iterate over the Code object (a List<Instruction>)
and invokes the 'execute' method on the Instruction. The interpreter doesn't
even know *what* is executed, all it knows is that an Instruction is a simple
interface:

Expand|Select|Wrap|Line Numbers

 
public interface Instruction extends Serializable {
 
    public void execute(Interpreter interpreter) 

                throws InterpreterException;
 
    public void initialize(String token, int arity);

}

We ignore the second method for now (it is used by the code Generator). The
first method declaration promises that an Instruction can execute as long as
we pass it an Interpreter object. It can throw an InterpreterException in case
anything goes wrong.

That fits the bill for the Interpreter: it calls the 'execute' method on every
Instruction in the Code list and passes itself as a parameter. Note the two
pre- and postExecute methods. I don't know what to do with them but I'm sure
they will come in handy. I've implemented them both as two empty methods in
the Interpreter class:

Expand|Select|Wrap|Line Numbers

 
protected void preExecute(Code code, Instruction instruction) 

                throws InterpreterException { }
 
protected void postExecute(Code code, Instruction instruction) 

                throws InterpreterException { }

They can be of use to other objects that extend from the Interpreter class.
If they turn out not to be of any use I simply kick them out again.

That 'execute' method of the Interpreter class returns a 'stack'. As you might
have noticed, or will notice in a short time, our parsers basically generate
postfix code: e.g. when the expression '1+2' is compiled first two Constant
Instructions are generated for the numbers 1 and 2 and next an Instruction is
generated for the '+' operator. Symbolically the generated code looks like '1 2 +'
which is a postfix form of the original infix '1+2' expression.

Postfix code almost begs for a stack on which the operands of a function or
operator are pushed and popped. The postfix expression '1 2 +' reads as:

1) push 1 on the stack
2) push 2 on the stack
3) pop the two operands, add them and push the result back on the stack.

But who or what pushes those constants 1 and 2 on the stack then? The Interpreter
itself is too simple and stupid for that; the answer is that the Constant
Instruction does it; it pushes the constant on the stack that was passed to it
when it was created by the Generator. Check the ExpressionParser and try to
figure out where all that happened.

Our Interpreter is a stack based machine. The Stack object is a member of the
Interpreter object itself:

Expand|Select|Wrap|Line Numbers

 
protected Stack stack= new Stack();
 
public Stack stack() { return stack; }

The Stack object itself looks like this:

Expand|Select|Wrap|Line Numbers

 
package compiler;
 
import java.util.ArrayList;
 
import compiler.value.DblValue;

import compiler.value.Value;
 
public class Stack extends ArrayList<Value> {
 
    private static final long serialVersionUID = -2663291338225528177L;
 
    public void add(double value) { add(new DblValue(value)); }

}

A Stack is a List<Value> which is used as if it were a real stack. The class
implements a little convenience method that can push a 'double' on the stack
after it has been wrapped in a DblValue.

The next paragraph explains what a Value is all about.

Value

All our Instructions and our Interpreter work with Values. We don't have many
different data types in our little language (yet?) It can handle doubles and
lists. Both of them are Values. Here's a little scenario that encapsulates both
types; as an example I've included a string value too which is not being used
for now. We start with an abstract class Value:

Expand|Select|Wrap|Line Numbers

 
public abstract class Value implements Serializable {
 
    public static final Value NUL= new DblValue(0.0);

    public static final Value ONE= new DblValue(1.0);

This abstract class defines a few constants for us (read: the Instructions):
zero and one. Note that the class is Serializable so we can write a Value to
a stream or read it from a stream if we want.

Here are a few methods from the abstract Value class:

Expand|Select|Wrap|Line Numbers

 
public double getDbl() throws InterpreterException {
 
    throw new InterpreterException("not a double: "+toString());

}
 
public String getStr() throws InterpreterException {
 
    throw new InterpreterException("not a string: "+toString());

}
 
public List<Value> getLst() throws InterpreterException {
 
    throw new InterpreterException("not a list: "+toString());

}

Those are the 'cast' methods which all throw an InterpreterException in this
abstract base class. The DblValue class and its compadres are supposed to override
the applicable method so a caller can easily find the Java type it needs.

The next methods are intended to be overridden too where applicable:

Expand|Select|Wrap|Line Numbers

 
public boolean isDbl() { return false; }

public boolean isStr() { return false; }

public boolean isLst() { return false; }

Every other type (there's only double and list for now) is supposed to extend
this base class. Here's the DblValue class:

Expand|Select|Wrap|Line Numbers

 
public class DblValue extends Value {
 
    private static final long serialVersionUID = -5067625407505210072L;
 
    private double value;
 
    public DblValue(double value) { this.value= value; }
 
    public double getDbl() { return value; }
 
    public boolean isDbl() { return true; }
 
    public boolean equals(Object obj) {
 
        if (obj == null || !(obj instanceof DblValue)) return false;
 
        return value == ((DblValue)obj).value;

    }
 
    public int hashCode() { return (int)value; }
 
    public String toString() { return ""+value; }

}

This little class is a nice data class: it can be compared for equality and
it encapsulates a primitive double type. Exactly what we want.

Although we don't use it; have a look at one of the other types: the StrValue class:

Expand|Select|Wrap|Line Numbers

 
public class StrValue extends Value {
 
    private static final long serialVersionUID = -5355082990666947635L;
 
    private String value;
 
    public StrValue(String value) { this.value= value; }
 
    public String getStr() { return value; }
 
    public boolean isStr() { return true; }
 
    public boolean equals(Object obj) {
 
        if (obj == null || !(obj instanceof StrValue)) return false;
 
        return value == ((StrValue)obj).value;

    }
 
    public int hashCode() { return value.hashCode(); }
 
    public String toString() { return value; }

}

It's almost the same as the DblClass but it encapsulates a String class. The
LstValue class is our little language notion of a list. Here it is:

Expand|Select|Wrap|Line Numbers

 
public class LstValue extends Value {
 
    private static final long serialVersionUID = 8665468454551339099L;
 
    private List<Value> value;
 
    public LstValue(List<Value> value) { this.value= value; }
 
    public List<Value> getLst() { return value; }
 
    public boolean isLst() { return true; }
 
    public boolean isSimple() {
 
        for (Value elem : value)

            if (!elem.isDbl()) return false;

        return true;

    }
 
    public int size() { return value.size(); }
 
    public boolean equals(Object obj) {
 
        if (obj == null || !(obj instanceof LstValue)) return false;
 
        return value.equals(((LstValue)obj).value);

    }
 
    public int hashCode() { return value.hashCode(); }
 
    public String toString() { return value.toString(); }

}

Again, this class is almost identical to its siblings (check it out) but it
has one little additional method; the 'isSimple' method. It is needed by our
Instructions and it checks whether or not all elements of the list are doubles.

Observe that its toString() method delegates to the value member variable which
knows best how to produce a String representation of the encapsulated
List<Value> object.

All three classes (DblValue, StrValue and LstValue) override the equals() method
and the hashCode() method: they're fine candidates to stick them in Maps, Sets
and whatever collection that requires an object to have those methods for proper
value semantics equality.

All three classed are also Serializable which comes in handy when we want to
save or load them to/from somewhere that can be represented by a stream.

These little classes also know how to produce a readable String from themselves
which comes in handy when we need to read what these values actually are.

Intermezzo

Our little language basically looks like this:

Expand|Select|Wrap|Line Numbers

D1;D2; ... Dm;E1;E2;E3; ... En;

Where the D's and E's can be in any order, even mixed. The D's are the definitions
and declarations of the user functions (if any), while the E's represent the
expressions that might use those functions or use built-in functions.

The definitions, declarations and expressions are translated to lists of
instructions by the parser(s) and code generator. The semicolons do nothing
actually: they serve as syntactic sugar for the parsers and indicate when an
expression of definition is complete and possibly a next one starts after it.

Normal expressions have a value, e.g. the expression 1+2 has a value 3 which
is left on the stack of values. A bunch of normal expressions leave a bunch
of values on that stack. We need special 'functions' (mind the quotes) that
can manipulate that stack (owned by the Interpreter) directly, otherwise we
might end up with a bunch of values on the stack when the program finishes
that we don't need or want. Such special functions may not return a value.
They're the equivalent to procedures in other languages.

Procedures can not be part of an expression of course, just because they don't
result in a (return) value. As an example there is the 'clear()' procedure or
special function. It clears the stack where all Value results are kept.

The expression 'clear()+42' doesn't make sense of course. Have a look at the
postfix representation of this non-expression: 'clear() 42 +'. This is what
happens internally:

1) clear() clears the Value stack in the interpreter;
2) 42 is pushed on the Value stack;
3) the '+' instruction attempts to pop two values from the stack.

Step 3) is an error: there are no two operands on the stack to add. The Interpreter
guards against stack underflow. But where can these procedures or special functions
be used then? They cannot be part of an expression as we just saw, but they
can be expressions by themselves followed by a semicolon. The 'expression'
(those quotes again) simply didn't result in a (return) value but it doesn't
matter because that value wasn't needed as an operand for a bigger expression.

As we can see later, those special functions can manipulate the Value stack in a
powerful way; a way that makes our compiler/interpreter classes fun to play with
and it can have educational purposes too. The 'oldies' (excusez le mot) who still
remember those Hewlett Packard reverse polish notation calculators sure know
what I'm talking about. Our little language can be a mixture of 'normal' infix
expressions and postfix explicit stack manipulation. Pick the one you like.

Variable binding

Our little language knows about variables too, both local and global. Where are
these variables stored then? The answer is: the Interpreter knows how to handle
them. The Interpreter manages a stack of 'scopes'. A scope is a Map that
associates the name of a variable with its value.

There's always a scope available. Even at the outermost level. Have a look at
this example:

Expand|Select|Wrap|Line Numbers

 
x= 1;

x+= 41;

The first expression associates the value 1 with the name 'x'. The second
expression looks up the current value of 'x' (which is 1) adds 41 to it and
re-associates the variable named 'x' with the value 42. If a variable is not
yet associated with a value, the value 0 is assumed.

A user defined function pushes another scope on the stack of scopes. Have
a look at this:

Expand|Select|Wrap|Line Numbers

 
function f(x) = x+= 41;

x= 54;

f(1);

When function f starts executing it first pushes a new empty scope on the scope
stack and it associates its parameter 'x' with the value 1 because that is what
the parameter value was when f(1) was called.

Next the expression 'x+= 41' is evaluated which means that variable 'x' is updated
and gets the value 42. Which 'x' then? Answer: the variable 'x found in the scope
found nearest to the top of the scope stack.

Before f(1) was evaluated the scope stack just contains one scope:

Expand|Select|Wrap|Line Numbers

{ x=54 }

When the user defined function is called a new scope is pushed:

Expand|Select|Wrap|Line Numbers

 
{ }

{ x=54 }

and the parameter 'x' is associated with the value 1:

Expand|Select|Wrap|Line Numbers

 
{ x=1 }

{ x=54 }

The body of the function find the value associated with 'x' starting at the top
of the stack; it finds the value 1 and adds 41 to it and re-associates 'x' with
that new value:

Expand|Select|Wrap|Line Numbers

 
{ x=42 }

{ x=54 }

and returns that same value (42) as the return value of the function expression.
The scope is popped again and the value 42 is pushed on the Value Stack of the
Interpreter.

As you can see, the old association for the 'global' value of 'x' is unaltered.
Just as we expected with other programming languages: local variables 'hide'
more global variables with the same name.

When we search for an association for a variable we could just search for it in
the top level scope on the scope stack. But why should we? We could search all
associations on the scope stack from top to bottom until we've found one. It we
don't find an association for a variable we simply assume that value is 0.
Have a look at this example:

Expand|Select|Wrap|Line Numbers

 
function f(x);

function g(x, y)= f(x);

function f= x+y;

x= 41;

y= 1;

f(x);

g(x, 13);

The tricky function is function 'f': it has one parameter 'x' but it adds its
value to whatever 'y' may be. First f(x) is called from a global scope in which
'y' has the value 1 so the result will be 41+1=42. Next function 'g' is called
which has two parameters 'x' and 'y'. In the next scope on the scope stack
variable 'y' will be bound (or associated) with the value 13. The function 'f'
is called again; it adds the value of 'x', which is local to function 'g' and
has the value 41 again, to the value of 'y' which is associated to the value
13 in the function call of function 'g' so the result will be 54.

Read this again and again until you understand 'deep binding'.

Deep binding (in contrast to shallow binding) is an old parameter passing and
scoping mechanism that is sort of lost in the modern C/C++/Java like languages.
Those languages use shallow binding which is also known as lexical scoping:
if a variable can not be found in the topmost scope on the scope stack it is
considered to be an error. Deep binding is a lot more fun.

Deep binding is used in variants of Lisp, a functional programming language much
more powerful than our little language, but we can do it too and we'll make
use of it in a later article, promised.

Here's the code that is needed in our Interpreter that implements the stack of
scopes for us; it's not much. We simply use a List for our scope stack again and
a scope simply is a Map<String, Value) that associates the name of a variable
(a String) with a Value:

Expand|Select|Wrap|Line Numbers

 
package compiler;
 
import java.util.ArrayList;

import java.util.HashMap;

import java.util.Map;
 
import compiler.exception.InterpreterException;

import compiler.value.Value;
 
public class Scope extends ArrayList<Map<String, Value>> {
 
    private static final long serialVersionUID = 3520767277469443623L;
 
    public Scope() { push(); }
 
    private Map<String, Value> find (String name) {
 
        for (int i= size(); --i >= 0;) {

            Map<String, Value> map= get(i);

            if (map.containsKey(name)) return map;

        }
 
        return get(size()-1);

    }
 
    public void push() { add(new HashMap<String, Value>()); }
 
    public void pop() { remove(size()-1); }
 
    public void put(String name, Value value) {
 
        find(name).put(name, value);

    }
 
    public Value get(String name) throws InterpreterException {
 
        Value value= find(name).get(name);
 
        return (value == null)?Value.NUL:value;

    }

}

The push method pushes a newly created empty scope on the stack and the pop method
removes it again. It's not very interesting.

If a variable value is needed, the entire stack of associations is searched;
if an association is found (by the private find method) the association is
returned, otherwise the top association is returned by the find method.
If the association Map doesn't contain the particular value, 0 is returned.

Putting an association of a variable and its value simply stores the association
in the topmost Map. If we had limited this search to the topmost scope on the
stack we would've implemented shallow binding; the current implementation implements
deep binding (the entire stack is searched for an association) which is more fun.
Note that if a variable needs to be assigned to and if it is not found in the
stack of associations, the top association Map is used; this implies that a
variable unknown at that time always ends up either in the global scope if the
assignment was performed as just an expression, or as a local variable if the
assignment was performed inside a user defined function. In other words: a user
defined function can not create global variables.

The Interpreter itself encapsulates such a Scope object and gives us access to it:

Expand|Select|Wrap|Line Numbers

 
protected Scope scope= new Scope();
 
public Scope scope() { return scope; }

Memoizing or Shallow binding with a speedup

As you saw in the previous paragraph deep binding is more fun (and more powerful
too) than shallow binding or lexical scoping. Our little interpreter supports
the latter form of binding too with a little speedup twist.

Suppose a user defined function f(x) takes quite some time to produce a return
value. Also suppose that function call f(x) always produces the same return
value for the same value of its parameter x all the time. Why should we call that
function again then if we already know the return value from a previous function
call f(x)?

We can only 'know' that return value if that function 'f' produces the same
return value for identical parameter value(s) 'n' of course. Deep binding makes
this impossible because variables that are not part of the parameter list may
change behind our backs.

We've seen how deep binding behaves in the previous paragraph. What happens if
we jot down every return value of every user defined function given this binding
mechanism? We 'memoize' the return values that 'belong' to the function calls
given certain parameter values and the values of variable that were not bound
or associated by the function itself. This is almost identical to shallow binding
with the difference that even the first function call would have failed.

Let's go for it: the Interpreter manages a simple table that memoizes function
calls given a certain parameter (list of) value(s). Given the name of a user
defined function we need a list of parameter values associated with the return
value. We need two Maps: the top level map associates the name of the user
defined function with a Map of all its invocations. Every invocation is a list
of parameter values associated with the return value of that function.

I crafted a little 'Memo' class for this purpose; here it is:

Expand|Select|Wrap|Line Numbers

 
public class Memo extends HashMap<String, Map<List<Value>, Value>> {
 
    private static final long serialVersionUID = 4800064570213212044L;
 
    private boolean memo= false;
 
    public void put(String name, List<Value> args, Value value) {
 
        if (!memo) return;
 
        Map<List<Value>, Value> values= get(name);
 
        if (values == null) 

            put(name, values= new HashMap<List<Value>, Value>());
 
        values.put(new ArrayList<Value>(args), value);

    }
 
    public Value get(String name, List<Value> args) {
 
        if (!memo) return null;
 
        Map<List<Value>, Value> values= get(name);
 
        return (values != null)?values.get(args):null;

    }
 
    public void setMemo(boolean memo) { this.memo= memo; }
 
    public boolean isMemo() { return memo; }

}

It has a private boolean flag 'memo' that tells whether or not it is allowed
to 'remember' the function call and its parameter list associated with the
return value.

This class implements the methods needed: a 'put' and a 'get' method. The 'put'
method adds the association and the 'get' method retrieves the association again
(if any). This little class is a nice example of what one can do with the
Collections framework.

See you in the next part of this article part below.

kind regards,

Jos

Nov 21 '07 #1

Subscribe Post Reply

3356

Compilers - 6A: Interpreters

Similar topics