By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,823 Members | 756 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Compilers - 5B: Parsers

Expert 10K+
P: 11,448
Greetings,

welcome back at the sequel of the parsers article chapter. This part is dedicated
to the ExpressionParser, the largest parser class for our little language.
This class parses a complete expression and just like the other parser classes
calls the Generator on the fly. When the parse has completed successfully, the
generated code is returned. Otherwise the parse is aborted and an exception is
thrown telling the reason of the abort.

The ExpressionParser

The obligatory preamble of the ExpressionParser class is boring again:

Expand|Select|Wrap|Line Numbers
  1. public class ExpressionParser extends AbstractParser {
  2.  
  3.     public ExpressionParser(Tokenizer tz, User user) { super(tz, user); }
  4.  
Just like the Parser and DefinitionParser classes a Tokenizer and User object
are passed to the constructor of this class. Here's the main entry point for
this object:

Expand|Select|Wrap|Line Numbers
  1. public Code parse(Generator gen) throws InterpreterException {
  2.  
  3.     binaryExpression(gen, 0);
  4.  
  5.     return gen.getCode();
  6. }
  7.  
You've seen this first method being used already: the DefinitionParser calls
it when it needs an entire expression to be parsed for the user defined function
body. The method calls another private method that does the job; it'll come a
bit later (see below). The method returns the compiled code that has been
generated on the fly while parsing the expression.

Binary expressions

Here are the grammar rules again for a binary expression:

Expand|Select|Wrap|Line Numbers
  1. expression0: expression1 ( operator0 expression1 )*
  2. expression1: expression2 ( operator1 expression2 )*
  3. expression2: expression3 ( operator2 expression3 )*
  4. expression3: expression4 ( operator3 expression4 )*
  5. expression4: expression5 ( operator4 expression5 )*
  6. expression4: unary       ( operator5 unary       )*
  7. operaor 0: ':'
  8. operator1: '==' | '!='
  9. operator2: '<=' | '<' | '>' | '>='
  10. operator3: '+' | '-'
  11. operator4: '*' | '/'
  12. operator5: '^'
  13.  
Note the numbers 0, 1, 2, 3, 4 and 5; we're going to use these numbers as index
numbers for the binary expression. In general rule 'i' looks like this:

Expand|Select|Wrap|Line Numbers
  1. expression(i): expression(i+1) ( operator(i) expression(i+1) )*
  2.  
... except when 'i' == 5 where the 'unary' rule is used instead. It translates
to Java code like this:

Expand|Select|Wrap|Line Numbers
  1. private void binaryExpression(Generator gen, int i) 
  2.                         throws InterpreterException {
  3.  
  4.     Token token;
  5.  
  6.     if (i == ParserTable.binops.size())
  7.         unaryExpression(gen);
  8.     else {
  9.         for (binaryExpression(gen, i+1); 
  10.             ParserTable.binops.get(i).
  11.                 contains((token= tz.getToken()).getStr()); ) {
  12.             tz.skip();
  13.             binaryExpression(gen, i+1);
  14.             gen.makeFunctionInstruction(token, 2);
  15.         }
  16.     }
  17. }
  18.  
The index value 'i' represents the precedence value and the operators of that
precedence that are concerned in this binary expression. At the highest precedence
level a binary expression is just a unary expression; otherwise at precedence
level 'i' just binary expressions that take higher precedence level operands
are parsed, separated by precedence level 'i' operators.

The 'makeFunctionInstruction' method in the Generator class generates a binary
operator instruction and adds it to the compiled code list. Also see the
explanation in the 'Grammar' part of this article. In a bird eye view this
recursive (!) method parses binary expressions like described in the indexed
grammar rule above. If you understand this method the rest of the ExpressionParser's
method will be a piece of cake. This method takes care of the binary operator
precedence rules, i.e. recursively it parses expressions that take higher and
higher precedence operators until unary expressions are the only thing that's
left. Unary expressions are a bit of a mess syntactically speaking because there
can be some many forms of them. We'll split things up a bit.

Unary expressions

Here are the grammar rules for unary expressions again:

Expand|Select|Wrap|Line Numbers
  1. unary: ( unaryop unary ) | atomic
  2. unaryop: '+' | '-' | '!' | '++' | '--'
  3.  
Basically a unary expression is prepended by zero or more unary operators
followed by an atomic expression. The operators are applied (or evaluated) from
right to left although they are parsed from left to right. Here's what the
unaryExpression looks like:

Expand|Select|Wrap|Line Numbers
  1. private void unaryExpression(Generator gen) throws InterpreterException {
  2.  
  3.     Token token= tz.getToken();
  4.  
  5.     if (ParserTable.unaops.contains(token.getStr())) {
  6.         tz.skip();
  7.         unaryExpression(gen);
  8.         gen.makeFunctionInstruction(token.mark(), 1);
  9.     }
  10.     else
  11.         atomicExpression(gen);
  12. }
  13.  
Again this is a highly recursive method: as long as a unary operator is present
in the token stream this method calls itself recursively and generated compiled
code for the unary operator afterwards. If no unary expression is present in the
token stream (anymore), the work is delegated to the atomicExpression method.

Here's an example: suppose the input stream is '-!x', first the '-' operator
is parsed and the method calls itself again. The input stream now contains '!x'.
The '!' operator is parsed and the method calls itself again. The input stream
now is 'x' which is handled by the atomicExpression method. In the second call
of the unaryExpression method code is generated for the '!' operator; the method
returns and in the first call of this method code is generated for the '-'
operator. The code can symbolically be represented as 'x ! -'.

When we can play a bit with the parsers we'll see that compiled code resembles
a postfix representation of the infix expressions to be compiled.

When we've reached this deep in our recursive descent parse there are no more
binary nor unary operators to parse in this part of the expressions. All that's
left are 'atomic' expressions.

Atomic expressions

Atomic expressions are a mess syntactically speaking. Basically an atomic expression
start with a constant, a left parenthesis, a left curly bracket or a name.

The first options are relatively easy; it the 'name' that's the cumbersome part.
All the Tokenizer knows is that it has scanned a name; but a name can be
anything:

1) a built-in function;
2) a user defined function or listfunc;
3) a quoted object (see the first part of this chapter);
4) a simple variable, or
5) an assignment to a variable.

Let's dig into this mess; here's the atomicExpression method:

Expand|Select|Wrap|Line Numbers
  1. private void atomicExpression(Generator gen) throws InterpreterException {
  2.  
  3.     Token token= tz.getToken();
  4.  
  5.     switch (type(token)) {
  6.  
  7.         case TokenTable.T_NUMB: constantExpression(token, gen); break;            
  8.         case TokenTable.T_NAME: nameExpression(token, gen); break;
  9.  
  10.         case ParserTable.T_USER: userExpression(token, gen); break;
  11.         case ParserTable.T_FUNC: functionExpression(token, gen); break;
  12.         case ParserTable.T_QUOT: quoteExpression(token, gen); break;
  13.  
  14.         default: 
  15.             if (expect("(")) 
  16.                 nestedExpression(gen);
  17.             else if (expect("{"))
  18.                 listExpression(gen);
  19.             else
  20.                 throw new ParserException(tz, 
  21.                     "expression expected");
  22.     }
  23. }
  24.  
This method tries to predict what to do next given the type of the current token.
It delegates all the work to yet other methods that are supposed to do the work.

Constant expressions

Constant expressions are easy: simply generate the Instruction that handles the
current constant. In our little language the only constants are doubles. Here's
the method that handles a constant:

Expand|Select|Wrap|Line Numbers
  1. private void constantExpression(Token token, Generator gen) {
  2.  
  3.     gen.makeConstantInstruction(token); 
  4.     tz.skip();        
  5. }
  6.  
This method asks the code Generator to produce a ConstantInstruction (whatever
that may be) and it considers the current token as processed so it skips it.

Parenthesized expressions

Parenthesized expressions are relatively easy to parse too: a left parenthesis
has already been processed (see the atomicExpression method), so just a binary
expression followed by a right parenthesis needs to be parsed:

Expand|Select|Wrap|Line Numbers
  1. private void nestedExpression(Generator gen) throws InterpreterException {
  2.  
  3.     binaryExpression(gen, 0);
  4.     demand(")");
  5. }
  6.  
Observe that this deeply nested method calls one of the outermost methods again:
an entire binary expression with all its precedence levels is parsed again using
just one recursive method call. We've reached the heart of recursive descent
parsing and we will reach it again a couple of times.

For an expression as simple as '(1+1)' the following methods are recursively
called:

- binaryExpression(0)
- binaryExpression(1)
- binaryExpression(2)
- binaryExpression(3)
- binaryExpression(4)
- unaryExpression
- atomicExpression
- nestedExpression

and then only that left parenthesis has been recognized and parsed. For that
first '1' we have to do the whole trip again:

- binaryExpression(0)
- binaryExpression(1)
- binaryExpression(2)
- binaryExpression(3)
- binaryExpression(4)
- unaryExpression
- atomicExpression
- constantExpression

Then recursion unwinds to the level of the binaryExpression(2) that recognizes
the '+' sign in the token stream and the entire circus starts all over again.

There's no need to feel sad about it or to be afraid of it: recursion is fast
nowadays and the depth of stacks is no problem either anymore. Let's go for the
nasty final parts:

Name expressions

When a name is parsed that is not some sort of a function or a quoted object
it can be either just a name or an assignment. Here's how it's parsed:

Expand|Select|Wrap|Line Numbers
  1. private void nameExpression(Token token, Generator gen) 
  2.                         throws InterpreterException {
  3.  
  4.     tz.skip();
  5.     String assign= tz.getToken().getStr();
  6.     if (ParserTable.asgns.contains(assign)) {
  7.         tz.skip();
  8.         gen.preAssignmentInstruction(token, assign);
  9.         binaryExpression(gen, 0);
  10.         gen.makeAssignmentInstruction(token, assign);
  11.     }
  12.     else {
  13.         gen.makeNameInstruction(token);
  14.         if (ParserTable.pstops.contains(assign)) {
  15.             tz.skip();
  16.             gen.makeConstantInstruction(Token.ONE);
  17.             gen.makeAssignmentInstruction(token, assign.charAt(0)+"=");
  18.  
  19.         }
  20.     }
  21. }
  22.  
This method skips the name (it still has the Token as its first parameter) and
checks whether or not the name is followed by an assignment operator. If so,
it skips the token and invokes that topmost binaryExpression method again for
the right hand side value of the assignment. Next it lets the code Generator
generate an AssignmentInstruction for it.

If no assignment operator was present it makes the code Generator generate a
simple NameInstruction for the name that has just been parsed unless a postfix
++ or -- token follows the name: in this case the appropriate expression and
assignment is generated.

As you might see later, our postfix decrement and increment operators behave
identical to C/C++/Java's prefix increment and decrement operators. Confusing,
admitted but it takes less parsing this way for our simple, little programming
language; and we're free to define and implement whatever we want after all ;-)

User function expressions

User function calls are quite simple to parse; here's how it's done:

Expand|Select|Wrap|Line Numbers
  1. private void userExpression(Token token, Generator gen) 
  2.                         throws InterpreterException {
  3.  
  4.     parseArguments(gen, getUserArity(token));
  5.  
  6.     gen.makeInstruction(user.get(token.getStr()));
  7. }
  8.  
This method parses the arguments of the user function call given the arity of
the user function. Then it makes the code Generator create an instruction which
it retrieves from the User object (see the first part of this chapter).

The 'getUserArity' method looks like this:

[code=java]
private int getUserArity(Token token) throws InterpreterException {

skipDemand("(");

return user.get(token.getStr()).getArity();
}
[code]

First it demands a left parenthesis after having skipped the name of the user
defined function. Then it consults that User object again to get the arity of
the user defined function.

Parsing the actual arguments of the user defined function is done like this:

Expand|Select|Wrap|Line Numbers
  1. private int parseArguments(Generator gen, int n) throws InterpreterException {
  2.  
  3.     for (int i= 0; i < n; i++) {
  4.         binaryExpression(gen, 0);
  5.         if (i < n-1) demand(",");
  6.     }
  7.     demand(")");
  8.  
  9.     return n;
  10. }
  11.  
The 'n' arguments are parsed; every argument can be an entire binary expression
again. Following every argument expression needs to be a comma except for the
last argument in which case a right parenthesis is expected.

Built-in function expressions

Parsing built-in functions, no matter listfuncs or functions is done in a
similar way as we parsed user defined functions. Have a look:

Expand|Select|Wrap|Line Numbers
  1. private void functionExpression(Token token, Generator gen) 
  2.                         throws InterpreterException {
  3.  
  4.     gen.makeFunctionInstruction(token, 
  5.                     parseArguments(gen, getArity(token)));
  6. }
  7.  
Again the arguments are parsed (see above) and a FunctionIstruction is created
by the code Generator. The only difference is that the parse knows already how
many arguments that built-in function needs, i.e. it has to retrieve it from
the ParserTable where the arities for built-in tokens are predefined:

Expand|Select|Wrap|Line Numbers
  1. private int getArity(Token token) throws InterpreterException {
  2.  
  3.     skipDemand("(");
  4.  
  5.     return ParserTable.arity.get(token.getStr());
  6. }
  7.  
Compare this method with the getUserArity method (see above). They're similar,
the only difference is the way they find the arity for the function.

Quoted object expressions

The mysterious quoted objects (think of the 'if' object) look similar to
ordinary function calls but they aren't: the code generated for their arguments
is not being generated by the normal code Generator but another Generator is
used for the purpose and the compiled code for the arguments is passed to the
compiled quoted object. Here's how it's done:

Expand|Select|Wrap|Line Numbers
  1. private void quoteExpression(Token token, Generator gen) 
  2.                         throws InterpreterException {
  3.  
  4.     List<Code> args= new ArrayList<Code>();
  5.  
  6.     for (int i= 0, n= getArity(token); i < n; i++) {
  7.         Generator arg= new Generator();
  8.         binaryExpression(arg, 0);
  9.         if (i < n-1) demand(",");
  10.         args.add(arg.getCode());
  11.     }
  12.     demand(")");
  13.  
  14.     gen.makeQuoteInstruction(token, args);
  15. }
  16.  
For every single argument a new code Generator is used; all the code is collected
in a list of compiled code. The syntax checks are similar to the syntax checks
for the user defined or built-in functions. Finally a QuoteInstruction is created
by the 'normal' main code Generator. The use of this mysterious behaviour has
been partly explained in the previous part of this chapter and we'll get back to
it when we discuss the Instructions.

List expressions

There just one more part to explain. Our little language can handle lists.
A list is a bunch of expressions in curly brackets. Note that an expression
is a list so a list can have lists as their elements which can have lists as
their elements etc. etc.

This is how a list is parsed:

Expand|Select|Wrap|Line Numbers
  1. private void listExpression(Generator gen) throws InterpreterException {
  2.  
  3.     int arity;
  4.  
  5.     for (arity= 0;; ) {
  6.         if(expect("}")) 
  7.             if (arity == 0) break;
  8.             else throw new ParserException(tz, 
  9.                         "list expression error");
  10.         binaryExpression(gen, 0);
  11.         arity++;
  12.         if (expect(",")) continue;
  13.         demand("}"); 
  14.         break;
  15.     }
  16.  
  17.     gen.makeListInstruction(arity);
  18. }
  19.  
Basically this method keeps on calling the outer binaryExpression method to
collect all the elements of the list. It takes care that the syntax is correct
by calling the expect and demand methods at the right places (check it) and
finally the code Generator is asked to create a ListInstruction given the number
of elements in the list that has just been parsed. Note that the list can contain
other lists as its elements just because of all the recursion that is used.

Code

We've seen this mysterious Code object a couple of times. A Code object is
managed by the code Generator. A Code object is just this:

Expand|Select|Wrap|Line Numbers
  1. public class Code extends ArrayList<Instruction> {
  2.  
  3.     private static final long serialVersionUID = 3728486712655477743L;
  4.  
  5.     public static Code read(InputStream is) throws InterpreterException {
  6.  
  7.         try {
  8.             return (Code)new ObjectInputStream(is).readObject();
  9.         }
  10.         catch (Exception e) {
  11.             throw new InterpreterException("can't read code", e);
  12.         }
  13.     }
  14.  
  15.     public static Code read(String name) throws InterpreterException {
  16.  
  17.         FileInputStream fis= null;
  18.  
  19.         try {
  20.             return read(fis= new FileInputStream(name));
  21.         }
  22.         catch (IOException ioe) {
  23.             throw new InterpreterException("can't read code from: "+name, ioe);
  24.         }
  25.         finally {
  26.             try { fis.close(); } catch (Exception e) { }
  27.         }
  28.     }
  29.  
  30.     public void write(OutputStream os) throws InterpreterException {
  31.  
  32.         try {
  33.             new ObjectOutputStream(os).writeObject(this);
  34.         }
  35.         catch (Exception e) {
  36.             throw new InterpreterException("can't write code");
  37.         }
  38.     }
  39.  
  40.     public void write(String name) throws InterpreterException {
  41.  
  42.         FileOutputStream fos= null;
  43.  
  44.         try {
  45.             write(fos= new FileOutputStream(name));
  46.         }
  47.         catch (IOException ioe) {
  48.             throw new InterpreterException("can't write code to: "+name, ioe);
  49.         }
  50.         finally {
  51.             try { fos.close(); } catch (Exception e) { }
  52.         }
  53.     }
  54. }
  55.  
  56.  
As you can see a Code object is a List that contains Instructions. The class
contains a couple of convenience methods, i.e. it can serialize itself to an
OutputStream or a file and two static methods can read a serialized Code object
from an InputStream or a file again. Besides these methods the Code object just
is a List of Instructions. We'll talk about Instructions in a next part of this
article.

Note that whenever reading or writing fails, the convenience methods nicely wrapthe thrown exception in another InterpreterException and let that one bubble up
to the callers of these methods. If the methods are passed a String it is supposed
to be the name of a file; otherwise the streams are used for reading or writing
but they're not closed. It's up to the caller to take care of that.

That's normal behaviour: if an object 'owns' something, it's responsible for it,
otherwise some other object (the owner) should take responsibility. When a file
name was passed in the Code object is the owner of the stream it creates and it
should close it when done with it.

Of course the read methods are static methods (there is no Code object yet) but
the write methods are not static: a Code object can serialize itself. Last, note
that funny number: it is needed for the Serialization framework in case of
different versions of the Code class.

Concluding remarks

As you might have noticed, Parsers can be reused, i.e. they can parse several
token streams given a Reader; they reuse the same Tokenizer over and over again.
We need this behaviour because a parser stores the information w.r.t. user
defined function and we want to use our parsers in an interactive environment
too where the user enters pieces of the text line by line. Parsers construct
the Tokenizer they use themselve; only classes (object thereof) in the same
package as where the Tokenizer class lives can instantiate a Tokenizer.

There is a nice convenience method available that runs in a one shot mode:
given the name of a file, it opens a stream and parses the entire text. We
also need that behaviour when we want to use the parser as a regular compiler
that compiles files and possibly saves the generated code to a file again.

That was a lot of code in this part of the Compilers article, but it was worth
it. We have implemented the entire parser for our compiler. We still can't do
anything usefull with this little language because there's the Generator to be
implemeented and finally there's an Interpreter; the thing that shows us whether
or not we've done things correctly in the previous stages of the construction of
the compiler system.

I hope you're still with me here after that entire avalanche of code for the
parsers. If you still are: congratulations; you have seen how parsers can be
constructed from formal grammar rules on a one-to-one basis with a bit of
detail sprinkled in. We already made our tokenizer so we can parse source text
and check for syntactical correctness now. The next section of this article
will include the source code we have so far and a bit more. We'll see how those
expressions are compiled to something equivalent to postfix form expressions and
we're going to start working on the instructions that actually perform what we
want them to perform, i.e. evaluate the expressions that make up our program.

See you next week and

kind regards,

Jos
Jun 10 '07 #1
Share this Article
Share on Google+