By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,274 Members | 2,229 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Compilers - 5A: Parsers

Expert 10K+
P: 11,448
Greetings,

this week's article part discusses the parsers used for our little language.
We will implement the parsers according to the grammar rules we defined in
the second part of this article. As you will see shortly, the implementation
of the parsers is an almost one-to-one translation of those grammar rules.

Recursive descent parsing

The grammar rules are highly recursive, i.e. one rule mentions another rule
which mentions yet another rule etc. until a rule mentions the first rule
again. (look at the definition of a nested expression for an example).

As the implementation follows those grammar rules quite closely, the methods
that make up the parsers are highly recursive too. This way of parsing a
(programming) language is named 'recursive descent parsing'. The methods try
to figure out what to do according to a current token and they 'predict' which
(other) methods to call until the entire token stream had been parsed.

Note that when using a recursive descent parsing strategy where method A calls
B, B calls C, C calls D etc. etc. until Z calls method A again, no matter how
long or short that chain of recursive method calls is, somewhere in that chain
of method calls, before the cycle closes again at least one token from the
token input stream has to be 'processed', i.e. at least once the Tokenizer's
skip() method has to be invoked. Otherwise the recursion can go deeper and
deeper without anything happening, i.e. the parser would 'chew' on the same
token over and over again.

For the grammar rules this boils down to the fact that no rule name in any
rule may occur in a cycle of rules as the leftmost rule name in any rule without
a token preceeding that rule name, e.g.

A : B <rest of rule A>
B : C <rest of rule B>
C : D <rest of rule C>
...
Z : A <rest of rule Z>

... this is not allowed; in more formal terms: no left-recursive grammar rules
are allowed in a recursive descent parsing scenario. I don't want to dig into
the more theoretical aspects of this for now but not allowing left-recursive
grammar rules was considered a major drawback for recursive descent parsers and
that's exactly the reason why other parser techniques applied by parser generators
such as Yacc and Bison were favoured until recently. Just a couple of years ago
a new parsing technique was discovered where still no left recursive rules were
allowed but the benefits of that new technique strongly outweighed the formerly
most popular other techniques applied by Yacc, Bison et al.

If you pay close attention to those grammar rules for our little language and
if you check all the methods of our parsers you'd see that at least one token
is consumed before that deadly recursive cycle closes itself.

There's no need to be afraid of recursion: it comes quite natural with those
grammar rules. The implementation of the parser consists of three sub-parsers
and a base class thereof:

1) Parser: implements the top level rules of the grammar;
2) DefinitionParser: implements the definition rules of the grammar;
3) ExpressionParser: implements the rules that define an expression.

The first parser class does the divide and conquor job: it checks whether the
current token introduces a user function definition or declaration or if the
current token introduces the start of an expression. When the entire token
stream has been parsed without errors, it checks if all user functions have
been actually defined.

The second parser class parses a user function definition or declararion.
For the body of a function it uses an instance of the third parser:

The third parser class will be the largest part of our parser; it parses
and entire expression no matter how complex the expression is.

All parser objects use a Generator object that is able to generate code for the
parsers; a sequence of instructions makes up the compiled code. When the parsing
and code generation is done, this code can be passed to an Interpreter that
interprets the instructions.

The three parsers are derived classes of the following abstract class:

The base parser class

Expand|Select|Wrap|Line Numbers
  1. public abstract class AbstractParser {
  2.  
  3.     protected Tokenizer tz;
  4.     protected User user;
  5.  
  6.     public AbstractParser(Tokenizer tz, User user) { 
  7.  
  8.         this.tz= tz; 
  9.         this.user= user;
  10.     }
  11.  
An Abstract class takes a Tokenizer and a User object by its constructor. The
Tokenizer delivers a stream of tokens (see part two of this article). The User
object basically is a Map that maps the names of user defined functions to the
instructions that represent the user defined functions.

This abstract class implements a few methods used by the derived classes. When
the Tokenizer scans a name all it can tell is that it has scanned a name, i.e.
a Tokenizer doesn't know what that name represents; it could be a variable name
or a built-in function name of some sort, or a user defined function or a
keyword, but what does the Tokenizer know? A method in the abstract base class
takes care of that:

Expand|Select|Wrap|Line Numbers
  1. protected int type(Token token) {
  2.  
  3.     int type= token.getTyp();
  4.  
  5.     if (type != TokenTable.T_NAME) return type;
  6.  
  7.     String str= token.getStr();
  8.  
  9.     if (ParserTable.funcs.contains(str)) return ParserTable.T_FUNC;
  10.     if (ParserTable.quots.contains(str)) return ParserTable.T_QUOT;
  11.     if (ParserTable.rword.contains(str)) return ParserTable.T_WORD;
  12.  
  13.     if (user.containsKey(str)) return ParserTable.T_USER;
  14.  
  15.     return type;
  16. }
  17.  
If the token type isn't a name this method simply does nothing; otherwise one
of the tables in the ParserTable or User map can tell what that name actually
is: a built-in function, a quoted object, a reserved word or a user defined
function.

Two 'special' objects

Quoted objest are syntactically equivalent to built-in functions but they
evaluate their arguments themselves. This mechanism is used by one quoted
object, the 'if' quoted object. We have seen an example of the 'if' quoted
object already at the end of the grammar part of this article:

Expand|Select|Wrap|Line Numbers
  1. function fac(n) =
  2.    if (n < 2
  3.       , 1
  4.       , n*fac(n)
  5.    );
  6.  
The 'if' object evaluates its first argument 'n < 2'. If the expression is
true the 'if' object evaluates its second argument, otherise it evaluates its
third argument. A more detailed explanation of these two objects will be given
when we discuss code generation and the Instruction set of classes. Maybe later
we develop other quoted objects such as the 'while' object perhaps?

The base parser class again

The abstract base class implements a few more small handy methods that are
used by the derived parser classes:

Expand|Select|Wrap|Line Numbers
  1. protected boolean expect(String token) throws InterpreterException {
  2.  
  3.     if (token.equals(tz.getToken().getStr())) {
  4.         tz.skip();
  5.         return true;
  6.     }
  7.  
  8.     return false;
  9. }
  10.  
  11. protected void skipDemand(String str) throws InterpreterException {
  12.  
  13.     tz.skip();
  14.     demand(str);
  15. }
  16.  
  17. protected void demand(String str) throws InterpreterException {
  18.  
  19.     if (!expect(str))
  20.         throw new ParserException(tz, "expected: "+str);
  21. }
  22.  
The parsers use the 'expect' method to figure out what to do next; this method
returns true if the current token equals the expected token (and skips it),
otherwise this method returns false. The second and third methods are called
when the parser already knows what it wants to parse, suppose the name of a
built-in function has just been scanned by the tokenizer; the parser 'knows'
that a left parentheses should follow that name. For such situations the second
or third method is called. When a previous token should be skipped first, the
second method is called, otherwise the third method is called.

The other derived parser objects base their decisions on the methods defined
in this base class. The next paragraphs describe the derived parser classes.

The main parser

Here's that first grammar rule again:

Expand|Select|Wrap|Line Numbers
  1. program: ((definition | expression) ';')*
  2.  
The main parser object checks what to parse: a definition or an expression.
Here's the class:

Expand|Select|Wrap|Line Numbers
  1. public class Parser extends AbstractParser {
  2.  
  3.     public static Code parse(String name) throws InterpreterException {
  4.  
  5.         FileReader fr= null;
  6.  
  7.         try {
  8.             Parser parser= new Parser();
  9.             return parser.parse(r= new FileReader(name));
  10.         }
  11.         catch (InterpreterException ie) {
  12.             throw ie;
  13.         }
  14.         catch (Exception e) {
  15.             throw new InterpreterException("can't compile: "+name, e);
  16.         }
  17.         finally {
  18.             try { fr.close();  } catch (Exception e) { }
  19.         }
  20.     }
  21.  
  22.  
  23.     private ExpressionParser expressionParser;
  24.     private DefinitionParser definitionParser;
  25.  
  26.     public Parser() {
  27.  
  28.         super(new Tokenizer(), new User());
  29.         expressionParser= new ExpressionParser(tz, user);
  30.         definitionParser= new DefinitionParser(tz, user);
  31.     }
  32.  
The first (static) method is a convenience method: it takes a file name as its
parameter and tries to open it for reading; it feeds the reader to a new Parser
and returns the result. This method comes in handy when you want to compile
source texts for files.

The constructor of this class takes no parameters; a Tokenizer is constructed
and two more parsers are constructed: an ExpressionParser that can parse an
expression (what a surprise!) and a DefinitionParser that can parse a definition
(also a surprise?).

It passes the freshly created Tokenizer and User object around to the other
two parsers because they will need them.

Have a look at the second grammar rule:

Expand|Select|Wrap|Line Numbers
  1. definition: ( 'function' | 'listfunc' ) 'name' defordecl
  2.  
A definition starts with a reserved word: either 'function' or 'listfunc'.
If the current token equals none of these the parser 'predicts' that an
expression must be parsed; here's how it does it:

Expand|Select|Wrap|Line Numbers
  1. public Code parse(Reader r) throws InterpreterException {
  2.  
  3.     tz.initialize(r);
  4.     Generator gen= new Generator();
  5.  
  6.     for (Token token= tz.getToken(); 
  7.          token.getTyp() != TokenTable.T_ENDT; 
  8.          token= tz.getToken()){
  9.  
  10.         if (expect("function")) 
  11.             definitionParser.parse(gen, false);
  12.         else if (expect("listfunc"))
  13.             definitionParser.parse(gen, true);
  14.         else
  15.             expressionParser.parse(gen);
  16.  
  17.         demand(";");
  18.     }
  19.     user.checkDefinitions();
  20.  
  21.     return gen.getCode();
  22. }
  23.  
The 'parse' method returns the code that has been generated on the fly. First
this method generates a new Generator and initializes the Tokenizer with the
Reader that was passed to this method

The for loop looks complicated but it isn't: it simply keeps on looping until
an end of file token has been read. Note that the token stream can be completely
empty in which case the loop stops immediately; compare this behaviour with the
first grammar rule.

In the body of the loop it is determined whether one of the two reserved words
has been scanned. If so the DefinitionParser is used with the appropriate
arguments: the code Generator and a flag telling whether or not function or
a listfunc reserved word was scanned.

If none of these two reserved words were scanned it is assumed that an expression
should be scanned and the ExpressionParser is used. At the end the parser
demands a semicolon in the token stream; see the first grammar rule above and
compare the body of this method with the grammar rule: they are almost the
same: the grammar rule describes the syntax of a token stream and this method
does exactly the same.

When an end of file token has been scanned the User object is asked to check
whether or not all user defined functions are defined. It'll throw an exception
if at least one of the user defined functions has just been declared but not
defined. If all is well, this method returns the code that has been generated
during this successful parse.

This method can throw an Interpreter exception; it doesn't throw such an
exception itself but the 'demand' method (see above) may throw one or one
of the other parser can throw one when something syntactically incorrect
happens. Parsers actually throw a ParserException which is a derived class
from the InterpreterException.

I added one more convenience method to this class; we're going to use it
later. It parses just one single expression; here it is:

Expand|Select|Wrap|Line Numbers
  1. public Code expression(Reader r) throws InterpreterException {
  2.  
  3.     tz.initialize(r);
  4.     Code code= expressionParser.binaryExpression(new Generator());
  5.  
  6.     if (tz.getToken().getTyp() != TokenTable.T_ENDT)
  7.         throw new ParserException(tz, "expected: <eof>");
  8.  
  9.     return code;
  10. }
  11.  
This method expects a single expression; the token input stream must be empty
when the entire expression has been parsed. The compiled code for the expression
is returned at the end. Later, when we're going to build a simple expression
evaluator we'll need this method.

The definition parser

When one of the reserved words 'function' or 'listfunc' has been scanned,
the DefinitionParser must parse the declaration or definition. Here are the
grammar rules again that describe the syntax of such a definition or declaration:

Expand|Select|Wrap|Line Numbers
  1. definition: ( 'function' | 'listfunc' ) 'name' defordecl
  2. defordecl: ('(' paramlist ')' ( '=' expression )?) | ( '=' expression ) 
  3. paramlist ( 'name' ( ',' 'name' )* )?
  4.  
Note that the reserved has already been scanned and skipped by the Parser object
(see above). A boolean parameter passed to the Definition scanner tells which
function needs to be scanned, not that they differ syntactically though.

Here's the first part of the DefinitionParser:

Expand|Select|Wrap|Line Numbers
  1. public class DefinitionParser extends AbstractParser {
  2.  
  3.     public DefinitionParser(Tokenizer tz, User user) { super(tz, user); }
  4.  
This class takes a Tokenizer as well as a User object as its constructor parameters.
This class also extends the AbstractParser abstract class. Here is the 'parse'
method that gets called by the Parser object:

Expand|Select|Wrap|Line Numbers
  1. public void parse(Generator gen, boolean asList) throws InterpreterException { 
  2.  
  3.     Token token= tz.getToken();
  4.     String name= token.getStr();
  5.     int type= type(token);
  6.  
  7.     if (type != TokenTable.T_NAME  && type != ParserTable.T_USER)
  8.         throw new ParserException(tz, 
  9.             "user function name expected: "+name);
  10.  
  11.     tz.skip();
  12.  
The first token that follows the keyword scanned by the Parser class must be
a T_NAME or a T_USER token. If the user defined function wasn't yet declared
the token type will be a T_NAME token, otherwise it's a T_USER. An exception
is thrown if another type of token is scanned; otherwise it is considered
'processed' and skpped.

Expand|Select|Wrap|Line Numbers
  1.     UserInstruction func= user.get(name);
  2.  
  3.     if (func != null) {
  4.         if (func.getBody() != null)
  5.             throw new ParserException(tz, 
  6.                 "user function already defined: "+name);
  7.         else {
  8.             demand("=");
  9.             func.setBody(parseBody());
  10.         }
  11.     }
  12.     else {
  13.         List<String> params= parseArgs();
  14.         user.put(name, func= 
  15.             new UserInstruction(name, params, null, asList));
  16.         if (expect("=")) func.setBody(parseBody());
  17.     }
  18. }
  19.  
The User object is consulted whether or not a user function was already declared
with that name. The outermost 'if' statement decides what to do: either it was
declared already; if it was defined too, it's an error and an Exception is thrown;
otherwise the DefinitionParser demands to see and equal sign followed by the
body of the function.

If the user defined function wasn't even declared before the outermost 'else'
part is executed: the arguments of the function are parsed and the declaration
is registered in the User object. Next if the expecation of an equals sign
was true, also the body of the function is parsed in which case a function
definition was just parsed; otherwise, if no equals sign was seen in the token
stream a function declaration was just parsed.

Observe that in both cases (a tentative definition or a normal definition) the
generated code for the user defined function body is passed to that mysterious
UserInstruction; it knows how to handle that. The UserInstruction and the name
of the user defined function are passsed to the User object which will store
them in a Map<String, Instruction> for later retrieval. The formal parameter
list is also passed to the UserInstruction as you can see. We'll discuss those
Instructions in a later article part.

Also observe that when a declaration and the definition of a user defined funcion
are supplied in two parts, as in:

Expand|Select|Wrap|Line Numbers
  1. listfunc foo(x);
  2. function foo = x+42;
  3.  
It's the declaration part that tells the parser that foo is a listfunc; the
definition can start with the reserved word 'function' or 'listfunc'; it doesn't
matter which one is used.

Parsing a formal argument list is quite a bit of work actually; the following
method attempts to produce a List<String> which represents the names of the
formal parameters; according to the grammar rules (see above) the names of the
formal arguments are separated by commas and the entire list is embraced by
parentheses. Here's how the formal argument list is parsed:

Expand|Select|Wrap|Line Numbers
  1. private List<String> parseArgs() throws InterpreterException {
  2.  
  3.     demand("(");
  4.  
  5.     List<String> params= new ArrayList<String>();
  6.  
  7.     for (Token param; 
  8.          this.type(param= tz.getToken()) == TokenTable.T_NAME; ) {
  9.  
  10.         tz.skip();
  11.         params.add(param.getStr());
  12.  
  13.         if (!expect(",")) break;
  14.         if (tz.getToken().getTyp() != TokenTable.T_NAME)
  15.             throw new ParserException(tz, "parameter name expected");
  16.         }
  17.  
  18.     demand(")"); 
  19.     return params;
  20. }
  21.  
First the method demands a left parenthesis. The for loop keeps on looping as
long as names have been scanned by the Tokenizer. The token is skipped and
added to the List<String>. Two tokens can follow a name: either a comma followed
by another name or something else. If no comma follows the parameter name,
the loop is stopped.

After the loop has finished this method demands a right parenthesis in the token
stream. Finally if all went fine, the List<String> representation of the formal
parameters is returned.

We've seen most of the DefinitionParser; there's one method to be explained
left: here's how the body of the function is parsed. The body of a user defined
function is an expression and that's exactly how it is parsed:

Expand|Select|Wrap|Line Numbers
  1. private Code parseBody() throws InterpreterException {
  2.  
  3.     return new ExpressionParser(tz, user).parse(new Generator());
  4. }
  5.  
This method simply instantiates another ExpressionParser and asks it to parse
a binary expression. Whatever happens in that other method, either a compiled
expression is returned or when something went wrong an Exception is thrown.
This method is a fine example of recursive descent parsing.

An alternative would have been that the Parser object had passed its own
ExpressionParser object to this object where the parseBody() method could have
used it for parsing the body of the user defined function. It would have needed
a bit more parameter passing but it would have avoided the creation of another
instance of the ExpressionParser class when a function body needs to be parsed.
I don't know which of the alternatives would be better.

Intermezzo

We're almost ready; there's just the ExpressionParser to be implemented and
explained. The ExpressionParser is the largest class of all the parser classes
and it deserves it's own article part (see below). There are still a few loose
ends to be explained: how does that code Generator work? How does that User
object do its job? What are these Instruction things?

But first try to understand this article part and the recursive descent technique
that is used for the implementation of these parser classes. When you're ready
I'll see you back in the next part of the article where the ExpressionParser
will be explained in detail.

kind regards,

Jos
Nov 21 '07 #1
Share this Article
Share on Google+