We have work to do today: we are going to design and implement the first parts
of our little compiler.
Tokens
We start with the tokenization part; the tokens recognized by our little
language are a bit like C or Java tokens: we want to recognize the following
binary operators: + - * / ^ (the last one stands for 'raised to the power'),
and
comparison operators: < <= == != => > as well as some unary operators:
! + - ++ --
There are some more characters we want to recognize: = ( and )
On top of that we want to recognize identifiers (or 'names') just as in
Java or C and we want to recognize floating point numbers.
All other characters are not actually 'recognized' but simply passed on 'as is'
to the parser. There's one exception: we want to recognize the fact that all
the text has been read and tokenized: the end of text condition.
White spaces are ignored in our simple language. The tokenizer also has to
keep track of the current line number and column number because we want to
synthesize nice error messages just as the javac compiler does.
Basically the tokenizer groups tokens in the following way:
Expand|Select|Wrap|Line Numbers
- // different types of tokens:
- static final int T_ENDT= -1; // end of stream reached
- static final int T_CHAR= 0; // an ordinary character
- static final int T_NUMB= 1; // a number
- static final int T_TEXT= 2; // a recognized token
- static final int T_NAME= 3; // an identifier
Alas we can't use it because a Scanner works based on delimiters, i.e. tokens
are separated by delimeters (which you can set yourself if you like). Our
language doesn't have delimeters, i.e. we don't want to write: 'a + b'
all the time when we simply want to write 'a+b'. Our tokes are delimited by
changing groups of characters and we don't know in advance what our tokenizer
is going to read from the input stream so we don't know which delimeters to use.
We take a different approach: we simply read the input line by line and use
regular expressions to figure out what the first possible token can be. If
we've found one, we'll chop it off from the line and return the token. If there
are no more tokens to read we simply read the next line and repeat the process.
There's a little trap here: suppose the first character in the stream is a '+'
or '>' or '<' or '=' sign? Can we just decide that that single character makes
up a token or should we check the next character too? If the stream contains
++ or >= or <= or == then indeed should we check the next character.
The following simple rule avoids this little trap:
We check for tokens in the following order:
Expand|Select|Wrap|Line Numbers
- check for an end of stream state; if found we return a T_ENDT token;
- check for spaces, if found we ignore them and continue;
- check for a number, if found return a T_NUMB token;
- check for an identifier, if found return a T_NAME token;
- check for two character tokens, if found return a T_TEXT token;
- check for single character tokens, if found return a T_TEXT token;
- all else fails return a single character T_CHAR token;
current line and increment the column count by L. We keep a backup of the
current line in case an error needs to be displayed. Note that if we've
just read spaces (see step 2) incrementing the column count is a bit more
complex: the spaces could be tabs and one tab visually represents another
number of character positions, depending on the tab-stop size.
The tokenizer is invoked by a parser. When the tokenizer has read a token
that has not been processed by the parser, the tokenizer will return the
same token over and over again, i.e. the parser has to explicitly tell the
tokenizer that the last token has been processed and a new token should be
read from the character input stream.
The Token class
First let's build a simple Token class; instances of this class are returned
by the tokenizer; here it is:
Expand|Select|Wrap|Line Numbers
- package compiler;
- public class Token {
- public static final Token ONE= new Token(1);
- private double dbl; // filled if token is a number
- private String str; // String representation of a token
- private int typ; // the type of this token
- // ctor for a double token:
- public Token(double dbl) { this(dbl, ""+dbl, TokenTable.T_NUMB); }
- // ctor for a non-double token:
- public Token(String str, int type) { this(0.0, str, type); }
- // general ctor:
- public Token(double dbl, String str, int typ) {
- this.dbl= dbl;
- this.str= str;
- this.typ= typ;
- }
- // mark this token as 'special':
- public Token mark() {
- str+= "@";
- return this;
- }
- // getters:
- public double getDbl() { return dbl; }
- public String getStr() { return str; }
- public int getTyp() { return typ; }
- // token string and type as string:
- public String toString() { return str+"["+typ+"]"; }
- }
was read by the tokenizer (if any), the String representation of the token
and the type of the token. The String value is the first part of the line
that is chopped off by the tokenizer. The Token class implements simple
accessors (the 'getters') and it can print itself. It also impements a few
convenient constructors. It's a simple and clean little class. Note that
the first constructors refers to a 'TokenTable' class. That class just
contains static data needed by the tokenizer and the little Token class.
The Token class has one public static member which can be useful: the constant ONE.
There's one method that needs a bit more explanation, the 'mark' method.
It is used by the parser to indicate that the current token is a bit 'special'
and it marks the token as such. The parser uses it for the overloaded + and
- operators because they can represent a binary addition or subtraction as
well as a unary plus or minus operator. We'll get back to that trickery later
when we dig into the parser section of this simple compiler.
Regular Expression resources
We'll discuss the TokenTable (and other table classes) later. The table
classes contain data needed by the more interesting parts of this project.
One notable detail is that the table classes are populated using external
properties files. This makes it easy to play a bit with the separate parts
(the tokenizer and parser(s)) and augment them at will.
One of the properties file is the 'tokens.properties' file which contains
the regular expressions needed by the tokenizer; here it is:
Expand|Select|Wrap|Line Numbers
- space = ^\\s*
- number = ^\\d+(\\.\\d*)?([eE][+-]?\\d+)?
- word = ^[A-Za-z_]\\w*
- symbol2 = ^(==|<=|>=|!=|^=|\\+=|-=|\\*=|/=|\\+\\+|--)
- symbol1 = ^[=!<>+*/()^-]
- char = ^\\S
object is read it translates double backslashes to single backslashes.
This gibberish is all explained in great detail in the API documentation
of the Pattern class. Here's a short outline: the regular expression on the
first line reads: ^\\s* which means: at the start of the line we want zero
or more \s characters. The \s character is special to the Pattern class and
means: any space character (including tabs).
The second line is more complicated; it reads: a number is one or more digits
(the \d+ part) optionally followed by a literal dot and zero or more digits
(the (\.\d*)? part. Next optionally a lowercase or uppercase 'e' or 'E' is
read followed (optionally) by a + or - sign followed by one or more digits
(the ([eE][+-]?\d+)? part.
The third line represents a name: any letter (uppercase or lowercase) or
underscore followed by zero or more of the same including digits.
The fourth line is simply a list of choices where each choice is one of our
two character tokens.
The fifth line is a list of single character tokens and the last line ^\S
means: any non-space character at the start of the line.
Compare these six lines with the paragraph above which lists the order in
which possible tokens are considered.
Because we chop off the recognized beginning of the line, the rest of the
line has a new start of the line again. The '^' character in the regular
expressions designate a start of a line and greatly speeds up matching
regular expressions against a line of text.
The next part of this week's article shows the final Tokenizer class.
See you there.
kind regards,
Jos