By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,018 Members | 1,204 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Compilers - 2B: Tokenizers

Expert 10K+
P: 11,448

welcome back to the second part of this week's tip. This article part shows
you the Tokenizer class. Let's do it in small parts:

Expand|Select|Wrap|Line Numbers
  1. public class Tokenizer {
  3.     // default size of a tab character
  4.     private static final int TAB= 8;
  6.     // the current line and its backup
  7.     private String line;
  8.     private String wholeLine;
  10.     // current column and tab size
  11.     private int column;
  12.     private int tab= TAB;
  14.     // the reader used for line reading
  15.     private LineNumberReader lnr;
  17.     // last unprocessed token
  18.     private Token token;
These are the private member variable carried around by the Tokenizer; most
of them don't need any explanation. We use a LineNumberReader that keeps
track of the line numbers (sic). Here are the constructors:

Expand|Select|Wrap|Line Numbers
  1. Tokenizer() { }
  3. public void initialize(Reader r) {
  5.     lnr= new LineNumberReader(r);
  6.     line= null;
  7.     token= null;
  8. }
They're not much to talk about either: there's just one constructor and it isn't
public; this implies that only other classes in the same package can create a
Tokenizer. The parsers live in the same package ; the parsers call the
initialize method after they have instantiated a Tokenizer using the package
scope default constructor. The initialize method wraps the Reader in a
LineNumberReader. The wapper reader is used for actually reading lines from
the input and updates the line numbers.

Here are the two methods that get invoked by the parsers:

Expand|Select|Wrap|Line Numbers
  2. public Token getToken() throws InterpreterException {
  4.     if (token == null) token= read();
  5.     return token;
  6. }
  8. public void skip() { token= null; }
If there is no token read already we read a new one, otherwise we simply return
the same token that was read before. The 'skip()' method signals the Tokenizer
that the current token has been processed so the first method will read a new
token again on the next invocation.

Next are a few 'getters':

Expand|Select|Wrap|Line Numbers
  1. public String getLine() { return wholeLine; }
  3. public int getLineNumber() { return lnr.getLineNumber(); }
  5. public int getColumn() { return column; }
  7. public int getTab() { return tab; }
  9. public void setTab(int tab) { if (tab > 0) tab; }
The 'setTab()' method does a bit of sanity checking and the 'getters' simply
return what they're supposed to return. Let's get to the private parts of this
class where more interesting things happen. When a token has been read from a
line the 'column' value needs to be updated. The 'tab' variable contains the
value of the tabstop size so we have to take it into account when we update
the 'column' value:

Expand|Select|Wrap|Line Numbers
  1. private void addColumn(String str) {
  3.     for (int i= 0, n= str.length(); i < n; i++)
  4.         if (str.charAt(i) != '\t') column++;
  5.         else column= ((column+tab)/tab)*tab;
  6. }
When a character isn't a tab character we simply increment the 'column' value,
otherwise we determine the next tabstop value.

The following method checks whether or not a next line needs to be read: if the
current line is null or contains just spaces we need to read a next line. If
there's nothing more to read we set the backup of the line to '<eof>' just in
case something wants to display what has just been read. Displaying 'null' just
looks silly. This method resets the 'column' value again because the start of
the new line is to be scanned for new tokens:

Expand|Select|Wrap|Line Numbers
  2. private String readLine() throws IOException {
  4.     for (; line == null || line.trim().length() == 0; ) {
  5.         column= 0;
  6.         if ((wholeLine= line= lnr.readLine()) == null) {
  7.             wholeLine= "<eof>";
  8.             return null;
  9.         }
  10.     }
  12.     return line;
  13. }
The following method is the heart of the Tokenizer, i.e. it attempts to match
the current line against a 'Matcher'. A Matcher is the counterpart of a Pattern
object: a Pattern compiles a regular expression while a Matcher tries to match
that pattern against a String. The String is our current 'line'. If a match is
found the matched prefix is chopped from the 'line' and the 'column' value is
updated; finally the matched token (if any) is returned. Here is the method:

Expand|Select|Wrap|Line Numbers
  1. private String read(Matcher m) {
  3.     String str= null;
  5.     if (m.find()) {
  6.         str= line.substring(0, m.end());
  7.         line= line.substring(m.end());
  9.         addColumn(str);
  10.     }
  12.     return str;
  13. }
The last method of the Tokenizer class is the largest method, but it's a bit
of a boring method. All it does is trying to find tokens using different
regular expressions in the order explained in the first part of this week's tip.
Here's the method:

Expand|Select|Wrap|Line Numbers
  1. private Token read() throws InterpreterException { 
  3.     String str;
  5.     try {
  6.         if (readLine() == null)
  7.             return new Token("eof", TokenTable.T_ENDT);
  9.         read(TokenTable.spcePattern.matcher(line));
  11.         if ((str= read(TokenTable.numbPattern.matcher(line))) != null)
  12.             return new Token(Double.parseDouble(str));
  14.         if ((str= read(TokenTable.wordPattern.matcher(line))) != null)
  15.             return new Token(str, TokenTable.T_NAME);
  17.         if ((str= read(TokenTable.sym2Pattern.matcher(line))) != null)
  18.             return new Token(str, TokenTable.T_TEXT);
  20.         if ((str= read(TokenTable.sym1Pattern.matcher(line))) != null)
  21.             return new Token(str, TokenTable.T_TEXT);
  23.         return new Token(read(TokenTable.charPattern.matcher(line)), TokenTable.T_CHAR);
  24.     }
  25.     catch (IOException ioe) {
  26.         throw new TokenizerException(ioe.getMessage(), ioe);
  27.     }
  28. }
The Patterns for the regular expressions are supplied by the TokenTable class
and all this method does is try to match them in a fixed order. This is all there
is to it w.r.t. lexical analysis for our little language: according to which
pattern had a match a corresponding token is returned. The previous methods take
care that no unprocessed token is skipped and forgotten (it is simply returned
again and again until the Tokenizer is notified that it actually has been
processed). Other methods (see above) take care of reading a next line when
necessary and on the fly the column value is updated.

As you might have noticed a TokenizerException is thrown in one part of
the Tokenizer code. A TokenizerException is a small class that extends the
InterpreterException class. The latter class is more interesting and looks
like this:

Expand|Select|Wrap|Line Numbers
  1. public class InterpreterException extends Exception {
  3.     private static final long serialVersionUID = 99986468888466836L;
  5.     private String message;
  7.     public InterpreterException(String message) { 
  9.         super(message); 
  10.     }
  12.     public InterpreterException(String message, Throwable cause) { 
  14.         super(message, cause); 
  15.     }
  17.     public InterpreterException(Tokenizer tz, String message) { 
  19.         super(message); 
  20.         process(tz);
  21.     }
  23.     public InterpreterException(Tokenizer tz, String message, Throwable cause) { 
  25.         super(message, cause); 
  26.         process(tz);
  27.     }
  29.     private void process(Tokenizer tz) {
  31.         StringBuilder sb= new StringBuilder();
  32.         String nl= System.getProperty("line.separator");
  34.         sb.append("["+tz.getLineNumber()+":"+tz.getColumn()+"] "+
  35.               super.getMessage()+nl);
  36.         sb.append(tz.getLine()+nl);        
  38.         for (int i= 1, n= tz.getColumn(); i < n; i++)
  39.             sb.append('-');
  40.         sb.append('^');
  42.         message= sb.toString();
  43.     }
  45.     public String getMessage() {
  47.         return (message != null)?message:super.getMessage();
  48.     }
  49. }
It looks like most Exceptions, i.e. it has a message and a possible 'cause'
for the Exception (the so called 'root' exception). There's one interesting
method in this InterpreterException: the 'process' method. When a Tokenizer
is passed at construction time of this object it is able to construct a nice
error message when printed out; the error message looks like his:

Expand|Select|Wrap|Line Numbers
  1. [line:column] error mesage
  2. line that has a column with an error
  3. ---------------------^
A StringBuilder is used to concatenate the different parts of the three line
message; the 'nl' variable contains the 'end of line' sequence of the system
this application is running on (any combination of '\r' and '\n' are possible).
and the correct amount of '-' signs are concatenated to make the '^' caret
appear at the correct location under the current line.

The LineNumber reader used by the Tokenizer counts lines starting at 0 (zero),
humans like to start counting at 1 (one) but lucky for us, the first line (0)
is already completely read and the LineNumberReader will return the next line
number for us when we ask for it so there's no need to adjust the value.
Note that column numbers also start at 0, but at least one token on the line has
been read already and the column points to the location in the string just
following the token so no adjustment is needed here either.

When no tokenizer is supplied to this constructor just the message that was
passed to the superclass will be returned, otherwise our nicely crafted
message is returned by the getMessage() method.

This InterpreterException is extensively used by the parsers when they encounter
a syntax error in the token stream. They pass the Tokenizer itself to the new
InterpreterException so whenever possible a nicely formatted error message is
shown when you print out the message of this InterpreterException.

Concluding remarks

I showed and explained quite a bit of code in this week's part of the article.
Try to understand the code and don't hesitate to reply when/if you don't
understand something. Compiler construction is a difficult task and contains
many tricky details. This week showed the simple Token class; the long and
boring Tokenizer class and the InterpreterException class.

In a next tip I'll explain how the table classes are initialized. When that is
over and done with I'll supply some actual code as an attachment so that you
can play and experiment with this all a bit.

The following parts of this article explain the parsers, the complicated parts
of our compiler. The parsers closely collaborate with the simple code generator.
The generator generates code (how surprising!) which can be fed to the last
class: the Intepreter class itself. The generated code consists of Instructions;
they are sometimes complicated but most of the time simple and coherent pieces
of code (sequences of Java instructions) that are activated by our Intepreter.

See you next week and

kind regards,

Nov 21 '07 #1
Share this Article
Share on Google+