Compilers - 2B: Tokenizers

11,448 Expert 8TB

Greetings,

welcome back to the second part of this week's tip. This article part shows
you the Tokenizer class. Let's do it in small parts:

Expand|Select|Wrap|Line Numbers

 
public class Tokenizer {
 
    // default size of a tab character

    private static final int TAB= 8;
 
    // the current line and its backup

    private String line;

    private String wholeLine;
 
    // current column and tab size

    private int column;

    private int tab= TAB;
 
    // the reader used for line reading

    private LineNumberReader lnr;
 
    // last unprocessed token

    private Token token;

These are the private member variable carried around by the Tokenizer; most
of them don't need any explanation. We use a LineNumberReader that keeps
track of the line numbers (sic). Here are the constructors:

Expand|Select|Wrap|Line Numbers

 
Tokenizer() { }
 
public void initialize(Reader r) {
 
    lnr= new LineNumberReader(r);

    line= null;

    token= null;

}

They're not much to talk about either: there's just one constructor and it isn't
public; this implies that only other classes in the same package can create a
Tokenizer. The parsers live in the same package ; the parsers call the
initialize method after they have instantiated a Tokenizer using the package
scope default constructor. The initialize method wraps the Reader in a
LineNumberReader. The wapper reader is used for actually reading lines from
the input and updates the line numbers.

Here are the two methods that get invoked by the parsers:

Expand|Select|Wrap|Line Numbers

  
public Token getToken() throws InterpreterException {
 
    if (token == null) token= read();

    return token;

}
 
public void skip() { token= null; }

If there is no token read already we read a new one, otherwise we simply return
the same token that was read before. The 'skip()' method signals the Tokenizer
that the current token has been processed so the first method will read a new
token again on the next invocation.

Next are a few 'getters':

Expand|Select|Wrap|Line Numbers

 
public String getLine() { return wholeLine; }
 
public int getLineNumber() { return lnr.getLineNumber(); }
 
public int getColumn() { return column; }
 
public int getTab() { return tab; }
 
public void setTab(int tab) { if (tab > 0) this.tab= tab; }

The 'setTab()' method does a bit of sanity checking and the 'getters' simply
return what they're supposed to return. Let's get to the private parts of this
class where more interesting things happen. When a token has been read from a
line the 'column' value needs to be updated. The 'tab' variable contains the
value of the tabstop size so we have to take it into account when we update
the 'column' value:

Expand|Select|Wrap|Line Numbers

 
private void addColumn(String str) {
 
    for (int i= 0, n= str.length(); i < n; i++)

        if (str.charAt(i) != '\t') column++;

        else column= ((column+tab)/tab)*tab;

}

When a character isn't a tab character we simply increment the 'column' value,
otherwise we determine the next tabstop value.

The following method checks whether or not a next line needs to be read: if the
current line is null or contains just spaces we need to read a next line. If
there's nothing more to read we set the backup of the line to '<eof>' just in
case something wants to display what has just been read. Displaying 'null' just
looks silly. This method resets the 'column' value again because the start of
the new line is to be scanned for new tokens:

Expand|Select|Wrap|Line Numbers

  
private String readLine() throws IOException {
 
    for (; line == null || line.trim().length() == 0; ) {

        column= 0;

        if ((wholeLine= line= lnr.readLine()) == null) {

            wholeLine= "<eof>";

            return null;

        }

    }
 
    return line;

}

The following method is the heart of the Tokenizer, i.e. it attempts to match
the current line against a 'Matcher'. A Matcher is the counterpart of a Pattern
object: a Pattern compiles a regular expression while a Matcher tries to match
that pattern against a String. The String is our current 'line'. If a match is
found the matched prefix is chopped from the 'line' and the 'column' value is
updated; finally the matched token (if any) is returned. Here is the method:

Expand|Select|Wrap|Line Numbers

 
private String read(Matcher m) {
 
    String str= null;
 
    if (m.find()) {

        str= line.substring(0, m.end());

        line= line.substring(m.end());
 
        addColumn(str);

    }
 
    return str;

}

The last method of the Tokenizer class is the largest method, but it's a bit
of a boring method. All it does is trying to find tokens using different
regular expressions in the order explained in the first part of this week's tip.
Here's the method:

Expand|Select|Wrap|Line Numbers

 
private Token read() throws InterpreterException { 
 
    String str;
 
    try {

        if (readLine() == null)

            return new Token("eof", TokenTable.T_ENDT);
 
        read(TokenTable.spcePattern.matcher(line));
 
        if ((str= read(TokenTable.numbPattern.matcher(line))) != null)

            return new Token(Double.parseDouble(str));
 
        if ((str= read(TokenTable.wordPattern.matcher(line))) != null)

            return new Token(str, TokenTable.T_NAME);
 
        if ((str= read(TokenTable.sym2Pattern.matcher(line))) != null)

            return new Token(str, TokenTable.T_TEXT);
 
        if ((str= read(TokenTable.sym1Pattern.matcher(line))) != null)

            return new Token(str, TokenTable.T_TEXT);
 
        return new Token(read(TokenTable.charPattern.matcher(line)), TokenTable.T_CHAR);

    }

    catch (IOException ioe) {

        throw new TokenizerException(ioe.getMessage(), ioe);

    }

}

The Patterns for the regular expressions are supplied by the TokenTable class
and all this method does is try to match them in a fixed order. This is all there
is to it w.r.t. lexical analysis for our little language: according to which
pattern had a match a corresponding token is returned. The previous methods take
care that no unprocessed token is skipped and forgotten (it is simply returned
again and again until the Tokenizer is notified that it actually has been
processed). Other methods (see above) take care of reading a next line when
necessary and on the fly the column value is updated.

As you might have noticed a TokenizerException is thrown in one part of
the Tokenizer code. A TokenizerException is a small class that extends the
InterpreterException class. The latter class is more interesting and looks
like this:

Expand|Select|Wrap|Line Numbers

 
public class InterpreterException extends Exception {
 
    private static final long serialVersionUID = 99986468888466836L;
 
    private String message;
 
    public InterpreterException(String message) { 
 
        super(message); 

    }
 
    public InterpreterException(String message, Throwable cause) { 
 
        super(message, cause); 

    }
 
    public InterpreterException(Tokenizer tz, String message) { 
 
        super(message); 

        process(tz);

    }
 
    public InterpreterException(Tokenizer tz, String message, Throwable cause) { 
 
        super(message, cause); 

        process(tz);

    }
 
    private void process(Tokenizer tz) {
 
        StringBuilder sb= new StringBuilder();

        String nl= System.getProperty("line.separator");
 
        sb.append("["+tz.getLineNumber()+":"+tz.getColumn()+"] "+

              super.getMessage()+nl);

        sb.append(tz.getLine()+nl);        
 
        for (int i= 1, n= tz.getColumn(); i < n; i++)

            sb.append('-');

        sb.append('^');
 
        message= sb.toString();

    }
 
    public String getMessage() {
 
        return (message != null)?message:super.getMessage();

    }

}

It looks like most Exceptions, i.e. it has a message and a possible 'cause'
for the Exception (the so called 'root' exception). There's one interesting
method in this InterpreterException: the 'process' method. When a Tokenizer
is passed at construction time of this object it is able to construct a nice
error message when printed out; the error message looks like his:

Expand|Select|Wrap|Line Numbers

 
[line:column] error mesage

line that has a column with an error

---------------------^

A StringBuilder is used to concatenate the different parts of the three line
message; the 'nl' variable contains the 'end of line' sequence of the system
this application is running on (any combination of '\r' and '\n' are possible).
and the correct amount of '-' signs are concatenated to make the '^' caret
appear at the correct location under the current line.

The LineNumber reader used by the Tokenizer counts lines starting at 0 (zero),
humans like to start counting at 1 (one) but lucky for us, the first line (0)
is already completely read and the LineNumberReader will return the next line
number for us when we ask for it so there's no need to adjust the value.
Note that column numbers also start at 0, but at least one token on the line has
been read already and the column points to the location in the string just
following the token so no adjustment is needed here either.

When no tokenizer is supplied to this constructor just the message that was
passed to the superclass will be returned, otherwise our nicely crafted
message is returned by the getMessage() method.

This InterpreterException is extensively used by the parsers when they encounter
a syntax error in the token stream. They pass the Tokenizer itself to the new
InterpreterException so whenever possible a nicely formatted error message is
shown when you print out the message of this InterpreterException.

Concluding remarks

I showed and explained quite a bit of code in this week's part of the article.
Try to understand the code and don't hesitate to reply when/if you don't
understand something. Compiler construction is a difficult task and contains
many tricky details. This week showed the simple Token class; the long and
boring Tokenizer class and the InterpreterException class.

In a next tip I'll explain how the table classes are initialized. When that is
over and done with I'll supply some actual code as an attachment so that you
can play and experiment with this all a bit.

The following parts of this article explain the parsers, the complicated parts
of our compiler. The parsers closely collaborate with the simple code generator.
The generator generates code (how surprising!) which can be fed to the last
class: the Intepreter class itself. The generated code consists of Instructions;
they are sometimes complicated but most of the time simple and coherent pieces
of code (sequences of Java instructions) that are activated by our Intepreter.

See you next week and

kind regards,

Jos

Nov 21 '07 #1

Subscribe Post Reply

3752

by: Bill Davidson | last post by:

Hi there, Please forgive me for posting this article on multiple groups. Being new in the newsgroups, I was not sure which group would have been appropriate for my question. Sorry. My...

C / C++

Link compatibility among C compilers?

by: Derek | last post by:

As I understand it there is a good amount of link compatibility among C compilers. For example, I can compile main.c with GCC and func.c with Sun One and link the objects using either linker (GNU...

C / C++

New and enhanced low cost C compilers

by: Chris Stephens | last post by:

Low Cost C Compilers ------------------------------------ HI-TECH Software's C compilers are now available to support the ARM, dsPIC, msp430, 8051, PIC 10 to 17, PIC 18 as well as many other...

C / C++

The element 'compilation' has invalid child element 'compilers'.

by: Robert | last post by:

I have a number of web projects converted from 1.1 to 2.0 in VS2005. I am methodically seeing the error below: The element 'compilation' has invalid child element 'compilers'. List of...

ASP.NET

compilers

by: madhura | last post by:

Hello, I have a basic question about compilers. What are the different types of compilers, are they written by different companies, why there are different names of compilers, what is the purpose,...

C / C++

designing of the compilers

by: pransri2006 | last post by:

Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...

C / C++

What's going on with C Compilers and C99??

by: albert.neu | last post by:

Hello! What is a good compiler to use? (for MS Windows, for Linux) Any recommendations?? What's the point of having a C99 standard, if different compilers still produce differing results? ...

C / C++

Compilers - 1: Introduction

by: JosAH | last post by:

Greetings, last week's tip was a bit of playtime where we've built a Sudoku solver. This week we're going to build some complicated stuff: a compiler. Compiler construction is a difficult...

Java

Compilers - 3: Grammars

by: JosAH | last post by:

Greetings, this week we discuss the design of the syntactic aspects of our little language; it helps with the design for the parser(s) that recognize such syntax. Last week we saw the tokenizer:...

Java

Compilers - 7: Instructions

by: JosAH | last post by:

Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...

Java

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Compilers - 2B: Tokenizers

Similar topics