Compilers - 2A: Tokenizers

11,448 Expert 8TB

Greetings,

We have work to do today: we are going to design and implement the first parts
of our little compiler.

Tokens

We start with the tokenization part; the tokens recognized by our little
language are a bit like C or Java tokens: we want to recognize the following
binary operators: + - * / ^ (the last one stands for 'raised to the power'),
and
comparison operators: < <= == != => > as well as some unary operators:
! + - ++ --

There are some more characters we want to recognize: = ( and )
On top of that we want to recognize identifiers (or 'names') just as in
Java or C and we want to recognize floating point numbers.

All other characters are not actually 'recognized' but simply passed on 'as is'
to the parser. There's one exception: we want to recognize the fact that all
the text has been read and tokenized: the end of text condition.

White spaces are ignored in our simple language. The tokenizer also has to
keep track of the current line number and column number because we want to
synthesize nice error messages just as the javac compiler does.

Basically the tokenizer groups tokens in the following way:

Expand|Select|Wrap|Line Numbers

 
// different types of tokens:

static final int T_ENDT= -1; // end of stream reached

static final int T_CHAR=  0; // an ordinary character

static final int T_NUMB=  1; // a number

static final int T_TEXT=  2; // a recognized token

static final int T_NAME=  3; // an identifier

The first idea that came to mind was to use a Scanner for the token recognition.
Alas we can't use it because a Scanner works based on delimiters, i.e. tokens
are separated by delimeters (which you can set yourself if you like). Our
language doesn't have delimeters, i.e. we don't want to write: 'a + b'
all the time when we simply want to write 'a+b'. Our tokes are delimited by
changing groups of characters and we don't know in advance what our tokenizer
is going to read from the input stream so we don't know which delimeters to use.

We take a different approach: we simply read the input line by line and use
regular expressions to figure out what the first possible token can be. If
we've found one, we'll chop it off from the line and return the token. If there
are no more tokens to read we simply read the next line and repeat the process.

There's a little trap here: suppose the first character in the stream is a '+'
or '>' or '<' or '=' sign? Can we just decide that that single character makes
up a token or should we check the next character too? If the stream contains
++ or >= or <= or == then indeed should we check the next character.

The following simple rule avoids this little trap:

We check for tokens in the following order:

Expand|Select|Wrap|Line Numbers

 
check for an end of stream state; if found we return a T_ENDT token;

check for spaces, if found we ignore them and continue;

check for a number, if found return a T_NUMB token;

check for an identifier, if found return a T_NAME token;

check for two character tokens, if found return a T_TEXT token;

check for single character tokens, if found return a T_TEXT token;

all else fails return a single character T_CHAR token;

Each time a token of length L is found we chop of L characters from the
current line and increment the column count by L. We keep a backup of the
current line in case an error needs to be displayed. Note that if we've
just read spaces (see step 2) incrementing the column count is a bit more
complex: the spaces could be tabs and one tab visually represents another
number of character positions, depending on the tab-stop size.

The tokenizer is invoked by a parser. When the tokenizer has read a token
that has not been processed by the parser, the tokenizer will return the
same token over and over again, i.e. the parser has to explicitly tell the
tokenizer that the last token has been processed and a new token should be
read from the character input stream.

The Token class

First let's build a simple Token class; instances of this class are returned
by the tokenizer; here it is:

Expand|Select|Wrap|Line Numbers

 
package compiler;
 
public class Token {
 
    public static final Token ONE= new Token(1);
 
    private double dbl; // filled if token is a number

    private String str; // String representation of a token

    private int    typ; // the type of this token
 
    // ctor for a double token:

    public Token(double dbl) { this(dbl, ""+dbl, TokenTable.T_NUMB); }
 
    // ctor for a non-double token:

    public Token(String str, int type) { this(0.0, str, type); }
 
    // general ctor:

    public Token(double dbl, String str, int typ) {

        this.dbl= dbl;

        this.str= str;

        this.typ= typ;

    }
 
    // mark this token as 'special':

    public Token mark() {
 
        str+= "@";

        return this;

    }
 
    // getters:

    public double getDbl() { return dbl; }

    public String getStr() { return str; }

    public int getTyp() { return typ; }
 
    // token string and type as string:

    public String toString() { return str+"["+typ+"]"; }

}

The Token class is just a simple data class: it stores the double value that
was read by the tokenizer (if any), the String representation of the token
and the type of the token. The String value is the first part of the line
that is chopped off by the tokenizer. The Token class implements simple
accessors (the 'getters') and it can print itself. It also impements a few
convenient constructors. It's a simple and clean little class. Note that
the first constructors refers to a 'TokenTable' class. That class just
contains static data needed by the tokenizer and the little Token class.

The Token class has one public static member which can be useful: the constant ONE.

There's one method that needs a bit more explanation, the 'mark' method.
It is used by the parser to indicate that the current token is a bit 'special'
and it marks the token as such. The parser uses it for the overloaded + and
- operators because they can represent a binary addition or subtraction as
well as a unary plus or minus operator. We'll get back to that trickery later
when we dig into the parser section of this simple compiler.

Regular Expression resources

We'll discuss the TokenTable (and other table classes) later. The table
classes contain data needed by the more interesting parts of this project.
One notable detail is that the table classes are populated using external
properties files. This makes it easy to play a bit with the separate parts
(the tokenizer and parser(s)) and augment them at will.

One of the properties file is the 'tokens.properties' file which contains
the regular expressions needed by the tokenizer; here it is:

Expand|Select|Wrap|Line Numbers

 
space         =        ^\\s*

number        =        ^\\d+(\\.\\d*)?([eE][+-]?\\d+)?

word        =        ^[A-Za-z_]\\w*

symbol2        =        ^(==|<=|>=|!=|^=|\\+=|-=|\\*=|/=|\\+\\+|--)

symbol1        =        ^[=!<>+*/()^-]

char        =        ^\\S

The double \\ characters represent a single \ character. When a Properties
object is read it translates double backslashes to single backslashes.

This gibberish is all explained in great detail in the API documentation
of the Pattern class. Here's a short outline: the regular expression on the
first line reads: ^\\s* which means: at the start of the line we want zero
or more \s characters. The \s character is special to the Pattern class and
means: any space character (including tabs).

The second line is more complicated; it reads: a number is one or more digits
(the \d+ part) optionally followed by a literal dot and zero or more digits
(the (\.\d*)? part. Next optionally a lowercase or uppercase 'e' or 'E' is
read followed (optionally) by a + or - sign followed by one or more digits
(the ([eE][+-]?\d+)? part.

The third line represents a name: any letter (uppercase or lowercase) or
underscore followed by zero or more of the same including digits.

The fourth line is simply a list of choices where each choice is one of our
two character tokens.

The fifth line is a list of single character tokens and the last line ^\S
means: any non-space character at the start of the line.

Compare these six lines with the paragraph above which lists the order in
which possible tokens are considered.

Because we chop off the recognized beginning of the line, the rest of the
line has a new start of the line again. The '^' character in the regular
expressions designate a start of a line and greatly speeds up matching
regular expressions against a line of text.

The next part of this week's article shows the final Tokenizer class.
See you there.

kind regards,

Jos

May 27 '07 #1

Subscribe Post Reply

4377

by: Bill Davidson | last post by:

Hi there, Please forgive me for posting this article on multiple groups. Being new in the newsgroups, I was not sure which group would have been appropriate for my question. Sorry. My...

C / C++

Link compatibility among C compilers?

by: Derek | last post by:

As I understand it there is a good amount of link compatibility among C compilers. For example, I can compile main.c with GCC and func.c with Sun One and link the objects using either linker (GNU...

C / C++

New and enhanced low cost C compilers

by: Chris Stephens | last post by:

Low Cost C Compilers ------------------------------------ HI-TECH Software's C compilers are now available to support the ARM, dsPIC, msp430, 8051, PIC 10 to 17, PIC 18 as well as many other...

C / C++

The element 'compilation' has invalid child element 'compilers'.

by: Robert | last post by:

I have a number of web projects converted from 1.1 to 2.0 in VS2005. I am methodically seeing the error below: The element 'compilation' has invalid child element 'compilers'. List of...

ASP.NET

compilers

by: madhura | last post by:

Hello, I have a basic question about compilers. What are the different types of compilers, are they written by different companies, why there are different names of compilers, what is the purpose,...

C / C++

designing of the compilers

by: pransri2006 | last post by:

Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...

C / C++

What's going on with C Compilers and C99??

by: albert.neu | last post by:

Hello! What is a good compiler to use? (for MS Windows, for Linux) Any recommendations?? What's the point of having a C99 standard, if different compilers still produce differing results? ...

C / C++

Compilers - 1: Introduction

by: JosAH | last post by:

Greetings, last week's tip was a bit of playtime where we've built a Sudoku solver. This week we're going to build some complicated stuff: a compiler. Compiler construction is a difficult...

Java

Compilers - 3: Grammars

by: JosAH | last post by:

Greetings, this week we discuss the design of the syntactic aspects of our little language; it helps with the design for the parser(s) that recognize such syntax. Last week we saw the tokenizer:...

Java

Compilers - 7: Instructions

by: JosAH | last post by:

Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...

Java

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Compilers - 2A: Tokenizers

Similar topics