473,387 Members | 1,579 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,387 developers and data experts.

Compilers - 2A: Tokenizers

11,448 Expert 8TB
Greetings,

We have work to do today: we are going to design and implement the first parts
of our little compiler.

Tokens

We start with the tokenization part; the tokens recognized by our little
language are a bit like C or Java tokens: we want to recognize the following
binary operators: + - * / ^ (the last one stands for 'raised to the power'),
and
comparison operators: < <= == != => > as well as some unary operators:
! + - ++ --

There are some more characters we want to recognize: = ( and )
On top of that we want to recognize identifiers (or 'names') just as in
Java or C and we want to recognize floating point numbers.

All other characters are not actually 'recognized' but simply passed on 'as is'
to the parser. There's one exception: we want to recognize the fact that all
the text has been read and tokenized: the end of text condition.

White spaces are ignored in our simple language. The tokenizer also has to
keep track of the current line number and column number because we want to
synthesize nice error messages just as the javac compiler does.

Basically the tokenizer groups tokens in the following way:
Expand|Select|Wrap|Line Numbers
  1. // different types of tokens:
  2. static final int T_ENDT= -1; // end of stream reached
  3. static final int T_CHAR=  0; // an ordinary character
  4. static final int T_NUMB=  1; // a number
  5. static final int T_TEXT=  2; // a recognized token
  6. static final int T_NAME=  3; // an identifier
  7.  
The first idea that came to mind was to use a Scanner for the token recognition.
Alas we can't use it because a Scanner works based on delimiters, i.e. tokens
are separated by delimeters (which you can set yourself if you like). Our
language doesn't have delimeters, i.e. we don't want to write: 'a + b'
all the time when we simply want to write 'a+b'. Our tokes are delimited by
changing groups of characters and we don't know in advance what our tokenizer
is going to read from the input stream so we don't know which delimeters to use.

We take a different approach: we simply read the input line by line and use
regular expressions to figure out what the first possible token can be. If
we've found one, we'll chop it off from the line and return the token. If there
are no more tokens to read we simply read the next line and repeat the process.

There's a little trap here: suppose the first character in the stream is a '+'
or '>' or '<' or '=' sign? Can we just decide that that single character makes
up a token or should we check the next character too? If the stream contains
++ or >= or <= or == then indeed should we check the next character.

The following simple rule avoids this little trap:

We check for tokens in the following order:

Expand|Select|Wrap|Line Numbers
  1. check for an end of stream state; if found we return a T_ENDT token;
  2. check for spaces, if found we ignore them and continue;
  3. check for a number, if found return a T_NUMB token;
  4. check for an identifier, if found return a T_NAME token;
  5. check for two character tokens, if found return a T_TEXT token;
  6. check for single character tokens, if found return a T_TEXT token;
  7. all else fails return a single character T_CHAR token; 
  8.  
Each time a token of length L is found we chop of L characters from the
current line and increment the column count by L. We keep a backup of the
current line in case an error needs to be displayed. Note that if we've
just read spaces (see step 2) incrementing the column count is a bit more
complex: the spaces could be tabs and one tab visually represents another
number of character positions, depending on the tab-stop size.

The tokenizer is invoked by a parser. When the tokenizer has read a token
that has not been processed by the parser, the tokenizer will return the
same token over and over again, i.e. the parser has to explicitly tell the
tokenizer that the last token has been processed and a new token should be
read from the character input stream.

The Token class

First let's build a simple Token class; instances of this class are returned
by the tokenizer; here it is:
Expand|Select|Wrap|Line Numbers
  1. package compiler;
  2.  
  3. public class Token {
  4.  
  5.     public static final Token ONE= new Token(1);
  6.  
  7.     private double dbl; // filled if token is a number
  8.     private String str; // String representation of a token
  9.     private int    typ; // the type of this token
  10.  
  11.     // ctor for a double token:
  12.     public Token(double dbl) { this(dbl, ""+dbl, TokenTable.T_NUMB); }
  13.  
  14.     // ctor for a non-double token:
  15.     public Token(String str, int type) { this(0.0, str, type); }
  16.  
  17.     // general ctor:
  18.     public Token(double dbl, String str, int typ) {
  19.         this.dbl= dbl;
  20.         this.str= str;
  21.         this.typ= typ;
  22.     }
  23.  
  24.     // mark this token as 'special':
  25.     public Token mark() {
  26.  
  27.         str+= "@";
  28.         return this;
  29.     }
  30.  
  31.     // getters:
  32.     public double getDbl() { return dbl; }
  33.     public String getStr() { return str; }
  34.     public int getTyp() { return typ; }
  35.  
  36.     // token string and type as string:
  37.     public String toString() { return str+"["+typ+"]"; }
  38. }
  39.  
The Token class is just a simple data class: it stores the double value that
was read by the tokenizer (if any), the String representation of the token
and the type of the token. The String value is the first part of the line
that is chopped off by the tokenizer. The Token class implements simple
accessors (the 'getters') and it can print itself. It also impements a few
convenient constructors. It's a simple and clean little class. Note that
the first constructors refers to a 'TokenTable' class. That class just
contains static data needed by the tokenizer and the little Token class.

The Token class has one public static member which can be useful: the constant ONE.

There's one method that needs a bit more explanation, the 'mark' method.
It is used by the parser to indicate that the current token is a bit 'special'
and it marks the token as such. The parser uses it for the overloaded + and
- operators because they can represent a binary addition or subtraction as
well as a unary plus or minus operator. We'll get back to that trickery later
when we dig into the parser section of this simple compiler.

Regular Expression resources

We'll discuss the TokenTable (and other table classes) later. The table
classes contain data needed by the more interesting parts of this project.
One notable detail is that the table classes are populated using external
properties files. This makes it easy to play a bit with the separate parts
(the tokenizer and parser(s)) and augment them at will.

One of the properties file is the 'tokens.properties' file which contains
the regular expressions needed by the tokenizer; here it is:
Expand|Select|Wrap|Line Numbers
  1. space         =        ^\\s*
  2. number        =        ^\\d+(\\.\\d*)?([eE][+-]?\\d+)?
  3. word        =        ^[A-Za-z_]\\w*
  4. symbol2        =        ^(==|<=|>=|!=|^=|\\+=|-=|\\*=|/=|\\+\\+|--)
  5. symbol1        =        ^[=!<>+*/()^-]
  6. char        =        ^\\S
  7.  
The double \\ characters represent a single \ character. When a Properties
object is read it translates double backslashes to single backslashes.

This gibberish is all explained in great detail in the API documentation
of the Pattern class. Here's a short outline: the regular expression on the
first line reads: ^\\s* which means: at the start of the line we want zero
or more \s characters. The \s character is special to the Pattern class and
means: any space character (including tabs).

The second line is more complicated; it reads: a number is one or more digits
(the \d+ part) optionally followed by a literal dot and zero or more digits
(the (\.\d*)? part. Next optionally a lowercase or uppercase 'e' or 'E' is
read followed (optionally) by a + or - sign followed by one or more digits
(the ([eE][+-]?\d+)? part.

The third line represents a name: any letter (uppercase or lowercase) or
underscore followed by zero or more of the same including digits.

The fourth line is simply a list of choices where each choice is one of our
two character tokens.

The fifth line is a list of single character tokens and the last line ^\S
means: any non-space character at the start of the line.

Compare these six lines with the paragraph above which lists the order in
which possible tokens are considered.

Because we chop off the recognized beginning of the line, the rest of the
line has a new start of the line again. The '^' character in the regular
expressions designate a start of a line and greatly speeds up matching
regular expressions against a line of text.

The next part of this week's article shows the final Tokenizer class.
See you there.

kind regards,

Jos
May 27 '07 #1
0 4377

Sign in to post your reply or Sign up for a free account.

Similar topics

10
by: Bill Davidson | last post by:
Hi there, Please forgive me for posting this article on multiple groups. Being new in the newsgroups, I was not sure which group would have been appropriate for my question. Sorry. My...
13
by: Derek | last post by:
As I understand it there is a good amount of link compatibility among C compilers. For example, I can compile main.c with GCC and func.c with Sun One and link the objects using either linker (GNU...
0
by: Chris Stephens | last post by:
Low Cost C Compilers ------------------------------------ HI-TECH Software's C compilers are now available to support the ARM, dsPIC, msp430, 8051, PIC 10 to 17, PIC 18 as well as many other...
3
by: Robert | last post by:
I have a number of web projects converted from 1.1 to 2.0 in VS2005. I am methodically seeing the error below: The element 'compilation' has invalid child element 'compilers'. List of...
12
by: madhura | last post by:
Hello, I have a basic question about compilers. What are the different types of compilers, are they written by different companies, why there are different names of compilers, what is the purpose,...
8
by: pransri2006 | last post by:
Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...
30
by: albert.neu | last post by:
Hello! What is a good compiler to use? (for MS Windows, for Linux) Any recommendations?? What's the point of having a C99 standard, if different compilers still produce differing results? ...
0
by: JosAH | last post by:
Greetings, last week's tip was a bit of playtime where we've built a Sudoku solver. This week we're going to build some complicated stuff: a compiler. Compiler construction is a difficult...
0
by: JosAH | last post by:
Greetings, this week we discuss the design of the syntactic aspects of our little language; it helps with the design for the parser(s) that recognize such syntax. Last week we saw the tokenizer:...
0
by: JosAH | last post by:
Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.