473,386 Members | 1,790 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,386 developers and data experts.

Compilers - 2B: Tokenizers

11,448 Expert 8TB
Greetings,

welcome back to the second part of this week's tip. This article part shows
you the Tokenizer class. Let's do it in small parts:

Expand|Select|Wrap|Line Numbers
  1. public class Tokenizer {
  2.  
  3.     // default size of a tab character
  4.     private static final int TAB= 8;
  5.  
  6.     // the current line and its backup
  7.     private String line;
  8.     private String wholeLine;
  9.  
  10.     // current column and tab size
  11.     private int column;
  12.     private int tab= TAB;
  13.  
  14.     // the reader used for line reading
  15.     private LineNumberReader lnr;
  16.  
  17.     // last unprocessed token
  18.     private Token token;
  19.  
These are the private member variable carried around by the Tokenizer; most
of them don't need any explanation. We use a LineNumberReader that keeps
track of the line numbers (sic). Here are the constructors:

Expand|Select|Wrap|Line Numbers
  1. Tokenizer() { }
  2.  
  3. public void initialize(Reader r) {
  4.  
  5.     lnr= new LineNumberReader(r);
  6.     line= null;
  7.     token= null;
  8. }
  9.  
They're not much to talk about either: there's just one constructor and it isn't
public; this implies that only other classes in the same package can create a
Tokenizer. The parsers live in the same package ; the parsers call the
initialize method after they have instantiated a Tokenizer using the package
scope default constructor. The initialize method wraps the Reader in a
LineNumberReader. The wapper reader is used for actually reading lines from
the input and updates the line numbers.

Here are the two methods that get invoked by the parsers:

Expand|Select|Wrap|Line Numbers
  1.  
  2. public Token getToken() throws InterpreterException {
  3.  
  4.     if (token == null) token= read();
  5.     return token;
  6. }
  7.  
  8. public void skip() { token= null; }
  9.  
If there is no token read already we read a new one, otherwise we simply return
the same token that was read before. The 'skip()' method signals the Tokenizer
that the current token has been processed so the first method will read a new
token again on the next invocation.

Next are a few 'getters':

Expand|Select|Wrap|Line Numbers
  1. public String getLine() { return wholeLine; }
  2.  
  3. public int getLineNumber() { return lnr.getLineNumber(); }
  4.  
  5. public int getColumn() { return column; }
  6.  
  7. public int getTab() { return tab; }
  8.  
  9. public void setTab(int tab) { if (tab > 0) this.tab= tab; }
  10.  
The 'setTab()' method does a bit of sanity checking and the 'getters' simply
return what they're supposed to return. Let's get to the private parts of this
class where more interesting things happen. When a token has been read from a
line the 'column' value needs to be updated. The 'tab' variable contains the
value of the tabstop size so we have to take it into account when we update
the 'column' value:

Expand|Select|Wrap|Line Numbers
  1. private void addColumn(String str) {
  2.  
  3.     for (int i= 0, n= str.length(); i < n; i++)
  4.         if (str.charAt(i) != '\t') column++;
  5.         else column= ((column+tab)/tab)*tab;
  6. }
  7.  
When a character isn't a tab character we simply increment the 'column' value,
otherwise we determine the next tabstop value.

The following method checks whether or not a next line needs to be read: if the
current line is null or contains just spaces we need to read a next line. If
there's nothing more to read we set the backup of the line to '<eof>' just in
case something wants to display what has just been read. Displaying 'null' just
looks silly. This method resets the 'column' value again because the start of
the new line is to be scanned for new tokens:

Expand|Select|Wrap|Line Numbers
  1.  
  2. private String readLine() throws IOException {
  3.  
  4.     for (; line == null || line.trim().length() == 0; ) {
  5.         column= 0;
  6.         if ((wholeLine= line= lnr.readLine()) == null) {
  7.             wholeLine= "<eof>";
  8.             return null;
  9.         }
  10.     }
  11.  
  12.     return line;
  13. }
  14.  
The following method is the heart of the Tokenizer, i.e. it attempts to match
the current line against a 'Matcher'. A Matcher is the counterpart of a Pattern
object: a Pattern compiles a regular expression while a Matcher tries to match
that pattern against a String. The String is our current 'line'. If a match is
found the matched prefix is chopped from the 'line' and the 'column' value is
updated; finally the matched token (if any) is returned. Here is the method:

Expand|Select|Wrap|Line Numbers
  1. private String read(Matcher m) {
  2.  
  3.     String str= null;
  4.  
  5.     if (m.find()) {
  6.         str= line.substring(0, m.end());
  7.         line= line.substring(m.end());
  8.  
  9.         addColumn(str);
  10.     }
  11.  
  12.     return str;
  13. }
  14.  
The last method of the Tokenizer class is the largest method, but it's a bit
of a boring method. All it does is trying to find tokens using different
regular expressions in the order explained in the first part of this week's tip.
Here's the method:

Expand|Select|Wrap|Line Numbers
  1. private Token read() throws InterpreterException { 
  2.  
  3.     String str;
  4.  
  5.     try {
  6.         if (readLine() == null)
  7.             return new Token("eof", TokenTable.T_ENDT);
  8.  
  9.         read(TokenTable.spcePattern.matcher(line));
  10.  
  11.         if ((str= read(TokenTable.numbPattern.matcher(line))) != null)
  12.             return new Token(Double.parseDouble(str));
  13.  
  14.         if ((str= read(TokenTable.wordPattern.matcher(line))) != null)
  15.             return new Token(str, TokenTable.T_NAME);
  16.  
  17.         if ((str= read(TokenTable.sym2Pattern.matcher(line))) != null)
  18.             return new Token(str, TokenTable.T_TEXT);
  19.  
  20.         if ((str= read(TokenTable.sym1Pattern.matcher(line))) != null)
  21.             return new Token(str, TokenTable.T_TEXT);
  22.  
  23.         return new Token(read(TokenTable.charPattern.matcher(line)), TokenTable.T_CHAR);
  24.     }
  25.     catch (IOException ioe) {
  26.         throw new TokenizerException(ioe.getMessage(), ioe);
  27.     }
  28. }
  29.  
The Patterns for the regular expressions are supplied by the TokenTable class
and all this method does is try to match them in a fixed order. This is all there
is to it w.r.t. lexical analysis for our little language: according to which
pattern had a match a corresponding token is returned. The previous methods take
care that no unprocessed token is skipped and forgotten (it is simply returned
again and again until the Tokenizer is notified that it actually has been
processed). Other methods (see above) take care of reading a next line when
necessary and on the fly the column value is updated.

As you might have noticed a TokenizerException is thrown in one part of
the Tokenizer code. A TokenizerException is a small class that extends the
InterpreterException class. The latter class is more interesting and looks
like this:

Expand|Select|Wrap|Line Numbers
  1. public class InterpreterException extends Exception {
  2.  
  3.     private static final long serialVersionUID = 99986468888466836L;
  4.  
  5.     private String message;
  6.  
  7.     public InterpreterException(String message) { 
  8.  
  9.         super(message); 
  10.     }
  11.  
  12.     public InterpreterException(String message, Throwable cause) { 
  13.  
  14.         super(message, cause); 
  15.     }
  16.  
  17.     public InterpreterException(Tokenizer tz, String message) { 
  18.  
  19.         super(message); 
  20.         process(tz);
  21.     }
  22.  
  23.     public InterpreterException(Tokenizer tz, String message, Throwable cause) { 
  24.  
  25.         super(message, cause); 
  26.         process(tz);
  27.     }
  28.  
  29.     private void process(Tokenizer tz) {
  30.  
  31.         StringBuilder sb= new StringBuilder();
  32.         String nl= System.getProperty("line.separator");
  33.  
  34.         sb.append("["+tz.getLineNumber()+":"+tz.getColumn()+"] "+
  35.               super.getMessage()+nl);
  36.         sb.append(tz.getLine()+nl);        
  37.  
  38.         for (int i= 1, n= tz.getColumn(); i < n; i++)
  39.             sb.append('-');
  40.         sb.append('^');
  41.  
  42.         message= sb.toString();
  43.     }
  44.  
  45.     public String getMessage() {
  46.  
  47.         return (message != null)?message:super.getMessage();
  48.     }
  49. }
  50.  
It looks like most Exceptions, i.e. it has a message and a possible 'cause'
for the Exception (the so called 'root' exception). There's one interesting
method in this InterpreterException: the 'process' method. When a Tokenizer
is passed at construction time of this object it is able to construct a nice
error message when printed out; the error message looks like his:

Expand|Select|Wrap|Line Numbers
  1. [line:column] error mesage
  2. line that has a column with an error
  3. ---------------------^
  4.  
A StringBuilder is used to concatenate the different parts of the three line
message; the 'nl' variable contains the 'end of line' sequence of the system
this application is running on (any combination of '\r' and '\n' are possible).
and the correct amount of '-' signs are concatenated to make the '^' caret
appear at the correct location under the current line.

The LineNumber reader used by the Tokenizer counts lines starting at 0 (zero),
humans like to start counting at 1 (one) but lucky for us, the first line (0)
is already completely read and the LineNumberReader will return the next line
number for us when we ask for it so there's no need to adjust the value.
Note that column numbers also start at 0, but at least one token on the line has
been read already and the column points to the location in the string just
following the token so no adjustment is needed here either.

When no tokenizer is supplied to this constructor just the message that was
passed to the superclass will be returned, otherwise our nicely crafted
message is returned by the getMessage() method.

This InterpreterException is extensively used by the parsers when they encounter
a syntax error in the token stream. They pass the Tokenizer itself to the new
InterpreterException so whenever possible a nicely formatted error message is
shown when you print out the message of this InterpreterException.

Concluding remarks

I showed and explained quite a bit of code in this week's part of the article.
Try to understand the code and don't hesitate to reply when/if you don't
understand something. Compiler construction is a difficult task and contains
many tricky details. This week showed the simple Token class; the long and
boring Tokenizer class and the InterpreterException class.

In a next tip I'll explain how the table classes are initialized. When that is
over and done with I'll supply some actual code as an attachment so that you
can play and experiment with this all a bit.

The following parts of this article explain the parsers, the complicated parts
of our compiler. The parsers closely collaborate with the simple code generator.
The generator generates code (how surprising!) which can be fed to the last
class: the Intepreter class itself. The generated code consists of Instructions;
they are sometimes complicated but most of the time simple and coherent pieces
of code (sequences of Java instructions) that are activated by our Intepreter.

See you next week and

kind regards,

Jos
Nov 21 '07 #1
0 3752

Sign in to post your reply or Sign up for a free account.

Similar topics

10
by: Bill Davidson | last post by:
Hi there, Please forgive me for posting this article on multiple groups. Being new in the newsgroups, I was not sure which group would have been appropriate for my question. Sorry. My...
13
by: Derek | last post by:
As I understand it there is a good amount of link compatibility among C compilers. For example, I can compile main.c with GCC and func.c with Sun One and link the objects using either linker (GNU...
0
by: Chris Stephens | last post by:
Low Cost C Compilers ------------------------------------ HI-TECH Software's C compilers are now available to support the ARM, dsPIC, msp430, 8051, PIC 10 to 17, PIC 18 as well as many other...
3
by: Robert | last post by:
I have a number of web projects converted from 1.1 to 2.0 in VS2005. I am methodically seeing the error below: The element 'compilation' has invalid child element 'compilers'. List of...
12
by: madhura | last post by:
Hello, I have a basic question about compilers. What are the different types of compilers, are they written by different companies, why there are different names of compilers, what is the purpose,...
8
by: pransri2006 | last post by:
Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter...
30
by: albert.neu | last post by:
Hello! What is a good compiler to use? (for MS Windows, for Linux) Any recommendations?? What's the point of having a C99 standard, if different compilers still produce differing results? ...
0
by: JosAH | last post by:
Greetings, last week's tip was a bit of playtime where we've built a Sudoku solver. This week we're going to build some complicated stuff: a compiler. Compiler construction is a difficult...
0
by: JosAH | last post by:
Greetings, this week we discuss the design of the syntactic aspects of our little language; it helps with the design for the parser(s) that recognize such syntax. Last week we saw the tokenizer:...
0
by: JosAH | last post by:
Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.