473,608 Members | 2,479 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Compilers - 2B: Tokenizers

11,448 Recognized Expert MVP
Greetings,

welcome back to the second part of this week's tip. This article part shows
you the Tokenizer class. Let's do it in small parts:

Expand|Select|Wrap|Line Numbers
  1. public class Tokenizer {
  2.  
  3.     // default size of a tab character
  4.     private static final int TAB= 8;
  5.  
  6.     // the current line and its backup
  7.     private String line;
  8.     private String wholeLine;
  9.  
  10.     // current column and tab size
  11.     private int column;
  12.     private int tab= TAB;
  13.  
  14.     // the reader used for line reading
  15.     private LineNumberReader lnr;
  16.  
  17.     // last unprocessed token
  18.     private Token token;
  19.  
These are the private member variable carried around by the Tokenizer; most
of them don't need any explanation. We use a LineNumberReade r that keeps
track of the line numbers (sic). Here are the constructors:

Expand|Select|Wrap|Line Numbers
  1. Tokenizer() { }
  2.  
  3. public void initialize(Reader r) {
  4.  
  5.     lnr= new LineNumberReader(r);
  6.     line= null;
  7.     token= null;
  8. }
  9.  
They're not much to talk about either: there's just one constructor and it isn't
public; this implies that only other classes in the same package can create a
Tokenizer. The parsers live in the same package ; the parsers call the
initialize method after they have instantiated a Tokenizer using the package
scope default constructor. The initialize method wraps the Reader in a
LineNumberReade r. The wapper reader is used for actually reading lines from
the input and updates the line numbers.

Here are the two methods that get invoked by the parsers:

Expand|Select|Wrap|Line Numbers
  1.  
  2. public Token getToken() throws InterpreterException {
  3.  
  4.     if (token == null) token= read();
  5.     return token;
  6. }
  7.  
  8. public void skip() { token= null; }
  9.  
If there is no token read already we read a new one, otherwise we simply return
the same token that was read before. The 'skip()' method signals the Tokenizer
that the current token has been processed so the first method will read a new
token again on the next invocation.

Next are a few 'getters':

Expand|Select|Wrap|Line Numbers
  1. public String getLine() { return wholeLine; }
  2.  
  3. public int getLineNumber() { return lnr.getLineNumber(); }
  4.  
  5. public int getColumn() { return column; }
  6.  
  7. public int getTab() { return tab; }
  8.  
  9. public void setTab(int tab) { if (tab > 0) this.tab= tab; }
  10.  
The 'setTab()' method does a bit of sanity checking and the 'getters' simply
return what they're supposed to return. Let's get to the private parts of this
class where more interesting things happen. When a token has been read from a
line the 'column' value needs to be updated. The 'tab' variable contains the
value of the tabstop size so we have to take it into account when we update
the 'column' value:

Expand|Select|Wrap|Line Numbers
  1. private void addColumn(String str) {
  2.  
  3.     for (int i= 0, n= str.length(); i < n; i++)
  4.         if (str.charAt(i) != '\t') column++;
  5.         else column= ((column+tab)/tab)*tab;
  6. }
  7.  
When a character isn't a tab character we simply increment the 'column' value,
otherwise we determine the next tabstop value.

The following method checks whether or not a next line needs to be read: if the
current line is null or contains just spaces we need to read a next line. If
there's nothing more to read we set the backup of the line to '<eof>' just in
case something wants to display what has just been read. Displaying 'null' just
looks silly. This method resets the 'column' value again because the start of
the new line is to be scanned for new tokens:

Expand|Select|Wrap|Line Numbers
  1.  
  2. private String readLine() throws IOException {
  3.  
  4.     for (; line == null || line.trim().length() == 0; ) {
  5.         column= 0;
  6.         if ((wholeLine= line= lnr.readLine()) == null) {
  7.             wholeLine= "<eof>";
  8.             return null;
  9.         }
  10.     }
  11.  
  12.     return line;
  13. }
  14.  
The following method is the heart of the Tokenizer, i.e. it attempts to match
the current line against a 'Matcher'. A Matcher is the counterpart of a Pattern
object: a Pattern compiles a regular expression while a Matcher tries to match
that pattern against a String. The String is our current 'line'. If a match is
found the matched prefix is chopped from the 'line' and the 'column' value is
updated; finally the matched token (if any) is returned. Here is the method:

Expand|Select|Wrap|Line Numbers
  1. private String read(Matcher m) {
  2.  
  3.     String str= null;
  4.  
  5.     if (m.find()) {
  6.         str= line.substring(0, m.end());
  7.         line= line.substring(m.end());
  8.  
  9.         addColumn(str);
  10.     }
  11.  
  12.     return str;
  13. }
  14.  
The last method of the Tokenizer class is the largest method, but it's a bit
of a boring method. All it does is trying to find tokens using different
regular expressions in the order explained in the first part of this week's tip.
Here's the method:

Expand|Select|Wrap|Line Numbers
  1. private Token read() throws InterpreterException { 
  2.  
  3.     String str;
  4.  
  5.     try {
  6.         if (readLine() == null)
  7.             return new Token("eof", TokenTable.T_ENDT);
  8.  
  9.         read(TokenTable.spcePattern.matcher(line));
  10.  
  11.         if ((str= read(TokenTable.numbPattern.matcher(line))) != null)
  12.             return new Token(Double.parseDouble(str));
  13.  
  14.         if ((str= read(TokenTable.wordPattern.matcher(line))) != null)
  15.             return new Token(str, TokenTable.T_NAME);
  16.  
  17.         if ((str= read(TokenTable.sym2Pattern.matcher(line))) != null)
  18.             return new Token(str, TokenTable.T_TEXT);
  19.  
  20.         if ((str= read(TokenTable.sym1Pattern.matcher(line))) != null)
  21.             return new Token(str, TokenTable.T_TEXT);
  22.  
  23.         return new Token(read(TokenTable.charPattern.matcher(line)), TokenTable.T_CHAR);
  24.     }
  25.     catch (IOException ioe) {
  26.         throw new TokenizerException(ioe.getMessage(), ioe);
  27.     }
  28. }
  29.  
The Patterns for the regular expressions are supplied by the TokenTable class
and all this method does is try to match them in a fixed order. This is all there
is to it w.r.t. lexical analysis for our little language: according to which
pattern had a match a corresponding token is returned. The previous methods take
care that no unprocessed token is skipped and forgotten (it is simply returned
again and again until the Tokenizer is notified that it actually has been
processed). Other methods (see above) take care of reading a next line when
necessary and on the fly the column value is updated.

As you might have noticed a TokenizerExcept ion is thrown in one part of
the Tokenizer code. A TokenizerExcept ion is a small class that extends the
InterpreterExce ption class. The latter class is more interesting and looks
like this:

Expand|Select|Wrap|Line Numbers
  1. public class InterpreterException extends Exception {
  2.  
  3.     private static final long serialVersionUID = 99986468888466836L;
  4.  
  5.     private String message;
  6.  
  7.     public InterpreterException(String message) { 
  8.  
  9.         super(message); 
  10.     }
  11.  
  12.     public InterpreterException(String message, Throwable cause) { 
  13.  
  14.         super(message, cause); 
  15.     }
  16.  
  17.     public InterpreterException(Tokenizer tz, String message) { 
  18.  
  19.         super(message); 
  20.         process(tz);
  21.     }
  22.  
  23.     public InterpreterException(Tokenizer tz, String message, Throwable cause) { 
  24.  
  25.         super(message, cause); 
  26.         process(tz);
  27.     }
  28.  
  29.     private void process(Tokenizer tz) {
  30.  
  31.         StringBuilder sb= new StringBuilder();
  32.         String nl= System.getProperty("line.separator");
  33.  
  34.         sb.append("["+tz.getLineNumber()+":"+tz.getColumn()+"] "+
  35.               super.getMessage()+nl);
  36.         sb.append(tz.getLine()+nl);        
  37.  
  38.         for (int i= 1, n= tz.getColumn(); i < n; i++)
  39.             sb.append('-');
  40.         sb.append('^');
  41.  
  42.         message= sb.toString();
  43.     }
  44.  
  45.     public String getMessage() {
  46.  
  47.         return (message != null)?message:super.getMessage();
  48.     }
  49. }
  50.  
It looks like most Exceptions, i.e. it has a message and a possible 'cause'
for the Exception (the so called 'root' exception). There's one interesting
method in this InterpreterExce ption: the 'process' method. When a Tokenizer
is passed at construction time of this object it is able to construct a nice
error message when printed out; the error message looks like his:

Expand|Select|Wrap|Line Numbers
  1. [line:column] error mesage
  2. line that has a column with an error
  3. ---------------------^
  4.  
A StringBuilder is used to concatenate the different parts of the three line
message; the 'nl' variable contains the 'end of line' sequence of the system
this application is running on (any combination of '\r' and '\n' are possible).
and the correct amount of '-' signs are concatenated to make the '^' caret
appear at the correct location under the current line.

The LineNumber reader used by the Tokenizer counts lines starting at 0 (zero),
humans like to start counting at 1 (one) but lucky for us, the first line (0)
is already completely read and the LineNumberReade r will return the next line
number for us when we ask for it so there's no need to adjust the value.
Note that column numbers also start at 0, but at least one token on the line has
been read already and the column points to the location in the string just
following the token so no adjustment is needed here either.

When no tokenizer is supplied to this constructor just the message that was
passed to the superclass will be returned, otherwise our nicely crafted
message is returned by the getMessage() method.

This InterpreterExce ption is extensively used by the parsers when they encounter
a syntax error in the token stream. They pass the Tokenizer itself to the new
InterpreterExce ption so whenever possible a nicely formatted error message is
shown when you print out the message of this InterpreterExce ption.

Concluding remarks

I showed and explained quite a bit of code in this week's part of the article.
Try to understand the code and don't hesitate to reply when/if you don't
understand something. Compiler construction is a difficult task and contains
many tricky details. This week showed the simple Token class; the long and
boring Tokenizer class and the InterpreterExce ption class.

In a next tip I'll explain how the table classes are initialized. When that is
over and done with I'll supply some actual code as an attachment so that you
can play and experiment with this all a bit.

The following parts of this article explain the parsers, the complicated parts
of our compiler. The parsers closely collaborate with the simple code generator.
The generator generates code (how surprising!) which can be fed to the last
class: the Intepreter class itself. The generated code consists of Instructions;
they are sometimes complicated but most of the time simple and coherent pieces
of code (sequences of Java instructions) that are activated by our Intepreter.

See you next week and

kind regards,

Jos
Nov 21 '07 #1
0 3763

Sign in to post your reply or Sign up for a free account.

Similar topics

10
3153
by: Bill Davidson | last post by:
Hi there, Please forgive me for posting this article on multiple groups. Being new in the newsgroups, I was not sure which group would have been appropriate for my question. Sorry. My Question ----------- I am looking for a list of popular compilers and/or platforms that *do not* support native C++ exceptions. Any pointers in this
13
2687
by: Derek | last post by:
As I understand it there is a good amount of link compatibility among C compilers. For example, I can compile main.c with GCC and func.c with Sun One and link the objects using either linker (GNU or Sun). What I'm curious about is why this compatibility exists in the absence of a standard C ABI? What encourages C compiler vendors to agree on implementation issues such as alignment, packing, etc., such that their object
0
1508
by: Chris Stephens | last post by:
Low Cost C Compilers ------------------------------------ HI-TECH Software's C compilers are now available to support the ARM, dsPIC, msp430, 8051, PIC 10 to 17, PIC 18 as well as many other popular 8/16/32 bit micros. Recent releases of these compilers have included significant enhancements to aid development and debugging. ARM - includes a USB based JTAG interface with high level debug
3
8693
by: Robert | last post by:
I have a number of web projects converted from 1.1 to 2.0 in VS2005. I am methodically seeing the error below: The element 'compilation' has invalid child element 'compilers'. List of possible elements expected: 'assemblies, buildProviders, codeSubDirectories, expressionBuilders'. Here's what the web config looks like. The error doesn't cause any issues and according to the MSDN documentation this is valid. So why is VS2005
12
2445
by: madhura | last post by:
Hello, I have a basic question about compilers. What are the different types of compilers, are they written by different companies, why there are different names of compilers, what is the purpose, I want to know about them. Are interpreters of C available, and if yes, how can I get them. Madhura
8
2177
by: pransri2006 | last post by:
Hi guys! I think all of u know about the designing of compilers. Can any body tell me about the designing of the compilers. And also tell me the difference between the compilers and Interpreter which weresically used by the computers. I want to know about the basic compiler software.
30
3014
by: albert.neu | last post by:
Hello! What is a good compiler to use? (for MS Windows, for Linux) Any recommendations?? What's the point of having a C99 standard, if different compilers still produce differing results? The main reason for a standard, is (in my opinion) portabilty and
0
4784
by: JosAH | last post by:
Greetings, last week's tip was a bit of playtime where we've built a Sudoku solver. This week we're going to build some complicated stuff: a compiler. Compiler construction is a difficult branch of CS and I don't want the article(s) to be the size of a book, so we have to keep things a bit simple. On the other hand the entire thing must have practical use more or less so the code to be developed must be extensible so that you can...
0
3890
by: JosAH | last post by:
Greetings, this week we discuss the design of the syntactic aspects of our little language; it helps with the design for the parser(s) that recognize such syntax. Last week we saw the tokenizer: it just groups sequences of bytes into tokens; it doesn't know anything about any syntax of any language at all. That's the job of the parser; a parser on the other hand doesn't care where these tokens come from, nor how they were formed....
0
4054
by: JosAH | last post by:
Greetings, Introduction This part of the article is one week late; I apologize for that; my excuse is: bizzy, bizzy, bizzy; I attended a nice course and I had to lecture a bit and there simply was no time left for writing. we've come a long way; the previous articles discussed the following aspects of compilers:
0
8059
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8000
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8495
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8470
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8330
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6815
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5475
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3960
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
1328
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.