By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,665 Members | 1,251 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Compilers - 4: Bookkeeping

Expert 10K+
P: 11,448

this week's compiler article is all about bookkeeping; boring, I admit it, but
we need it for our Tokenizer and Parser(s). Two weeks ago I showed the Tokenizer
class code. It uses a TokenTable which contains the data needed by the Tokenizer.
The TokenTable contains the data alright, but how is this table initialized? I
could've hard coded all data in that table but I didn't. I want to be able to
play and alter that data without altering a single letter in the code itself.
I also want that table to initialize itself. Properties files are good external
resources for that purpose. A Properties file is just a list of key value pairs
separated by an equal '=' sign.

Here's the '' file:

Expand|Select|Wrap|Line Numbers
  1. space    =                      ^\\s*
  2. number     =                      ^\\d+(\\.\\d*)?([eE][+-]?\\d+)?
  3. word     =                      ^[A-Za-z_]\\w*
  4. symbol2  =                      ^(==|<=|>=|!=|^=|\\+=|-=|\\*=|/=|\\+\\+|--)
  5. symbol1  =                      ^[:,=!<>+*/(){}^-]
  6. char     =                      ^\\S
We've seen that contents before in the article part that describes our Tokenizer


We need a mechanism that reads a file from somewhere and produces a
Properties object. If the file has been read before, we don't want to read it
again but we simply want to produce the same Properties object again and again.
We need a cache for that:

Expand|Select|Wrap|Line Numbers
  1. private static Map<String, Properties> cache= 
  2.                 new HashMap<String, Properties>();
Given a String for a key (such as "tokens") we want to cache a Properties object
such as the one listed above as the value element of the map.

The following method finds a Properties object for us given a String key:

Expand|Select|Wrap|Line Numbers
  2. private static final fs= System.getProperty("file.separator");
  4. protected static synchronized Properties getProperties(String name) {
  6.     Properties properties= cache.get(name);
  8.     if (properties != null) return properties;
  10.     try {
  11.         InputStream is= Resources.class.getResourceAsStream(
  12.                     "resources"+fs+name+".properties");
  13.         properties= new Properties();
  15.         properties.load(is);
  16.         is.close();
  18.         cache.put(name, properties);
  20.         return properties;
  21.     }
  22.     catch (IOException ioe) {
  23.         return null;
  24.     }
  25. }
Given a name (such as "tokens"), this method either finds an already cached
Properties object from the cache or it searches for a file named
"resources/<name>.properties". In our example it searches for a file named
"resources/". If all is fine it creates a new Properties object,
loads the contents of the file in the object, adds the object to the cache and
finally returns the Properties object. If anything fails this method returns
a 'null' value.

Note that we didn't hardcode the file separator character (a slash or backslash
in western languages) because it can be anything; any character we don't know
in advance and we don't want to know it either: let Java deal with it.

A next time a Properties file is wanted it is either cached already or it will
be freshly loaded and put into the cache. This little scenario takes care that
every distinct Properties object will be loaded once at most only.

There's one interesting little thing going on here: this method loads a Properties
object relative to where the class is stored in which this method is defined.
Suppose the class itself is stored in '/usr/jos/java/compiler'. Then the
resource will be searched in directory '/usr/jos/java/compiler/resource'.

We stick this method and the cache itself in a class named 'Resources':

Expand|Select|Wrap|Line Numbers
  1. abstract class Resources {
  3.     protected Resources() { }
  5.     private static Map<String, Properties> cache= 
  6.                     new HashMap<String, Properties>();
  8.     protected static synchronized Properties getProperties(String name) { 
  9.         ... 
  10.     }
  11. }
I've made it an abstract class with a single protected default constructor
because I want the table classes to extend from it. Note that the method for
supplying Properties objects is synchronized, just in case two separate threads
want to get the same Properties object at the same time, i.e. we don't want to
load the thing twice or more at the same time.

Also note that this method is a static method just as the cache object itself.
We don't need a Resources object for just loading Properties objects. The class
is a utility class; all it does is supplying functionality.


Let's get on with the boring stuff; here's the TokenTable class:

Expand|Select|Wrap|Line Numbers
  1. abstract class TokenTable extends Resources {
  3.     private TokenTable() { }
  5.     // different types of tokens:
  6.     static final int T_ENDT= -1; // end of stream reached
  7.     static final int T_CHAR=  0; // an ordinary character
  8.     static final int T_NUMB=  1; // a number
  9.     static final int T_TEXT=  2; // a recognized token
  10.     static final int T_NAME=  3; // an identifier
  12.     // regexps for different token types
  13.     static final Pattern spcePattern= Pattern.compile(pat("space"));
  14.     static final Pattern numbPattern= Pattern.compile(pat("number"));
  15.     static final Pattern wordPattern= Pattern.compile(pat("word"));
  16.     static final Pattern sym2Pattern= Pattern.compile(pat("symbol2"));
  17.     static final Pattern sym1Pattern= Pattern.compile(pat("symbol1"));
  18.     static final Pattern charPattern= Pattern.compile(pat("char"));
  20.     private static String pat(String pattern) {
  22.         return getProperties("tokens").getProperty(pattern);
  23.     }
  24. }
We've extended from the Resourses (abstract) class and made the TokenTable class
an abstract class too. The default constructor of this class is private so no-one
can instantiate an object from this class. It isn't needed, i.e. only one table
is needed by any Tokenizer that gets instantiated.

The first part of this class defines a couple of int constants which we have seen
before in the part of this article that showed the Tokenizer class.

The last part initializes the Patterns needed by a Tokenizer. We make use of
the caching facility in the base class Resources by getting property values
from the same Property object over and over again for all six values.

Resources again

The Parser(s) need to know about some more or less fixed matters also:

1) the names of the built-in functions;
2) the names of 'special' built-in functions;
3) the unary operator tokens;
4) the binary operator tokens;
5) the assignment operator tokens;
6) the arity of all operators and functions.

On top of that, for every operator or function name, whether 'special' or not,
a class name is associated with the name/token. The class names are needed by
the code generator but we store that information in the ParserTable for simplicity.

The arity of an operator or function is the number of arguments needed by that
operator or function, e.g. the arity of the 'sin' function is 1, the arity of
the '*' operator is 2 etc.

All the properties files have the same structure: every line in it looks like this:

<description> = <arity> <name> <class>

The <description> is a single word unique description of the function or operator.
The <arity> is a number which is, *ahem*, the arity of the function or operator.
The <name> is the name or token of the function or operator itself and the <class>
is the fully qualified name of the class that implements the function or operator.

As an example, here is part of the contents of the '' file:

Expand|Select|Wrap|Line Numbers
  1. sine        = 1    sin    compiler.instruction.SinInstruction
  2. cosine        = 1    cos    compiler.instruction.CosInstruction
  3. tangent        = 1    tan    compiler.instruction.TanInstruction
  4. exponential    = 1    exp    compiler.instruction.ExpInstruction
  5. logarithm    = 1    log    compiler.instruction.LogInstruction
  6. abs        = 1    abs    compiler.instruction.AbsInstruction
  8. minimum        = 2    min    compiler.instruction.MinInstruction
  9. maximum        = 2    max    compiler.instruction.MaxInstruction
The first column is the descriptive single word which makes up the key of the
Properties object. The value String makes up three columns: the arity, the
name or token of the function or operator and the last column is the fully
qualified name of the class that implements the function or operator.

The Resources class uses another table for that: the GeneratorTable which defines
a Map also:

Expand|Select|Wrap|Line Numbers
  1. protected static Map<String, String> classes= new HashMap<String, String>();
This Map stores tuples of the form <name or token> - <class name> and is build
by the Resources class on the fly.

Next we add a method to the Resources class to handle a Properties object like

Expand|Select|Wrap|Line Numbers
  1. protected static Set<String> getResource(String name) {
  3.     Properties properties= getProperties(name);
  4.     Set<String> tokens;
  6.     if (properties == null) return null;
  8.     tokens= new HashSet<String>();
  10.     for(Map.Entry property: properties.entrySet()) {
  11.         String[] entries= ((String)property.getValue()).
  12.                     trim().split("\\s+");
  14.         int marked= entries[1].indexOf('@');
  16.         ParserTable.arity.put(entries[1], Integer.valueOf(entries[0]));
  18.         if (marked < 0)
  19.             tokens.add(entries[1]);
  20.         else 
  21.             tokens.add(entries[1].substring(0, marked));
  23.         GeneratorTable.classes.put(entries[1], entries[2]);
  24.     }
  26.     return tokens;
  27. }
This is quite a large method; let's see what it does: it returns a Set of names
or tokens; the Set consists of all the names in the second column of the table
(see above). It builds up the set as follows:

it reads every value (the String following the '=' character) and splits the
value into a String array 'entries'. entries[0] is the arity, entries[1] is the
name of the function or operator and entries[2] is the class name. We check if
the name contains a '@' sign and put every element in its place in the appropriate
map. The 'classes' map is updated and the 'tokens' Set is updated and returned
at the end of this method. The '@' sign is removed from the name for the 'tokens'
Set but it is used as part of the key for the 'classes' Map.

Note that this method directly refers to a member of its subclass: the 'arity'
Map. It updates that map for every line in the .properties file. Purist would
argue that we're breaking the Liskov Substitution Priniple (see another artice
for an explanation) but this base class as well as its subclasses TokenTable
and ParserTable are not really classes; they're just a bunch of static methods
grouped in abstract classes for convenience. I could've dumped those methods
and Maps in one big utility class as well but I didn't. I want to group the
functionalities in classes for a bit of clarity.

We're still not done with this Resources class. A ParserTable needs to store
information about 'special' functions and reserved words. The ParserTable needs
sets of Strings for that. Here's another utility method in our Resources class:

Expand|Select|Wrap|Line Numbers
  1. protected static Set<String> getSet(String name) {
  3.     Properties properties= getProperties(name);
  4.     Set<String> set= new HashSet<String>();
  6.     if (properties != null)
  7.         for (Object elem : properties.keySet())
  8.             set.add((String)elem);
  10.     return set;
  11. }
This methods uses a Properties object and stores the keys of that object in a Set.
Here's the file that lists the reserved words of our little language; I named it

Expand|Select|Wrap|Line Numbers
  1. function  =    a declaration or definition of a user function
  2. listfunc  =    a declaration or definition of a list user function
The return value of the method above given this file is a Set with two elements
in it: 'function' and 'listfunc', both reserved words.


Finally, after all this boring bookkeeping, here's the ParserTable class:

Expand|Select|Wrap|Line Numbers
  1. abstract public class ParserTable extends Resources {
  3.     // total number of binary operator precedences
  4.     private static final int PREC= 6;
  6.     private ParserTable() { }
  8.     public static final int T_FUNC=  TokenTable.T_NAME+1; 
  9.     public static final int T_QUOT=  TokenTable.T_NAME+2;
  10.     public static final int T_USER=  TokenTable.T_NAME+3;
  11.     public static final int T_WORD=  TokenTable.T_NAME+4;
  13.     public static final Map<String, Integer> arity= 
  14.                 new HashMap<String, Integer>();
  16.     public static final Set<String> funcs= getResource("functions");
  17.     public static final Set<String> quots= getResource("quotes");
  18.     public static final Set<String> rword= getSet("reserved");
  19.     public static final Set<String> lfncs= getSet("listfuncs");
  21.     // all unary, binary operators and assignments
  22.     static final Set<String> unaops= getResource("unaops");
  23.     static final Set<String> pstops= getResource("postops");
  24.     static final Set<String> asgns= getResource("assigns");
  26.     static final List<Set<String>> binops= new ArrayList<Set<String>>(PREC);
  28.     static {
  29.         for (int i= 0; i < PREC; i++)
  30.             binops.add(getResource("binops"+i));
  31.     }
  32. }
Similar to the TokenTable (see above), this class is nothing more than a couple
of Sets, Maps and Lists of Maps. The first part of this class defines a couple of
constants needed by the Parser(s) while the rest of the class uses the methods
in its superclass (Resources) to fill its Sets and Maps.

The 'arity' Map is filled on the fly when other maps and sets are loaded.

The List of Maps represents the binary operators; the 0th element of this List
store the binary operators with the lowest precedence ( ':' ), and so on.
See the previous article part for a description of all binary operators and their

The Sets of the ParserTable are discussed when we implement our Parser(s).
A short peek:

- the Set funcs stores the names of the built-in normal function names;
- the Set quots stores the names of special built-in function names;
- the Set rword stores the reserved words;
- the Set lfncs stores the names of built-in functions that take lists as
their parameters.


There's one more table to be discussed: the GeneratorTable. This table stores
the names of the operators or functions and associates them with their fully
qualified class name. The data for this table is picked up by the Resources
class from the .properties files and stored in a Map in this table.

A second Map in this class caches instructions:

Expand|Select|Wrap|Line Numbers
  1. private static Map<String, Instruction> instructions= 
  2.                 new HashMap<String, Instruction>();
Given the name of the instruction the real instruction can be associated with it.
This caching mechanism implements the Flyweight pattern, e.g. there's only one
'sin' instruction instantiation needed, no matter how many times the sine function
is used in expressions. This Map takes care of the mechanics needed by this pattern.

The GeneratorTable class implements two methods for just this:

Expand|Select|Wrap|Line Numbers
  1. protected static Instruction getInstruction(String name) {
  3.     try {
  4.         return (Instruction)Class.forName(classes.get(name)).
  5.                 newInstance();    
  6.     }
  7.     catch (Exception e) {
  8.         e.printStackTrace();
  9.         return null;
  10.     }
  11. }
This method instantiates a new Instruction given the name of the operator or
function. When it fails (it shouldn't) it prints the complete stack trace telling
us *why* it failed. Most likely the .properties files contain erroneous data which
should be fixed.

The second method takes care of the caching of instructions:

Expand|Select|Wrap|Line Numbers
  1. protected static Instruction cacheInstruction(String name) {
  3.     Instruction instruction= instructions.get(name);
  5.     if (instruction == null)
  6.         instructions.put(name, instruction= getInstruction(name));
  8.     return instruction;
  9. }
If the instruction was cached before, it is returned, otherwise the previous method
is invoked which instantiates a new instruction for us given the Map that stores
the fully qualified class name of the wanted instruction.

Both methods are protected because the outide world shouldn't even know that this
class exists. It is solely used by the Parser(s) and/or the Interpreter.

Concluding remarks

There's nothing interesting about the four classes described above. The Resources
class is the base class of the TokenTable, ParserTable and GeneratorTable classes
for conveniece.

The four classes are simple utitility classes, no more and no less and don't need
any instances at all. The base class Resources does all the work: it loads contents
from files into Properties objects and fills the maps of the sub-classes.

The main purpose of those four classes is that I don't want to hard-code every
single name or token or whatever in the rest of the code. I want to be able to
change simple .properties files and alter the language without having to alter
much Java code.

This was the boring chapter of the Compilers article; it had to be written for
reasons of completeness; it was quite a big chapter too; I showed how tables fill
themselves when loaded by the JVM. The data comes from .properties files which
are easy to edit. Those .properties files are not a 'user feature'; the contents
of those .properties files have to be correct in order for the entire compilation
system to behave correctly.

I do hope I haven't scared you off too much with this part of the Compilers article,
but that's life: compiler writing is 20% inspiration and 80% perspiration. This
boring article part took care of most if not all of the perspiration.

The inspriration will be back in the sequel(s) of this article. I'm sure we'll
see each other again next week when we finally can dig into Parsers.

kind regards,

Jun 6 '07 #1
Share this Article
Share on Google+