By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,490 Members | 1,417 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Text retrieval systems - 7: the Software and Data

Expert 10K+
P: 11,448
Greetings,

Introduction

Last week I was a bit too busy to cook up this part of the article series; sorry
for that. This article part wraps up the Text Processing article series. The
attachment contains the complete source code which was explained in the previous
article parts. Download it, extract the sources somewhere and have a look.

The data I've been using all the time can be found here. There are
two zip files to be found:
  • kjbible.zip
  • svbible.zip
The first zip file contains all the books present in the King James bible; the
second zip file contains all the books present in the Dutch Staten Vertaling
bible. Download them and unzip them somewere, preferably in separate directories.

Now it's time to see what this text processing article was all about.

Compiling the source files

All the source files are stored in their correct directory, corresponding to
the packages in which they are defined. I am not going to explain how you
should compile everything; you're supposed to know that by now. Simply compile
everything and you're in business.

Creating the libraries

In the class directory you can find two small, simple utilities that start
the entire library builder framework. The utilities can be found in the
default package. The utilities are named MakeKJib.class and MakeSVLib.class.
I think you can guess what they do.

The utilities take three parameters:
  • -i the directory where the separate .txt files are stored.
  • -o the name (and directory) where the created libraray should be stored
  • -n the name of the created library.
The last parameter is optional, and a sensible name is given to the library
if you don't explicitly give it a name yourself.

As an example, suppose you have stored the .txt files for the King James bible
in c:\kj\text; suppose you want the library to be stored under the name:
c:\kj\lib\kjbible.txtlib.

This is how you build it:

Expand|Select|Wrap|Line Numbers
  1. java -cp . MakeKJLib -i c:/kj/text/ -o c:/kj/lib/kjbible.txtlib
  2.  
Note that the directories already have to exist and also note the trailing
slash character for the input directory.

After a a short while you should see the following output:

Expand|Select|Wrap|Line Numbers
  1. King James Bible [15267, 15232, 3, 82, 1352, 36133]
  2. build time: 7931ms
  3.  
The build time doesn't need an explanation: it was the time taken on my laptop
to read all the files and build the entire library.

The default name for this library is "King James Bible". The numbers following
the name have the following meaning:
  • the number of unique words
  • the number of unique words in the wordMap index
  • the number of groups
  • the number of books
  • the number of chapters
  • the number of paragraphs
As you can see 15267-15232 = 35 'noise' words were filtered out from the index.
See the previous article parts for an exaplanation of this all.

Now we have a compressed library in c:\kj\lib\kjbible.txtlib

If you want to create a library for the Dutch Staten Vertaling, just follow
the same steps. The default name for that bible is "Staten Vertaling".

So a library has a name and it is stored in a file with any name you like.

The exact size of both bibles is:
  • King James bible: 2,987,979 bytes
  • Staten Vertaling bible: 3,149,545 bytes
The Dutch version is a bit bigger; here are its statistics as shown by the
builder utility:

Expand|Select|Wrap|Line Numbers
  1. Staten Vertaling [23141, 23096, 2, 82, 1370, 37235]
  2. build time: 6799ms
  3.  
As you can see there are more unique words but only two groups: the Old Testament
and the New Testament. The King James bible has a separate group for those
apocrypha (see a previous article part for an explanation about those books).

The sizes of the non-compressed raw texts are:
  • King James bible: 5,387,972 bytes
  • Staten Vertaling bible: 5,577,169 bytes
The sizes are as reported by the 'dir' command. As you can see the created
libraries are quite a bit smaller than the original raw text sizes while the
libraries contain additional quite useful indexes.

Let's play a bit with those libraries:

the RunLib utility

A third utility present in the source code is the RunLib utility; it opens
a library and takes queries from the user. It feeds the queries to the library,
gets a Query back, asks it for the results and displays them. This is the way
to start the utililty if you want to play with the kjbible.txtlib:

Expand|Select|Wrap|Line Numbers
  1. java -cp . RunLib c:/kj/lib/kjbible.txtlib
  2.  
The utility prints the simple statistics from the loaded library and prompts
you for a query. If you want to play with the Dutch Staten Vertaling bible
instead it probably is obvious what to do.

If you don't supply it with a library name or if you supply it with an incorrect
name this simple utility just dies after printing out a stacktrace. In no case
is any harm done to the library itself.

Queries

The utility prints out a 'query: ' prompt. Let's give it a silly query twice:

Expand|Select|Wrap|Line Numbers
  1. query: computer
  2. query time: 90ms
  3. results: 0
  4. query: computer
  5. query time: 0ms
  6. results: 0
  7. query: 
  8.  
I asked if the word 'computer' is present in that King James bible. Obviously
it isn't, but notice that the first query took 90 milliseconds on my laptop
while the second identical query took less than a millisecod; that is because
the first time most of the classes still needed to be loaded. When the second
query was issued all classes were loaded already. Part of the speed increase
is because the hotspot and jit compiler found the code and decided to compile
it to raw machine code.

Let's give it a sensible query:

Expand|Select|Wrap|Line Numbers
  1. query: God & devil
  2. New Testament    Luke    4    2    And the devil said unto him, If thou be the Son of God, command this stone that it be made bread.
  3. New Testament    Acts    10    37    How God anointed Jesus of Nazareth with the Holy Ghost and with power: who went about doing good, and healing all that were oppressed of the devil; for God was with him.
  4. New Testament    Ephesians    6    10    Put on the whole armour of God, that ye may be able to stand against the wiles of the devil.
  5. New Testament    James    4    6    Submit yourselves therefore to God. Resist the devil, and he will flee from you.
  6. New Testament    I John    3    7    He that committeth sin is of the devil; for the devil sinneth from the beginning. For this purpose the Son of God was manifested, that he might destroy the works of the devil.
  7. New Testament    I John    3    9    In this the children of God are manifest, and the children of the devil: whosoever doeth not righteousness is not of God, neither he that loveth not his brother.
  8. Apocrypha    Tobit    6    16    And the devil shall smell it, and flee away, and never come again any more: but when thou shalt come to her, rise up both of you, and pray to God which is merciful, who will have pity on you, and save you: fear not, for she is appointed unto thee from the beginning; and thou shalt preserve her, and she shall go with thee. Moreover I suppose that she shall bear thee children. Now when Tobias had heard these things, he loved her, and his heart was effectually joined to her.
  9. query time: 441ms
  10. results: 7
  11.  
So there are seven paragraphs in the King James bible that have both the words
'God' and 'devil' in them. Let's see what this query does:

Expand|Select|Wrap|Line Numbers
  1. query: !(God & devil)
  2. ... lots of output here
  3. query time: 30ms
  4. results: 36126
  5.  
Well, that makes sense: there are seven paragraphs with both the words 'God' and
'devil' in it and there are 36126 parapgraphs that don't. There are 36133
paragraphs in total so that adds up nicely. Note that the query took only thirty
milliseconds.

We already know that this bible doesn't contain the word 'computer', lets see
if it contains 'device' or 'devices':

Expand|Select|Wrap|Line Numbers
  1. query: =/device(s?)/
  2. ... quite a bit of output
  3. query time: 510ms
  4. results: 36
  5. query: 
  6.  
It took 510 milliseconds to couch up those 36 results; that is because regular
expression queries have to plough through all the text, i.e. they can't
simply consult the index for their results.

But the remarkable thing is that they did know about devices in those days ;-)

Before I programmed this little utility I checked whether or not those bibles
contained the word 'exit'; they don't, so I decided that if the user types
the word 'exit' the utility exits. Nice and convenient. Feel free to change
the source of the utilities and/or the builders and library code; you can
even build a nice GUI around it all if you feel like it.

Concluding remarks and acknowledgements

This article part completes the Text Processing article series. We have seen
the processor and builder parts that had to do the dirty work, i.e. they had
to clean up the raw (and sometimes inconsistently structured) text. The library
itself is a clean piece of software; it is even Serializable for convenience.
The queries offer quite complicated functionality; we have just seen a little
bit of it in this article; read the previous part of this article for a complete
description offered by the query subsystem.

Thanks again to Prometheuzz for supplying the necessary data/web space; thanks!
and a belated thanks to late Dirk for donating his old desktop computer to me,
otherwise I'd never developed a text processing engine because I didn't have
a substantial body of coherent text to play with.

I'll cook something up for the next article (series?). If you have read this
complete series and downloaded the code you have something fun, compact and
quite powerful stored on your computer disk. See you next week and

kind regards,

Jos
Sep 2 '07 #1
Share this Article
Share on Google+