473,388 Members | 875 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,388 developers and data experts.

Text retrieval systems - 7: the Software and Data

11,448 Expert 8TB


Last week I was a bit too busy to cook up this part of the article series; sorry
for that. This article part wraps up the Text Processing article series. The
attachment contains the complete source code which was explained in the previous
article parts. Download it, extract the sources somewhere and have a look.

The data I've been using all the time can be found here. There are
two zip files to be found:
  • kjbible.zip
  • svbible.zip
The first zip file contains all the books present in the King James bible; the
second zip file contains all the books present in the Dutch Staten Vertaling
bible. Download them and unzip them somewere, preferably in separate directories.

Now it's time to see what this text processing article was all about.

Compiling the source files

All the source files are stored in their correct directory, corresponding to
the packages in which they are defined. I am not going to explain how you
should compile everything; you're supposed to know that by now. Simply compile
everything and you're in business.

Creating the libraries

In the class directory you can find two small, simple utilities that start
the entire library builder framework. The utilities can be found in the
default package. The utilities are named MakeKJib.class and MakeSVLib.class.
I think you can guess what they do.

The utilities take three parameters:
  • -i the directory where the separate .txt files are stored.
  • -o the name (and directory) where the created libraray should be stored
  • -n the name of the created library.
The last parameter is optional, and a sensible name is given to the library
if you don't explicitly give it a name yourself.

As an example, suppose you have stored the .txt files for the King James bible
in c:\kj\text; suppose you want the library to be stored under the name:

This is how you build it:

Expand|Select|Wrap|Line Numbers
  1. java -cp . MakeKJLib -i c:/kj/text/ -o c:/kj/lib/kjbible.txtlib
Note that the directories already have to exist and also note the trailing
slash character for the input directory.

After a a short while you should see the following output:

Expand|Select|Wrap|Line Numbers
  1. King James Bible [15267, 15232, 3, 82, 1352, 36133]
  2. build time: 7931ms
The build time doesn't need an explanation: it was the time taken on my laptop
to read all the files and build the entire library.

The default name for this library is "King James Bible". The numbers following
the name have the following meaning:
  • the number of unique words
  • the number of unique words in the wordMap index
  • the number of groups
  • the number of books
  • the number of chapters
  • the number of paragraphs
As you can see 15267-15232 = 35 'noise' words were filtered out from the index.
See the previous article parts for an exaplanation of this all.

Now we have a compressed library in c:\kj\lib\kjbible.txtlib

If you want to create a library for the Dutch Staten Vertaling, just follow
the same steps. The default name for that bible is "Staten Vertaling".

So a library has a name and it is stored in a file with any name you like.

The exact size of both bibles is:
  • King James bible: 2,987,979 bytes
  • Staten Vertaling bible: 3,149,545 bytes
The Dutch version is a bit bigger; here are its statistics as shown by the
builder utility:

Expand|Select|Wrap|Line Numbers
  1. Staten Vertaling [23141, 23096, 2, 82, 1370, 37235]
  2. build time: 6799ms
As you can see there are more unique words but only two groups: the Old Testament
and the New Testament. The King James bible has a separate group for those
apocrypha (see a previous article part for an explanation about those books).

The sizes of the non-compressed raw texts are:
  • King James bible: 5,387,972 bytes
  • Staten Vertaling bible: 5,577,169 bytes
The sizes are as reported by the 'dir' command. As you can see the created
libraries are quite a bit smaller than the original raw text sizes while the
libraries contain additional quite useful indexes.

Let's play a bit with those libraries:

the RunLib utility

A third utility present in the source code is the RunLib utility; it opens
a library and takes queries from the user. It feeds the queries to the library,
gets a Query back, asks it for the results and displays them. This is the way
to start the utililty if you want to play with the kjbible.txtlib:

Expand|Select|Wrap|Line Numbers
  1. java -cp . RunLib c:/kj/lib/kjbible.txtlib
The utility prints the simple statistics from the loaded library and prompts
you for a query. If you want to play with the Dutch Staten Vertaling bible
instead it probably is obvious what to do.

If you don't supply it with a library name or if you supply it with an incorrect
name this simple utility just dies after printing out a stacktrace. In no case
is any harm done to the library itself.


The utility prints out a 'query: ' prompt. Let's give it a silly query twice:

Expand|Select|Wrap|Line Numbers
  1. query: computer
  2. query time: 90ms
  3. results: 0
  4. query: computer
  5. query time: 0ms
  6. results: 0
  7. query: 
I asked if the word 'computer' is present in that King James bible. Obviously
it isn't, but notice that the first query took 90 milliseconds on my laptop
while the second identical query took less than a millisecod; that is because
the first time most of the classes still needed to be loaded. When the second
query was issued all classes were loaded already. Part of the speed increase
is because the hotspot and jit compiler found the code and decided to compile
it to raw machine code.

Let's give it a sensible query:

Expand|Select|Wrap|Line Numbers
  1. query: God & devil
  2. New Testament    Luke    4    2    And the devil said unto him, If thou be the Son of God, command this stone that it be made bread.
  3. New Testament    Acts    10    37    How God anointed Jesus of Nazareth with the Holy Ghost and with power: who went about doing good, and healing all that were oppressed of the devil; for God was with him.
  4. New Testament    Ephesians    6    10    Put on the whole armour of God, that ye may be able to stand against the wiles of the devil.
  5. New Testament    James    4    6    Submit yourselves therefore to God. Resist the devil, and he will flee from you.
  6. New Testament    I John    3    7    He that committeth sin is of the devil; for the devil sinneth from the beginning. For this purpose the Son of God was manifested, that he might destroy the works of the devil.
  7. New Testament    I John    3    9    In this the children of God are manifest, and the children of the devil: whosoever doeth not righteousness is not of God, neither he that loveth not his brother.
  8. Apocrypha    Tobit    6    16    And the devil shall smell it, and flee away, and never come again any more: but when thou shalt come to her, rise up both of you, and pray to God which is merciful, who will have pity on you, and save you: fear not, for she is appointed unto thee from the beginning; and thou shalt preserve her, and she shall go with thee. Moreover I suppose that she shall bear thee children. Now when Tobias had heard these things, he loved her, and his heart was effectually joined to her.
  9. query time: 441ms
  10. results: 7
So there are seven paragraphs in the King James bible that have both the words
'God' and 'devil' in them. Let's see what this query does:

Expand|Select|Wrap|Line Numbers
  1. query: !(God & devil)
  2. ... lots of output here
  3. query time: 30ms
  4. results: 36126
Well, that makes sense: there are seven paragraphs with both the words 'God' and
'devil' in it and there are 36126 parapgraphs that don't. There are 36133
paragraphs in total so that adds up nicely. Note that the query took only thirty

We already know that this bible doesn't contain the word 'computer', lets see
if it contains 'device' or 'devices':

Expand|Select|Wrap|Line Numbers
  1. query: =/device(s?)/
  2. ... quite a bit of output
  3. query time: 510ms
  4. results: 36
  5. query: 
It took 510 milliseconds to couch up those 36 results; that is because regular
expression queries have to plough through all the text, i.e. they can't
simply consult the index for their results.

But the remarkable thing is that they did know about devices in those days ;-)

Before I programmed this little utility I checked whether or not those bibles
contained the word 'exit'; they don't, so I decided that if the user types
the word 'exit' the utility exits. Nice and convenient. Feel free to change
the source of the utilities and/or the builders and library code; you can
even build a nice GUI around it all if you feel like it.

Concluding remarks and acknowledgements

This article part completes the Text Processing article series. We have seen
the processor and builder parts that had to do the dirty work, i.e. they had
to clean up the raw (and sometimes inconsistently structured) text. The library
itself is a clean piece of software; it is even Serializable for convenience.
The queries offer quite complicated functionality; we have just seen a little
bit of it in this article; read the previous part of this article for a complete
description offered by the query subsystem.

Thanks again to Prometheuzz for supplying the necessary data/web space; thanks!
and a belated thanks to late Dirk for donating his old desktop computer to me,
otherwise I'd never developed a text processing engine because I didn't have
a substantial body of coherent text to play with.

I'll cook something up for the next article (series?). If you have read this
complete series and downloaded the code you have something fun, compact and
quite powerful stored on your computer disk. See you next week and

kind regards,

Sep 2 '07 #1
0 4417

Sign in to post your reply or Sign up for a free account.

Similar topics

by: Guy | last post by:
Please email your resume if you are interested in this position in Harrisburg, PA. Candidate will consolidate state agency mission-critical systems that citizens of Pennsylvania depend on every...
by: Eric | last post by:
Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there...
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...
by: Bob Alston | last post by:
I am looking for others who have built systems to scan documents, index them and then make them accessible from an Access database. My environment is a nonprofit with about 20-25 case workers who...
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...
by: JosAH | last post by:
Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that...
by: JosAH | last post by:
Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible ...
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...
by: JosAH | last post by:
Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.