By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,022 Members | 856 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Text retrieval systems - 3A: the Library Builder

Expert 10K+
P: 11,448
Greetings,

Introduction

Before we start designing and implementing our text builder class(es), I'd like
to mention a reply by Prometheuzz: he had a Dutch version of the entire bible
available, including those apocryphical books. I downloaded the entire shebang,
hacked my King James text processor a bit and now I have two bibles available:
the English 'King James' version and the Dutch 'Staten Vertaling' version.

The Dutch version was even bigger than the King James version, so there :-P

Thanks to Prometheuzz for that valuable piece of text; I really appreciate it.
The Dutch version had less (read: no) inconsistencies while my late neighbour
Dirk's version was a mess here and there. Most of the raw parsing text trouble
happened in Dirk's version. Just a bit of rewriting and quick hacking was
enough to be able to process the Dutch version supplied by Prometheuzz.

The LibraryBuilder

We've got work to do: here's the LibraryBuilder interface again:

Expand|Select|Wrap|Line Numbers
  1. public interface LibraryBuilder {
  2.  
  3.     public void preProcess();
  4.     public void postProcess();
  5.  
  6.     public void setTitle(String title);
  7.  
  8.     public void buildGroup(String group);
  9.     public void buildParagraph(String book, String chapter, int para, String text) throws IOException;
  10.  
  11.     public Library build();
  12. }
  13.  
The AbstractBuilder class implements all of this interface and leaves a few
abstract protected methods to be implemented by the specific builders such as
the KJBuilder (for the King James bible version) and the SVBuilder (for the
Staten Vertaling version).

The AbstractBuilder doesn't do all the work itself, i.e. it delegates a lot of
the work to smaller and simpler builders.

Here's the simples builder of them all: the StringsBuilder:

Expand|Select|Wrap|Line Numbers
  1. class StringsBuilder extends ArrayList<String> {
  2.  
  3.     private static final long serialVersionUID = 4709552086518644567L;
  4.  
  5.     String[] build() { return this.toArray(new String[size()]); }
  6. }
  7.  
It is simply an implementation of the List<String> interface and it can build
a String[] for us. That's all it does. Remember that the Library itself just
uses a bunch of arrays for the entire text. Arrays can't be resized; that's
one of the reasons I prefer builders that manipulate more flexible, but larger
data structures (classes) while at the end all the arrays are produced.

Here's another simple builder; it builds a single section:

Expand|Select|Wrap|Line Numbers
  1. package text.builders;
  2.  
  3. import text.library.Section;
  4.  
  5. class SectionBuilder {
  6.  
  7.     private String name;
  8.     private int index;
  9.     private int total;
  10.  
  11.     SectionBuilder(String name, int index) {
  12.  
  13.         this.name = name;
  14.         this.index= index;
  15.     }
  16.  
  17.     void setTotal(int end) { this.total= end-index; }
  18.  
  19.     String getName() { return name; }
  20.  
  21.     Section build() { return new Section(name, index, total); }
  22. }
  23.  
A Section can be either a section in a group, a book or a chapter. The 'index'
member refers to the corresponding section in an arrray of groups, books or
chapters. The 'total' member is the total number of sections in that particular
array. Also see the previous part of this article for examples of Sections.

The above code can build a Section. The next little builder builds an array
of Sections for us; here it is:

Expand|Select|Wrap|Line Numbers
  1. class SectionsBuilder extends ArrayList<SectionBuilder> {
  2.  
  3.     private static final long serialVersionUID = 1476840322184839715L;
  4.  
  5.     boolean addSection(String name, int index, boolean add) {
  6.  
  7.         int n= size();
  8.         SectionBuilder s= (n == 0)?null:get(n-1);
  9.  
  10.         if (add= (add || s == null || !s.getName().equals(name))) {
  11.             add(new SectionBuilder(name, index));
  12.             if (s != null) s.setTotal(index);
  13.         }    
  14.  
  15.         return add;
  16.     }
  17.  
  18.     void postProcess(int total) {
  19.  
  20.         if (size() > 0)
  21.             get(size()-1).setTotal(total);
  22.     }
  23.  
  24.     Section[] build() { 
  25.  
  26.         int n= size();
  27.         Section[] sections= new Section[n];
  28.  
  29.         for (int i= 0; i < n; i++)
  30.             sections[i]= this.get(i).build();
  31.  
  32.         return sections;
  33.     }
  34. }
  35.  
The SectionsBuilder is a List of SectionBuilders (not the different plurals).
When a new Section is about to start, it figures out how many 'things' are
in the previous section. See the addSection() method for that. The boolean
'add' variable (if true) forces a new Section to be build no matter what.
When this variable is false a new Section needs to be built if the name of
the current Section differs from the previous Section, or, if the Section to
be built is the very first Section of them all.

Actually this builder doesn't keep track of Sections at all: it handles
SectionBuilders. The build() method of this builder builds the actual Sections:
it invokes the build() method on every SectionBuilder in its list which in turn
build a single Section each.

Intermezzo

Let's pause for a minute here: we can build single Sections (using the
SectionBuilder) and we can build arrays of Sections (using the SectionsBuilder).
We can also build arrays of Strings using the StringsBuilder.

But we need more: we need to know in which paragraphs a word occurs. In other
words (sic) we need a mapping between a single word and list of indexes in the
paragraphs List. A single paragraph is a String representing the indexes of
the single words and punctuation characters again. (see the previous article
part for how paragraphs are compressed).

More work to do ...

More builders

Here is a WordBuilder; it keeps track of the paragraph numbers in which a word
occurs. It doesn't care which word though; that has been figured out somewhere
else. It does know that it's the 'index'-th word in the entire text. Here's
the implementation:

Expand|Select|Wrap|Line Numbers
  1. class WordBuilder {
  2.  
  3.     private int index;
  4.     private List<Integer> paragraphs= new ArrayList<Integer>();
  5.  
  6.     WordBuilder(int paragraph) {
  7.  
  8.         this.index= paragraph;
  9.     }
  10.  
  11.     int getIndex() { return index; }
  12.  
  13.     void addParagraph(int paragraph) { paragraphs.add(paragraph); }
  14.  
  15.     int[] build() {
  16.  
  17.         int n= paragraphs.size();
  18.         int array[]= new int[n];
  19.  
  20.         for (int i= 0; i < n; i++) array[i]= paragraphs.get(i);
  21.  
  22.         return array;
  23.     }
  24. }
  25.  
It basically keeps track of a list of paragraph numbers in which this particular
word occurs and it can build an array of integers, the indexes of the paragraphs
in which the word occurs again. It's not particularly rocket science.

The last builder maps words to WordBuilders; it doesn't come as a surprise that
this builder is named WordMapBuilder; here it is:

Expand|Select|Wrap|Line Numbers
  1. class WordMapBuilder extends HashMap<String, WordBuilder> {
  2.  
  3.     private static final long serialVersionUID = -5282145470567550534L;
  4.  
  5.     WordMap build() {
  6.  
  7.         WordMap map= new WordMap();
  8.  
  9.         for (Map.Entry<String, WordBuilder> entry : entrySet()) 
  10.             map.put(entry.getKey(), entry.getValue().build());
  11.  
  12.         return map;
  13.     }
  14. }
  15.  
It maps a word (a String) to a WordBuilder (see above) and it can produce
a WordMap. We'll see that thing later but I think you can guess what it is: it
maps Strings (words) to Integer[]s which are the paragraph numbers in which the
words occur.

AbstractBuilder

We have the constituents for our LibraryBuilder implementation now, let's see
what this implementation is all about. This class is an abstract class because
it leaves a few simple methods to an extending class:

Expand|Select|Wrap|Line Numbers
  1. protected abstract String clean(String book, String chapter, 
  2.                 int paragraph, String text);
  3. protected abstract Collection<String> getNoise();
  4.  
The first method is supposed to clean up a paragraph text. Most of it has been
cleaned up already in a general way, but a specific builder may want to clean
up the text even a bit more.

In Dirk's King James version some of the words had curly braces and some other
braces around them. I don't want those curly braces, so my KJBuilder's clean()
method implementation looks like this:

Expand|Select|Wrap|Line Numbers
  1. protected String clean(String book, String chapter, 
  2.             int paragraph, String text) {
  3.  
  4.     text= text.replaceAll("[({\\[\\]})]", "");
  5.  
  6.     return text.replaceAll("[^\\x20-\\x7e]", "");
  7. }
  8.  
I clean up the text a bit more by removing all the non 7 bits ASCII characters
from the text (remember those 'y's with dots on top from the first article part?
This gets rid of them).

The second method is supposed to return a Collection of Strings that are not
supposed to be indexed at all. Think of words such as 'the', 'a', 'is' etc.
The AbstractBuilder does all the work, it just needs to know *which* words are
supposed to be excluded from the WordMap (see above).

Here is how the KJBuilder (extending the AbstractBuilder) returns its noise words:

Expand|Select|Wrap|Line Numbers
  1. private static final Set<String> noise= new HashSet<String>();
  2.  
  3. static { 
  4.     noise.addAll(Arrays.asList(new String[] {
  5.  
  6.         "O", "s",
  7.         "The", "the", 
  8.         "A", "a", "An", "an", 
  9.         "Of", "of", "By", "by", "On", "on",
  10.         "Am", "am", "Are", "are", "Is", "is", "Be", "be",
  11.         "Had", "had", "Has", "has", "Have", "have",
  12.         "At", "at", "As", "as", "In", "in", 
  13.         "Do", "do",
  14.  
  15.     }));
  16. }
  17.  
  18. protected Collection<String> getNoise() {
  19.  
  20.     return noise;
  21. }
  22.  
A HashSet is a Set and a Set is a Collection. The Arrays.asList() method can
build a List out of an array and a Set can take a List which is also a
Collection as a constructor parameter and that's what is finally returned by
the getNoise() method.

This was enough typing for today; I go get myself some coffee. I hope to see
you in the next fragment of this article part below.

kind regards,

Jos

Part B
Jul 21 '07 #1
Share this Article
Share on Google+