473,385 Members | 1,720 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,385 developers and data experts.

Text retrieval systems - 3A: the Library Builder

11,448 Expert 8TB
Greetings,

Introduction

Before we start designing and implementing our text builder class(es), I'd like
to mention a reply by Prometheuzz: he had a Dutch version of the entire bible
available, including those apocryphical books. I downloaded the entire shebang,
hacked my King James text processor a bit and now I have two bibles available:
the English 'King James' version and the Dutch 'Staten Vertaling' version.

The Dutch version was even bigger than the King James version, so there :-P

Thanks to Prometheuzz for that valuable piece of text; I really appreciate it.
The Dutch version had less (read: no) inconsistencies while my late neighbour
Dirk's version was a mess here and there. Most of the raw parsing text trouble
happened in Dirk's version. Just a bit of rewriting and quick hacking was
enough to be able to process the Dutch version supplied by Prometheuzz.

The LibraryBuilder

We've got work to do: here's the LibraryBuilder interface again:

Expand|Select|Wrap|Line Numbers
  1. public interface LibraryBuilder {
  2.  
  3.     public void preProcess();
  4.     public void postProcess();
  5.  
  6.     public void setTitle(String title);
  7.  
  8.     public void buildGroup(String group);
  9.     public void buildParagraph(String book, String chapter, int para, String text) throws IOException;
  10.  
  11.     public Library build();
  12. }
  13.  
The AbstractBuilder class implements all of this interface and leaves a few
abstract protected methods to be implemented by the specific builders such as
the KJBuilder (for the King James bible version) and the SVBuilder (for the
Staten Vertaling version).

The AbstractBuilder doesn't do all the work itself, i.e. it delegates a lot of
the work to smaller and simpler builders.

Here's the simples builder of them all: the StringsBuilder:

Expand|Select|Wrap|Line Numbers
  1. class StringsBuilder extends ArrayList<String> {
  2.  
  3.     private static final long serialVersionUID = 4709552086518644567L;
  4.  
  5.     String[] build() { return this.toArray(new String[size()]); }
  6. }
  7.  
It is simply an implementation of the List<String> interface and it can build
a String[] for us. That's all it does. Remember that the Library itself just
uses a bunch of arrays for the entire text. Arrays can't be resized; that's
one of the reasons I prefer builders that manipulate more flexible, but larger
data structures (classes) while at the end all the arrays are produced.

Here's another simple builder; it builds a single section:

Expand|Select|Wrap|Line Numbers
  1. package text.builders;
  2.  
  3. import text.library.Section;
  4.  
  5. class SectionBuilder {
  6.  
  7.     private String name;
  8.     private int index;
  9.     private int total;
  10.  
  11.     SectionBuilder(String name, int index) {
  12.  
  13.         this.name = name;
  14.         this.index= index;
  15.     }
  16.  
  17.     void setTotal(int end) { this.total= end-index; }
  18.  
  19.     String getName() { return name; }
  20.  
  21.     Section build() { return new Section(name, index, total); }
  22. }
  23.  
A Section can be either a section in a group, a book or a chapter. The 'index'
member refers to the corresponding section in an arrray of groups, books or
chapters. The 'total' member is the total number of sections in that particular
array. Also see the previous part of this article for examples of Sections.

The above code can build a Section. The next little builder builds an array
of Sections for us; here it is:

Expand|Select|Wrap|Line Numbers
  1. class SectionsBuilder extends ArrayList<SectionBuilder> {
  2.  
  3.     private static final long serialVersionUID = 1476840322184839715L;
  4.  
  5.     boolean addSection(String name, int index, boolean add) {
  6.  
  7.         int n= size();
  8.         SectionBuilder s= (n == 0)?null:get(n-1);
  9.  
  10.         if (add= (add || s == null || !s.getName().equals(name))) {
  11.             add(new SectionBuilder(name, index));
  12.             if (s != null) s.setTotal(index);
  13.         }    
  14.  
  15.         return add;
  16.     }
  17.  
  18.     void postProcess(int total) {
  19.  
  20.         if (size() > 0)
  21.             get(size()-1).setTotal(total);
  22.     }
  23.  
  24.     Section[] build() { 
  25.  
  26.         int n= size();
  27.         Section[] sections= new Section[n];
  28.  
  29.         for (int i= 0; i < n; i++)
  30.             sections[i]= this.get(i).build();
  31.  
  32.         return sections;
  33.     }
  34. }
  35.  
The SectionsBuilder is a List of SectionBuilders (not the different plurals).
When a new Section is about to start, it figures out how many 'things' are
in the previous section. See the addSection() method for that. The boolean
'add' variable (if true) forces a new Section to be build no matter what.
When this variable is false a new Section needs to be built if the name of
the current Section differs from the previous Section, or, if the Section to
be built is the very first Section of them all.

Actually this builder doesn't keep track of Sections at all: it handles
SectionBuilders. The build() method of this builder builds the actual Sections:
it invokes the build() method on every SectionBuilder in its list which in turn
build a single Section each.

Intermezzo

Let's pause for a minute here: we can build single Sections (using the
SectionBuilder) and we can build arrays of Sections (using the SectionsBuilder).
We can also build arrays of Strings using the StringsBuilder.

But we need more: we need to know in which paragraphs a word occurs. In other
words (sic) we need a mapping between a single word and list of indexes in the
paragraphs List. A single paragraph is a String representing the indexes of
the single words and punctuation characters again. (see the previous article
part for how paragraphs are compressed).

More work to do ...

More builders

Here is a WordBuilder; it keeps track of the paragraph numbers in which a word
occurs. It doesn't care which word though; that has been figured out somewhere
else. It does know that it's the 'index'-th word in the entire text. Here's
the implementation:

Expand|Select|Wrap|Line Numbers
  1. class WordBuilder {
  2.  
  3.     private int index;
  4.     private List<Integer> paragraphs= new ArrayList<Integer>();
  5.  
  6.     WordBuilder(int paragraph) {
  7.  
  8.         this.index= paragraph;
  9.     }
  10.  
  11.     int getIndex() { return index; }
  12.  
  13.     void addParagraph(int paragraph) { paragraphs.add(paragraph); }
  14.  
  15.     int[] build() {
  16.  
  17.         int n= paragraphs.size();
  18.         int array[]= new int[n];
  19.  
  20.         for (int i= 0; i < n; i++) array[i]= paragraphs.get(i);
  21.  
  22.         return array;
  23.     }
  24. }
  25.  
It basically keeps track of a list of paragraph numbers in which this particular
word occurs and it can build an array of integers, the indexes of the paragraphs
in which the word occurs again. It's not particularly rocket science.

The last builder maps words to WordBuilders; it doesn't come as a surprise that
this builder is named WordMapBuilder; here it is:

Expand|Select|Wrap|Line Numbers
  1. class WordMapBuilder extends HashMap<String, WordBuilder> {
  2.  
  3.     private static final long serialVersionUID = -5282145470567550534L;
  4.  
  5.     WordMap build() {
  6.  
  7.         WordMap map= new WordMap();
  8.  
  9.         for (Map.Entry<String, WordBuilder> entry : entrySet()) 
  10.             map.put(entry.getKey(), entry.getValue().build());
  11.  
  12.         return map;
  13.     }
  14. }
  15.  
It maps a word (a String) to a WordBuilder (see above) and it can produce
a WordMap. We'll see that thing later but I think you can guess what it is: it
maps Strings (words) to Integer[]s which are the paragraph numbers in which the
words occur.

AbstractBuilder

We have the constituents for our LibraryBuilder implementation now, let's see
what this implementation is all about. This class is an abstract class because
it leaves a few simple methods to an extending class:

Expand|Select|Wrap|Line Numbers
  1. protected abstract String clean(String book, String chapter, 
  2.                 int paragraph, String text);
  3. protected abstract Collection<String> getNoise();
  4.  
The first method is supposed to clean up a paragraph text. Most of it has been
cleaned up already in a general way, but a specific builder may want to clean
up the text even a bit more.

In Dirk's King James version some of the words had curly braces and some other
braces around them. I don't want those curly braces, so my KJBuilder's clean()
method implementation looks like this:

Expand|Select|Wrap|Line Numbers
  1. protected String clean(String book, String chapter, 
  2.             int paragraph, String text) {
  3.  
  4.     text= text.replaceAll("[({\\[\\]})]", "");
  5.  
  6.     return text.replaceAll("[^\\x20-\\x7e]", "");
  7. }
  8.  
I clean up the text a bit more by removing all the non 7 bits ASCII characters
from the text (remember those 'y's with dots on top from the first article part?
This gets rid of them).

The second method is supposed to return a Collection of Strings that are not
supposed to be indexed at all. Think of words such as 'the', 'a', 'is' etc.
The AbstractBuilder does all the work, it just needs to know *which* words are
supposed to be excluded from the WordMap (see above).

Here is how the KJBuilder (extending the AbstractBuilder) returns its noise words:

Expand|Select|Wrap|Line Numbers
  1. private static final Set<String> noise= new HashSet<String>();
  2.  
  3. static { 
  4.     noise.addAll(Arrays.asList(new String[] {
  5.  
  6.         "O", "s",
  7.         "The", "the", 
  8.         "A", "a", "An", "an", 
  9.         "Of", "of", "By", "by", "On", "on",
  10.         "Am", "am", "Are", "are", "Is", "is", "Be", "be",
  11.         "Had", "had", "Has", "has", "Have", "have",
  12.         "At", "at", "As", "as", "In", "in", 
  13.         "Do", "do",
  14.  
  15.     }));
  16. }
  17.  
  18. protected Collection<String> getNoise() {
  19.  
  20.     return noise;
  21. }
  22.  
A HashSet is a Set and a Set is a Collection. The Arrays.asList() method can
build a List out of an array and a Set can take a List which is also a
Collection as a constructor parameter and that's what is finally returned by
the getNoise() method.

This was enough typing for today; I go get myself some coffee. I hope to see
you in the next fragment of this article part below.

kind regards,

Jos

Part B
Jul 21 '07 #1
0 3785

Sign in to post your reply or Sign up for a free account.

Similar topics

6
by: hplloyd | last post by:
I am using some code off the web that requires a string builder in a script within a web page. In order to get the script to work I need to include the relevant library in the references section...
0
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...
0
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...
0
by: JosAH | last post by:
Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that...
0
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...
0
by: JosAH | last post by:
Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a...
1
by: JosAH | last post by:
Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make...
0
by: JosAH | last post by:
Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The ...
0
by: JosAH | last post by:
Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.