By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,320 Members | 1,893 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Text retrieval systems - 4B: the Library

Expert 10K+
P: 11,448
Greetings,

welcome back; above we discussed the peripherals of the Library class: loading
and saving such an instantiation of it, the BookMark interface and then some.
This part of the article discusses the internals of the Library class a bit more.

Sections again

A previous article part showed how Sections work, i.e. a group Section refers
to book Sections, a book Section refers to chapter Sections and the latter refer
to paragraphs of text.

There are a few methods implemented in the Library class that handle all the
Section manipulation. Here are the first two; they find the index of a Section
in a Section array given the name of a Section:

Expand|Select|Wrap|Line Numbers
  1.  
  2. private int getIndex(Section[] sections, String name) {
  3.  
  4.     return getIndex(sections, name, 0, sections.length);
  5. }
  6.  
  7. private int getIndex(Section[] sections, String name, int lo, int n) {
  8.  
  9.     for (int i= lo; n-- > 0; i++)
  10.         if (sections[i].getName().equals(name)) return i;
  11.  
  12.     return -1;
  13. }
  14.  
The first method finds the index of a Section in an array of Sections given
the name of the Section. It delegates the work to the second method which
takes an interval of indexes in the form of the lowest index number to be
considered and the number of adjacent sections in the array.

This is hardly rocket science, but we need these methods nevertheless. The next
method is a bit more clever: suppose we want to find the index of a book given
the absolute index number of a chapter; the following method handles it all:

Expand|Select|Wrap|Line Numbers
  1. private int getIndex(Section[] section, int index) {
  2.  
  3.     for (int lo= 0, hi= section.length-1; lo <= hi; ) {
  4.         int mid= (lo+hi)>>>1;
  5.         int ilo= section[mid].getIndex();
  6.         int ihi= ilo+section[mid].getTotal();
  7.  
  8.         if (ilo <= index && ihi > index) return mid;
  9.         if (ihi <= index) lo= mid+1;
  10.         else              hi= mid-1;
  11.     }
  12.  
  13.     return -1;
  14. }
  15.  
Given an array of Sections and an index number, this method applies a binary
search through the array and finds the index of the Section that 'embraces'
the wanted index, i.e. the Section contains the index passed in as a parameter.

For example, the King James bible contains 1352 chapters and 36133 paragraphs.
Given a paragraph number this method can find the corresponding chapter Section
in 2log(1352) operations, which is less than 11 search actions at most.

Results are even better when we're searching for the book Section given the
index of a chapter Section etc. because there are far less books than there
are chapters and there are far less groups then there are books.

Believe it or not, but the methods described above are the only 'intelligence'
needed and implemented in the Library class.

Simple statistics

The library object can produce simple statistics too and the methods that
implement them are ridicilously simple; here they are:

Expand|Select|Wrap|Line Numbers
  1. public String getTitle() { return title; }
  2.  
  3. public int getGroupSize() { return groups.length; }
  4. public int getBookSize() { return books.length; }
  5. public int getChapterSize() { return chapters.length; }
  6. public int getParagraphSize() { return paragraphs.length; }
  7. public int getWordSize() { return words.length; }
  8. public int getIndexSize() { return wordMap.size(); }
  9. public int getOccurrences(String word) {
  10.  
  11.     int[] paragraphs= wordMap.get(word);
  12.  
  13.     return (paragraphs == null)?0:paragraphs.length;
  14. }
  15.  
The first method isn't actually a statistics type of method but I include it
anyway. The other methods simple return the lengths of the various arrays and
the WordMap which plays a crucial role in the text retrieval functionality.

Here's another simple method I implemented:

Expand|Select|Wrap|Line Numbers
  1. public int[] getIndexes(String word) { 
  2.  
  3.     int[] paragraphs= wordMap.get(word);
  4.  
  5.     return (paragraphs == null)?null:Arrays.copyOf(paragraphs, paragraphs.length);
  6. }
  7.  
The method returns a copy of all the index values where a certain word occurs.
A copy of the original array is returnd because I don't want any caller to
accidentally destroy the values in the original array(s). There's one caller
I trust: me; *ahem* so especially for me I wrote this method:

Expand|Select|Wrap|Line Numbers
  1. int[] getWordIndexes(String word) { return wordMap.get(word); }
  2.  
I need that little method later when we are going to implement more fancy text
retrieval functionality. Note that this isn't a public method.

Text decoding

As the previous article part explained, the text is compressed and encoded.
The paragraph strings aren't the literal text but just a representation thereof.
The following (non public!) method retrieves the actual text again given such an
encoded paragraph String:

Expand|Select|Wrap|Line Numbers
  1. String decompress(String text) {
  2.  
  3.     StringBuffer sb= new StringBuffer(); 
  4.  
  5.     for (int i= 0, n= text.length(); i < n; i++) {
  6.  
  7.         char c= text.charAt(i);
  8.  
  9.         if (c < 0x20) 
  10.             sb.append(punctuation[0].charAt(c)).append(' ');
  11.         else if (c < 0x100)
  12.                 sb.append(punctuation[1].charAt(c&0x1f));
  13.         else {
  14.             sb.append(words[c-0x100]);
  15.             if (i < n-1 && text.charAt(i+1) >= 0x100)
  16.                 sb.append(' ');
  17.         }
  18.     }
  19.  
  20.     return sb.toString();
  21. }    
  22.  
If a character from the paragraph String represents one of the punctuation
characters it is translated back and a white space is appended if needed.
Otherwise the character represents an index in the words array; the word is
retrieved and optionally a space is appended if the current word is followed
by another word. Compare this method agaist the 'compress()' method from the
previous article part: it simply performs the reverse operations.

Simple text retrieval

Before we dig into the real retrieval, here is the simple functionality: the
Library class makes use of a very simple class for this, the SectionList class.
This is a list of sections, or in Java a: List<String>. This is how it is done:

Expand|Select|Wrap|Line Numbers
  1. class SectionList extends AbstractList<String> {
  2.  
  3.     private Section[] section;
  4.     private int lo, n;
  5.  
  6.     SectionList(Section[] section, int lo, int n) {
  7.  
  8.         this.section= section;
  9.         this.lo        = lo;
  10.         this.n        = n;
  11.     }
  12.  
  13.     public String get(int i) { return section[i+lo].getName(); }
  14.  
  15.     public int size() { return n; }
  16. }
  17.  
The constructor takes an array of Sections, a 'lo' index and the number of
adjacent Sections which is also the length of the List itself. The list has
values of type String; the Strings are retrieved from the individual Sections
themselves. Note that the AbstractList<String> superclass takes care of the
rest of the functionality that makes this object a full fledged List.

All the following methods make use of this little class:

Expand|Select|Wrap|Line Numbers
  1. public List<String> getGroups() { 
  2.  
  3.     return new SectionList(groups, 0, groups.length); 
  4. }
  5.  
  6. public List<String> getBooks() { 
  7.  
  8.     return new SectionList(books, 0, books.length); 
  9. }
  10.  
  11. public List<String> getBooks(String group) {
  12.  
  13.     int i= getIndex(groups, group);
  14.  
  15.     return (i < 0)?null:new SectionList(books, groups[i].getIndex(), groups[i].getTotal());
  16. }
  17.  
  18. public List<String> getChapters() { 
  19.  
  20.     return new SectionList(chapters, 0, chapters.length); 
  21. }
  22.  
  23. public List<String> getChapters(String book) {
  24.  
  25.     int i= getIndex(books, book);
  26.  
  27.     return (i < 0)?null:new SectionList(chapters, books[i].getIndex(), books[i].getTotal());
  28. }
  29.  
Note how most of the methods use the 'getIndex()' method described above in
order to get the range of indexes in one of the Section arrays. The StringList
takes care of the rest. It really is that simple. The last method returns a
List that contains all the unique words in the entire text but not in any
particular order.

The next little method returns an unmodifiable list of all the unique words
in no particular order, just in case you're interested in such a list:

Expand|Select|Wrap|Line Numbers
  1. public List<String> getWords() {
  2.  
  3.     return Collections.unmodifiableList(Arrays.asList(words));
  4. }
  5.  
Here are a few examples given the fact that our example Library stores the
King James bible text:

Expand|Select|Wrap|Line Numbers
  1. Library lib= Library.load("/usr/jos/tmp/kj.txtlib"); // get the library
  2.  
  3. // show all books:
  4. List<String> books= lib.getBooks();
  5. for (String book : books)
  6.     System.out.println(book);
  7.  
  8. // show all chapters in book "Genesis"
  9. List<String> chapters= lib.getChapters("Genesis");
  10. for (String chapter : chapters)
  11.     System.out.println(chapter);
  12.  
For the actual paragraph text we can't use this StringList but we just define
another little class: the BookMark class. We also need another method in the
Library class itself:

Expand|Select|Wrap|Line Numbers
  1. BookMark getBookMark(int paragraph) { return new BookMarkImpl(paragraph); }
  2.  
We don't know what a BookMarkImpl class is, but for sure it is an implementation
of the BookMark interface. I'll explain this implementation later. Here is our
BookMarkList:

Expand|Select|Wrap|Line Numbers
  1. class BookMarkList extends AbstractList<BookMark> {
  2.  
  3.     private Library lib;
  4.     private int lo, n;
  5.  
  6.     BookMarkList(Library lib, int lo, int n) {
  7.  
  8.         this.lib= lib;
  9.         this.lo    = lo;
  10.         this.n    = n;
  11.     }
  12.  
  13.     public BookMark get(int i) { return lib.getBookMark(i+lo); }
  14.  
  15.     public int size() { return n; }
  16. }
  17.  
Note that this class knows nothing about a BookMarkImpl class, i.e. it is
just interested in BookMarks (the interface). Analogous to our previous List
class it implementes an AbstractList for the hard work and it returns a
BookMark for its 'get()' method. We trust that mysterious BookMarkImpl class
to have an appropriate 'toString()' implemented.

Here are the methods in the Library class that use these classes:

Expand|Select|Wrap|Line Numbers
  1. public List<BookMark> getParagraphs() { 
  2.  
  3.     return new BookMarkList(this, 0, paragraphs.length); 
  4. }
  5.  
  6. public List<BookMark> getParagraphs(String book) {
  7.  
  8.     int i= getIndex(books, book);
  9.  
  10.     if (i < 0) return null;
  11.  
  12.     int clo= books[i].getIndex();
  13.     int chi= books[i].getTotal()+clo-1;
  14.  
  15.     int p= chapters[clo].getIndex();
  16.     int n= chapters[chi].getIndex()+chapters[chi].getTotal()-p;
  17.  
  18.     return new BookMarkList(this, p, n);
  19. }
  20.  
  21. public List<BookMark> getParagraphs(String book, String chapter) {
  22.  
  23.     int i= getIndex(books, book);
  24.  
  25.     if (i < 0) return null;
  26.  
  27.     int c= getIndex(chapters, chapter, books[i].getIndex(), books[i].getTotal());
  28.  
  29.     return (c < 0)?null:new BookMarkList(this, chapters[c].getIndex(), chapters[c].getTotal());
  30. }
  31.  
Again observe that the 'getIndex()' method is heavily used again. All that these
methods do is figure out which parts of the paragraph array must be included in
the List<BookMark>.

An example use could be:

Expand|Select|Wrap|Line Numbers
  1. Library lib= Library.load("/usr/jos/tmp/kj.txtlib"); // get the library
  2.  
  3. // show all text for the book Genesis:
  4. List<Bookmark> paragraphs= lib.getParagraphs("Genesis");
  5. for (BookMark paragraph : paragraphs)
  6.     System.out.println(paragraph);
  7.  
  8. // show all text in the entire bible"
  9. List<BookMark> paragraphs= lib.getParagraphs();
  10. for (BookMark paragraph: paragraphs)
  11.     System.out.println(paragraph);
  12.  
Concluding remarks

The methods described above allow us to do all of the following:
  • Produce a list of all group names
  • Produce a list of all book names
  • Produce a list of book names in a group
  • Produce a list of all chapter names
  • Produce a list of chapter names in book
  • Produce a list of all paragraphs
  • Produce a list of paragraphs in a book
  • Produce a list of paragraphs in a chapter of a book

This is fun but it still doesn't allow for quite advanced text retrieval; i.e.
with a bit of 'grep' fiddling we can do the same and even more. The next chapter
introduces Query objects that allow us to really dig into the text and produce
quite advanced results.

This article part was full of code again and I call it a day. I hope to see you
next time in the next part of this article. I'll attach all the code we have so
far then and I still have to find a place for those two bible texts so you
can upload them, convert them into Library objects and play with it.

For now I hope you study the code a bit, notify me if you find bugs or if you
don't understand something or just to say what a great guy I am ;-)

I hope to see you again next week and,

kind regards,

Jos
Jul 29 '07 #1
Share this Article
Share on Google+