Text retrieval systems - 4B: the Library

11,448 Expert 8TB

Greetings,

welcome back; above we discussed the peripherals of the Library class: loading
and saving such an instantiation of it, the BookMark interface and then some.
This part of the article discusses the internals of the Library class a bit more.

Sections again

A previous article part showed how Sections work, i.e. a group Section refers
to book Sections, a book Section refers to chapter Sections and the latter refer
to paragraphs of text.

There are a few methods implemented in the Library class that handle all the
Section manipulation. Here are the first two; they find the index of a Section
in a Section array given the name of a Section:

Expand|Select|Wrap|Line Numbers

  
private int getIndex(Section[] sections, String name) {
 
    return getIndex(sections, name, 0, sections.length);

}
 
private int getIndex(Section[] sections, String name, int lo, int n) {
 
    for (int i= lo; n-- > 0; i++)

        if (sections[i].getName().equals(name)) return i;
 
    return -1;

}

The first method finds the index of a Section in an array of Sections given
the name of the Section. It delegates the work to the second method which
takes an interval of indexes in the form of the lowest index number to be
considered and the number of adjacent sections in the array.

This is hardly rocket science, but we need these methods nevertheless. The next
method is a bit more clever: suppose we want to find the index of a book given
the absolute index number of a chapter; the following method handles it all:

Expand|Select|Wrap|Line Numbers

 
private int getIndex(Section[] section, int index) {
 
    for (int lo= 0, hi= section.length-1; lo <= hi; ) {

        int mid= (lo+hi)>>>1;

        int ilo= section[mid].getIndex();

        int ihi= ilo+section[mid].getTotal();
 
        if (ilo <= index && ihi > index) return mid;

        if (ihi <= index) lo= mid+1;

        else              hi= mid-1;

    }
 
    return -1;

}

Given an array of Sections and an index number, this method applies a binary
search through the array and finds the index of the Section that 'embraces'
the wanted index, i.e. the Section contains the index passed in as a parameter.

For example, the King James bible contains 1352 chapters and 36133 paragraphs.
Given a paragraph number this method can find the corresponding chapter Section
in 2log(1352) operations, which is less than 11 search actions at most.

Results are even better when we're searching for the book Section given the
index of a chapter Section etc. because there are far less books than there
are chapters and there are far less groups then there are books.

Believe it or not, but the methods described above are the only 'intelligence'
needed and implemented in the Library class.

Simple statistics

The library object can produce simple statistics too and the methods that
implement them are ridicilously simple; here they are:

Expand|Select|Wrap|Line Numbers

 
public String getTitle() { return title; }
 
public int getGroupSize() { return groups.length; }

public int getBookSize() { return books.length; }

public int getChapterSize() { return chapters.length; }

public int getParagraphSize() { return paragraphs.length; }

public int getWordSize() { return words.length; }

public int getIndexSize() { return wordMap.size(); }

public int getOccurrences(String word) {
 
    int[] paragraphs= wordMap.get(word);
 
    return (paragraphs == null)?0:paragraphs.length;

}

The first method isn't actually a statistics type of method but I include it
anyway. The other methods simple return the lengths of the various arrays and
the WordMap which plays a crucial role in the text retrieval functionality.

Here's another simple method I implemented:

Expand|Select|Wrap|Line Numbers

 
public int[] getIndexes(String word) { 
 
    int[] paragraphs= wordMap.get(word);
 
    return (paragraphs == null)?null:Arrays.copyOf(paragraphs, paragraphs.length);

}

The method returns a copy of all the index values where a certain word occurs.
A copy of the original array is returnd because I don't want any caller to
accidentally destroy the values in the original array(s). There's one caller
I trust: me; *ahem* so especially for me I wrote this method:

Expand|Select|Wrap|Line Numbers

int[] getWordIndexes(String word) { return wordMap.get(word); }

I need that little method later when we are going to implement more fancy text
retrieval functionality. Note that this isn't a public method.

Text decoding

As the previous article part explained, the text is compressed and encoded.
The paragraph strings aren't the literal text but just a representation thereof.
The following (non public!) method retrieves the actual text again given such an
encoded paragraph String:

Expand|Select|Wrap|Line Numbers

 
String decompress(String text) {
 
    StringBuffer sb= new StringBuffer(); 
 
    for (int i= 0, n= text.length(); i < n; i++) {
 
        char c= text.charAt(i);
 
        if (c < 0x20) 

            sb.append(punctuation[0].charAt(c)).append(' ');

        else if (c < 0x100)

                sb.append(punctuation[1].charAt(c&0x1f));

        else {

            sb.append(words[c-0x100]);

            if (i < n-1 && text.charAt(i+1) >= 0x100)

                sb.append(' ');

        }

    }
 
    return sb.toString();

}

If a character from the paragraph String represents one of the punctuation
characters it is translated back and a white space is appended if needed.
Otherwise the character represents an index in the words array; the word is
retrieved and optionally a space is appended if the current word is followed
by another word. Compare this method agaist the 'compress()' method from the
previous article part: it simply performs the reverse operations.

Simple text retrieval

Before we dig into the real retrieval, here is the simple functionality: the
Library class makes use of a very simple class for this, the SectionList class.
This is a list of sections, or in Java a: List<String>. This is how it is done:

Expand|Select|Wrap|Line Numbers

 
class SectionList extends AbstractList<String> {
 
    private Section[] section;

    private int lo, n;
 
    SectionList(Section[] section, int lo, int n) {
 
        this.section= section;

        this.lo        = lo;

        this.n        = n;

    }
 
    public String get(int i) { return section[i+lo].getName(); }
 
    public int size() { return n; }

}

The constructor takes an array of Sections, a 'lo' index and the number of
adjacent Sections which is also the length of the List itself. The list has
values of type String; the Strings are retrieved from the individual Sections
themselves. Note that the AbstractList<String> superclass takes care of the
rest of the functionality that makes this object a full fledged List.

All the following methods make use of this little class:

Expand|Select|Wrap|Line Numbers

 
public List<String> getGroups() { 
 
    return new SectionList(groups, 0, groups.length); 

}
 
public List<String> getBooks() { 
 
    return new SectionList(books, 0, books.length); 

}
 
public List<String> getBooks(String group) {
 
    int i= getIndex(groups, group);
 
    return (i < 0)?null:new SectionList(books, groups[i].getIndex(), groups[i].getTotal());

}
 
public List<String> getChapters() { 
 
    return new SectionList(chapters, 0, chapters.length); 

}
 
public List<String> getChapters(String book) {
 
    int i= getIndex(books, book);
 
    return (i < 0)?null:new SectionList(chapters, books[i].getIndex(), books[i].getTotal());

}

Note how most of the methods use the 'getIndex()' method described above in
order to get the range of indexes in one of the Section arrays. The StringList
takes care of the rest. It really is that simple. The last method returns a
List that contains all the unique words in the entire text but not in any
particular order.

The next little method returns an unmodifiable list of all the unique words
in no particular order, just in case you're interested in such a list:

Expand|Select|Wrap|Line Numbers

 
public List<String> getWords() {
 
    return Collections.unmodifiableList(Arrays.asList(words));

}

Here are a few examples given the fact that our example Library stores the
King James bible text:

Expand|Select|Wrap|Line Numbers

 
Library lib= Library.load("/usr/jos/tmp/kj.txtlib"); // get the library
 
// show all books:

List<String> books= lib.getBooks();

for (String book : books)

    System.out.println(book);
 
// show all chapters in book "Genesis"

List<String> chapters= lib.getChapters("Genesis");

for (String chapter : chapters)

    System.out.println(chapter);

For the actual paragraph text we can't use this StringList but we just define
another little class: the BookMark class. We also need another method in the
Library class itself:

Expand|Select|Wrap|Line Numbers

 
BookMark getBookMark(int paragraph) { return new BookMarkImpl(paragraph); }

We don't know what a BookMarkImpl class is, but for sure it is an implementation
of the BookMark interface. I'll explain this implementation later. Here is our
BookMarkList:

Expand|Select|Wrap|Line Numbers

 
class BookMarkList extends AbstractList<BookMark> {
 
    private Library lib;

    private int lo, n;
 
    BookMarkList(Library lib, int lo, int n) {
 
        this.lib= lib;

        this.lo    = lo;

        this.n    = n;

    }
 
    public BookMark get(int i) { return lib.getBookMark(i+lo); }
 
    public int size() { return n; }

}

Note that this class knows nothing about a BookMarkImpl class, i.e. it is
just interested in BookMarks (the interface). Analogous to our previous List
class it implementes an AbstractList for the hard work and it returns a
BookMark for its 'get()' method. We trust that mysterious BookMarkImpl class
to have an appropriate 'toString()' implemented.

Here are the methods in the Library class that use these classes:

Expand|Select|Wrap|Line Numbers

 
public List<BookMark> getParagraphs() { 
 
    return new BookMarkList(this, 0, paragraphs.length); 

}
 
public List<BookMark> getParagraphs(String book) {
 
    int i= getIndex(books, book);
 
    if (i < 0) return null;
 
    int clo= books[i].getIndex();

    int chi= books[i].getTotal()+clo-1;
 
    int p= chapters[clo].getIndex();

    int n= chapters[chi].getIndex()+chapters[chi].getTotal()-p;
 
    return new BookMarkList(this, p, n);

}
 
public List<BookMark> getParagraphs(String book, String chapter) {
 
    int i= getIndex(books, book);
 
    if (i < 0) return null;
 
    int c= getIndex(chapters, chapter, books[i].getIndex(), books[i].getTotal());
 
    return (c < 0)?null:new BookMarkList(this, chapters[c].getIndex(), chapters[c].getTotal());

}

Again observe that the 'getIndex()' method is heavily used again. All that these
methods do is figure out which parts of the paragraph array must be included in
the List<BookMark>.

An example use could be:

Expand|Select|Wrap|Line Numbers

 
Library lib= Library.load("/usr/jos/tmp/kj.txtlib"); // get the library
 
// show all text for the book Genesis:

List<Bookmark> paragraphs= lib.getParagraphs("Genesis");

for (BookMark paragraph : paragraphs)

    System.out.println(paragraph);
 
// show all text in the entire bible"

List<BookMark> paragraphs= lib.getParagraphs();

for (BookMark paragraph: paragraphs)

    System.out.println(paragraph);

Concluding remarks

The methods described above allow us to do all of the following:

Produce a list of all group names
Produce a list of all book names
Produce a list of book names in a group
Produce a list of all chapter names
Produce a list of chapter names in book
Produce a list of all paragraphs
Produce a list of paragraphs in a book
Produce a list of paragraphs in a chapter of a book

This is fun but it still doesn't allow for quite advanced text retrieval; i.e.
with a bit of 'grep' fiddling we can do the same and even more. The next chapter
introduces Query objects that allow us to really dig into the text and produce
quite advanced results.

This article part was full of code again and I call it a day. I hope to see you
next time in the next part of this article. I'll attach all the code we have so
far then and I still have to find a place for those two bible texts so you
can upload them, convert them into Library objects and play with it.

For now I hope you study the code a bit, notify me if you find bugs or if you
don't understand something or just to say what a great guy I am ;-)

I hope to see you again next week and,

kind regards,

Jos

Jul 29 '07 #1

Subscribe Post Reply

3473

by: SoftComplete Development | last post by:

AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...

Microsoft SQL Server

Search for an expression in a text file

by: ritesh | last post by:

Hi, I'm working on a piece of code that - 1. has a list of text files 2. and needs to search for a particular expression in these files (this is being done on Linux using gcc 3.4.2) ...

C / C++

Text retrieval systems - 1: Introduction

by: JosAH | last post by:

Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...

Java

Text retrieval systems - 2A: Text Processors

by: JosAH | last post by:

Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that...

Java

Text retrieval systems - 3A: the Library Builder

by: JosAH | last post by:

Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible ...

Java

Text retrieval systems - 4A: the Library

by: JosAH | last post by:

Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...

Java

Text retrieval systems - 5: the BookMark implementation

by: JosAH | last post by:

Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a...

Java

Text retrieval systems - 6: Queries

by: JosAH | last post by:

Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make...

Java

Text retrieval systems - 7: the Software and Data

by: JosAH | last post by:

Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The ...

Java

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Text retrieval systems - 4B: the Library

Similar topics