473,719 Members | 2,239 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Text retrieval systems - 4B: the Library

11,448 Recognized Expert MVP
Greetings,

welcome back; above we discussed the peripherals of the Library class: loading
and saving such an instantiation of it, the BookMark interface and then some.
This part of the article discusses the internals of the Library class a bit more.

Sections again

A previous article part showed how Sections work, i.e. a group Section refers
to book Sections, a book Section refers to chapter Sections and the latter refer
to paragraphs of text.

There are a few methods implemented in the Library class that handle all the
Section manipulation. Here are the first two; they find the index of a Section
in a Section array given the name of a Section:

Expand|Select|Wrap|Line Numbers
  1.  
  2. private int getIndex(Section[] sections, String name) {
  3.  
  4.     return getIndex(sections, name, 0, sections.length);
  5. }
  6.  
  7. private int getIndex(Section[] sections, String name, int lo, int n) {
  8.  
  9.     for (int i= lo; n-- > 0; i++)
  10.         if (sections[i].getName().equals(name)) return i;
  11.  
  12.     return -1;
  13. }
  14.  
The first method finds the index of a Section in an array of Sections given
the name of the Section. It delegates the work to the second method which
takes an interval of indexes in the form of the lowest index number to be
considered and the number of adjacent sections in the array.

This is hardly rocket science, but we need these methods nevertheless. The next
method is a bit more clever: suppose we want to find the index of a book given
the absolute index number of a chapter; the following method handles it all:

Expand|Select|Wrap|Line Numbers
  1. private int getIndex(Section[] section, int index) {
  2.  
  3.     for (int lo= 0, hi= section.length-1; lo <= hi; ) {
  4.         int mid= (lo+hi)>>>1;
  5.         int ilo= section[mid].getIndex();
  6.         int ihi= ilo+section[mid].getTotal();
  7.  
  8.         if (ilo <= index && ihi > index) return mid;
  9.         if (ihi <= index) lo= mid+1;
  10.         else              hi= mid-1;
  11.     }
  12.  
  13.     return -1;
  14. }
  15.  
Given an array of Sections and an index number, this method applies a binary
search through the array and finds the index of the Section that 'embraces'
the wanted index, i.e. the Section contains the index passed in as a parameter.

For example, the King James bible contains 1352 chapters and 36133 paragraphs.
Given a paragraph number this method can find the corresponding chapter Section
in 2log(1352) operations, which is less than 11 search actions at most.

Results are even better when we're searching for the book Section given the
index of a chapter Section etc. because there are far less books than there
are chapters and there are far less groups then there are books.

Believe it or not, but the methods described above are the only 'intelligence'
needed and implemented in the Library class.

Simple statistics

The library object can produce simple statistics too and the methods that
implement them are ridicilously simple; here they are:

Expand|Select|Wrap|Line Numbers
  1. public String getTitle() { return title; }
  2.  
  3. public int getGroupSize() { return groups.length; }
  4. public int getBookSize() { return books.length; }
  5. public int getChapterSize() { return chapters.length; }
  6. public int getParagraphSize() { return paragraphs.length; }
  7. public int getWordSize() { return words.length; }
  8. public int getIndexSize() { return wordMap.size(); }
  9. public int getOccurrences(String word) {
  10.  
  11.     int[] paragraphs= wordMap.get(word);
  12.  
  13.     return (paragraphs == null)?0:paragraphs.length;
  14. }
  15.  
The first method isn't actually a statistics type of method but I include it
anyway. The other methods simple return the lengths of the various arrays and
the WordMap which plays a crucial role in the text retrieval functionality.

Here's another simple method I implemented:

Expand|Select|Wrap|Line Numbers
  1. public int[] getIndexes(String word) { 
  2.  
  3.     int[] paragraphs= wordMap.get(word);
  4.  
  5.     return (paragraphs == null)?null:Arrays.copyOf(paragraphs, paragraphs.length);
  6. }
  7.  
The method returns a copy of all the index values where a certain word occurs.
A copy of the original array is returnd because I don't want any caller to
accidentally destroy the values in the original array(s). There's one caller
I trust: me; *ahem* so especially for me I wrote this method:

Expand|Select|Wrap|Line Numbers
  1. int[] getWordIndexes(String word) { return wordMap.get(word); }
  2.  
I need that little method later when we are going to implement more fancy text
retrieval functionality. Note that this isn't a public method.

Text decoding

As the previous article part explained, the text is compressed and encoded.
The paragraph strings aren't the literal text but just a representation thereof.
The following (non public!) method retrieves the actual text again given such an
encoded paragraph String:

Expand|Select|Wrap|Line Numbers
  1. String decompress(String text) {
  2.  
  3.     StringBuffer sb= new StringBuffer(); 
  4.  
  5.     for (int i= 0, n= text.length(); i < n; i++) {
  6.  
  7.         char c= text.charAt(i);
  8.  
  9.         if (c < 0x20) 
  10.             sb.append(punctuation[0].charAt(c)).append(' ');
  11.         else if (c < 0x100)
  12.                 sb.append(punctuation[1].charAt(c&0x1f));
  13.         else {
  14.             sb.append(words[c-0x100]);
  15.             if (i < n-1 && text.charAt(i+1) >= 0x100)
  16.                 sb.append(' ');
  17.         }
  18.     }
  19.  
  20.     return sb.toString();
  21. }    
  22.  
If a character from the paragraph String represents one of the punctuation
characters it is translated back and a white space is appended if needed.
Otherwise the character represents an index in the words array; the word is
retrieved and optionally a space is appended if the current word is followed
by another word. Compare this method agaist the 'compress()' method from the
previous article part: it simply performs the reverse operations.

Simple text retrieval

Before we dig into the real retrieval, here is the simple functionality: the
Library class makes use of a very simple class for this, the SectionList class.
This is a list of sections, or in Java a: List<String>. This is how it is done:

Expand|Select|Wrap|Line Numbers
  1. class SectionList extends AbstractList<String> {
  2.  
  3.     private Section[] section;
  4.     private int lo, n;
  5.  
  6.     SectionList(Section[] section, int lo, int n) {
  7.  
  8.         this.section= section;
  9.         this.lo        = lo;
  10.         this.n        = n;
  11.     }
  12.  
  13.     public String get(int i) { return section[i+lo].getName(); }
  14.  
  15.     public int size() { return n; }
  16. }
  17.  
The constructor takes an array of Sections, a 'lo' index and the number of
adjacent Sections which is also the length of the List itself. The list has
values of type String; the Strings are retrieved from the individual Sections
themselves. Note that the AbstractList<St ring> superclass takes care of the
rest of the functionality that makes this object a full fledged List.

All the following methods make use of this little class:

Expand|Select|Wrap|Line Numbers
  1. public List<String> getGroups() { 
  2.  
  3.     return new SectionList(groups, 0, groups.length); 
  4. }
  5.  
  6. public List<String> getBooks() { 
  7.  
  8.     return new SectionList(books, 0, books.length); 
  9. }
  10.  
  11. public List<String> getBooks(String group) {
  12.  
  13.     int i= getIndex(groups, group);
  14.  
  15.     return (i < 0)?null:new SectionList(books, groups[i].getIndex(), groups[i].getTotal());
  16. }
  17.  
  18. public List<String> getChapters() { 
  19.  
  20.     return new SectionList(chapters, 0, chapters.length); 
  21. }
  22.  
  23. public List<String> getChapters(String book) {
  24.  
  25.     int i= getIndex(books, book);
  26.  
  27.     return (i < 0)?null:new SectionList(chapters, books[i].getIndex(), books[i].getTotal());
  28. }
  29.  
Note how most of the methods use the 'getIndex()' method described above in
order to get the range of indexes in one of the Section arrays. The StringList
takes care of the rest. It really is that simple. The last method returns a
List that contains all the unique words in the entire text but not in any
particular order.

The next little method returns an unmodifiable list of all the unique words
in no particular order, just in case you're interested in such a list:

Expand|Select|Wrap|Line Numbers
  1. public List<String> getWords() {
  2.  
  3.     return Collections.unmodifiableList(Arrays.asList(words));
  4. }
  5.  
Here are a few examples given the fact that our example Library stores the
King James bible text:

Expand|Select|Wrap|Line Numbers
  1. Library lib= Library.load("/usr/jos/tmp/kj.txtlib"); // get the library
  2.  
  3. // show all books:
  4. List<String> books= lib.getBooks();
  5. for (String book : books)
  6.     System.out.println(book);
  7.  
  8. // show all chapters in book "Genesis"
  9. List<String> chapters= lib.getChapters("Genesis");
  10. for (String chapter : chapters)
  11.     System.out.println(chapter);
  12.  
For the actual paragraph text we can't use this StringList but we just define
another little class: the BookMark class. We also need another method in the
Library class itself:

Expand|Select|Wrap|Line Numbers
  1. BookMark getBookMark(int paragraph) { return new BookMarkImpl(paragraph); }
  2.  
We don't know what a BookMarkImpl class is, but for sure it is an implementation
of the BookMark interface. I'll explain this implementation later. Here is our
BookMarkList:

Expand|Select|Wrap|Line Numbers
  1. class BookMarkList extends AbstractList<BookMark> {
  2.  
  3.     private Library lib;
  4.     private int lo, n;
  5.  
  6.     BookMarkList(Library lib, int lo, int n) {
  7.  
  8.         this.lib= lib;
  9.         this.lo    = lo;
  10.         this.n    = n;
  11.     }
  12.  
  13.     public BookMark get(int i) { return lib.getBookMark(i+lo); }
  14.  
  15.     public int size() { return n; }
  16. }
  17.  
Note that this class knows nothing about a BookMarkImpl class, i.e. it is
just interested in BookMarks (the interface). Analogous to our previous List
class it implementes an AbstractList for the hard work and it returns a
BookMark for its 'get()' method. We trust that mysterious BookMarkImpl class
to have an appropriate 'toString()' implemented.

Here are the methods in the Library class that use these classes:

Expand|Select|Wrap|Line Numbers
  1. public List<BookMark> getParagraphs() { 
  2.  
  3.     return new BookMarkList(this, 0, paragraphs.length); 
  4. }
  5.  
  6. public List<BookMark> getParagraphs(String book) {
  7.  
  8.     int i= getIndex(books, book);
  9.  
  10.     if (i < 0) return null;
  11.  
  12.     int clo= books[i].getIndex();
  13.     int chi= books[i].getTotal()+clo-1;
  14.  
  15.     int p= chapters[clo].getIndex();
  16.     int n= chapters[chi].getIndex()+chapters[chi].getTotal()-p;
  17.  
  18.     return new BookMarkList(this, p, n);
  19. }
  20.  
  21. public List<BookMark> getParagraphs(String book, String chapter) {
  22.  
  23.     int i= getIndex(books, book);
  24.  
  25.     if (i < 0) return null;
  26.  
  27.     int c= getIndex(chapters, chapter, books[i].getIndex(), books[i].getTotal());
  28.  
  29.     return (c < 0)?null:new BookMarkList(this, chapters[c].getIndex(), chapters[c].getTotal());
  30. }
  31.  
Again observe that the 'getIndex()' method is heavily used again. All that these
methods do is figure out which parts of the paragraph array must be included in
the List<BookMark>.

An example use could be:

Expand|Select|Wrap|Line Numbers
  1. Library lib= Library.load("/usr/jos/tmp/kj.txtlib"); // get the library
  2.  
  3. // show all text for the book Genesis:
  4. List<Bookmark> paragraphs= lib.getParagraphs("Genesis");
  5. for (BookMark paragraph : paragraphs)
  6.     System.out.println(paragraph);
  7.  
  8. // show all text in the entire bible"
  9. List<BookMark> paragraphs= lib.getParagraphs();
  10. for (BookMark paragraph: paragraphs)
  11.     System.out.println(paragraph);
  12.  
Concluding remarks

The methods described above allow us to do all of the following:
  • Produce a list of all group names
  • Produce a list of all book names
  • Produce a list of book names in a group
  • Produce a list of all chapter names
  • Produce a list of chapter names in book
  • Produce a list of all paragraphs
  • Produce a list of paragraphs in a book
  • Produce a list of paragraphs in a chapter of a book

This is fun but it still doesn't allow for quite advanced text retrieval; i.e.
with a bit of 'grep' fiddling we can do the same and even more. The next chapter
introduces Query objects that allow us to really dig into the text and produce
quite advanced results.

This article part was full of code again and I call it a day. I hope to see you
next time in the next part of this article. I'll attach all the code we have so
far then and I still have to find a place for those two bible texts so you
can upload them, convert them into Library objects and play with it.

For now I hope you study the code a bit, notify me if you find bugs or if you
don't understand something or just to say what a great guy I am ;-)

I hope to see you again next week and,

kind regards,

Jos
Jul 29 '07 #1
0 3505

Sign in to post your reply or Sign up for a free account.

Similar topics

0
1632
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing technology provides you with highest indexing performance, possibility to index very large sets of data in minimal time even with memory constraints and unbelievable fast query processing speed. The main AlphaTIX's feature that makes it first and...
15
2141
by: ritesh | last post by:
Hi, I'm working on a piece of code that - 1. has a list of text files 2. and needs to search for a particular expression in these files (this is being done on Linux using gcc 3.4.2) Currently the search is done using the 'grep' utility on linux. This takes too much time, and kills the responsiveness of the application.
0
4417
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded me to "clean up that mess you never use anyway and please dump the rest of it in the attic or simply throw that junk away". I want to make a statement here:
0
4056
by: JosAH | last post by:
Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that text for the examples in this article. I want to 'transform' the entire text to a Java object that allows me to search through the entire text in a quite powerful way. Of course that object must be
0
3830
by: JosAH | last post by:
Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible available, including those apocryphical books. I downloaded the entire shebang, hacked my King James text processor a bit and now I have two bibles available: the English 'King James' version and the Dutch 'Staten Vertaling' version.
0
4089
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans up and stores the text being fed to it. Finally the LibrayBuilder is able to produce a Library which is the topic of this part of the article. Introduction
0
4071
by: JosAH | last post by:
Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a simple way. We also briefly mentioned the BookMark which represents a single paragraph of text. We haven't seen it's implementation yet. This is the topic of this week's article part.
1
4435
by: JosAH | last post by:
Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make mistakes. Instead, the Library hands out queries to the user given a simple query String. This is how the library does it:
0
4459
by: JosAH | last post by:
Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The attachment contains the complete source code which was explained in the previous article parts. Download it, extract the sources somewhere and have a look. The data I've been using all the time can be found here. There are
0
8829
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8726
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9361
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9125
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9054
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7997
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5972
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4482
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4746
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.