473,548 Members | 2,691 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Text retrieval systems - 2B: Text Processors

11,448 Recognized Expert MVP
Greetings,

Introduction

welcome back. It's time to do some real design: I want two have two 'things':

1) a 'LibraryBuilder ' that gradually builds the processed text and finally
builds the 'Library' itself.

2) a 'Processor' that processes the input text and spoonfeeds it to the first
object.

Processor

I want these two entities to be as general as possible. First I design the
wanted interfaces. Here is the Processor interface:

Expand|Select|Wrap|Line Numbers
  1. public interface Processor {
  2.  
  3.     public void process(String prefix) throws IOException;
  4.  
  5.     public Library getLibrary();
  6. }
  7.  
The prefix String can be any string; the Processor knows what to do with it.
The prefix can be a uri or a directory or whatever is needed to get to the
raw text.

The process() method does all the processing; and because things can go wrong
during processing it is allowed to throw an IOException which is the most
likely exception that can be thrown. I'll design sub classes thereof when
needed.

The second method gives me the end result: the Library. A Library is an ordinary
class that can retrieve text for me.

I want a Processor implementation to be as generic as possible, i.e. I don't
want to stick any particular King James bible knowledge into my Processor.

The Processor implementation will be an abstract class that does all the
organizational or 'conducting' work and leaves the particular King James bible
knowledge to a subclass. It implements abstract methods for that purpose.

LibraryBuilder

I use the same scenario for a LibraryBuilder:

Expand|Select|Wrap|Line Numbers
  1. public interface LibraryBuilder {
  2.  
  3.     public void preProcess();
  4.     public void postProcess();
  5.  
  6.     public void setTitle(String title);
  7.  
  8.     public void buildGroup(String group);
  9.     public void buildParagraph(String book, String chapter, 
  10.                    int para, String text) throws IOException;
  11.  
  12.     public Library build();
  13. }
  14.  
The interface can't enforce it, but the intention is to call the preProcess()
method before anything else is done. After all processing is over and done
with, the postProcess() method is supposed to be called.

At the very end the LibraryBuilder is supposed to give me a Library object
when its build() method is invoked.

The Library class itself doesn't know which text it handles, i.e. it knows
nothing about King James bible texts, nor about CD collections or whatever.

The two remaining methods implement the text spoonfeeding:

1) buildGroup() builds a new group for its caller.
2) buildParagraph( ) builds a new paragraph given a book, chapter, paragraph
and the raw paragraph text. It may throw an IOException if needed.

As already can be seen, when the Processor.getLi brary() method is invoked it
delegates the job to the LibraryBuilder. build() method.

Here too, I want a LibraryBuilder to be as generic as possible so I implement
an abstract class that implements the LibraryBuilder interface. This class does
all the work that doesn't need any particular knowledge about the King James
bible text.

A special subclass should implement the abstract methods defined in the abstract
super class in which it can stick its specific King James bible text.

Class structure

This is the top level class structure:

Expand|Select|Wrap|Line Numbers
  1. // interfaces:
  2. interface Processor { ... }
  3. interface LibraryBuilder { ... }
  4. // implementing classes:
  5. abstract class AbstractProcessor() implements Processor { ... }
  6. abstract class AbstractBuilder() implements LibraryBuilder { ... }
  7.  
For this particular example project I have to implement two specific classes:

Expand|Select|Wrap|Line Numbers
  1. class KJProcessor extends AbstractProcessor { ... }
  2. class KJBuilder extends AbstractBuillder { ... }
  3.  
These two classes contain specific King James bible knowledge about the text
being processed and from which a Library is constructed by the builder.

AbstractProcess or

The AbstractProcess or does all the 'conducting' work for the raw text processing
job. It needs to be subclassed for the real job. Here is its first part:

Expand|Select|Wrap|Line Numbers
  1. public abstract class AbstractProcessor implements Processor {
  2.  
  3.     protected LibraryBuilder builder;
  4.     protected String title;
  5.  
  6.     public AbstractProcessor(String title, LibraryBuilder builder) {
  7.  
  8.         this.builder= builder;
  9.         this.title= title;
  10.     }
  11.     ...
  12.  
An AbstractProcess or can be constructed given the title of the library and
a LibraryBuilder. The KJProcessor supplies the KBBuilder for its superclass
as well as the title String.

The abstract methods defined in this class are:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     protected abstract void preProcess();
  3.     protected abstract void postProcess();
  4.  
  5.     protected abstract int getNofBooks();
  6.     protected abstract String getBookTitle(String prefix, int book);
  7.  
  8.     protected abstract Reader getBookReader(String prefix, int book) 
  9.                 throws IOException;
  10.  
  11.     protected abstract void processBook(String title, BufferedReader br) 
  12.                 throws IOException;
  13.     ...
  14.  
Similar to the LibraryBuilder this object calls the preProcess() method before
processing starts. When the processing is done the postProcess() method is
invoked. The KJProcessor implements empty methods for these two abstract methods
because it doesn't need to do any special pre- or post processing.

The AbstractProcess or needs to know how many books are to be processed and it
needs the title of each book. That's what the next two methods are for and
they need to be implemented in a subclass of the AbstractProcess or class.

The getBookReader() method needs to return a Java Reader object that can read
from a book. The last method must process an entire book, given a Reader for
that book.

The last two methods can throw an IOException because anything input/output
related actions can go wrong.

Note that the subclass can invoke methods and read or alter member variables
in the builder directly, i.e. the coupling between the two is tight.

Here's the delegator method when a Library object is wanted:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     public Library getLibrary() { return builder.build(); }
  3.     ...
  4.  
Also see above: the AbstractProcess or simply invokes the builder.build() method
for the Library.

Now for some substantial conducting work. The next method in the AbstractProcess or
class is the implementation of the process() method defined in the Processor
interface:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     public void process(String prefix) throws IOException {
  3.  
  4.         builder.preProcess();
  5.         builder.setTitle(title);
  6.  
  7.         this.preProcess();
  8.  
  9.         for (int i= 0, n= getNofBooks(); i < n; i++)
  10.             processBook(prefix, i);
  11.  
  12.         this.postProcess();
  13.  
  14.         builder.postProcess();
  15.     }
  16.     ...
  17.  
It calls the preProcess() methods on the builder and the subclass and it
passes the title to the builder.

Next it determines the number of books to be processed and processes each
book by invoking the processBook() method (see below).

When everything succeeds the postProcess() method is invoked on both the
subclass and the builder.

Here's the processBook() method implementation:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     public void processBook(String prefix, int book) throws IOException {
  3.  
  4.         BufferedReader br= null;
  5.  
  6.         try {
  7.             br= new BufferedReader(getBookReader(prefix, book));
  8.             processBook(getBookTitle(prefix, book), br);
  9.         }
  10.         finally {
  11.             try { br.close(); } catch (IOException ioe) { }
  12.         }
  13.     }
  14.  
This methods asks the subclass to return a Reader given a book. It wraps
a BufferedReader around the Reader and asks the subclass again to process
the current book. Finally the buffered reader is closed again, which closes
the wrapped reader itself.

I think this is enough design and implementation for this week. Next week I'll
show how the LibraryBuilder is designed and implemented. It's more work than
this Processor implementation.

After that I'll show the KJProcessor and KJBuilder classes; they handle the
nitty-gritty String processing work and are basically the implementations
of the abstract methods defined in their parent classes and a few ugly methods
that must come up with consistent text (see the last week's article part).

I'll add all the code as an attachment in some of the following article parts so
you can play with it or maybe actually apply it in a useful way. It doesn't
hurt to actually read the source code. If you find bugs feel free to correct me.

See you next week and

kind regards,

Jos
Jul 13 '07 #1
0 3442

Sign in to post your reply or Sign up for a free account.

Similar topics

0
1621
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing technology provides you with highest indexing performance, possibility to index very large sets of data in minimal time even with memory constraints and...
16
2162
by: Ioannis Vranos | last post by:
Since multicore processors are about to become mainstream soon, multithreading will become a main concern too. However I am thinking that perhaps for small/medium-sized applications multithreading optimisation should not be a major concern apart from the cases where it makes sense (for example a downloading application where one thread...
0
4385
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded me to "clean up that mess you never use anyway and please dump the rest of it in the attic or simply throw that junk away". I want to make a...
0
4036
by: JosAH | last post by:
Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that text for the examples in this article. I want to 'transform' the entire text to a Java object that allows me to search through the entire text in a...
0
3803
by: JosAH | last post by:
Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible available, including those apocryphical books. I downloaded the entire shebang, hacked my King James text processor a bit and now I have two bibles...
0
4070
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans up and stores the text being fed to it. Finally the LibrayBuilder is able to produce a Library which is the topic of this part of the article. ...
0
4049
by: JosAH | last post by:
Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a simple way. We also briefly mentioned the BookMark which represents a single paragraph of text. We haven't seen it's implementation yet. This is the...
1
4412
by: JosAH | last post by:
Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make mistakes. Instead, the Library hands out queries to the user given a simple query String. This is how the library does it:
0
3484
by: JosAH | last post by:
Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article discusses the internals of the Library class a bit more. Sections again A previous article part showed how Sections work, i.e. a group Section...
20
2018
by: =?ISO-8859-1?Q?Tom=E1s_=D3_h=C9ilidhe?= | last post by:
There are a few guarantees I exploit in the C Standard. For instance, I might write (unsigned)-1 to get the maximum value for an unsigned integer. Also, I might rely on things such as: memset(data,-1,sizeof data)
0
7444
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7954
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
6039
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5367
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5085
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3497
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3478
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1932
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1054
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.