Text retrieval systems - 6: Queries

11,448 Recognized Expert MVP

Greetings,

Introduction

This week we start building Query objects. A query can retrieve portions of text
from a Library. I don't want users to build queries by themselves, because users
make mistakes. Instead, the Library hands out queries to the user given a simple
query String. This is how the library does it:

Expand|Select|Wrap|Line Numbers

 
public Query getQuery(String query) throws QueryException {
 
    return new QueryTxt(this, query);

}

The user passes in a String and gets a Query back again. But what is a Query?
Read on.

The Query interface

Query is an interface again and this is what it looks like:

Expand|Select|Wrap|Line Numbers

 
public interface Query {
 
    public int size();

    public List<BookMark> result();

}

The size() method returns the number of results (paragraphs) that had a positive
match against the query text. The result() method returns a list of BookMarks
that represent all the matched paragraphs.

Internally the query subsystem uses a extension to this interface:

Expand|Select|Wrap|Line Numbers

 
interface InternalQuery extends Query {
 
    public BitSet evaluate();

}

As you can see this interface extends the previous interface with one single
method. The evaluate() method is supposed to evaluate (sic) the query. This
interface isn't a public interface so all the user knows about is the Query
interface itself.

The implementation of the result() interface method invokes the evaluate()
method only once; so no matter how many times the result() method itself is
invoked, the resuls are 'known' already and the query doesn't need to be
re-evaluated.

The Query internals

The simples query is just a simple word of text. The word is looked up in
the Library's WordMap (see the previous article section) and a BitSet is used
to indicate which of the paragraphs contain that word:

Expand|Select|Wrap|Line Numbers

 
private BitSet makeBitSet(int[] p) {
 
    BitSet bs= new BitSet();
 
    for (int i= 0; i < p.length; bs.set(p[i++]));
 
    return bs;

}

Given the array of ints that represent the indexes of the paragraphs in which
a word occurs a BitSet is constructed and the corresponding bits are set to 1.

If you look at the BitSet API documentation you'll see that BitSets are very
handy for the job: they can manipulate entire BitSets using a single method
invocation, i.e. they can 'and' and 'or' two BitSets and entire ranges of
bits in the BitSet can be set or reset.

So why didn't I use BitSets in that WordMap in the first place? The reason is,
a BitSet can be much larger than a simple array of paragraph indexes; suppose
a word only occurs in paragraph 65,000. A BitSet would take +- 65,000 bits
(which is about 8KB) while a one element int array, that stores just the index
number 65,000 would take a single 32 bit int.

Most of the work involved for Query objects is manipulating BitSets because
they're fast to manipulate.

The Query Language

The String passed to the Library has a certain syntax, i.e. the language of
the Query. It's an extremely simple language: the simplest query is a single
word as shown in the previous paragraph.

Queries can be logically 'and'ed and 'or'ed together. The 'and' operator has
higher precedence than the 'or' operator, e.g. the query:

Jesus & God | devil

lists the paragraphs in which both the words Jesus and God appear, or just the
word devil, or all three of the words.

Note that the symbols '&' and '|' represent the 'and'and 'or' operators.

Sub Queries can be grouped by embracing them with parentheses '(' and ')'.
For example the following query gives different results than the previous
query does:

Jesus & (God | devil)

Only those paragraphs are the result of the query if they contain the word Jesus
together with one or both of the words God or devil.

A Query can be negated by preceding it with an exclamation mark '!', so the
following query:

Jesus & !(God | devil)

returns those paragraphs that contain the word Jesus but not any of the words
God or devil.

There's a bit more to the 'and' operator: if it is directly followed by a number
(a non-negative integer number) that number is considered a 'proximity'; here's
an example:

Jesus &10 God

This query returns all the paragraph that contain either (or both of the words)
Jesus or God only if another paragraph within a 'proximity' of 10 paragraphs
contains the other word. Suppose paragraph n contains the word Jesus; this
paragraph is only part of the result if one of the paragraphs

n-10, n-9, ... n-1, n, n+1, ... n+9, n+10

contain the word God. The query is commutative so the query:

God &10 Jesus

returns the same results, if any. Note that no space is allowed between the '&'
symbol and the number. Spaces are optional so the previous query could have
been written as:

God&10Jesus

If a word contains leading digits you have to use a space between the number
and the second word (if it contains leading digits). If the number isn't present
the value 0 (zero) is assumed, i.e. both operands of the and query must occur
in the same paragraph only.

The last query language element is the regular expression. A regular expression
in a query takes the following form:

=<delimeter><regular expression><delimeter>

The next queries use a regular expression and all have the same result:

=/esus/
=AesusA
=%esus%

The leading and trailing delimeters must be equal and not a character used in
the regular expression itself. For the readers who also read my compiler
article series, the following Backus Naur notation describes the formal syntax
of the query language:

query = and-query { | and-query }*
and-query = unary-query { &number? unary-query }*
unary-query = ( query ) | ! unary-query | =delim regexp delim | word
number = any non-negative integer number
regexp = any regular expression
delim = any character
word = any sequence of letters or digits

The Query implementation

The Query subsystem forms a little hierarchy of classes. You have seen both
interfaces already (see above). The hierarchy looks like this:

Expand|Select|Wrap|Line Numbers

 
Query

   |

   +---- InternalQuery

                 |

                 +------- AbstractQuery

                 |                |

                 |                +------- QueryAnd

                 |                |

                 |                +------- QueryNot

                 |                |

                 |                +------- QueryOr

                 |                |

                 |                +------- QueryRex

                 |

                 +------- QueryTxt

The QueryTxt class is the query compiler; as you can see in the beginning of
this article part this is the class that is used by the Library when it has
to hand out a Query to the user.

All the other concrete classes extend from the AbstractQuery class that keeps
track of a bit of bookkeeping, e.g. the makeBitSet() method (see above) is
a method implemented in this class.

The other classes just implement their own 'evaluate()' method. When compilation
has succeeded the result will be an InternalQuery object. The object passed to
the user actually is a QueryTxt object. When the user invokes any of the methods
defined in the Query interface, the QueryTxt object delegates the invocation to
the InternalQuery object that was the result of the compilation phase.

Note that there is no special class for the simple 'word' query; this is handled
by a QueryOr object which has nothing to 'or', just build a BitSet for the single
word in the query.

Text

None of the queries, except for the QueryRex need to actually scan the actual
text: they all use the WordMap stored in the Library itself and build the
corresponding BitSets out of it.

The QueryRex is an 'expensive' query in terms of processing time: it needs to
retrieve the actual text from the bookmarks and match a regular expression
against this text. The text being used is the result of the BookMark's toString()
method.

This method not only returns the paragraph text but the name of the group,
book, chapter and relative paragraph number as well. (see the previous article
part), so a QueryRex can search and find those names as well.

Concluding Remarks

Next week I'll point you to two .zip files that contain the compressed text
of the King James bible and the Dutch Staten Vertaling bible. I'll attach all
the source code with the next article part.

You can decompress the .zip files, compile the source code and you're in business
then: build a Library and query away on the result. Next week I'll give detailed
instructions.

Prometheuzz not only supplied me the complete text for the Dutch Staten Vertaling
bible but he also offered me disk space for the two .zip files. My own ISP is a
scrooge and I can't host those two files myself so I'm very grateful to Prometheuzz
for his offer: thank you very much, I owe you a beer.

I hope to see you next week and

kind regards,

Jos

Aug 12 '07 #1

Subscribe Reply

4405

Leke Ekundayo

New Member

I was not able to compile and run the program. Is there already compiled file

Jun 21 '11 #2

by: SoftComplete Development | last post by:

AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...

Text retrieval systems - 6: Queries

Similar topics