473,473 Members | 1,954 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Text retrieval systems - 6: Queries

11,448 Recognized Expert MVP
Greetings,

Introduction

This week we start building Query objects. A query can retrieve portions of text
from a Library. I don't want users to build queries by themselves, because users
make mistakes. Instead, the Library hands out queries to the user given a simple
query String. This is how the library does it:

Expand|Select|Wrap|Line Numbers
  1. public Query getQuery(String query) throws QueryException {
  2.  
  3.     return new QueryTxt(this, query);
  4. }
  5.  
The user passes in a String and gets a Query back again. But what is a Query?
Read on.

The Query interface

Query is an interface again and this is what it looks like:

Expand|Select|Wrap|Line Numbers
  1. public interface Query {
  2.  
  3.     public int size();
  4.     public List<BookMark> result();
  5. }
  6.  
The size() method returns the number of results (paragraphs) that had a positive
match against the query text. The result() method returns a list of BookMarks
that represent all the matched paragraphs.

Internally the query subsystem uses a extension to this interface:

Expand|Select|Wrap|Line Numbers
  1. interface InternalQuery extends Query {
  2.  
  3.     public BitSet evaluate();
  4. }
  5.  
As you can see this interface extends the previous interface with one single
method. The evaluate() method is supposed to evaluate (sic) the query. This
interface isn't a public interface so all the user knows about is the Query
interface itself.

The implementation of the result() interface method invokes the evaluate()
method only once; so no matter how many times the result() method itself is
invoked, the resuls are 'known' already and the query doesn't need to be
re-evaluated.

The Query internals

The simples query is just a simple word of text. The word is looked up in
the Library's WordMap (see the previous article section) and a BitSet is used
to indicate which of the paragraphs contain that word:

Expand|Select|Wrap|Line Numbers
  1. private BitSet makeBitSet(int[] p) {
  2.  
  3.     BitSet bs= new BitSet();
  4.  
  5.     for (int i= 0; i < p.length; bs.set(p[i++]));
  6.  
  7.     return bs;
  8. }
  9.  
Given the array of ints that represent the indexes of the paragraphs in which
a word occurs a BitSet is constructed and the corresponding bits are set to 1.

If you look at the BitSet API documentation you'll see that BitSets are very
handy for the job: they can manipulate entire BitSets using a single method
invocation, i.e. they can 'and' and 'or' two BitSets and entire ranges of
bits in the BitSet can be set or reset.

So why didn't I use BitSets in that WordMap in the first place? The reason is,
a BitSet can be much larger than a simple array of paragraph indexes; suppose
a word only occurs in paragraph 65,000. A BitSet would take +- 65,000 bits
(which is about 8KB) while a one element int array, that stores just the index
number 65,000 would take a single 32 bit int.

Most of the work involved for Query objects is manipulating BitSets because
they're fast to manipulate.

The Query Language

The String passed to the Library has a certain syntax, i.e. the language of
the Query. It's an extremely simple language: the simplest query is a single
word as shown in the previous paragraph.

Queries can be logically 'and'ed and 'or'ed together. The 'and' operator has
higher precedence than the 'or' operator, e.g. the query:

Jesus & God | devil

lists the paragraphs in which both the words Jesus and God appear, or just the
word devil, or all three of the words.

Note that the symbols '&' and '|' represent the 'and'and 'or' operators.

Sub Queries can be grouped by embracing them with parentheses '(' and ')'.
For example the following query gives different results than the previous
query does:

Jesus & (God | devil)

Only those paragraphs are the result of the query if they contain the word Jesus
together with one or both of the words God or devil.

A Query can be negated by preceding it with an exclamation mark '!', so the
following query:

Jesus & !(God | devil)

returns those paragraphs that contain the word Jesus but not any of the words
God or devil.

There's a bit more to the 'and' operator: if it is directly followed by a number
(a non-negative integer number) that number is considered a 'proximity'; here's
an example:

Jesus &10 God

This query returns all the paragraph that contain either (or both of the words)
Jesus or God only if another paragraph within a 'proximity' of 10 paragraphs
contains the other word. Suppose paragraph n contains the word Jesus; this
paragraph is only part of the result if one of the paragraphs

n-10, n-9, ... n-1, n, n+1, ... n+9, n+10

contain the word God. The query is commutative so the query:

God &10 Jesus

returns the same results, if any. Note that no space is allowed between the '&'
symbol and the number. Spaces are optional so the previous query could have
been written as:

God&10Jesus

If a word contains leading digits you have to use a space between the number
and the second word (if it contains leading digits). If the number isn't present
the value 0 (zero) is assumed, i.e. both operands of the and query must occur
in the same paragraph only.

The last query language element is the regular expression. A regular expression
in a query takes the following form:

=<delimeter><regular expression><delimeter>

The next queries use a regular expression and all have the same result:

=/esus/
=AesusA
=%esus%

The leading and trailing delimeters must be equal and not a character used in
the regular expression itself. For the readers who also read my compiler
article series, the following Backus Naur notation describes the formal syntax
of the query language:

query = and-query { | and-query }*
and-query = unary-query { &number? unary-query }*
unary-query = ( query ) | ! unary-query | =delim regexp delim | word
number = any non-negative integer number
regexp = any regular expression
delim = any character
word = any sequence of letters or digits

The Query implementation

The Query subsystem forms a little hierarchy of classes. You have seen both
interfaces already (see above). The hierarchy looks like this:

Expand|Select|Wrap|Line Numbers
  1. Query
  2.    |
  3.    +---- InternalQuery
  4.                  |
  5.                  +------- AbstractQuery
  6.                  |                |
  7.                  |                +------- QueryAnd
  8.                  |                |
  9.                  |                +------- QueryNot
  10.                  |                |
  11.                  |                +------- QueryOr
  12.                  |                |
  13.                  |                +------- QueryRex
  14.                  |
  15.                  +------- QueryTxt
  16.  
The QueryTxt class is the query compiler; as you can see in the beginning of
this article part this is the class that is used by the Library when it has
to hand out a Query to the user.

All the other concrete classes extend from the AbstractQuery class that keeps
track of a bit of bookkeeping, e.g. the makeBitSet() method (see above) is
a method implemented in this class.

The other classes just implement their own 'evaluate()' method. When compilation
has succeeded the result will be an InternalQuery object. The object passed to
the user actually is a QueryTxt object. When the user invokes any of the methods
defined in the Query interface, the QueryTxt object delegates the invocation to
the InternalQuery object that was the result of the compilation phase.

Note that there is no special class for the simple 'word' query; this is handled
by a QueryOr object which has nothing to 'or', just build a BitSet for the single
word in the query.

Text

None of the queries, except for the QueryRex need to actually scan the actual
text: they all use the WordMap stored in the Library itself and build the
corresponding BitSets out of it.

The QueryRex is an 'expensive' query in terms of processing time: it needs to
retrieve the actual text from the bookmarks and match a regular expression
against this text. The text being used is the result of the BookMark's toString()
method.

This method not only returns the paragraph text but the name of the group,
book, chapter and relative paragraph number as well. (see the previous article
part), so a QueryRex can search and find those names as well.

Concluding Remarks

Next week I'll point you to two .zip files that contain the compressed text
of the King James bible and the Dutch Staten Vertaling bible. I'll attach all
the source code with the next article part.

You can decompress the .zip files, compile the source code and you're in business
then: build a Library and query away on the result. Next week I'll give detailed
instructions.

Prometheuzz not only supplied me the complete text for the Dutch Staten Vertaling
bible but he also offered me disk space for the two .zip files. My own ISP is a
scrooge and I can't host those two files myself so I'm very grateful to Prometheuzz
for his offer: thank you very much, I owe you a beer.

I hope to see you next week and

kind regards,

Jos
Aug 12 '07 #1
1 4405
Leke Ekundayo
5 New Member
I was not able to compile and run the program. Is there already compiled file
Jun 21 '11 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...
12
by: grace | last post by:
i am wondering why my database retrieval becomes too slow...we set up a new server (ubuntu, breezy badger) machine where we transferred all our files from the old server.. Our new server uses Asus...
0
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...
0
by: JosAH | last post by:
Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that...
0
by: JosAH | last post by:
Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible ...
0
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...
0
by: JosAH | last post by:
Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a...
0
by: JosAH | last post by:
Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The ...
0
by: JosAH | last post by:
Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.