By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,916 Members | 1,068 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Text retrieval systems - 1: Introduction

Expert 10K+
P: 11,448
Greetings,

Introduction

At the end of the last Compiler article part I stated that I wanted to write
about text processing. I had no idea what exactly to talk about; until my wife
commanded me to "clean up that mess you never use anyway and please dump the
rest of it in the attic or simply throw that junk away".

I want to make a statement here:

1) I don't have any mess at all;
2) I use that mess every single day;
3) My mess is not junk.

Just because I'm a very obedient person I started to check my piles of valuable
stuff: of course I found my own notes that I couldn't read anymore and haven't
seen in years, so I decided to put them on a new valuable pile.

Then I found my old K&R1 (Kernighan and Ritchie's C programming first edition)
filled with my annotations. I haven't seen that book in years either (I keep
my latest K&R2 copy on a bookshelf somewere) so I decided to create a new pile
for my valuable old books.

Then I found a thingie that had "Instant mushroom cream sauce" printed on it
with a date "January 2002" and decided to stick it somewhere between the last
chapters of my valuable K&R1 copy. I opened it a bit first because I was very
curious whether or not (intelligent?) life forms could grow in there. It would
be a waste to throw it away. Don't tell my wife about it; she doesn't know what
pure science is.

After I had rearanged a couple of piles to a couple of new piles I found
something interesting: an old desktop PC. I recognized that old desktop despite
the fact that I hadn't seen it in years: it belonged to my old neighbour Dirk
who died a couple of years ago.

Dirk lived a couple of houses down the road and I visited him sometimes. He was
always reading; reading things stored on his computer and he kept on mumbling
about matters and he jotted things down on paper and sometimes he typed a bit
using just one or two fingers while softly cursing and complaining about keys
he couldn't find on that very same keyboard that were still there moments
before. Dirk was a nice and funny old guy.

He had 'donated' that desktop to me a couple of months before he died, saying:
"you take that computer sonny, you know all about those things". That thing ran
a version of Microsoft Windows '98. The very moment I happen to understand
anything at all about that flakey operating system I hope to be shot in the
back mercifully by a good friend. I mean that.

There was a keyboard and the system unit. I remembered I put that monitor on
the attic and decided to see if I could switch it on again. I found the monitor
up in the attic, carried it down, put everything on the dining table, hooked
everything up and switched that old computer on: it still worked, hurray!

Available text

There was an 60MB disk in it and I found a bunch of files, all neatly stored
in their directories. One directory was named 'C:\KJ' and I decided to give
it a peek.

That directory contained an entire bible text, divided up in separate files,
each containing one of those bible books. It was the King James version of the
bible. I ran upstairs to the attic again where I found a bunch of old floppies.

One of my old laptops has a floppy drive as well as a wireless card and I managed
to transfer the whole shebang through those floppies (I needed five of them),
through my old laptop all the way to a laptop I use for everyday work and on
which I type up this text.

I now was the proud owner of a substantial amount of text. One of the obstacles
I envisioned before I started writing this article was me talking about text
processing without me being able to show how stuff works with a substantial
amount of text.

I didn't want to type in the lyrics of my CDs for this (I'm way too lazy for
that) and just a little bit of text don't cut it either. I am going to use that
King James bible text for my examples and little experiments.

My old neighbour Dirk has been reading the King James (C:\KJ) version of the
bible. I wondered if there were any Dutch versions of that bible available for
downloading but I couldn't find any either; possibly like my old neighbour
Dirk wasn't able to find a Dutch text version. I assume he resorted to that
English version just because of a lack of a downloadable Dutch version.

I put all the other piles back in place and decided that that job was over and
done with. My wife disagrees but I'm not going to dig through all that again.
That stuff belongs to me and I removed Dirk's desktop from the dining table
after all, so I did my job; so there.

I am going to use that King James bible text as an example for my text processing

software. On the other hand: I don't want to tie my text processing stuff to
bible texts, neither the King James version nor any other version including
the non available Dutch version(s) of that text nor any other type of text.

Nevertheless that King James bible text makes up a good example of what I want
to talk aboutor even Dirk's version thereof. English would de better in this
international forum after all.

Preparing the raw text

This is the very first part of the King James bible text:

Expand|Select|Wrap|Line Numbers
  1. GENESIS 1:1 In the beginning God created the heaven and the earth.
  2. GENESIS 1:2 And the earth was without form, and void; and darkness {was} upon 
  3. the face of the deep. And the Spirit of God moved upon the face of the waters.
  4. GENESIS 1:3 And God said, Let there be light: and there was light.
  5.  
It looks good: every paragraph (or 'verse' as they call it when we're talking
about bible texts). is prepended with the name of the book, a chapter and
paragraph/verse number.

I didn't like that '{was}' thingy in the second verse and decided to google a
bit; I suspected Dirk's work here but wasn't sure. After a bit of googling I
figured out that the King James text was edited and altered over the centuries
and this '{was}' thing must've been added somewhere at one of those editing
sessions. Old Dirk had nothing to do with it.

I decided to remove the curly brackets for no particular reason. After browsing
an reading a bit through a few other books I found this:

Expand|Select|Wrap|Line Numbers
  1. I KINGS 22:23 ˙Now therefore, behold, the Lord hath put a lying spirit in the 
  2. mouth of ˙all ˙these thy prophets, ˙and the Lord hath spoken ˙evil ˙concerning 
  3. thee. 
  4.  
  5. 24 But Zedekiah the son of Chenaanah went near, and smote Micaiah on the 
  6. cheek, and said, ˙Which way went the spirit of the Lord from me to speak 
  7. unto thee? 
  8.  
  9. 25 And Micaiah said, Behold, thou shalt see in that day, when thou shalt 
  10. go into an inner chamber to hide thyself. 
  11.  
Rats, the book name and chapter number were missing and there were funny
characters in there as well. Browsing and skimming a bit further I found
that chapter numbers were included at the start of a new chapter. Sometimes
I found ^Z characters at the end of the individual files.

I don't want to correct all that by hand; I'm not a monk and I have other
things to do. From experience I know that any text, no matter how meticulously
handled and typed in and edited and proofread, contains errors. Either
typing errors or structural errors. The examples above are structural errors.

Typing errors are almost impossible to find programmatically. Those structural
errors can be handled by a program up to a certain level.

Work to do: I decided I want to design and implement the following:

A simple text processing engine

The previous paragraph showed that errors occur in large amounts of text.
I consider a complete bible a large amount of text. In order to be able to
read, search, query or whatever, the text needs to be in a consistent format.
The text needs to be preprocessed; if this automatic preprocessing fails I
have to correct the text manually but that will be a last resort.

If preprocessing succeeds I want to store the text in a format which allows
for fast retrieval, querying etc. I decided that the smallest unit of text
should be a paragraph. A paragraph consists of one or more sentences, every
sentence consists of one or more words.

I want to be able to locate one or more words in a paragraph quickly. A bunch
of paragraphs form a chapter and one or more chapters make up a book. A bunch
of books are grouped together in a group and a bunch of groups makes up a
library (or a bookshelf or whatever you want to call it).

So basically I divide text up like this, from coarse to fine granularity:

1: group
2: book
3: chapter
4: paragraph

This little scenario fits nicely for bibles as well as a collection of CDs
where the group would be all CDs grouped by artist, the book would be equivalent
to a single CD. The chapters would be the songs and the paragraphs would be
the parts of the song's lyrics. Other setups are possbile as well with this
little scenario.

The ugly part would be the preprocessing of the text. A preprocessor has to be
written for a particular piece of text and could theoretically be thrown away
when the text has been preprocessed. I have to separate that part of my little
text processing and retrieval engine.

There are three additional goals I want to achieve: any unicode text should be
processable, not just ASCII text and the size of the resulting structure, or
object or whatever should not be larger than the size of the original text
stored on disk. The total size of those King James bible texts is ~ 4.7MB
so that would be a maximum size for my King James text retrieval object.

And last: I want to add my own notes to every paragraph I want. The notes
should be saved along with the entire thing and I want to retrieve them when
I want. I basically want a read only sort of database that contains the
bunch of books including my own notes.

I have to add all sorts of indexes and lists and whatever so there should be
some form of compression and maybe some other redundancy removal. There should
be three main parts in this little system:

1) the PreProcessor that offers a good text structure given the raw text;
2) the Builder that creates the final text retrieval system;
3) the Library, the actual text retrieval system (including my own notes).

The design and implementation of those three classes (or groups of classes) is
the subject of the next part of this article. Hopefully I'll see you again next
week when we have to solve quite some technicalities. We have work to do here.

Google is a spoilsport

After typing this article part I investigated this King James bible a bit more
and found out that there are quite some more books. They have been torn out
of the official bible. Either chapters from books or entire books were banned
from the official bible. Not just that King James XTeam did that, it had
happened centuries before already: popes, protestants, catholics, all had a big
fight over what was supposed to be true or not, fanatasy, fiction or the Truth.

A bunch of books ended up at the junk pile but were still considered 'sort of
true'. The 'really true' books make up the 'canon'; the King James version
is such a 'canon'. After a bit more googling I figured out that the King James
XTeam sort of accepted sixteen more books as 'more or less true-ish'.

Those books are called 'apocryphical' books (Greek: 'apogrypha', 'those that
have been hidden away'). Hidden away by whom if I may ask?

Anyway, I decided not to be part of centuries' old crusades and what have you,
and I simply include those sixteen apocryphical books in my old neighbour
Dirk's version of the King James bible. I hope I haven't insulted Dirk, nor
anybody else: I just want to use that text as an example text for the software
I intend to design and implement.

After a bit more googling I found those apocryphical texts; I edited them a bit
manually (actually VI did all the dirty work, I just made a few macros) and
added them to my list of books to be processed for my text retrieval system.

I am going to tell my wife that my piles of notes, books, stuff etc. is not junk:
it is apocryphical. So there.

See you next week and

kind regards,

Jos
Jul 8 '07 #1
Share this Article
Share on Google+