Text retrieval systems - 1: Introduction

11,448 Expert 8TB

Greetings,

Introduction

At the end of the last Compiler article part I stated that I wanted to write
about text processing. I had no idea what exactly to talk about; until my wife
commanded me to "clean up that mess you never use anyway and please dump the
rest of it in the attic or simply throw that junk away".

I want to make a statement here:

1) I don't have any mess at all;
2) I use that mess every single day;
3) My mess is not junk.

Just because I'm a very obedient person I started to check my piles of valuable
stuff: of course I found my own notes that I couldn't read anymore and haven't
seen in years, so I decided to put them on a new valuable pile.

Then I found my old K&R1 (Kernighan and Ritchie's C programming first edition)
filled with my annotations. I haven't seen that book in years either (I keep
my latest K&R2 copy on a bookshelf somewere) so I decided to create a new pile
for my valuable old books.

Then I found a thingie that had "Instant mushroom cream sauce" printed on it
with a date "January 2002" and decided to stick it somewhere between the last
chapters of my valuable K&R1 copy. I opened it a bit first because I was very
curious whether or not (intelligent?) life forms could grow in there. It would
be a waste to throw it away. Don't tell my wife about it; she doesn't know what
pure science is.

After I had rearanged a couple of piles to a couple of new piles I found
something interesting: an old desktop PC. I recognized that old desktop despite
the fact that I hadn't seen it in years: it belonged to my old neighbour Dirk
who died a couple of years ago.

Dirk lived a couple of houses down the road and I visited him sometimes. He was
always reading; reading things stored on his computer and he kept on mumbling
about matters and he jotted things down on paper and sometimes he typed a bit
using just one or two fingers while softly cursing and complaining about keys
he couldn't find on that very same keyboard that were still there moments
before. Dirk was a nice and funny old guy.

He had 'donated' that desktop to me a couple of months before he died, saying:
"you take that computer sonny, you know all about those things". That thing ran
a version of Microsoft Windows '98. The very moment I happen to understand
anything at all about that flakey operating system I hope to be shot in the
back mercifully by a good friend. I mean that.

There was a keyboard and the system unit. I remembered I put that monitor on
the attic and decided to see if I could switch it on again. I found the monitor
up in the attic, carried it down, put everything on the dining table, hooked
everything up and switched that old computer on: it still worked, hurray!

Available text

There was an 60MB disk in it and I found a bunch of files, all neatly stored
in their directories. One directory was named 'C:\KJ' and I decided to give
it a peek.

That directory contained an entire bible text, divided up in separate files,
each containing one of those bible books. It was the King James version of the
bible. I ran upstairs to the attic again where I found a bunch of old floppies.

One of my old laptops has a floppy drive as well as a wireless card and I managed
to transfer the whole shebang through those floppies (I needed five of them),
through my old laptop all the way to a laptop I use for everyday work and on
which I type up this text.

I now was the proud owner of a substantial amount of text. One of the obstacles
I envisioned before I started writing this article was me talking about text
processing without me being able to show how stuff works with a substantial
amount of text.

I didn't want to type in the lyrics of my CDs for this (I'm way too lazy for
that) and just a little bit of text don't cut it either. I am going to use that
King James bible text for my examples and little experiments.

My old neighbour Dirk has been reading the King James (C:\KJ) version of the
bible. I wondered if there were any Dutch versions of that bible available for
downloading but I couldn't find any either; possibly like my old neighbour
Dirk wasn't able to find a Dutch text version. I assume he resorted to that
English version just because of a lack of a downloadable Dutch version.

I put all the other piles back in place and decided that that job was over and
done with. My wife disagrees but I'm not going to dig through all that again.
That stuff belongs to me and I removed Dirk's desktop from the dining table
after all, so I did my job; so there.

I am going to use that King James bible text as an example for my text processing

software. On the other hand: I don't want to tie my text processing stuff to
bible texts, neither the King James version nor any other version including
the non available Dutch version(s) of that text nor any other type of text.

Nevertheless that King James bible text makes up a good example of what I want
to talk aboutor even Dirk's version thereof. English would de better in this
international forum after all.

Preparing the raw text

This is the very first part of the King James bible text:

Expand|Select|Wrap|Line Numbers

 
GENESIS 1:1 In the beginning God created the heaven and the earth.

GENESIS 1:2 And the earth was without form, and void; and darkness {was} upon 

the face of the deep. And the Spirit of God moved upon the face of the waters.

GENESIS 1:3 And God said, Let there be light: and there was light.

It looks good: every paragraph (or 'verse' as they call it when we're talking
about bible texts). is prepended with the name of the book, a chapter and
paragraph/verse number.

I didn't like that '{was}' thingy in the second verse and decided to google a
bit; I suspected Dirk's work here but wasn't sure. After a bit of googling I
figured out that the King James text was edited and altered over the centuries
and this '{was}' thing must've been added somewhere at one of those editing
sessions. Old Dirk had nothing to do with it.

I decided to remove the curly brackets for no particular reason. After browsing
an reading a bit through a few other books I found this:

Expand|Select|Wrap|Line Numbers

 
I KINGS 22:23 ÿNow therefore, behold, the Lord hath put a lying spirit in the 

mouth of ÿall ÿthese thy prophets, ÿand the Lord hath spoken ÿevil ÿconcerning 

thee. 
 
24 But Zedekiah the son of Chenaanah went near, and smote Micaiah on the 

cheek, and said, ÿWhich way went the spirit of the Lord from me to speak 

unto thee? 
 
25 And Micaiah said, Behold, thou shalt see in that day, when thou shalt 

go into an inner chamber to hide thyself.

Rats, the book name and chapter number were missing and there were funny
characters in there as well. Browsing and skimming a bit further I found
that chapter numbers were included at the start of a new chapter. Sometimes
I found ^Z characters at the end of the individual files.

I don't want to correct all that by hand; I'm not a monk and I have other
things to do. From experience I know that any text, no matter how meticulously
handled and typed in and edited and proofread, contains errors. Either
typing errors or structural errors. The examples above are structural errors.

Typing errors are almost impossible to find programmatically. Those structural
errors can be handled by a program up to a certain level.

Work to do: I decided I want to design and implement the following:

A simple text processing engine

The previous paragraph showed that errors occur in large amounts of text.
I consider a complete bible a large amount of text. In order to be able to
read, search, query or whatever, the text needs to be in a consistent format.
The text needs to be preprocessed; if this automatic preprocessing fails I
have to correct the text manually but that will be a last resort.

If preprocessing succeeds I want to store the text in a format which allows
for fast retrieval, querying etc. I decided that the smallest unit of text
should be a paragraph. A paragraph consists of one or more sentences, every
sentence consists of one or more words.

I want to be able to locate one or more words in a paragraph quickly. A bunch
of paragraphs form a chapter and one or more chapters make up a book. A bunch
of books are grouped together in a group and a bunch of groups makes up a
library (or a bookshelf or whatever you want to call it).

So basically I divide text up like this, from coarse to fine granularity:

1: group
2: book
3: chapter
4: paragraph

This little scenario fits nicely for bibles as well as a collection of CDs
where the group would be all CDs grouped by artist, the book would be equivalent
to a single CD. The chapters would be the songs and the paragraphs would be
the parts of the song's lyrics. Other setups are possbile as well with this
little scenario.

The ugly part would be the preprocessing of the text. A preprocessor has to be
written for a particular piece of text and could theoretically be thrown away
when the text has been preprocessed. I have to separate that part of my little
text processing and retrieval engine.

There are three additional goals I want to achieve: any unicode text should be
processable, not just ASCII text and the size of the resulting structure, or
object or whatever should not be larger than the size of the original text
stored on disk. The total size of those King James bible texts is ~ 4.7MB
so that would be a maximum size for my King James text retrieval object.

And last: I want to add my own notes to every paragraph I want. The notes
should be saved along with the entire thing and I want to retrieve them when
I want. I basically want a read only sort of database that contains the
bunch of books including my own notes.

I have to add all sorts of indexes and lists and whatever so there should be
some form of compression and maybe some other redundancy removal. There should
be three main parts in this little system:

1) the PreProcessor that offers a good text structure given the raw text;
2) the Builder that creates the final text retrieval system;
3) the Library, the actual text retrieval system (including my own notes).

The design and implementation of those three classes (or groups of classes) is
the subject of the next part of this article. Hopefully I'll see you again next
week when we have to solve quite some technicalities. We have work to do here.

Google is a spoilsport

After typing this article part I investigated this King James bible a bit more
and found out that there are quite some more books. They have been torn out
of the official bible. Either chapters from books or entire books were banned
from the official bible. Not just that King James XTeam did that, it had
happened centuries before already: popes, protestants, catholics, all had a big
fight over what was supposed to be true or not, fanatasy, fiction or the Truth.

A bunch of books ended up at the junk pile but were still considered 'sort of
true'. The 'really true' books make up the 'canon'; the King James version
is such a 'canon'. After a bit more googling I figured out that the King James
XTeam sort of accepted sixteen more books as 'more or less true-ish'.

Those books are called 'apocryphical' books (Greek: 'apogrypha', 'those that
have been hidden away'). Hidden away by whom if I may ask?

Anyway, I decided not to be part of centuries' old crusades and what have you,
and I simply include those sixteen apocryphical books in my old neighbour
Dirk's version of the King James bible. I hope I haven't insulted Dirk, nor
anybody else: I just want to use that text as an example text for the software
I intend to design and implement.

After a bit more googling I found those apocryphical texts; I edited them a bit
manually (actually VI did all the dirty work, I just made a few macros) and
added them to my list of books to be processed for my text retrieval system.

I am going to tell my wife that my piles of notes, books, stuff etc. is not junk:
it is apocryphical. So there.

See you next week and

kind regards,

Jos

Jul 8 '07 #1

Subscribe Post Reply

4368

by: Rare Book School | last post by:

RARE BOOK SCHOOL 2005 Rare Book School is pleased to announce its schedule of courses for 2005, including sessions at the University of Virginia, the Walters Art Museum/Johns Hopkins University...

.NET Framework

Let your customer find fast and easy all text he searches !

by: SoftComplete Development | last post by:

AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...

Microsoft SQL Server

Text retrieval systems - 2A: Text Processors

by: JosAH | last post by:

Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that...

Java

Text retrieval systems - 3A: the Library Builder

by: JosAH | last post by:

Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible ...

Java

Text retrieval systems - 4A: the Library

by: JosAH | last post by:

Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...

Java

Text retrieval systems - 5: the BookMark implementation

by: JosAH | last post by:

Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a...

Java

Text retrieval systems - 6: Queries

by: JosAH | last post by:

Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make...

Java

Text retrieval systems - 7: the Software and Data

by: JosAH | last post by:

Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The ...

Java

Text retrieval systems - 4B: the Library

by: JosAH | last post by:

Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...

Java

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Text retrieval systems - 1: Introduction

Similar topics