Introduction
Last week I started thinking about a text processing facility. I already found
a substantial amount of text: a King James version of the bible. I'm going to
use that text for the examples in this article.
I want to 'transform' the entire text to a Java object that allows me to search
through the entire text in a quite powerful way. Of course that object must be
persistable (read: writable to disk and readable from disk). I'm not going to
build a view for it; I leave that as an exercise for you.
I do implement this object; I call it a 'Library' and the search facility, which
I call a 'Query'. The Library must organize text in groups, books, chapters and
paragraphs in a hierarchical way.
Groups have a title text and store books. Books also have a title and store
chapters; again, chapters have titles and store paragraphs.
A paragraph is a piece of text, usually it consists of one or more sentences.
I want to be able to find words, combinations of them or parts thereof in
paragrahs. The result should be a list of paragraphs which can be manipulated
further or shown to the user.
On top of that I want the user to be able to add 'notes' or 'annotations' to
the paragraphs. Annotations are simple pieces of text which are not searchable
themselves (otherwise they should have been part of the book text to start with).
Sections
The subdivision into groups, books and chapters are just lists of sections.
Here's a Section class:
Expand|Select|Wrap|Line Numbers
- public class Section implements Serializable {
- private static final long serialVersionUID = 2396474693631832475L;
- private String name;
- private int index;
- private int total;
- public Section(String name, int index, int total) {
- this.name = name;
- this.index= index;
- this.total= total;
- }
- public String getName() { return name; }
- public int getIndex() { return index; }
- public int getTotal() { return total; }
- }
Genesis (which is the name of the Section). The 'index' and 'total' values
refer to a subdivision of a book, in the example the index refers to a list
of chapters of which entry 'index' is the first chapter of the book represented
by this Section. The 'total' value is the total number of chapters in the book.
A Section can also be used for groups, e.g. in the King James example there are
three group Sections: the Old Testament, the New Testament and those Apocryphical
books. The 'index' value of the Old Testament Section is the index of the first
book in the Old Testament (which happens to be the book Genesis for the atheists
among us). The 'total' value would equal 39 because there are 39 books in the
Old Testament.
I'm sure you can figure out that a Section can equally well be applied to all
the chapters in a single book. So I need three lists or arrays of Sections, one
for the groups, one for the books and one for the chapters.
The 'index' and 'total' values for chapter Sections refer to the paragraphs
in all the chapters, books and groups in the library.
Here's a little example:
Suppose there are two groups with two books in each group. Also suppose that
there are 3 and 6 chapters in the first two books. There are two paragraphs in
the first chapter of the first book and three paragraphs in the second and
third chapter of the first book. The following ASCII art figure may clarify
how the Sections for the groups, books and chapters are organized: (never mind
the line numbers)
Expand|Select|Wrap|Line Numbers
- +---+---+
- |2 |2 | <-- groups
- +--|+--|+
- | |
- | +---+
- | |
- v v
- +---+---+---+---+ +---+
- |3 |6 | | | ... | b | <-- books
- +--|+---+---+---+ +---+
- |
- |
- v
- +---+---+---+---+---+---+ +---+
- |2 |3 |3 | | | | ... | c | <-- chapters
- +--|+--|+--|+---+---+---+ +---+
- | | |
- | | +-----------+
- | +---+ |
- | | |
- v v v
- +---+---+---+---+---+---+---+---+ +---+
- | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | p | <-- paragraphs
- +- -+---+---+---+---+---+---+---+ +---+
The paragraphs in the text retrieval system are all stored in a simple array
of Strings; they are referenced by the chapter Sections (see above). In the
previous part of this article I stated that I didn't want the data part of the
system to be larger (measured in bytes) than the original text. It's an arbitrary
constraint I'd like to meet.
Given those arrays of Section objects we need to compress the actual text itself.
In an ordinary conversation or article or novel no more than, say, 20,000
*different* words are used and that includes conjugations and inflections
of the stem of words, e.g. "working", "work", "works", "worked", or "devil",
'devils'. For convenience I treat them all as different words.
I even treat the genitive of a word (the "possesive form") as two separate words,
as in: "God's ways". There are three words in that fragment: "God", "s" and "ways".
A word contains letters and/or digits only; it doesn't contain any punctuation
characters or spaces. Spaces just separate words just as these punctuation
characters do.
After a bit of preprocessing, according to my sloppy definition of a word,
I found out that this King James version of the bible contains just 15267
different words; that count includes all those different names and initial
capital and lower case letters as well (e.g. "the" versus "The").
According to 'wc' (Word Count, a unix utitility) the King James bible contains
989006 words, so on average every word occurs about 989006/15267 = 63.8 times.
When you think of it, a word ends with a space or a punctuation character.
A punctuation character is followed by a space or not. So I can safely
ignore spaces (they're implicit between words and punctuation characters).
Take this part of a sentence for an example:
"Behold, I am according to thy wish in God's stead."
I put square brackets around each token I want to deal with and spaces removed:
[Behold][,][i][am][according][to][thy][wish][in][God]['][s][stead][.]
Note that that comma has "conceptually" a space to the right of it. The single
quote character and in this example the trailing dot don't have a space
character following it.
I need two types of punctuation characters: one with a following space and one
without a space character. I can ignore all spaces then. So a word is either
followed by a punctuation character (of any kind) or a space character.
I consider only one single space character, i.e. no multiple spaces, nor tab
characters or any other space-like characters.
Consider the following paragraph:
"This is a sentence. This is a sentence as well."
The method described above chops up this paragraph as:
[This][is][a][sentence][.][This][is][a][sententence][as][well][.]
Suppose a list of unique words is available:
- 0x100: This
- 0x101: is
- 0x102: a
- 0x103: sentence
- 0x104: as
- 0x105: well
Also suppose there are two lists of punctuation characters. The first list
stores punctuation characters implicitly followed by a space; the second list
stores just punctuation characters:
- 0x000: .
- 0x080: .
The above paragraph can be represented as the sequence of list index values:
0x100 0x101 0x102 0x103 0x000 0x100 0x101 0x102 0x103 0x104 0x105 0x080
What I just did is this: the code points [0x000, 0x020) are reserved for the
punctuation characters that are implicitly followed by a space. The code points
[0x080, 0xa0) are reserved for punctuation characters not followed by a
space character. All other numbers [0x0100, 0x10000) are reserved for entire
words. This implies that I can only store 2^16-256 different words. This is
not a limitation (the bible contains just 15267 different words).
In Java a String is composed of zero or more characters. A character (a 'char')
is a 16 bit unsigned number. I can (ab)use those characters for those index
numbers; the resulting paragraph can be a Java String where the String contains
just those index values.
The example paragraph above will take only 12 chars in a String instead of
the 48 characters in the original string. That sure would help to compress
the whole lot.
Resuming, I am going to (ab)use the character values as follows:
- 0x0000 - 0x001f: punctuation characters with a space to the right of it;
- 0x0080 - 0x009f: single punctuation characters;
- 0x0100 - 0xffff: an index in a list of words.
If a word is not followed by any punctuation character code, a space character
is assumed. A word is simply a non-empty sequence of what the Character.isLetter()
or Character.isDigit() methods consider to be true.
Note that the character values [0x0020, 0x007f] and [0x00a0, 0x00ff] aren't
used; so be it. Also note that I transformed the character codes of the
punctuation characters to the ranges which make up 'ISO control characters'
according to Java's Character.isISOControl() method.
I hope to see you in the next part of this article below. I'll show you how
I designed and implemented the 'processor' part of the text processing system.
kind regards,
Jos
Part B