473,385 Members | 1,409 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,385 developers and data experts.

Text retrieval systems - 2A: Text Processors

11,448 Expert 8TB
Greetings,

Introduction

Last week I started thinking about a text processing facility. I already found
a substantial amount of text: a King James version of the bible. I'm going to
use that text for the examples in this article.

I want to 'transform' the entire text to a Java object that allows me to search
through the entire text in a quite powerful way. Of course that object must be
persistable (read: writable to disk and readable from disk). I'm not going to
build a view for it; I leave that as an exercise for you.

I do implement this object; I call it a 'Library' and the search facility, which
I call a 'Query'. The Library must organize text in groups, books, chapters and
paragraphs in a hierarchical way.

Groups have a title text and store books. Books also have a title and store
chapters; again, chapters have titles and store paragraphs.

A paragraph is a piece of text, usually it consists of one or more sentences.
I want to be able to find words, combinations of them or parts thereof in
paragrahs. The result should be a list of paragraphs which can be manipulated
further or shown to the user.

On top of that I want the user to be able to add 'notes' or 'annotations' to
the paragraphs. Annotations are simple pieces of text which are not searchable
themselves (otherwise they should have been part of the book text to start with).

Sections

The subdivision into groups, books and chapters are just lists of sections.
Here's a Section class:

Expand|Select|Wrap|Line Numbers
  1. public class Section implements Serializable {
  2.  
  3.     private static final long serialVersionUID = 2396474693631832475L;
  4.  
  5.     private String name;
  6.     private int index;
  7.     private int total;
  8.  
  9.     public Section(String name, int index, int total) {
  10.  
  11.         this.name = name;
  12.         this.index= index;
  13.         this.total= total;
  14.     }
  15.  
  16.     public String getName() { return name; }
  17.  
  18.     public int getIndex() { return index; }
  19.  
  20.     public int getTotal() { return total; }
  21. }
  22.  
A Section can be used for, say, books, e.g. a Section can describe the book
Genesis (which is the name of the Section). The 'index' and 'total' values
refer to a subdivision of a book, in the example the index refers to a list
of chapters of which entry 'index' is the first chapter of the book represented
by this Section. The 'total' value is the total number of chapters in the book.

A Section can also be used for groups, e.g. in the King James example there are
three group Sections: the Old Testament, the New Testament and those Apocryphical
books. The 'index' value of the Old Testament Section is the index of the first
book in the Old Testament (which happens to be the book Genesis for the atheists
among us). The 'total' value would equal 39 because there are 39 books in the
Old Testament.

I'm sure you can figure out that a Section can equally well be applied to all
the chapters in a single book. So I need three lists or arrays of Sections, one
for the groups, one for the books and one for the chapters.

The 'index' and 'total' values for chapter Sections refer to the paragraphs
in all the chapters, books and groups in the library.

Here's a little example:

Suppose there are two groups with two books in each group. Also suppose that
there are 3 and 6 chapters in the first two books. There are two paragraphs in
the first chapter of the first book and three paragraphs in the second and
third chapter of the first book. The following ASCII art figure may clarify
how the Sections for the groups, books and chapters are organized: (never mind
the line numbers)

Expand|Select|Wrap|Line Numbers
  1. +---+---+
  2. |2  |2  | <-- groups
  3. +--|+--|+
  4.    |   |
  5.    |   +---+
  6.    |       |
  7.    v       v
  8. +---+---+---+---+       +---+
  9. |3  |6  |   |   |  ...  | b | <-- books
  10. +--|+---+---+---+       +---+
  11.    | 
  12.    |
  13.    v
  14. +---+---+---+---+---+---+       +---+
  15. |2  |3  |3  |   |   |   |  ...  | c | <-- chapters
  16. +--|+--|+--|+---+---+---+       +---+
  17.    |   |   |  
  18.    |   |   +-----------+  
  19.    |   +---+           |   
  20.    |       |           |  
  21.    v       v           v  
  22. +---+---+---+---+---+---+---+---+       +---+
  23. | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |  ...  | p | <-- paragraphs
  24. +- -+---+---+---+---+---+---+---+       +---+
  25.  
Paragraphs and text compression

The paragraphs in the text retrieval system are all stored in a simple array
of Strings; they are referenced by the chapter Sections (see above). In the
previous part of this article I stated that I didn't want the data part of the
system to be larger (measured in bytes) than the original text. It's an arbitrary
constraint I'd like to meet.

Given those arrays of Section objects we need to compress the actual text itself.
In an ordinary conversation or article or novel no more than, say, 20,000
*different* words are used and that includes conjugations and inflections
of the stem of words, e.g. "working", "work", "works", "worked", or "devil",
'devils'. For convenience I treat them all as different words.

I even treat the genitive of a word (the "possesive form") as two separate words,
as in: "God's ways". There are three words in that fragment: "God", "s" and "ways".

A word contains letters and/or digits only; it doesn't contain any punctuation
characters or spaces. Spaces just separate words just as these punctuation
characters do.

After a bit of preprocessing, according to my sloppy definition of a word,
I found out that this King James version of the bible contains just 15267
different words; that count includes all those different names and initial
capital and lower case letters as well (e.g. "the" versus "The").

According to 'wc' (Word Count, a unix utitility) the King James bible contains
989006 words, so on average every word occurs about 989006/15267 = 63.8 times.

When you think of it, a word ends with a space or a punctuation character.
A punctuation character is followed by a space or not. So I can safely
ignore spaces (they're implicit between words and punctuation characters).

Take this part of a sentence for an example:

"Behold, I am according to thy wish in God's stead."

I put square brackets around each token I want to deal with and spaces removed:

[Behold][,][i][am][according][to][thy][wish][in][God]['][s][stead][.]

Note that that comma has "conceptually" a space to the right of it. The single
quote character and in this example the trailing dot don't have a space
character following it.

I need two types of punctuation characters: one with a following space and one
without a space character. I can ignore all spaces then. So a word is either
followed by a punctuation character (of any kind) or a space character.

I consider only one single space character, i.e. no multiple spaces, nor tab
characters or any other space-like characters.

Consider the following paragraph:

"This is a sentence. This is a sentence as well."

The method described above chops up this paragraph as:

[This][is][a][sentence][.][This][is][a][sententence][as][well][.]

Suppose a list of unique words is available:
  • 0x100: This
  • 0x101: is
  • 0x102: a
  • 0x103: sentence
  • 0x104: as
  • 0x105: well

Also suppose there are two lists of punctuation characters. The first list
stores punctuation characters implicitly followed by a space; the second list
stores just punctuation characters:
  • 0x000: .
  • 0x080: .

The above paragraph can be represented as the sequence of list index values:

0x100 0x101 0x102 0x103 0x000 0x100 0x101 0x102 0x103 0x104 0x105 0x080

What I just did is this: the code points [0x000, 0x020) are reserved for the
punctuation characters that are implicitly followed by a space. The code points
[0x080, 0xa0) are reserved for punctuation characters not followed by a
space character. All other numbers [0x0100, 0x10000) are reserved for entire
words. This implies that I can only store 2^16-256 different words. This is
not a limitation (the bible contains just 15267 different words).

In Java a String is composed of zero or more characters. A character (a 'char')
is a 16 bit unsigned number. I can (ab)use those characters for those index
numbers; the resulting paragraph can be a Java String where the String contains
just those index values.

The example paragraph above will take only 12 chars in a String instead of
the 48 characters in the original string. That sure would help to compress
the whole lot.

Resuming, I am going to (ab)use the character values as follows:
  • 0x0000 - 0x001f: punctuation characters with a space to the right of it;
  • 0x0080 - 0x009f: single punctuation characters;
  • 0x0100 - 0xffff: an index in a list of words.

If a word is not followed by any punctuation character code, a space character
is assumed. A word is simply a non-empty sequence of what the Character.isLetter()
or Character.isDigit() methods consider to be true.

Note that the character values [0x0020, 0x007f] and [0x00a0, 0x00ff] aren't
used; so be it. Also note that I transformed the character codes of the
punctuation characters to the ranges which make up 'ISO control characters'
according to Java's Character.isISOControl() method.

I hope to see you in the next part of this article below. I'll show you how
I designed and implemented the 'processor' part of the text processing system.

kind regards,

Jos

Part B
Jul 12 '07 #1
0 4026

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...
16
by: Ioannis Vranos | last post by:
Since multicore processors are about to become mainstream soon, multithreading will become a main concern too. However I am thinking that perhaps for small/medium-sized applications...
0
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...
0
by: JosAH | last post by:
Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible ...
0
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...
0
by: JosAH | last post by:
Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a...
1
by: JosAH | last post by:
Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make...
0
by: JosAH | last post by:
Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...
20
by: =?ISO-8859-1?Q?Tom=E1s_=D3_h=C9ilidhe?= | last post by:
There are a few guarantees I exploit in the C Standard. For instance, I might write (unsigned)-1 to get the maximum value for an unsigned integer. Also, I might rely on things such as: ...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.