Dameon wrote:
>
Any advice on how to implement Boyer-Moore in C# for the following scenario:
1. I have an automated process which needs to add a closing tag to a file
2. The close tag cannot be added until the file has X amount of occurences
of a specified string
3. The automated process will call methodX to check for X amount of
occurences in the specified file
4. When the correct # of occurences are found, the closing tag will be added.
The first question is, how many of the files are likely to need the
closing tag added?
1. Most of them
2. A few of them
This governs your solution. If the answer is "a few of them" then you
can tolerate an algorithm that first has to read the file to see if it
is a match, and then has to read it again to actually append the tag.
This is easier. If the answer is "most of them" then you have to search
and append the tag all in one operation, which is harder.
Unfortunately, there's no free lunch, either: if you choose the second,
harder algorithm, then its performance will suck if you have to append
to only a few files. If you choose the first, easier algorithm, then
its performance will suck if you have to append to most files.
In either case, in order to do the search, I would create an array of
characters in memory that is the same size as or larger than the search
string. I would then wrap it in a class that gives you "circular
buffer" semantics with indices that work backward (where index 0 is the
last character you just read, 1 is the one before that, 2 is the one
before that, etc):
public class CircularBuffer
{
private char[] _buffer;
private int _length;
private int _zeroIndex;
public CircularBuffer(int capacity)
{
this._buffer = new char[capacity];
this._length = 0;
this._zeroIndex = 0;
}
public void Add(char newChar)
{
this._zeroIndex -= 1;
if (this._zeroIndex < 0)
{
this._zeroIndex += this._buffer.Length;
}
this._buffer[this._zeroIndex] = newChar;
if (this._length < this._buffer.Length)
{
this._length += 1;
}
}
public char this[int index]
{
if (index < 0 || index >= this._length)
{
throw new IndexOutOfRangeException(...);
}
int trueIndex = this._zeroIndex + index;
// Yes, one could use a modulus operator to do this, but I
would expect
// a test followed by a subtraction to be slightly faster than
a modulus,
// which requires an integer division operation.
if (trueIndex 0)
{
trueIndex -= this._buffer.Length;
}
return this._buffer[trueIndex];
}
}
Basically what this gives you is a buffer that holds the last
"capacity" characters that you have read from a stream. As you read
each character you can "Add" it to the buffer, which makes one
character "fall off" the "start" of the array to make room for the new
character.
With this, it should be easy enough to program Boyer-Moore: you have a
rolling record of enough characters from the stream that you can move
far enough backward through the stream to make the required
comparisons.
Now the only problem is adding the closing tag.
Here the only question is whether you want to be writing an output file
as you are doing Boyer-Moore, and throw away the copied file if it
turns out that the file fails the test, or whether you want to do
Boyer-Moore first and then, if the file passes the test, read the file
again, copying it into a new file and appending the closing tag.
In the first case, you will waste time writing files you don't need if
the Boyer-Moore test eventually reveals that the file isn't a match. In
the second case, you have to read the file again if the test reveals
that the file is a match. So, if few files will be a match, then the
second algorithm (read to test / read again to copy) is obviously more
performant. On the other hand, if most files will be a match, then the
first algorithm (always write new file / delete if not needed) will be
more performant. This first algorithm is, by the way, slightly easier
to write, as you can write two completely separate bits of code, each
of which has one job to do. However, it really depends upon your mix of
files which is the best way to do it.