"James Aguilar" <jf**@cec.wustl .edu> wrote in message
news:d2******** **@newsreader.w ustl.edu...
Hi James,
I have a program that runs the Manber-Myers suffix array search algorithm
with longest common prefix values. I need an efficient way to represent a
sequential array of characters to be passed into the algorithm. Here are
the features I need for the representation:
* Must be randomly accessible in constant time
* Must be capable of efficiently storing strings of more than 10 million
characters
Here are some plusses:
* Dereferencing an iterator should be fast, not just constant time, but
quick constant time, since it will be happening often (at least (size of
pattern) + log(size of text) times)
* Incrementing an iterator should be similarly fast
It would also be useful if this representation provided an efficient
method of extracting all the characters for the problem from the standard
input (again, no less than 5 million characters, randomly distributed
among A, C, G, and T).
Now, here's the question: which is better for this problem, a string or a
rope. If a rope would be best, can you offer any guidance as to ways to
make the use of the rope as fast and painless as possible (if that is the
preferable representation) .
The implementation of rope::iterator is somewhat more complex than for
string (where it is basically a pointer), and will introduce an overhead.
The only true advantage of a rope (i.e. the non-standard rope class from
the SGI STL) is that it supports efficient *editing* of a small portion
of the string (not just replacing, but also inserting/deleting/splicing
subsequences within the original string). Unless this is relevant to
what you need to do, there is no point in using rope instead of string.
The caveat with std::string, depending on how the platform you use
implements the standard library, is to try to avoid unnecessary copies
of the data structure (be careful to pass the strings as references
whenever possible, to avoid object copies in case your implementation
does not use reference-counting).
In case top performance is critical, passing the input as a file
and mapping it into memory (with platform-specific mmap/MapViewOfFile),
will most likely be the fastest approach at run-time.
Note also that your algorithm could be templated on the iterator type,
allowing it to be used on any type of source data range.
I hope this helps,
Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Brainbench MVP for C++ <>
http://www.brainbench.com