Hello,

denis wrote:

Hi there,

I got a tough interview questions lately, and I would like to hear

your opinion:

An array of N chars is given

Write an efficient algorithm to find all the repeating substring with

a

minimal size

of 2

f.e

ABCFABHYIFAB

sunstrings are:

"AB"

"FAB"

I wonder, what purpose this interview question should have served. If

you need people with expertise in string algorithms, then it should be

clear from the CV, whether this can be expected or not.

I wonder what the connection of the question is with C++. It is a purely

algorithmic question. Since I had to deal with a similar problem a few

years ago, and implemented it in C++ I can at least give you a few

ideas.

One basic idea is to construct a suffix tree (a generalization of tries,

similar to digital trees), which is possible in time O(N), given

constant alphabet size. Look for authors McCreight or Ukkonen which

have written papers on it. They have been cited a lot, since a lot of

work has been done on this data structure in the last 20 years.

The suffix tree has at most 2 nodes per suffix, one leaf node per

suffix, and there is an inner node per maximal prefix which is a prefix

of at least two suffixes. This is a very rough explanation of what can

be explained better on a few pages with pictures. The proof that this

tree can be constructed in time O(N) is non-trivial, and needs some

extra pointers in the tree. But it should be easier to understand that

the suffix tree has size O(N) and can be traversed in time O(N). The

inner nodes which are are at least two characters down from the root

correspond to substrings of minimal size 2 with repeated occurrences.

So there is a solution in time O(N) for the overall problem of finding

the repeated substrings, i.e. algorithmically optimal. The constant

involved is not overly high, the suffix tree might pay out for N in the

thousands already, and especially if you can use the same suffix tree

many times for different questions.

I don't know of any implementation generic enough that it could be just

plugged into the C++ standard library. The system I implemented the

suffix tree for used a string class of its own with rather different

requirements than std::string. A suffix tree with the O(N) size can

only be implemented as a kind of add-on index of a string or set of

strings. So this does not fit at all anything in the C++ standard

library. It is neither a container of it's own, nor an algorithm, it's

a chimaera, and it is fat in memory consumption, up to about 10

pointers per character, if you add all usual bells and whistles. There

are more recent refinements of suffix trees, e.g. suffix arrays, suffix

cactus and suffix whatever, which try to reduce the memory consumption.

Another method that might help getting near to O(N) is

hashing/fingerprinting as proposed by Karp and Rabin. Due to hasing

O(N) worst case will not be maintainable, but, as usual with hashing,

the average will be nice. The constant and the memory consumption will

be a lot less compared to suffix trees.

Can any curriculum be expected to treat suffix trees or Rabin-Karp

matching? I doubt that there is any message other than the trivial one

from the fact that somebody has not heard about it. An interviewer will

always be able to make you either reinvent simple and inefficient

solutions or give up, since there are many simply stated problems with

non-trivial solutions.

Bernd Strieder