440,581 Members | 2,007 Online
Need help? Post your question and get tips & solutions from a community of 440,581 IT Pros & Developers. It's quick & easy.

# tough interview question

 P: n/a Hi there, I got a tough interview questions lately, and I would like to hear your opinion: An array of N chars is given Write an efficient algorithm to find all the repeating substring with a minimal size of 2 f.e ABCFABHYIFAB sunstrings are: "AB" "FAB" Mar 3 '06 #1
9 Replies

 P: n/a denis wrote: I got a tough interview questions lately, and I would like to hear your opinion: An array of N chars is given Write an efficient algorithm to find all the repeating substring with a minimal size of 2 I would agree that this is a though interview question! Did they really wanted a correct and (provably) efficient algorithm? ... or did they want to see how you get about it? I think the following would do the trick: #include #include #include #include int main() { typedef std::string::const_iterator iterator; std::string s("ABCFABHYIFAB"); std::set found; if (2 < s.size()) for (iterator oit = s.begin() + 1, oe = s.end(); oit != oe; ++oit) for (iterator iit = s.begin(); iit != oit; ++iit) { iterator tmp = std::mismatch(oit, oe, iit).second;; if (tmp - iit > 1) found.insert(std::string(iit, tmp)); } std::copy(found.begin(), found.end(), std::ostream_iterator(std::cout, "\n")); } .... but I'm not sure that it is the most efficient version. -- - Efficient Artificial Intelligence Mar 3 '06 #2

 P: n/a what abt quickly sorting the strings & then go abt.... Mar 3 '06 #3

 P: n/a dc wrote: what abt quickly sorting the strings & then go abt.... My algorithm is at worst an O(n^3) algorithm. With your algorithm you would create O(n^2) strings, with a size of up to O(n), i.e. comparing each string is not a constant but an O(n) operation. I think this effectively means that sorting the strings takes O(n^3 ln n) operations. A more thorough analysis might, however, reduce these numbers somewhat because the majority number of strings is smaller than O(n), I wouldn't expect this to fit into an interview, however, unless it is for a rather scientific position. It is definitely important to know about asymptotic performance and choose proper algorithms but it is also important to get the job done in a reasonable amount of time which also adds to the selection criteria. I think that both algorithms are in the same order with respect to implementation complexity, maybe yours is slightly simpler. Thus, I would expect both to be a reasonable answer, although it is hard to tell what the interviewer actually wanted to see. -- - Efficient Artificial Intelligence Mar 3 '06 #4

 P: n/a denis wrote: Hi there, I got a tough interview questions lately, and I would like to hear your opinion: An array of N chars is given Write an efficient algorithm to find all the repeating substring with a minimal size of 2 f.e ABCFABHYIFAB sunstrings are: "AB" "FAB" They might have been intrested if you are aware of string search algorithms. http://en.wikipedia.org/wiki/String_search Mar 3 '06 #5

 P: n/a Here my version of algorithm. Experimentally I found that number of searches it performs is: [Length of string]+[Number of unique substring] - [Some Number]. Where [Some Number] = 3 or 4 (may be some other). Don't know why. Any ideas? void foo(std::string& s) { std::set sSet; int findCount = 0; for (int firstElement = 0; firstElement < s.size(); firstElement++) { for (int stringLen = 2; firstElement+stringLen < s.size(); stringLen++) { std::string t = s.substr(firstElement,stringLen); if (sSet.find(t)!=sSet.end()) continue; findCount++; if (s.find(t,firstElement+stringLen) != std::string::npos) sSet.insert(t); else break; } } std::copy(sSet.begin(), sSet.end(), std::ostream_iterator(std::cout, "\n")); cout << "Search count " << findCount << endl; cout << "Calc search count either " << s.length() + sSet.size() - 3 << endl; cout << " or " << s.length() + sSet.size() - 4 << endl; cout << "==============" << endl; } int main( void ) { std::string s = "ABCFABHYIFAB"; foo(s); return 0; } Mar 3 '06 #6

 P: n/a Hello, denis wrote: Hi there, I got a tough interview questions lately, and I would like to hear your opinion: An array of N chars is given Write an efficient algorithm to find all the repeating substring with a minimal size of 2 f.e ABCFABHYIFAB sunstrings are: "AB" "FAB" I wonder, what purpose this interview question should have served. If you need people with expertise in string algorithms, then it should be clear from the CV, whether this can be expected or not. I wonder what the connection of the question is with C++. It is a purely algorithmic question. Since I had to deal with a similar problem a few years ago, and implemented it in C++ I can at least give you a few ideas. One basic idea is to construct a suffix tree (a generalization of tries, similar to digital trees), which is possible in time O(N), given constant alphabet size. Look for authors McCreight or Ukkonen which have written papers on it. They have been cited a lot, since a lot of work has been done on this data structure in the last 20 years. The suffix tree has at most 2 nodes per suffix, one leaf node per suffix, and there is an inner node per maximal prefix which is a prefix of at least two suffixes. This is a very rough explanation of what can be explained better on a few pages with pictures. The proof that this tree can be constructed in time O(N) is non-trivial, and needs some extra pointers in the tree. But it should be easier to understand that the suffix tree has size O(N) and can be traversed in time O(N). The inner nodes which are are at least two characters down from the root correspond to substrings of minimal size 2 with repeated occurrences. So there is a solution in time O(N) for the overall problem of finding the repeated substrings, i.e. algorithmically optimal. The constant involved is not overly high, the suffix tree might pay out for N in the thousands already, and especially if you can use the same suffix tree many times for different questions. I don't know of any implementation generic enough that it could be just plugged into the C++ standard library. The system I implemented the suffix tree for used a string class of its own with rather different requirements than std::string. A suffix tree with the O(N) size can only be implemented as a kind of add-on index of a string or set of strings. So this does not fit at all anything in the C++ standard library. It is neither a container of it's own, nor an algorithm, it's a chimaera, and it is fat in memory consumption, up to about 10 pointers per character, if you add all usual bells and whistles. There are more recent refinements of suffix trees, e.g. suffix arrays, suffix cactus and suffix whatever, which try to reduce the memory consumption. Another method that might help getting near to O(N) is hashing/fingerprinting as proposed by Karp and Rabin. Due to hasing O(N) worst case will not be maintainable, but, as usual with hashing, the average will be nice. The constant and the memory consumption will be a lot less compared to suffix trees. Can any curriculum be expected to treat suffix trees or Rabin-Karp matching? I doubt that there is any message other than the trivial one from the fact that somebody has not heard about it. An interviewer will always be able to make you either reinvent simple and inefficient solutions or give up, since there are many simply stated problems with non-trivial solutions. Bernd Strieder Mar 3 '06 #7

 P: n/a On 2006-03-03, Dietmar Kuehl wrote: denis wrote: I got a tough interview questions lately, and I would like to hear your opinion: An array of N chars is given Write an efficient algorithm to find all the repeating substring with a minimal size of 2 I would agree that this is a though interview question! Did they really wanted a correct and (provably) efficient algorithm? ... or did they want to see how you get about it? I think the following would do the trick: #include #include #include #include int main() { typedef std::string::const_iterator iterator; std::string s("ABCFABHYIFAB"); std::set found; if (2 < s.size()) for (iterator oit = s.begin() + 1, oe = s.end(); oit != oe; ++oit) You can save (N-1) fruitless calls to mismatch by bumping one off the end. Since "B" is not a sequence, it needn't be checked against "A", "B", "C", etc...: for (iterator oit = s.begin() +1, oe = s.end() -1; oit < oe; ++oit) for (iterator iit = s.begin(); iit != oit; ++iit) { iterator tmp = std::mismatch(oit, oe, iit).second;; if (tmp - iit > 1) found.insert(std::string(iit, tmp)); } std::copy(found.begin(), found.end(), std::ostream_iterator(std::cout, "\n")); } ... but I'm not sure that it is the most efficient version. This may be a problem: "ebebe" => "be" "ebe" It's hard to tell from the problem descrition. -- Neil Cerutti Mar 3 '06 #8

 P: n/a Dietmar Kuehl schrieb: dc wrote: what abt quickly sorting the strings & then go abt.... My algorithm is at worst an O(n^3) algorithm. With your algorithm you would create O(n^2) strings, you dont need n^2 strings for the same (buggy, eg "fabfab") results #include #include #include #include class substring { public: char *s; substring(char *ss) : s(ss) { } bool operator <(const substring &second) const { return strcmp(s, second.s) < 0; } int match(const substring &second) { int i = 0; while (s[i] != 0 && s[i] == second.s[i]) i++; return i; } }; int main() { char *s = "ABCFABHYIFAB"; std::vector strings; for (int i = 0; i < strlen(s); i++) strings.push_back(substring(s+i)); std::sort(strings.begin(), strings.end()); int match, lmatch = 0; std::vector::iterator it; for (it = strings.begin(); it != strings.end()-1; it++) { if (((match = (*it).match(*(it+1))) > lmatch) && (match 1)) std::cout << std::string((*it).s, match) << std::endl; lmatch = match; } return 0; } with a size of up to O(n), i.e. comparing each string is not a constant but an O(n) operation. I think this effectively means that sorting the strings takes O(n^3 ln n) operations. A more thorough analysis might, however, reduce these numbers somewhat because the majority number of strings is smaller than O(n), I wouldn't expect this to fit into an interview, however, unless it is for a rather scientific position. It is definitely important to know about asymptotic performance and choose proper algorithms but it is also important to get the job done in a reasonable amount of time which also adds to the selection criteria. I think that both algorithms are in the same order with respect to implementation complexity, maybe yours is slightly simpler. Thus, I would expect both to be a reasonable answer, although it is hard to tell what the interviewer actually wanted to see. -- - Efficient Artificial Intelligence Mar 3 '06 #9

 P: n/a denis wrote: Hi there, I got a tough interview questions lately, and I would like to hear your opinion: An array of N chars is given Write an efficient algorithm to find all the repeating substring with a eg ABCFABHYIFAB sunstrings are: "AB" "FAB" You can create an array 'p' of char*s into the given string and initialise it so that: p[0] = "ABCFABHYIFAB" p[1] = "BCFABHYIFAB" p[2] = "CFABHYIFAB" p[3] = "FABHYIFAB" : p[9] = "FAB" p[10] = "AB" p[11] = "B" Now sort this array. You can then scan the sorted array looking for adjacent pairs whose first 2 chars are the same, eg (p[0],p[10]) and (p[3],p[9]) above. This trick appeared in Dr Dobbs some time back. Mar 3 '06 #10

### This discussion thread is closed

Replies have been disabled for this discussion.