435,358 Members | 2,957 Online + Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,358 IT Pros & Developers. It's quick & easy.

Vector question

 P: n/a Hi, everyone. Given a vector, what is the fastest way to find out whether there is a repeated element in it? The result is just "true" or "false". Thanks. Kitty Jul 22 '05 #1
16 Replies

 P: n/a In article <41**********@rain.i-cable.com>, Kitty wrote:Given a vector, what is the fastest way to find out whether there is arepeated element in it? The result is just "true" or "false". Thanks. I'd construct an empty set, then walk through the vector, processing each element in turn. If the element is in the set already, stop and return 'true'; otherwise insert the element into the set and continue. If you reach the end of the vector without finding any duplicates, return 'false'. -- Jon Bell Presbyterian College Dept. of Physics and Computer Science Clinton, South Carolina USA Jul 22 '05 #2

 P: n/a "Kitty" wrote in message news:41**********@rain.i-cable.com... Given a vector, what is the fastest way to find out whether there is a repeated element in it? The result is just "true" or "false". Thanks. Sounds like an interview question. There are many answers, each with tradeoffs. Jon's method uses additional memory, but is fast time wise (time is N*O(lg(N), space is O(N)), and easy to write too -- that's important too because software that is easier to write and maintain pays off long term, though this rarely comes up in interviews (though it should). You can also not use additional memory, but sacrifice time (time would be O(N^2), space is O(1)). And each method has its own varations, such as set or hashtable, type of hash function if a hash, set or sorted vector, vector or list, etc, etc. Jul 22 '05 #3

 P: n/a In a similar problem, I use a different approach due to the fact that jon's solution is not an option as it consumes too much memmory. I use a modified qsort and in the comparation part if two sucessive elements are the same, I return true. This is slow with small lists but fast with huge datasets (30.000+ as in my application takes about 100ms) but require no additional memmory! /Casper Kitty wrote: Hi, everyone. Given a vector, what is the fastest way to find out whether there is a repeated element in it? The result is just "true" or "false". Thanks. Kitty Jul 22 '05 #4

 P: n/a "Casper" wrote in message news:GFeYc.73396 In a similar problem, I use a different approach due to the fact that jon's solution is not an option as it consumes too much memmory. I use a modified qsort and in the comparation part if two sucessive elements are the same, I return true. This is slow with small lists but fast with huge datasets (30.000+ as in my application takes about 100ms) but require no additional memmory! A good solution, I think the fastest possible without using extra space. But it changes the original vector. Can you think of a way to do it without changing the original? Jul 22 '05 #5

 P: n/a "Kitty" wrote: Hi, everyone. Given a vector, what is the fastest way to find out whether there is a repeated element in it? The result is just "true" or "false". Thanks. Kitty Here are three obvious ways of using standard algorithms. These solutions have the advantage that they are easy to understand. The second one could avoid copying if it was allowed to modify the sequence in the first place. #include #include #include template < typename ConstIter > bool has_dublicates_a ( ConstIter from, ConstIter to ) { typedef typename std::iterator_traits< ConstIter >::value_type value_type; typedef typename std::set< value_type > set_type; typedef typename set_type::size_type size_type; size_type i = 0; set_type s; for ( ConstIter iter = from; iter != to; ++iter ) { ++ i; s.insert( *iter ); // this test could also be done after the loop: if ( s.size() != i ) { return( true ); } } return( false ); } template < typename ConstIter > bool has_dublicates_b ( ConstIter from, ConstIter to ) { typedef typename std::iterator_traits< ConstIter >::value_type value_type; typedef typename std::vector< value_type > vector_type; vector_type v ( from, to ); std::sort( v.begin(), v.end() ); return( std::adjacent_find( v.begin(), v.end() ) != v.end() ); } int main( int argn, char ** args ){ std::vector< int > a; std::vector< int > b; for ( int i = 0; i < 10; ++i ) { a.push_back( i ); b.push_back( i ); b.push_back( i ); } std::cout << has_dublicates_a( a.begin(), a.end() ) << " " << has_dublicates_a( b.begin(), b.end() ) << "\n"; std::cout << has_dublicates_b( a.begin(), a.end() ) << " " << has_dublicates_b( b.begin(), b.end() ) << "\n"; } As for speed, I would suggest a method based on the "partition by exchange" idea underlying quicksort, possibly modified like introsort to avoid O(N^2) worst case runtime. But that idea has been mentioned in another posting already. Best Kai-Uwe Bux Jul 22 '05 #6

 P: n/a "Kitty" wrote in message news:41**********@rain.i-cable.com... Hi, everyone. Given a vector, what is the fastest way to find out whether there is a repeated element in it? The result is just "true" or "false". Thanks. Kitty See the thread titled "If vector contains only different elements" at http://groups.google.com/groups?thre...0uni-berlin.de -- Alex Vinokur http://mathforum.org/library/view/10978.html http://sourceforge.net/users/alexvn Jul 22 '05 #7

 P: n/a Siemel Naran wrote: "Casper" wrote in message news:GFeYc.73396 In a similar problem, I use a different approach due to the fact that jon's solution is not an option as it consumes too much memmory. I use a modified qsort and in the comparation part if two sucessive elements are the same, I return true. This is slow with small lists but fast with huge datasets (30.000+ as in my application takes about 100ms) but require no additional memmory! A good solution, I think the fastest possible without using extra space. I agree that this is probably the fastest solution using comparisons. Given that the question was for vector, there could be a linear time solution. Here is a (messy) draft of something like that. The idea is to do partition by exchange based on the even/odd distinction (note that dublicates have the same parity). So in one pass, we put all the even elements to the left of all the odd elements. During the same pass, shift down by one bit. Now, we return true if there is a dublicate in the left segment or in the right segment. Farther down in the recursion, we can sometimes return true just based on the length of the segment: if we have shifted so many times that all values are in [0..15], then any segment of length 17 is bound to contain dublicates. The run time for the worst case should be O( N * bit_length ). #include #include #include typedef std::vector< unsigned int > Uvector; void print_range ( u_int* from, u_int* to ) { std::cerr << "[ "; while ( from <= to ) { std::cerr << *from << " "; ++from; } std::cerr << "]"; } bool has_dublicates_helper ( u_int* from, u_int* to, u_int max ) { // WARNING, range is closed: [from,to] print_range( from, to ); if ( to <= from ) { return( false ); } if ( max < to - from ) { return( true ); } u_int* low = from; u_int* high = to; while ( true ) { while ( ( low < high ) && ( *low % 2 == 0 ) ) { *low >>= 1; ++low; } while ( ( low < high ) && ( *high % 2 != 0 ) ) { *high >>= 1; --high; } // either ( low == high ) or ( *high is even ) if ( low < high ) { // *low is odd and *high is even std::swap( *low, *high ); } else { break; } } std::cerr << std::endl; print_range( from, to ); std::cerr << std::endl; // low == high; if ( *low % 2 == 0 ) { *low >>= 1; return( has_dublicates_helper( from, low, max >> 1 ) || has_dublicates_helper( high+1, to, max >>1 ) ); } else { *low >>= 1; return( has_dublicates_helper( from, low-1, max >> 1 ) || has_dublicates_helper( high, to, max >>1 ) ); } } bool has_dublicates ( Uvector u_vect ) { // this makes a copy return( has_dublicates_helper( &u_vect, &u_vect[ u_vect.size()-1 ], std::numeric_limits< unsigned int >::max() ) ); } int main( int argn, char ** args ){ Uvector a; Uvector b; for ( int i = 0; i < 10; ++i ) { a.push_back( i ); b.push_back( i ); b.push_back( i ); } std::cout << has_dublicates( a ) << " " << has_dublicates( b ) << "\n"; } But it changes the original vector. Can you think of a way to do it without changing the original? Same here, I do not see how to avoid that. Best Kai-Uwe Bux Jul 22 '05 #8

 P: n/a "Siemel Naran" wrote in message news:GS*********************@bgtnsc05-news.ops.worldnet.att.net... "Casper" wrote in message news:GFeYc.73396 In a similar problem, I use a different approach due to the fact that jon's solution is not an option as it consumes too much memmory. I use a modified qsort and in the comparation part if two sucessive elements are the same, I return true. This is slow with small lists but fast with huge datasets (30.000+ as in my application takes about 100ms) but require no additional memmory! A good solution, I think the fastest possible without using extra space. But it changes the original vector. Can you think of a way to do it without changing the original? If time is not an issue, perhaps this will do? bool has_duplicates(vector const& i_Ints) { for ( std::vector::const_iterator cit = i_Ints.begin(); cit != i_Ints.end(); ) { int i = *cit++; std::vector::const_iterator jit = cit; while (jit != i_Ints.end()) if (*jit++ == i) return true; } return false; } cheers, Conrad Weyns Jul 22 '05 #9

 P: n/a The most important thing is to keep elements in the same positions after processing. That is sorting-like algorithm is not allowed. "Kitty" ¦b¶l¥ó news:41**********@rain.i-cable.com ¤¤¼¶¼g... Hi, everyone. Given a vector, what is the fastest way to find out whether there is a repeated element in it? The result is just "true" or "false". Thanks. Kitty Jul 22 '05 #10

 P: n/a In article <41**********@rain.i-cable.com>, "Kitty" wrote: Given a vector, what is the fastest way to find out whether there is a repeated element in it? The result is just "true" or "false". Thanks. People have given great answers to the question asked, but I want to take another tack... Another way would be to not allow repeated elements in the vector in the first place, this can be done by wrapping the vector in a class that checks during insertion to make sure the element isn't already there... Jul 22 '05 #11

 P: n/a This also sounds like a good idea, a kind of sorted vector using binary search. But then again, this comes awfully close to the functionality of a hash map which would be the ultimate solution speed wize though: if a collision happens, you've got yourself a duplet instance. Again maybe if you give a bit more info as to the size, content and use of the vector? /Casper Daniel T. wrote: In article <41**********@rain.i-cable.com>, "Kitty" wrote:Given a vector, what is the fastest way to find out whether there is arepeated element in it? The result is just "true" or "false". Thanks. People have given great answers to the question asked, but I want to take another tack... Another way would be to not allow repeated elements in the vector in the first place, this can be done by wrapping the vector in a class that checks during insertion to make sure the element isn't already there... Jul 22 '05 #12

 P: n/a "Conrad Weyns" wrote in message news: wrote in message If time is not an issue, perhaps this will do? bool has_duplicates(vector const& i_Ints) { for ( std::vector::const_iterator cit = i_Ints.begin(); cit != i_Ints.end(); ) { int i = *cit++; std::vector::const_iterator jit = cit; while (jit != i_Ints.end()) if (*jit++ == i) return true; } return false; } Very good. Remember though to handle the special case of a vector of one element. Jul 22 '05 #13

 P: n/a Kai-Uwe Bux wrote in message Siemel Naran wrote: I agree that this is probably the fastest solution using comparisons. Given that the question was for vector, there could be a linear time solution. Here is a (messy) draft of something like that. The idea is to do partition by exchange based on the even/odd distinction (note that dublicates have the same parity). So in one pass, we put all the even elements to the left of all the odd elements. During the same pass, shift down by one bit. Now, we return true if there is a dublicate in the left segment or in the right segment. Farther down in the recursion, we can sometimes return true just based on the length of the segment: if we have shifted so many times that all values are in [0..15], then any segment of length 17 is bound to contain dublicates. The run time for the worst case should be O( N * bit_length ). This sounds reasonable, though I've not looked at the details carefully. But with N typically equal to 32 or 64, would it be too slow? #include #include #include typedef std::vector< unsigned int > Uvector; void print_range ( u_int* from, u_int* to ) { std::cerr << "[ "; while ( from <= to ) { std::cerr << *from << " "; ++from; } std::cerr << "]"; } bool has_dublicates_helper ( u_int* from, u_int* to, u_int max ) { // WARNING, range is closed: [from,to] print_range( from, to ); if ( to <= from ) { return( false ); } if ( max < to - from ) { return( true ); } u_int* low = from; u_int* high = to; while ( true ) { while ( ( low < high ) && ( *low % 2 == 0 ) ) { *low >>= 1; ++low; } while ( ( low < high ) && ( *high % 2 != 0 ) ) { *high >>= 1; --high; } // either ( low == high ) or ( *high is even ) if ( low < high ) { // *low is odd and *high is even std::swap( *low, *high ); } else { break; } } std::cerr << std::endl; print_range( from, to ); std::cerr << std::endl; // low == high; if ( *low % 2 == 0 ) { *low >>= 1; return( has_dublicates_helper( from, low, max >> 1 ) || has_dublicates_helper( high+1, to, max >>1 ) ); } else { *low >>= 1; return( has_dublicates_helper( from, low-1, max >> 1 ) || has_dublicates_helper( high, to, max >>1 ) ); } } bool has_dublicates ( Uvector u_vect ) { // this makes a copy return( has_dublicates_helper( &u_vect, &u_vect[ u_vect.size()-1 ], std::numeric_limits< unsigned int >::max() ) ); } int main( int argn, char ** args ){ Uvector a; Uvector b; for ( int i = 0; i < 10; ++i ) { a.push_back( i ); b.push_back( i ); b.push_back( i ); } std::cout << has_dublicates( a ) << " " << has_dublicates( b ) << "\n"; } But it changes the original vector. Can you think of a way to do it without changing the original? Same here, I do not see how to avoid that. You can for for the O(N^2) algorithm, as in the other post. Jul 22 '05 #14

 P: n/a Siemel Naran wrote: Kai-Uwe Bux wrote in message Siemel Naran wrote: I agree that this is probably the fastest solution using comparisons. Given that the question was for vector, there could be a linear time solution. Here is a (messy) draft of something like that. The idea is to do partition by exchange based on the even/odd distinction (note that dublicates have the same parity). So in one pass, we put all the even elements to the left of all the odd elements. During the same pass, shift down by one bit. Now, we return true if there is a dublicate in the left segment or in the right segment. Farther down in the recursion, we can sometimes return true just based on the length of the segment: if we have shifted so many times that all values are in [0..15], then any segment of length 17 is bound to contain dublicates. The run time for the worst case should be O( N * bit_length ). This sounds reasonable, though I've not looked at the details carefully. But with N typically equal to 32 or 64, would it be too slow? Well, uhm, yes. I thought of that too: In order for ( N * bit_length ) to be faster than O( N * log(N) ), one would have to have N > MAX_INT. That is usually a very unreasonable assumption. However, the difference to a suitably modified qsort is that this gives you a good *worst case*, which is still quadratic for qsort. Nonetheless, I found empirically that template < typename ConstIter > bool has_dublicates ( ConstIter from, ConstIter to ) { typedef typename std::iterator_traits< ConstIter >::value_type value_type; typedef typename std::vector< value_type > vector_type; vector_type v ( from, to ); std::sort( v.begin(), v.end() ); return( std::adjacent_find( v.begin(), v.end() ) != v.end() ); } is strictly faster than the even/odd partitioning scheme; and I would expect that in many implementations of std::sort(), quicksort has been replaces by introsort or something similar, which would guarantee an O( N * log(N) ) worst case, too. Moreover, there is one consideration that should be taken most seriously: What will typically be the case, dublicates or not? If the algorithm will most often encounter dublicate-free sequences, then it makes sense to optimize for proving that no dublicates occur. If typically the sequence will have dublicates, the algorithm should try to find those as quickly as possible. I realized that even/odd partitioning scheme sucks in this regard because dublicates are not found until towards the end. This is where the real strength of std::set<> based solutions lies: template < typename ConstIter > bool has_dublicates_a ( ConstIter from, ConstIter to ) { typedef typename std::iterator_traits< ConstIter >::value_type value_type; typedef typename std::set< value_type > set_type; typedef typename set_type::size_type size_type; size_type i = 0; set_type s; for ( ConstIter iter = from; iter != to; ++iter ) { ++ i; s.insert( *iter ); // this test could also be done after the loop: if ( s.size() != i ) { return( true ); } } return( false ); } This flags dublicates early on. However, it is about 10 times slower in proving that there are no dublicates. Maybe, one could speed up the process by using an array of 256 sets -- one for each possible value of the low-order byte. Best Kai-Uwe Bux Jul 22 '05 #15

 P: n/a "Siemel Naran" wrote in message news:3d**************************@posting.google.c om... "Conrad Weyns" wrote in message news: wrote in message If time is not an issue, perhaps this will do? bool has_duplicates(vector const& i_Ints) { for ( std::vector::const_iterator cit = i_Ints.begin(); cit != i_Ints.end(); ) { int i = *cit++; std::vector::const_iterator jit = cit; while (jit != i_Ints.end()) if (*jit++ == i) return true; } return false; } Very good. Remember though to handle the special case of a vector of one element. It's allready done: jit == i_Ints.end() when size() == 1. I'll admit though, I never thought of checking it, so it's mere luck... :-) /Conrad Jul 22 '05 #16 