> I need to write an algorithm that sheds the outliers in a large data
set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?
You haven't described your data but if you can use a histogram, use it.
It's one pass on the data and then another on the histogram while
keeping a running total to get the index you want. Otherwise:
size_t buf[2*n]
buf[0..n-1] = 0..n-1
sort buf[0..n-1]
for (size_t i = n; i < whatever; i += n) {
buf[n..2n-1] = i..2i-1
sort buf[n..2n-1]
inplace_merge buf, buf+n, buf+2n
}
index = buf[n]
(this obviously assumes the data set is of some positive length m*n but
it's trivial to fix it so it does the proper bounds checks)
buf is an index buffer, so the sort/inplace_merge should do an
indirection on it when doing the comparison. You don't even need to use
an index buffer if the values are cheap to copy as opposed to doing an
indirection. Not to mention the much better locality of reference
copying the values has.