Algorithm to find nth largest or nth smallest in a range

Code4u

I need to write an algorithm that sheds the outliers in a large data
set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

TIA

Jul 23 '05 #1

Subscribe Post Reply

6453

Ian

Code4u wrote:

I need to write an algorithm that sheds the outliers in a large data
set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

I don't see how you could do this without two passes.

You have to make one pass to determine whatever statistics you require
from the data.

Is the set that large and the overhead that great that you cant use
nth_element?

Ian

Jul 23 '05 #2

Paul Bilnoski

Code4u wrote:

I need to write an algorithm that sheds the outliers in a large data
set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

TIA

Because you have to trim a portion of the data based on the entire data
set, as in the smallest 2%, you will have to have the data sorted (or
partially sorted) to know what elements lie out of range.
I think the algorithm you mention will work in linear time using a
version of quicksort's partitioning algorithm, but it works by
recursively shuffling elements around to find the nth smallest.

If you know where the 2% mark is before you start looking at the data,
then a linear search - single pass - ignoring elements out of range will
work.

Otherwise, you will have to do one pass to find the bounds and another
to find the right element. This is still linear with respect to the
input data size, complexity in the average case is 1.5 * N

--Paul

Jul 23 '05 #3

> I need to write an algorithm that sheds the outliers in a large data

set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

You haven't described your data but if you can use a histogram, use it.
It's one pass on the data and then another on the histogram while
keeping a running total to get the index you want. Otherwise:

size_t buf[2*n]
buf[0..n-1] = 0..n-1
sort buf[0..n-1]
for (size_t i = n; i < whatever; i += n) {
buf[n..2n-1] = i..2i-1
sort buf[n..2n-1]
inplace_merge buf, buf+n, buf+2n
}
index = buf[n]
(this obviously assumes the data set is of some positive length m*n but
it's trivial to fix it so it does the proper bounds checks)

buf is an index buffer, so the sort/inplace_merge should do an
indirection on it when doing the comparison. You don't even need to use
an index buffer if the values are cheap to copy as opposed to doing an
indirection. Not to mention the much better locality of reference
copying the values has.

Jul 23 '05 #4

Stephen Howe

>I need to write an algorithm that sheds the outliers in a large data

set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

There is an algorithm that does this.
It is much more expensive than nth_element as it must make multiple passes
and do some book-keeping to keep track of the element you want because of
the requirement of "remain" in sorted order. It zooms into the kth element
in multiple passes.

But if I was you, I would look into having an extra element in your data set
called (intrusive)

int originalOrder;

When you want to eliminate the smallest 2%

1. Iterate over your vector numbering elements originalOrder as 0 to
size() - 1.
2. Do nth_element and use erase(remove() ) to get rid of the bottom 2%
3. Apply sort() with 3 parameter ordering by originalOrder.

You now have your original order and eliminated smallest 2%.

Stephen Howe

Jul 23 '05 #5

by: Keke922 | last post by:

I have to write a program that allows the user to enter a series of integers and -99 when they want to exit the loop. How do I display the largest and smallest number the user entered?

C / C++

Is there an STL algorithm for this?

by: Mark P | last post by:

I'm working with a std::multiset and I want to efficiently select a (continuous) range of elements in the set. What I know is that the desired range is exactly those elements which compare equal...

C / C++

Finding next largest/smallest floating point value

by: Peter Ammon | last post by:

I have a floating point number. I'd like to get the nearest floating point number that is larger or smaller than the given number. I investigated FLT_EPSILON but it only seems to be useful if the...

C / C++

Generic heap sort algorithm A

by: JosAH | last post by:

Greetings, I was asked to write a Tip Of the Week; so here goes: a lot of topics are started here in this forum (and a lot of other forums too) mentioning a problem about sorting data. ...

Java

volume/dimension algorithm question

by: sklett | last post by:

I realize this could be a little off-topic, but there are some great minds on this NG and I hope you can let me slide this time ;0) I'm designing our system to manage what products can fit in...

C# / C Sharp

best fit algorithm

by: kamsmartx | last post by:

I'm new to programming and need some help figuring out an algorithm. I need to design some kind of algorithm which will help us define capacity for one of our portfolios....here's the problem...

Java

Finding smallest and largest string

by: crystal2005 | last post by:

I am writing a program that receive maximum of 25 line of string each has 20 characters maximum. The program will print the smallest and the largest string. However the following program gives me...

C / C++

Quadratic Sorting Algorithm Program

by: slapsh0t11 | last post by:

Hello! I need help with a program that I believe I am nearly done with. However, there seems to be a few details that preclude me from success. Here is my assignment: Here is my class file...

Java

Dijkstra's Algorithm for finding the shortest route

by: Glenton | last post by:

Hi All Here is a very simple little class for finding a shortest route on a network, following Dijkstra's Algorithm: #!/usr/bin/env python #This is meant to solve a maze with Dijkstra's...

Python

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Algorithm to find nth largest or nth smallest in a range

Similar topics