Why std::sort is ~20% faster then introsort coded by hand?

ikarus

Hello C++ Gurus!

I'm comparing sorting algorithm for study goals. I've compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.

My 'hand coded' version is based on D. Musser's suggestions and on
some STL header which I've found in diffrent STL installations (GCC,
and MSVS2008). The maximum optimization flags was used in both cases,
for g++ and cl. But 'hand-coded' version is still slower.

Why? May be in production std::sort version it is used some SIMD
optimization internally or assembler optiomization?

Thanks in advance.
Alex.

Pseudocode (C++, may be bugs) of introsort version:

// CODE

template <typename T>
inline void swap(T *a, T *b)
{
T temp = *a;
*a = *b;
*b = temp;
}

template <typename T>
inline void cmpswap(T *a, T *b)
{
if (*a *b) {
T temp = *a;
*a = *b;
*b = temp;
}
}

template <typename T>
inline T *partition(T *first, T *last)
{
T *v = last-1;
first--;

do {
while(*(++first) *v);
while(*last-- < *v) if (first == last) break;

if (first >= last) break;
swap(first,last);
} while(true);

swap(v,first);
return ++first;
}

template <typename T>
inline void median3(T *first, T *last)
{
T *r = last-1, *rr = last-2;
swap(first + (last - first)/2 , rr);
cmpswap(first, rr);
cmpswap(first, r);
cmpswap(rr, r);
}

template <typename T>
inline void _introsort(T *first, T *last, int recurs_depth)
{
while(last - first M) {
if (recurs_depth--) return hsort(first, last);

median3(first, last);
T *k = partition(first+1, last-1);
introsort(k,last, recurs_depth);
last = k;
}
}

template <typename T>
void introsort(T *first, T *last, int recurs_depth)
{
_introsort(first, last, recurs_depth);
insertion(first, last);
}

// CODE END

Sep 14 '08 #1

Subscribe Post Reply

6212

Juha Nieminen

ikarus wrote:

Why?

Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.
That accounts for approximately that 20% speed increase.

I don't know exactly how small the partition must be before it skips
subpartitioning it, but according to my own tests something between 15
and 25 elements seems optimal (although it depends on the element type).

If you add this additional step to your sorting algorithm you will
most probably see the ~20% speed increase.

Sep 14 '08 #2

Juha Nieminen

Juha Nieminen wrote:

Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.

Oh, and by the way: When I say insertion sort, I mean insertion sort.
I do *not* mean bubble sort, selection sort or any other type of sort.
I *do* mean insertion sort. Nothing else.

(Yes, I have seen tons of misconceptions about this subject. Insertion
sort is the most optimal comparison-based sorting algorithm for very
small arrays. Don't even bother trying anything else.)

Sep 14 '08 #3

Jerry Coffin

In article <aae2f3e6-16c1-4e60-a214-
90**********@c58g2000hsc.googlegroups.com>, Ka*************@gmail.com
says...

Hello C++ Gurus!

I'm comparing sorting algorithm for study goals. I've compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.

While it's impossible to say for _sure_ without even knowing which exact
library implementation you're looking at, the obvious guess would be
that the library is using a "modified introsort", which is essentially
the same as a modified quicksort, but (obviously enough) with the extra
book-keeping and switching of an introsort added.

The idea of a modified quicksort is pretty simple: below some number of
items, the "simple" sorts (insertion sort, selection sort, bubble sort)
are generally faster than a quicksort. Therefore, when you reach that
threshold, you're better off switching to something else (and the
threshold is rarely very critical).

A further optimization is due to Robert Sedgewick: he noted that you
still have a fair amount of overhead repeatedly calling the simple sort
on all those small partitions. He also noted that with an insertion sort
(unlike anything else) the speed doesn't really depend on the number of
items being sorted, but strictly upon the distances they need to be
moved during sorting -- therefore, rather than calling it once every
time a partition reaches a certain size threshold, you just _stop_ doing
more sorting when the size reaches the threshold, and then when all the
partitions have been reduced to that size, you call an insertion sort
_once_ for the entire collection. Even though you're using an insertion
sort on a large colletion, its speed is proportional to the sizes of the
partitions (when you stopped partitioning) rather than to the size of
the entire collection.

The difference this makes will depend _heavily_ upon the speed of
function calls. For example, it makes a MUCH larger difference on an x86
than it does on a SPARC. Based on the 20% difference you noted, it's a
fair guess that you were probably testing on an x86 or something very
similar. More importantly, that overhead also has an effect on the point
at which you should stop partitioning -- there's no one size that's
right for all architectures (though, as I mentioned above, the threshold
isn't usually critical).

--
Later,
Jerry.

The universe is a figment of its own imagination.

Sep 14 '08 #4

Kai-Uwe Bux

Juha Nieminen wrote:

ikarus wrote:
>Why?

Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.
That accounts for approximately that 20% speed increase.

I don't know exactly how small the partition must be before it skips
subpartitioning it, but according to my own tests something between 15
and 25 elements seems optimal (although it depends on the element type).

If you add this additional step to your sorting algorithm you will
most probably see the ~20% speed increase.

I think, he already has that. In the code, the threshold would be M:

template <typename T>
inline void _introsort(T *first, T *last, int recurs_depth)
{
while(last - first M) {
...
}
}

template <typename T>
void introsort(T *first, T *last, int recurs_depth)
{
_introsort(first, last, recurs_depth);
insertion(first, last);
}
Best

Kai-Uwe Bux

Sep 14 '08 #5

Kai-Uwe Bux

ikarus wrote:

Hello C++ Gurus!

I'm comparing sorting algorithm for study goals. I've compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.

My 'hand coded' version is based on D. Musser's suggestions and on
some STL header which I've found in diffrent STL installations (GCC,
and MSVS2008). The maximum optimization flags was used in both cases,
for g++ and cl. But 'hand-coded' version is still slower.

Why? May be in production std::sort version it is used some SIMD
optimization internally or assembler optiomization?

Thanks in advance.
Alex.

Pseudocode (C++, may be bugs) of introsort version:

// CODE

template <typename T>
inline void swap(T *a, T *b)
{
T temp = *a;
*a = *b;
*b = temp;
}

template <typename T>
inline void cmpswap(T *a, T *b)
{
if (*a *b) {
T temp = *a;
*a = *b;
*b = temp;
}
}

template <typename T>
inline T *partition(T *first, T *last)
{
T *v = last-1;
first--;

do {
while(*(++first) *v);
while(*last-- < *v) if (first == last) break;

if (first >= last) break;
swap(first,last);
} while(true);

swap(v,first);
return ++first;
}

template <typename T>
inline void median3(T *first, T *last)
{
T *r = last-1, *rr = last-2;
swap(first + (last - first)/2 , rr);
cmpswap(first, rr);
cmpswap(first, r);
cmpswap(rr, r);
}

template <typename T>
inline void _introsort(T *first, T *last, int recurs_depth)
{
while(last - first M) {
if (recurs_depth--) return hsort(first, last);

Is this if() right? It appears that you are using heap sort unless your
guess for recurs_depth is exactly 1.

median3(first, last);
T *k = partition(first+1, last-1);
introsort(k,last, recurs_depth);
last = k;
}
}

template <typename T>
void introsort(T *first, T *last, int recurs_depth)
{
_introsort(first, last, recurs_depth);
insertion(first, last);
}

// CODE END

Best

Kai-Uwe Bux

Sep 14 '08 #6

James Kanze

On Sep 14, 4:25 pm, Jerry Coffin <jcof...@taeus.comwrote:

In article <aae2f3e6-16c1-4e60-a214-
90a9cc30a...@c58g2000hsc.googlegroups.com>,
Karev.Alexan...@gmail.com says...

I'm comparing sorting algorithm for study goals. I've
compared STL std::sort and hand-coded introsort on millions
(tens of millions) of integers array sorting. It was tested
on random, sorted, and sorted in reverse arrays. In all
tests on such a huge arrays std::sort is faster by ~1/4 then
my version. So it seems that it some call optimization.

While it's impossible to say for _sure_ without even knowing
which exact library implementation you're looking at, the
obvious guess would be that the library is using a "modified
introsort", which is essentially the same as a modified
quicksort, but (obviously enough) with the extra book-keeping
and switching of an introsort added.

The idea of a modified quicksort is pretty simple: below some
number of items, the "simple" sorts (insertion sort, selection
sort, bubble sort) are generally faster than a quicksort.
Therefore, when you reach that threshold, you're better off
switching to something else (and the threshold is rarely very
critical).

A further optimization is due to Robert Sedgewick: he noted
that you still have a fair amount of overhead repeatedly
calling the simple sort on all those small partitions. He also
noted that with an insertion sort (unlike anything else) the
speed doesn't really depend on the number of items being
sorted, but strictly upon the distances they need to be moved
during sorting -- therefore, rather than calling it once every
time a partition reaches a certain size threshold, you just
_stop_ doing more sorting when the size reaches the threshold,
and then when all the partitions have been reduced to that
size, you call an insertion sort _once_ for the entire
collection. Even though you're using an insertion sort on a
large colletion, its speed is proportional to the sizes of the
partitions (when you stopped partitioning) rather than to the
size of the entire collection.

The difference this makes will depend _heavily_ upon the speed
of function calls. For example, it makes a MUCH larger
difference on an x86 than it does on a SPARC. Based on the 20%
difference you noted, it's a fair guess that you were probably
testing on an x86 or something very similar. More importantly,
that overhead also has an effect on the point at which you
should stop partitioning -- there's no one size that's right
for all architectures (though, as I mentioned above, the
threshold isn't usually critical).

Two other possibilities to consider:

-- Except for random data, quicksort is very sensitive to how
the pivot is chosen. If he's constantly seeing roughly the
same difference (including with already sorted arrays), then
this probably isn't the case, but it can certainly explain
significant differences in some specific cases.

-- Using the modified quicksort, it's possible to gain a little
time once you go to do the final insertion sort by first
searching for the smallest value (which can only be in the
first partition, so you don't have to search far), and
putting it in place first. Having done that, you can drop
the test for having reached the beginning of the array in
the loop of the insertion sort, since you're guaranteed to
find a value which precedes the one you're inserting before
reaching it.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Sep 14 '08 #7

ikarus

Thanks for everybody for suggestions. But as I mentioned, I already
using most of suggested improvments, except searching of minimal
element only in minimal left 'subproblem' to set up it as 'sentinel'
while in insertiosort, which is used on last stage. I already tryed to
play with M coefficient, which is size of array for which insertion
sort is going in game. It dosn't give any notable impovements,
std::sort is still faster by avarage 4 times (not 20%, sorry for
misleading you), on different sets of data.

The reference implemetation of introsort was one from D. Musser docs,
another from STL algorithm headers from MSVS 2008 C++ implementation,
which seems to me as plain introsort, but I'm not sure that it is used
while compilation, may be it is only for Óompatibility there and it is
used some other internal code (I'll check this).

The difference this makes will depend _heavily_ upon the speed of
function calls. For example, it makes a MUCH larger difference on an x86
than it does on a SPARC. Based on the 20% difference you noted, it's a
fair guess that you were probably testing on an x86 or something very
similar.

Yes, I'll agree, it seems to me too that the probelm is in some 'call'
overhead, but I'm using templates whith inline keywords, and all
needed optimization flags (/Ox for MS's cl, -O3 for mingw g++). Also I
looked at generated assembler, and yes most (all) used functions are
inlined.

I can't see any assembler inlines or similar optimization in STL
reference implemntation which I'm using as reference, but it is faster
by 4 times.

So how we can avoid this overhead by using only C++ language
construction?

Sep 15 '08 #8

Martin Eisenberg

Jerry Coffin wrote:

A further optimization is due to Robert Sedgewick: he noted that
you still have a fair amount of overhead repeatedly calling the
simple sort on all those small partitions.

Do you have a reference handy?
Thanks,
Martin

--
Quidquid latine scriptum est, altum videtur.

Sep 15 '08 #9

sean_in_raleigh

Why not just read the sources to the GNU
std::sort() and see what they're doing
differently to you?

Sean

Sep 15 '08 #10

Jerry Coffin

In article <6j************@mid.uni-berlin.de>, ma**************@udo.edu
says...

Jerry Coffin wrote:

A further optimization is due to Robert Sedgewick: he noted that
you still have a fair amount of overhead repeatedly calling the
simple sort on all those small partitions.

Do you have a reference handy?

Sorry, but no -- I don't have _any_ references handy right now (I just
sold my house, and am in the process of moving).

--
Later,
Jerry.

The universe is a figment of its own imagination.

Sep 16 '08 #11

Similar topics

principle of stport std::sort

by: Pavel Pluhacek | last post by:

Does anybody know how is working generic std::sort function in stlport? I've tried to go through source code but it's too complex and not commented. I've tried to write as fast algorithm as I...

C / C++

problem on creating functor for std::sort

by: lok | last post by:

i have a class: template <class T1, class T2> class CPairMapping { public: typedef std::pair<T1, T2> ValuePair_t; typedef std::vector<ValuePair_t> ValueList_t; typedef std::binary_function<...

C / C++

using std::sort;

by: alexhong2001 | last post by:

Does "std::sort" work only with sequence containers, not associative containers at all? Among sequential containers, can it be used with "list", "queue" and other sequence containers besides...

C / C++

std::sort question

by: Ravi | last post by:

I came across code which contains: std::sort<unsigned char*>((newAnswer.begin()+1),newAnswer.end()); with newAnswer defined as std::vector<unsigned char>& newAnswer However upon...

C / C++

trouble when overloading function used as predicate in std::sort

by: hall | last post by:

I accidently overloaded a static member function that I use as predicate in the std::sort() for a vector and ended up with a compiler error. Is this kind of overload not allowed for predicates and...

C / C++

overhead of using std::sort

by: Ireneusz SZCZESNIAK | last post by:

I want to sort a vector with the std::sort function. There are two functions: one with two arguments, the other with three arguments. I am using the one with three arguments. I noticed that...

C / C++

std::sort using a member function as the BinaryPredicate

by: Peter Olcott | last post by:

Does anyone know how to do this?

C / C++

Need help on std::sort comparison function

by: fade | last post by:

Good afternoon, I need some advice on the following: I've got a class that has a member std::vector<CStringm_vFileName and a member CString m_path; The vector contains a bunch of filenames with...

C / C++

std::sort

by: Jeff Schwab | last post by:

Would std::sort ever compare an object with itself? I'm not talking about two distinct, equal-valued objects, but rather this == &that. The container being sorted is a std::vector. I've never...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server