A Sort Optimization Technique: decorate-sort-dedecorate

xahlee

Last year, i've posted a tutorial and commentary about Python and
Perl's sort function. (http://xahlee.org/perl-python/sort_list.html)

In that article, i discussed a technique known among juvenile Perlers
as the Schwartzian Transform, which also manifests in Python as its
â€œkeyâ€ optional parameter.

Here, i give a more detailed account on why and how of this construct.

----

Language and Sort Optimization: decorate-sort-dedecorate

There are many algorithms for sorting. Each language can chose which to
use. See wikipedia for a detailed list and explanation:
Sorting_algorithmâ†— .

However, does not matter which algorithm is used, ultimately it will
need the order-deciding function on the items to be sorted. Suppose
your items are (a,b,c,d,...), and your order-deciding function is F.
Various algorithms will try to minimize the number of times F is
called, but nevertheless, F will be applied to a particular element in
the list multiple times. For example, F(a,b) may be called to see which
of â€œaâ€ or â€œbâ€ comes first. Then, later the algorithm might need
to call F(m,a), or F(a,z). The point here is that, F will be called
many times on arbitrary two items in your list, even if one of the
element has been compared to others before.

Now suppose, you are sorting some polygons in 3D space, or personal
records by the person's address's distance from a location, or sorting
matrixes by their eigen-values in some math application, or ordering
files by number of occurrences of some text in the file.

In general, when you define your decision function F(x,y), you will
need to extract some property from the elements to be sorted. For
example, when sorting points in space by a criterion of distance, one
will need to compute the distance for the point. When sorting personal
records from database by the person's location, the decision function
will need to retrieve the person's address from the database, then find
the coordinate of that address, that compute the distance from there to
a given coordinate. In sorting matrixes in math by eigen-values, the
order-decision will first compute the eigen-value of the matrix. A
common theme from all of the above is that they all need to do some
non-trivial computation on each element.

As we can see, the order-decision function F may need to do some
expensive computation on each element first, and this is almost always
the case when sorting elements other than simple numbers. Also, we know
that a sorting algorithm will need to call F(x,y) many times, even if
one of x or y has been compared to others before. So, this may result
high inefficiency. For example, you need to order people by their
location to your home. So when F(Mary,Jane) is called, Mary's address
is first retrieved from a database across a network, the coordinate of
her address is looked up in another database. Then the distance to your
home are computed using spherical geometry. The exact same thing is
done for Jane. But later on, it may call F(Mary,Chrissy),
F(Mary,Jonesy), F(Mary,Pansy) and so on, and the entire record
retrieval for Mary is repeated many times.

One solution, is to do the expensive extraction one time for each
element, then associate that with the corresponding elements. Suppose
this expensive extraction function is called gist(). So, you create a
new list ([Mary,gist(Mary)], [Jane,gist(Jane)], [John,gist(John)],
[Jenny,gist(Jenny)], ...) and sort this list instead, when done, remove
associated gist. This technique is sometimes called
decorate-sort-dedecorate.

In Perl programing, this decorate-sort-dedecorate technique is sillily
known as Schwartzian Transform as we have demonstrated previously. In
Python, they tried to incorporate this technique into the language, by
adding the â€œkeyâ€ optional parameter, which is our gist() function.

----------
This post is archived at:
http://xahlee.org/perl-python/sort_list.html

I would be interested in comments about how Common Lisp, Scheme, and
Haskell deal with the decorate-sort-dedecorate technique. In
particular, does it manifest in the language itself? If not, how does
one usually do it in the code? (note: please reply to approprate groups
if it is of no general interest. Thanks) (am also interested on how
Perl6 or Python3000 does this, if there are major changes to their sort
function)

Thanks.

Xah
xa*@xahlee.org
âˆ‘ http://xahlee.org/

Aug 27 '06 #1

Subscribe Post Reply

2309

Tom Cole

Well you cross-posted this enough, including a Java group, and didn't
even ask about us... What a pity.

In Java, classes can implement the Comparable interface. This interface
contains only one method, a compareTo(Object o) method, and it is
defined to return a value < 0 if the Object is considered less than the
one being passed as an argument, it returns a value 0 if considered
greater than, and 0 if they are considered equal.

The object implementing this interface can use any of the variables
available to it (AKA address, zip code, longitude, latitude, first
name, whatever) to return this -1, 0 or 1. This is slightly different
than what you mention as we don't have to "decorate" the object. These
are all variables that already exist in the Object, and if fact make it
what it is. So, of course, there is no need to un-decorate at the end.

There are several built-in objects and methods available to sort
Objects that are Comparable, even full Arrays of them.

Aug 27 '06 #2

William James

xa****@gmail.com wrote:

I would be interested in comments about how Common Lisp, Scheme, and
Haskell deal with the decorate-sort-dedecorate technique.

%w(FORTRAN LISP COBOL).sort_by{|s| s.reverse}
==>["COBOL", "FORTRAN", "LISP"]

--
Common Lisp did kill Lisp. Period. ... It is to Lisp what
C++ is to C. A monstrosity that totally ignores the basics
of language design, simplicity, and orthogonality to begin
with. --- Bernard Lang

Aug 28 '06 #3

Marc 'BlackJack' Rintsch

In <11*********************@m79g2000cwm.googlegroups. com>, Tom Cole wrote:

In Java, classes can implement the Comparable interface. This interface
contains only one method, a compareTo(Object o) method, and it is
defined to return a value < 0 if the Object is considered less than the
one being passed as an argument, it returns a value 0 if considered
greater than, and 0 if they are considered equal.

The object implementing this interface can use any of the variables
available to it (AKA address, zip code, longitude, latitude, first
name, whatever) to return this -1, 0 or 1. This is slightly different
than what you mention as we don't have to "decorate" the object. These
are all variables that already exist in the Object, and if fact make it
what it is. So, of course, there is no need to un-decorate at the end.

Python has such a mechanism too, the special `__cmp__()` method
has basically the same signature. The problem the decorate, sort,
un-decorate pattern solves is that this object specific compare operations
only use *one* criteria.

Let's say you have a `Person` object with name, surname, date of birth and
so on. When you have a list of such objects and want to sort them by name
or by date of birth you can't use the `compareTo()` method for both.

Ciao,
Marc 'BlackJack' Rintsch

Aug 28 '06 #4

Jim Gibson

In article <pa****************************@gmx.net>, Marc 'BlackJack'
Rintsch <bj****@gmx.netwrote:

In <11*********************@m79g2000cwm.googlegroups. com>, Tom Cole wrote:

In Java, classes can implement the Comparable interface. This interface
contains only one method, a compareTo(Object o) method, and it is
defined to return a value < 0 if the Object is considered less than the
one being passed as an argument, it returns a value 0 if considered
greater than, and 0 if they are considered equal.

The object implementing this interface can use any of the variables
available to it (AKA address, zip code, longitude, latitude, first
name, whatever) to return this -1, 0 or 1. This is slightly different
than what you mention as we don't have to "decorate" the object. These
are all variables that already exist in the Object, and if fact make it
what it is. So, of course, there is no need to un-decorate at the end.

Python has such a mechanism too, the special `__cmp__()` method
has basically the same signature. The problem the decorate, sort,
un-decorate pattern solves is that this object specific compare operations
only use *one* criteria.

I can't believe I am getting drawn into a thread started by xahlee, but
here goes anyway:

The problem addressed by what is know in Perl as the 'Schwartzian
Transform' is that the compare operation can be an expensive one,
regardless of the whether the comparison uses multiple keys. Since in
comparison sorts, the compare operation will be executed N(logN) times,
it is more efficient to pre-compute a set of keys, one for each object
to be sorted. That need be done only N times. The sort can then use
these pre-computed keys to sort the objects. See, for example:

http://en.wikipedia.org/wiki/Schwartzian_transform

--
Jim Gibson

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Aug 28 '06 #5

Dr.Ruud

Jim Gibson schreef:

The problem addressed by what is know in Perl as the 'Schwartzian
Transform' is that the compare operation can be an expensive one,
regardless of the whether the comparison uses multiple keys. Since in
comparison sorts, the compare operation will be executed N(logN)
times, it is more efficient to pre-compute a set of keys, one for
each object to be sorted. That need be done only N times. The sort
can then use these pre-computed keys to sort the objects.

Basically it first builds, than sorts an index.

The pre-computed (multi-)keys can often be optimized, see Uri's
Sort::Maker http://search.cpan.org/search?query=Sort::Maker
for facilities.

--
Affijn, Ruud

"Gewoon is een tijger."

Aug 28 '06 #6

Joachim Durchholz

Jim Gibson schrieb:

>
The problem addressed by what is know in Perl as the 'Schwartzian
Transform' is that the compare operation can be an expensive one,
regardless of the whether the comparison uses multiple keys. Since in
comparison sorts, the compare operation will be executed N(logN) times,
it is more efficient to pre-compute a set of keys, one for each object
to be sorted. That need be done only N times.

Wikipedia says it's going from 2NlogN to N. If a sort is massively
dominated by the comparison, that could give a speedup of up to 100%
(approximately - dropping the logN factor is almost irrelevant, what
counts is losing that factor of 2).

Regards,
Jo

Aug 29 '06 #7

xhoster

Joachim Durchholz <jo@durchholz.orgwrote:

Jim Gibson schrieb:

The problem addressed by what is know in Perl as the 'Schwartzian
Transform' is that the compare operation can be an expensive one,
regardless of the whether the comparison uses multiple keys. Since in
comparison sorts, the compare operation will be executed N(logN) times,
it is more efficient to pre-compute a set of keys, one for each object
to be sorted. That need be done only N times.

Wikipedia says it's going from 2NlogN to N. If a sort is massively
dominated by the comparison, that could give a speedup of up to 100%
(approximately - dropping the logN factor is almost irrelevant, what
counts is losing that factor of 2).

It seems to me that ln 1,000,000 is 13.8, and that 13.8 is quite a bit
greater than 2.

Cheers,

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Aug 29 '06 #8

Gabriel Genellina

At Tuesday 29/8/2006 07:50, Joachim Durchholz wrote:

>Wikipedia says it's going from 2NlogN to N. If a sort is massively
dominated by the comparison, that could give a speedup of up to 100%
(approximately - dropping the logN factor is almost irrelevant, what
counts is losing that factor of 2).

In fact it's the other way - losing a factor of 2 is irrelevant,
O(2N)=O(N). The logN factor is crucial here.

Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Aug 29 '06 #9

Joachim Durchholz

Gabriel Genellina schrieb:

At Tuesday 29/8/2006 07:50, Joachim Durchholz wrote:

>Wikipedia says it's going from 2NlogN to N. If a sort is massively
dominated by the comparison, that could give a speedup of up to 100%
(approximately - dropping the logN factor is almost irrelevant, what
counts is losing that factor of 2).

In fact it's the other way - losing a factor of 2 is irrelevant,
O(2N)=O(N). The logN factor is crucial here.

That's just a question of what you're interested in.

If it's asymptotic behavior, then the O(logN) factor is a difference.

If it's practical speed, a constant factor of 2 is far more relevant
than any O(logN) factor.

(I'm not on the list, so I won't see responses unless specifically CC'd.)

Regards,
Jo

Aug 29 '06 #10

Tim Peters

[Joachim Durchholz]

>>Wikipedia says it's going from 2NlogN to N. If a sort is massively
dominated by the comparison, that could give a speedup of up to 100%
(approximately - dropping the logN factor is almost irrelevant, what
counts is losing that factor of 2).

[Gabriel Genellina]

>In fact it's the other way - losing a factor of 2 is irrelevant,
O(2N)=O(N). The logN factor is crucial here.

[Joachim Durchholz]

That's just a question of what you're interested in.

If it's asymptotic behavior, then the O(logN) factor is a difference.

If it's practical speed, a constant factor of 2 is far more relevant
than any O(logN) factor.

Nope. Even if you're thinking of base 10 logarithms, log(N, 10) 2
for every N 100. Base 2 logarithms are actually most appropriate
here, and log(N, 2) 2 for every N 4. So even if the "2" made
sense here (it doesn't -- see next paragraph), the log(N) term
dominates it for even relatively tiny values of N.

Now it so happens that current versions of Python (and Perl) use merge
sorts, where the worst-case number of comparisons is roughly N*log(N,
2), not Wikipedia's 2*N*log(N, 2) (BTW, the Wikipedia article
neglected to specify the base of its logarithms, but it's clearly
intended to be 2). So that factor of 2 doesn't even exist in current
reality: the factor of log(N, 2) is the /whole/ worst-case story
here, and, e.g., is near 10 when N is near 1000. A factor of 10 is
nothing to sneeze at.

OTOH, current versions of Python (and Perl) also happen to use
"adaptive" merge sorts, which can do as few as N-1 comparisons in the
best case (the input is already sorted, or reverse-sorted). Now I'm
not sure why anyone would sort a list already known to be sorted, but
if you do, DSU is a waste of time. In any other case, it probably
wins, and usually wins big, and solely by saving a factor of (up to)
log(N, 2) key computations.

(I'm not on the list, so I won't see responses unless specifically CC'd.)

Done.

Aug 29 '06 #11

Joachim Durchholz

Tim Peters schrieb:

[Joachim Durchholz]

>>>Wikipedia says it's going from 2NlogN to N. If a sort is massively
dominated by the comparison, that could give a speedup of up to 100%
(approximately - dropping the logN factor is almost irrelevant, what
counts is losing that factor of 2).

[Gabriel Genellina]

>>In fact it's the other way - losing a factor of 2 is irrelevant,
O(2N)=O(N). The logN factor is crucial here.

[Joachim Durchholz]
>That's just a question of what you're interested in.

If it's asymptotic behavior, then the O(logN) factor is a difference.

If it's practical speed, a constant factor of 2 is far more relevant
than any O(logN) factor.

Nope. Even if you're thinking of base 10 logarithms, log(N, 10) 2
for every N 100. Base 2 logarithms are actually most appropriate
here, and log(N, 2) 2 for every N 4. So even if the "2" made
sense here (it doesn't -- see next paragraph), the log(N) term
dominates it for even relatively tiny values of N.

Whether this argument is relevant depends on the constant factors
associated with each term.
Roughly speaking, if the constant factor on the O(N) term is 100 and the
constant factor on the O(logN) term is 1, then it's still irrelevant.

My point actually is this: big-Oh measures are fine for comparing
algorithms in general, but when it comes to optimizing concrete
implementations, its value greatly diminishes: you still have to
investigate the constant factors and all the details that the big-Oh
notation abstracts away.
From that point of view, it's irrelevant whether some part of the
algorithm contributes an O(1) or an O(logN) factor: the decision where
to optimize is almost entirely dominated by the constant factors.

Regards,
Jo

Aug 29 '06 #12

Tim Peters

[Joachim Durchholz]

>>>>Wikipedia says it's going from 2NlogN to N. If a sort is massively
dominated by the comparison, that could give a speedup of up to 100%
(approximately - dropping the logN factor is almost irrelevant, what
counts is losing that factor of 2).

[Gabriel Genellina]

>>>In fact it's the other way - losing a factor of 2 is irrelevant,
O(2N)=O(N). The logN factor is crucial here.

[Joachim Durchholz]

>>That's just a question of what you're interested in.

If it's asymptotic behavior, then the O(logN) factor is a difference.

If it's practical speed, a constant factor of 2 is far more relevant
than any O(logN) factor.

[Tim Peters]

>Nope. Even if you're thinking of base 10 logarithms, log(N, 10) 2
for every N 100. Base 2 logarithms are actually most appropriate
here, and log(N, 2) 2 for every N 4. So even if the "2" made
sense here (it doesn't -- see next paragraph), the log(N) term
dominates it for even relatively tiny values of N.

[Joachim Durchholz]

Whether this argument is relevant depends on the constant factors
associated with each term.

I'm afraid you're still missing the point in this example: it's not
just that Python's (& Perl's) current sorts do O(N*log(N)) worst-case
comparisons, it's that they /do/ N*log(N, 2) worst-case comparisons.
O() notation isn't being used, and there is no "constant factor" here:
the count of worst-case comparisons made is close to exactly N*log(N,
2), not to some mystery-constant times N*log(N, 2). For example,
sorting a randomly permuted array with a million distinct elements
will require nearly 1000000*log(1000000, 2) ~= 1000000 * 20 = 20
million comparisons, and DSU will save about 19 million key
computations in this case. O() arguments are irrelevant to this, and
the Wikipedia page you started from wasn't making an O() argument
either:

http://en.wikipedia.org/wiki/Schwartzian_transform

For an efficient ordinary sort function, the number of invocations of the
transform function goes from an average of 2nlogn to n;

No O() in sight, and no O() was intended there either. You do exactly
N key computations when using DSU, while the hypothetical "efficient
ordinary sort function" the author had in mind does about 2*N*log(N,
2) key computations when not using DSU. That's overly pessimistic for
Python's & Perl's current sort functions, where no more than N*log(N,
2) key computations are done when not using DSU. The /factor/ of key
computations saved is thus as large as N*log(N, 2) / N = log(N, 2).
O() behavior has nothing to do with this result, and the factor of
log(N, 2) is as real as it gets. If key extraction is at all
expensive, and N isn't trivially small, saving a factor of log(N, 2)
key extractions is /enormously/ helpful.

If this is sinking in now, reread the rest of my last reply that got
snipped hre.

Roughly speaking, if the constant factor on the O(N) term is 100 and the
constant factor on the O(logN) term is 1, then it's still irrelevant.

As above, it's talking about O() that's actually irrelevant in this
specific case.

My point actually is this: big-Oh measures are fine for comparing
algorithms in general, but when it comes to optimizing concrete
implementations, its value greatly diminishes: you still have to
investigate the constant factors and all the details that the big-Oh
notation abstracts away.

That's true, although the argument here wasn't actually abstracting
away anything. You've been adding abstraction to an argument that
didn't have any ;-)

From that point of view, it's irrelevant whether some part of the
algorithm contributes an O(1) or an O(logN) factor: the decision where
to optimize is almost entirely dominated by the constant factors.

While true in some cases, it's irrelevant to this specific case.
More, in practice a factor of O(log(N)) is almost always more
important "for real" than a factor of O(1) anyway -- theoretical
algorithms hiding gigantic constant factors in O() notion are very
rarely used in real life. For example, the best-known algorithm for
finding the number of primes <= x has O(sqrt(x)) time complexity, but
AFAIK has /never/ been implemented because the constant factor is
believed to be gigantic indeed. Theoretical CompSci is full of
results like that, but they have little bearing on practical
programming.

Aug 29 '06 #13

Fredrik Lundh

Tim Peters wrote:

OTOH, current versions of Python (and Perl)

just curious, but all this use of (& Perl) mean that the Perl folks have
implemented timsort ?

</F>

Aug 30 '06 #14

Tim Peters

[/T]

>OTOH, current versions of Python (and Perl)

[/F]

just curious, but all this use of (& Perl) mean that the Perl folks have
implemented timsort ?

A remarkable case of independent harmonic convergence:

http://mail.python.org/pipermail/pyt...ly/026946.html

Come to think of it, I don't actually know whether a /released/ Perl
ever contained the development code I saw. Looks like it got added to
Perl 5.8:

http://perldoc.perl.org/sort.html

Aug 30 '06 #15

Joachim Durchholz

Tim Peters schrieb:

>
O() notation isn't being used

I was replying to Gabriel's post:

>>>>In fact it's the other way - losing a factor of 2 is irrelevant,
O(2N)=O(N). The logN factor is crucial here.

Regards,
Jo

Aug 30 '06 #16

Duncan Booth

Tim Peters wrote:

[/T]

>>OTOH, current versions of Python (and Perl)

[/F]
>just curious, but all this use of (& Perl) mean that the Perl folks have
implemented timsort ?

A remarkable case of independent harmonic convergence:

http://mail.python.org/pipermail/pyt...ly/026946.html

Come to think of it, I don't actually know whether a /released/ Perl
ever contained the development code I saw. Looks like it got added to
Perl 5.8:

http://perldoc.perl.org/sort.html

The difference in style between Perl and Python is quite interesting: Perl
lets you specify the algorithm which you think it going to be best for your
code, but in a global manner which means your pragmas may or may not have
the desired effect (which sounds a headache for maintenance). Python simply
gets on with the job.

What the perl docs don't say though is whether they do any of the fancy
timsort optimisations. I guess not, or at least not all of them: the Perl
docs warn that mergesort can be much slower if there are a lot of
duplicates, and from the Python dev thread above we can see that early
timsort was also much slower for that case (~sort) but in the final version
there is no effective difference.

Aug 30 '06 #17

neoedmund

yeah, java also have 2 interface, Comparator and Comparable, which
equal to python's compareTo() and __cmp__()

Marc 'BlackJack' Rintsch wrote:

In <11*********************@m79g2000cwm.googlegroups. com>, Tom Cole wrote:

In Java, classes can implement the Comparable interface. This interface
contains only one method, a compareTo(Object o) method, and it is
defined to return a value < 0 if the Object is considered less than the
one being passed as an argument, it returns a value 0 if considered
greater than, and 0 if they are considered equal.

The object implementing this interface can use any of the variables
available to it (AKA address, zip code, longitude, latitude, first
name, whatever) to return this -1, 0 or 1. This is slightly different
than what you mention as we don't have to "decorate" the object. These
are all variables that already exist in the Object, and if fact make it
what it is. So, of course, there is no need to un-decorate at the end.

Python has such a mechanism too, the special `__cmp__()` method
has basically the same signature. The problem the decorate, sort,
un-decorate pattern solves is that this object specific compare operations
only use *one* criteria.

Let's say you have a `Person` object with name, surname, date of birth and
so on. When you have a list of such objects and want to sort them by name
or by date of birth you can't use the `compareTo()` method for both.

Ciao,
Marc 'BlackJack' Rintsch

Aug 31 '06 #18

Xah Lee

i just want to make it known that i think most if not all of the
replies in this thread are of not much technical value. They are either
wrong and or misleading, and the perl module mentioned about sorting or
the Java language aspect on sorting, as they are discussed or
represented, are rather stupid.

I may or may not write a detailed account later. If you have specific
questions, or want to know specific reasons of my claims, please don't
hesitate to email. (privately if you deem it proper)

Xah
xa*@xahlee.org
âˆ‘ http://xahlee.org/

xa****@gmail.com wrote:

Last year, i've posted a tutorial and commentary about Python and
Perl's sort function. (http://xahlee.org/perl-python/sort_list.html)

In that article, i discussed a technique known among juvenile Perlers
as the Schwartzian Transform, which also manifests in Python as its
â€œkeyâ€ optional parameter.

Here, i give a more detailed account on why and how of this construct.
...
This post is archived at:
http://xahlee.org/perl-python/sort_list.html

I would be interested in comments about how Common Lisp, Scheme, and
Haskell deal with the decorate-sort-dedecorate technique. In
particular, does it manifest in the language itself? If not, how does
one usually do it in the code? (note: please reply to approprate groups
if it is of no general interest. Thanks) (am also interested on how
Perl6 or Python3000 does this, if there are major changes to their sort
function)

Thanks.

Xah
xa*@xahlee.org
âˆ‘ http://xahlee.org/

Sep 8 '06 #19

A Sort Optimization Technique: decorate-sort-dedecorate

Similar topics