Nasty gotcha/bug in heapq.nlargest/nsmallest

George Sakkis

I spent several hours debugging some bogus data results that turned
out to be caused by the fact that heapq.nlargest doesn't respect rich
comparisons:

import heapq
import random

class X(object):
def __init__(self, x): self.x=x
def __repr__(self): return 'X(%s)' % self.x
if True:
# this succeeds
def __cmp__(self, other): return cmp(self.x , other.x)
else:
# this fails
def __lt__(self, other): return self.x < other.x

s = [X(i) for i in range(10)]
random.shuffle(s)

s1 = heapq.nlargest(5, s)
s2 = sorted(s, reverse=True)[:5]
assert s1 == s2, (s,s1,s2)

s1 = heapq.nsmallest(5, s)
s2 = sorted(s)[:5]
assert s1 == s2, (s,s1,s2)
According to the docs, nlargest is equivalent to: "sorted(iterable,
key=key, reverse=True)[:n]" and similarly for nsmallest. So that must
be at least a documentation bug, if not an implementation one.

George

Jun 27 '08 #1

Subscribe Reply

1468

Gabriel Genellina

En Wed, 14 May 2008 23:47:56 -0300, George Sakkis <ge***********@gmail.comescribió:

I spent several hours debugging some bogus data results that turned
out to be caused by the fact that heapq.nlargest doesn't respect rich
comparisons:

import heapq
import random

class X(object):
def __init__(self, x): self.x=x
def __repr__(self): return 'X(%s)' % self.x
if True:
# this succeeds
def __cmp__(self, other): return cmp(self.x , other.x)
else:
# this fails
def __lt__(self, other): return self.x < other.x

s = [X(i) for i in range(10)]
random.shuffle(s)

s1 = heapq.nlargest(5, s)
s2 = sorted(s, reverse=True)[:5]
assert s1 == s2, (s,s1,s2)

s1 = heapq.nsmallest(5, s)
s2 = sorted(s)[:5]
assert s1 == s2, (s,s1,s2)
According to the docs, nlargest is equivalent to: "sorted(iterable,
key=key, reverse=True)[:n]" and similarly for nsmallest. So that must
be at least a documentation bug, if not an implementation one.

Your class is not properly defined. sorted() only uses < to compare items, but this fact is NOT documented as far as I can tell. So you are abusing this implementation detail by not defining the other comparison operators, and the X class has a rather strange behavior:

pyX(1)<X(2)
True
pyX(1)<=X(2)
False # !!!
pyX(1)>X(2)
False
pyX(1)==X(1)
False # !!!

If you write all the six rich comparison methods (__lt__, __le__, etc) nlargest and nsmallest work fine.
If your objects are fully comparable (like numbers; that is, given A and B, one and only one of these holds: A<B, A==B, A>B) the easiest way is to write a helper function (like the old __cmp__) and implement the rich comparison methods using it. Those six methods could be defined in a mixin class.

--
Gabriel Genellina

Jun 27 '08 #2

Peter Otten

George Sakkis wrote:

I spent several hours debugging some bogus data results that turned
out to be caused by the fact that heapq.nlargest doesn't respect rich
comparisons:

import heapq
import random

class X(object):
def __init__(self, x): self.x=x
def __repr__(self): return 'X(%s)' % self.x
if True:
# this succeeds
def __cmp__(self, other): return cmp(self.x , other.x)
else:
# this fails
def __lt__(self, other): return self.x < other.x

s = [X(i) for i in range(10)]
random.shuffle(s)

s1 = heapq.nlargest(5, s)
s2 = sorted(s, reverse=True)[:5]
assert s1 == s2, (s,s1,s2)

s1 = heapq.nsmallest(5, s)
s2 = sorted(s)[:5]
assert s1 == s2, (s,s1,s2)
According to the docs, nlargest is equivalent to: "sorted(iterable,
key=key, reverse=True)[:n]" and similarly for nsmallest. So that must
be at least a documentation bug, if not an implementation one.

Implementing a subset of the rich comparisons is always dangerous. According
to my ad hoc test you need <, <=, and == for nlargest()/nsmallest() to
work:

import heapq
import random

used_rich = set()

class X(object):
def __init__(self, x): self.x=x
def __repr__(self): return 'X(%s)' % self.x
def __lt__(self, other):
used_rich.add("lt")
return self.x < other.x
def __eq__(self, other):
used_rich.add("eq")
return self.x == other.x
def __gt__(self, other):
used_rich.add("gt")
return self.x other.x
def __ge__(self, other):
used_rich.add("ge")
return self.x >= other.x
def __ne__(self, other):
used_rich.add("ne")
return self.x != other.x
def __le__(self, other):
used_rich.add("le")
return self.x <= other.x

s = [X(8), X(0), X(3), X(4), X(5), X(2), X(1), X(6), X(7), X(9)]

smallest = sorted(s)[:5]
largest = sorted(s, reverse=True)[:5]

print "used by sorted:", used_rich
used_rich = set()

for i in range(10000):

s1 = heapq.nlargest(5, s)
assert s1 == largest, (s, s1, largest)

s1 = heapq.nsmallest(5, s)
assert s1 == smallest, (s, s1, smallest)

random.shuffle(s)

print "used by nlargest/nsmallest:", used_rich

Output:
used by sorted: set(['lt'])
used by nlargest/nsmallest: set(['lt', 'le', 'eq'])

Peter

Jun 27 '08 #3

George Sakkis

On May 15, 3:06 am, Peter Otten <__pete...@web.dewrote:

George Sakkis wrote:
I spent several hours debugging some bogus data results that turned
out to be caused by the fact that heapq.nlargest doesn't respect rich
comparisons:

import heapq
import random

class X(object):
def __init__(self, x): self.x=x
def __repr__(self): return 'X(%s)' % self.x
if True:
# this succeeds
def __cmp__(self, other): return cmp(self.x , other.x)
else:
# this fails
def __lt__(self, other): return self.x < other.x

s = [X(i) for i in range(10)]
random.shuffle(s)

s1 = heapq.nlargest(5, s)
s2 = sorted(s, reverse=True)[:5]
assert s1 == s2, (s,s1,s2)

s1 = heapq.nsmallest(5, s)
s2 = sorted(s)[:5]
assert s1 == s2, (s,s1,s2)

According to the docs, nlargest is equivalent to: "sorted(iterable,
key=key, reverse=True)[:n]" and similarly for nsmallest. So that must
be at least a documentation bug, if not an implementation one.

Implementing a subset of the rich comparisons is always dangerous. According
to my ad hoc test you need <, <=, and == for nlargest()/nsmallest() to
work:

import heapq
import random

used_rich = set()

class X(object):
def __init__(self, x): self.x=x
def __repr__(self): return 'X(%s)' % self.x
def __lt__(self, other):
used_rich.add("lt")
return self.x < other.x
def __eq__(self, other):
used_rich.add("eq")
return self.x == other.x
def __gt__(self, other):
used_rich.add("gt")
return self.x other.x
def __ge__(self, other):
used_rich.add("ge")
return self.x >= other.x
def __ne__(self, other):
used_rich.add("ne")
return self.x != other.x
def __le__(self, other):
used_rich.add("le")
return self.x <= other.x

s = [X(8), X(0), X(3), X(4), X(5), X(2), X(1), X(6), X(7), X(9)]

smallest = sorted(s)[:5]
largest = sorted(s, reverse=True)[:5]

print "used by sorted:", used_rich
used_rich = set()

for i in range(10000):

s1 = heapq.nlargest(5, s)
assert s1 == largest, (s, s1, largest)

s1 = heapq.nsmallest(5, s)
assert s1 == smallest, (s, s1, smallest)

random.shuffle(s)

print "used by nlargest/nsmallest:", used_rich

Output:
used by sorted: set(['lt'])
used by nlargest/nsmallest: set(['lt', 'le', 'eq'])

Interesting, I don't get 'eq' on one box ("2.5 (r25:51908, Sep 19
2006, 09:52:17) [MSC v.1310 32 bit (Intel)]") but I get it on another
("2.5.1 (r251:54863, May 9 2007, 15:27:54) [GCC 3.3.5 (Debian
1:3.3.5-13)]").

A more interesting question is how many (and which) of the rich
comparisons can you omit while maintaining correctness (see script
below). The results on my v2.5 on Windows are:

1. Use ('lt', 'le') if both present
2. If only 'lt' is missing use ('le', 'gt')
3. If only 'le' is missing use ('lt', 'ge')
4. If both ('lt', 'le') are missing use ('gt', 'ge')
5. If 3 or more of ('lt', 'le', 'gt', 'ge') are missing use the
remaining one (if any) plus 'cmp'
6. If (5) holds and there is no 'cmp', nlargest/nsmaller are WRONG

The results on the v2.5.1 debian box are the same, with the addition
of 'eq' in all steps; if 'eq' is not present, 'cmp' is called, and if
neither 'cmp' is present, the results are still correct (provided any
of the cases 1-4 above hold of course). IOW, it does some extra
equality checks if it can, but these are optional since it can be
correct without them.

For sorted(), the results are identical on both boxes:
- Use only 'lt' if present else
- Use only 'gt' if present else
- Use only 'cmp' if present else
- WRONG RESULTS
IOW, no attempt to use 'le','ge','eq','ne' is made.

I wonder how stable is this behavior across different versions and
implementations (Jython, IronPython, etc).

George
==== testing script ==================================

import heapq
from random import shuffle

used = set(); add = used.add

class X(object):
def __init__(self, x): self.x=x
def __repr__(self): return 'X(%s)' % self.x

# try to remove one or more of these included in a used set
# and see what happens
def __gt__(self, other): add('gt'); return self.x other.x
def __ge__(self, other): add('ge'); return self.x >= other.x
def __lt__(self, other): add('lt'); return self.x < other.x
def __le__(self, other): add('le'); return self.x <= other.x
def __eq__(self, other): add('eq'); return self.x == other.x
def __ne__(self, other): add('ne'); return self.x != other.x
def __cmp__(self, other): add('cmp'); return cmp(self.x, other.x)
if __name__ == '__main__':
check_heapq = True
n = 1000
s = [X(8), X(0), X(3), X(4), X(5), X(2), X(1), X(6), X(7), X(9)]
used.clear()
if check_heapq:
for i in xrange(n):
s2 = heapq.nlargest(5, s)
assert [a.x for a in s2] == range(9,4,-1), s2
shuffle(s)
else:
for i in xrange(n):
s2 = sorted(s, reverse=True)[:5]
assert [a.x for a in s2] == range(9,4,-1), s2
shuffle(s)
used_largest = used.copy()
print "largest:", used_largest

used.clear()
if check_heapq:
for i in xrange(n):
s2 = heapq.nsmallest(5, s)
assert [a.x for a in s2] == range(5), s2
shuffle(s)
else:
for i in xrange(n):
s2 = sorted(s)[:5]
assert [a.x for a in s2] == range(5), s2
shuffle(s)
used_smallest = used.copy()
print "smallest:", used_smallest
assert used_largest == used_smallest

Jun 27 '08 #4

Raymond Hettinger

On May 15, 12:06*am, Peter Otten <__pete...@web.dewrote:

According
to my ad hoc test you need <, <=, and == for nlargest()/nsmallest() to
work:

In Py2.6 and after, you only need < and ==. I replaced the LE tests
with LT to match list.sort() and bisect.bisect(). The == arises
because nlargest/nsmallest compare decorated tuples. Tuple comparison
always starts with equality tests to find the first unequal element
and then switches back to testing whichever inequality test was
requested.

Raymond

Jun 27 '08 #5

Similar topics

1766

just another default argument value gotcha

by: Frank Bechmann | last post by:

Eventually most of you will not learn much from this because it's just another event in the 'default argument value gotcha' series, but because it cost me some hours yesterday to spot this 'error'...

Python

1331

heapq iteration?

by: Eric | last post by:

I'm missing something critical about how heapq works. I assumed I could iterate through the heap, but I get partial iteration: >>> listB >>> heapify(listB) >>> for h in listB:...

Python

1412

Heap class for heapq module

by: Stefan Behnel | last post by:

Hi! I filed a patch for a Heap class to be integrated into the heapq module (in addition the the currently available functions). The main features are support for iteration and for the standard...

Python

7417

FileSystemWatcher raises Changed Twice....

by: ljh | last post by:

Has anyone else noticed that the FileSystemWatcher raises the changed event twice when a file is changed? Do you have any idea why this is the case?

.NET Framework

2823

A note on heapq module

by: bearophileHUGS | last post by:

In few minutes I have just written this quite raw class, it lacks doctring (the same of the functions of the heapq module), it may contain bugs still, I haven't tested it much. It's just a simple...

Python

4130

heapq.heappush and pop to use key

by: jm.suresh | last post by:

I wanted to have a heap of custom objects, and in different heaps I wanted to have the weights for my elements differently. So, I modified the heapq module to accept key arguments also. The...

Python

3208

Define key in nlargest() of heapq?

by: Davy | last post by:

Hi all, I have a dictionary with n elements, and I want to get the m(m<=n) keys with the largest values. For example, I have dic that includes n=4 elements, I want m=2 keys have the largest...

Python

968

Theading success (and a little gotcha lurking...)

by: Just_a_fan | last post by:

I am so happy about this I had to tell someone and since no one at the house even knows how to spell VB, I just had to make a short post. I finally got my little program to correctly do a second...

Visual Basic .NET

4131

heapq question

by: Giampaolo Rodola' | last post by:

Hi, this is related to what I'm trying to implement here: http://groups.google.com/group/comp.lang.python/browse_thread/thread/20796724c1daf1e1# My question is the following: is it safe to avoid...

Python

7269

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7394

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7123

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

5100

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4756

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3248

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3237

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

811

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

470

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General