By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,928 Members | 1,210 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,928 IT Pros & Developers. It's quick & easy.

finding items that occur more than once in a list

P: n/a
Is there a more efficient way to do this?

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)
|>f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])

Mar 18 '08 #1
Share this Question
Share on Google+
22 Replies


P: n/a
On Mar 18, 11:57*am, Simon Forman <sajmik...@gmail.comwrote:
Is there a more efficient way to do this?

def f(L):
* * '''Return a set of the items that occur more than once in L.'''
* * L = list(L)
* * for item in set(L):
* * * * L.remove(item)
* * return set(L)

|>f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])
def f(L):
D = dict()
for item in L:
if item in D:
D[item] += 1
else:
D[item] = 1
return [i for i,j in D.items() if j 1]

That would be my way to do it, would need to test it via several
thousand iterations to see which one is most efficient though.
Mar 18 '08 #2

P: n/a
Simon Forman <sa*******@gmail.comwrites:
Is there a more efficient way to do this?
http://aspn.activestate.com/ASPN/Coo.../Recipe/502263
Mar 18 '08 #3

P: n/a
Simon Forman wrote:
Is there a more efficient way to do this?

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)
That's neat, but quadratic time because list.remove() requires
a linear search. We can make an efficient variant by using
remove on a set rather than a list:

def multiples(lst):
singles = set(lst)
mults = set()
for x in lst:
if x in singles:
singles.remove(x)
else:
mults.add(x)
return mults

Though probably better is:

def multiples(lst):
seen = set()
mults = set()
for x in lst:
if x in seen:
mults.add(x)
else:
seen.add(x)
return mults
I've typically used dicts for such things, as in:

def multiples(lst):
h = {}
for x in lst:
h[x] = h.get(x, 0) + 1
return set([x for x in h if h[x] 1])
--
--Bryan
Mar 18 '08 #4

P: n/a
On Mar 18, 2:57 am, Simon Forman <sajmik...@gmail.comwrote:
Is there a more efficient way to do this?

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)

|>f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])

def f(L):
seen = set()
dups = set()
for e in L:
if e in seen:
dups.add(e)
else:
seen.add(e)
return dups

Raymond
Mar 18 '08 #5

P: n/a
Ninereeds <st*************@aol.comwrites:
As for the growth pattern, each time you grow the table you have to
redistribute all the items previously inserted to new locations.
Resizes would get rarer as more items are added due to the
exponential growth, but every table resize would take longer too
since there are more items to move. Inserting n items still
intuitively looks like O(n^2) to me.

That said, it does remind me of that old exponential realloc trick
for array resizing. Same thing, I suppose, since a hash table is
basically an array. Maybe my math "intuition" is just wrong.
That's it. While table resizes grow linearly in complexity, they
become geometrically rarer. This is exactly what happens when
resizing dynamic arrays such as Python lists. Applying your
intuition, appending n elements to a Python list would also have
O(n^2) complexity, which it doesn't. See, for example,
http://en.wikipedia.org/wiki/Dynamic...amortized_cost
Mar 18 '08 #6

P: n/a
On 18 Mar, 10:57, Simon Forman <sajmik...@gmail.comwrote:
def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)

def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))


Mar 18 '08 #7

P: n/a
On 18 Mar, 22:22, sturlamolden <sturlamol...@yahoo.nowrote:
def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))

Or perhaps better:

def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

Mar 18 '08 #8

P: n/a
On 18 Mar, 22:25, sturlamolden <sturlamol...@yahoo.nowrote:
def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))
Obviously that should be 'lambda t : t[0] == t[1]'. Instead of using
the set function, there is more structure to exploit when the list is
sorted:
def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]

Mar 18 '08 #9

P: n/a
On Mar 18, 9:56*pm, sturlamolden <sturlamol...@yahoo.nowrote:
On 18 Mar, 22:25, sturlamolden <sturlamol...@yahoo.nowrote:
def nonunique(lst):
* *slst = sorted(lst)
* *return list(set([s[0] for s in
* * * *filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

Obviously that should be 'lambda t : t[0] == t[1]'. Instead of using
the set function, there is more structure to exploit when the list is
sorted:

def nonunique(lst):
* *slst = sorted(lst)
* *dups = [s[0] for s in
* * * * filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
* *return [dups[0]] + [s[1] for s in
* * * * filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:

def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
>>list(duplicates([1, 2, 3, 2, 2, 4, 1]))
[1, 2]

--
Arnaud

Mar 18 '08 #10

P: n/a
On 18 Mar, 23:45, Arnaud Delobelle <arno...@googlemail.comwrote:
def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]

Argh! What's wrong with something like:

def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k

Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).


Mar 18 '08 #11

P: n/a
On Mar 19, 10:08 am, sturlamolden <sturlamol...@yahoo.nowrote:
On 18 Mar, 23:45, Arnaud Delobelle <arno...@googlemail.comwrote:
def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k

Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).
I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.
Mar 19 '08 #12

P: n/a
On 19 Mar, 22:48, John Machin <sjmac...@lexicon.netwrote:
I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.
Is a Python set implemented using a hash table?
Mar 19 '08 #13

P: n/a
On Mar 19, 2:48 pm, John Machin <sjmac...@lexicon.netwrote:
On Mar 19, 10:08 am, sturlamolden <sturlamol...@yahoo.nowrote:
On 18 Mar, 23:45, Arnaud Delobelle <arno...@googlemail.comwrote:
def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.
It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.
Mar 19 '08 #14

P: n/a
On Mar 20, 9:57 am, Justin Bozonier <darkxant...@gmail.comwrote:
On Mar 19, 2:48 pm, John Machin <sjmac...@lexicon.netwrote:
On Mar 19, 10:08 am, sturlamolden <sturlamol...@yahoo.nowrote:
On 18 Mar, 23:45, Arnaud Delobelle <arno...@googlemail.comwrote:
def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).
I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.
"""
from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n 1]
"""

I see no sort here.
Mar 19 '08 #15

P: n/a
On Mar 20, 9:14 am, sturlamolden <sturlamol...@yahoo.nowrote:
On 19 Mar, 22:48, John Machin <sjmac...@lexicon.netwrote:
I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

Is a Python set implemented using a hash table?
What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?

Mar 19 '08 #16

P: n/a
On 20 Mar, 00:16, John Machin <sjmac...@lexicon.netwrote:
What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?
I had not looked at it, but now I have. Is seems Hettinger is the
author :) Ok, so sets are implemented as hash tables. Then I agree,
use Raymond Hettinger's solution.
Mar 20 '08 #17

P: n/a
En Wed, 19 Mar 2008 20:16:36 -0300, John Machin <sj******@lexicon.net>
escribió:
On Mar 20, 9:14 am, sturlamolden <sturlamol...@yahoo.nowrote:
>Is a Python set implemented using a hash table?

What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?
That comment was unnecesarily harsh, don't you think?

--
Gabriel Genellina

Mar 20 '08 #18

P: n/a
On Mar 20, 12:50*pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Wed, 19 Mar 2008 20:16:36 -0300, John Machin <sjmac...@lexicon.net*
escribió:
On Mar 20, 9:14 am, sturlamolden <sturlamol...@yahoo.nowrote:
Is a Python set implemented using a hash table?
What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?

That comment was unnecesarily harsh, don't you think?
No, otherwise I wouldn't have made it.
Mar 20 '08 #19

P: n/a
On Mar 20, 2:07*am, John Machin <sjmac...@lexicon.netwrote:
On Mar 20, 12:50*pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:En Wed, 19 Mar 2008 20:16:36 -0300, John Machin <sjmac...@lexicon.net*
escribió:
On Mar 20, 9:14 am, sturlamolden <sturlamol...@yahoo.nowrote:
>Is a Python set implemented using a hash table?
What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?
That comment was unnecesarily harsh, don't you think?

No, otherwise I wouldn't have made it.
Screenfuls is an awful lot, the way you guys write. Besides, what's
the answer? What don't you understand?
Mar 21 '08 #20

P: n/a
On Mar 21, 3:39*pm, John Machin <sjmac...@lexicon.netwrote:
On Mar 22, 1:11 am, castiro...@gmail.com wrote:
A collision sequence is not so rare.
>>[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]

Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."
Some adversary. What, you mean, my boss or my customers?
Mar 22 '08 #21

P: n/a
ca********@gmail.com wrote:
John Machin wrote:
>On Mar 22, 1:11 am, castiro...@gmail.com wrote:
>>A collision sequence is not so rare.
>[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
>Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

Some adversary. What, you mean, my boss or my customers?
We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here, but that's
an issue in theory and not in serving customers.
--
--Bryan
Mar 22 '08 #22

P: n/a
On 22 Mar, 09:31, Bryan Olson <fakeaddr...@nowhere.orgwrote:
Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here
In which case inserts are not amortized to O(1).
Mar 22 '08 #23

This discussion thread is closed

Replies have been disabled for this discussion.