473,322 Members | 1,307 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Removing duplicates from a list

I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)

Method 2 (not so traditional)

mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

What do you think?

Cheers,

Robin

Sep 14 '05 #1
20 3299
Am Wed, 14 Sep 2005 04:38:35 -0700 schrieb Rubinho:
I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice). mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

What do you think?


Hi,

I would use "set":

mylist=list(set(mylist))

Thomas

--
Thomas Güttler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de
Spam Catcher: ni**************@thomas-guettler.de

Sep 14 '05 #2
Rubinho wrote:
I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)

Method 2 (not so traditional)

mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.


I would imagine that 2 would be significantly faster. Method 1 uses
'count' which must make a pass through every element of the list, which
would be slower than the efficient hashing that set does. I'm also not
sure about removing an element whilst iterating, I think thats a no-no.

Will McGugan
--
http://www.willmcgugan.com
"".join({'*':'@','^':'.'}.get(c,0) or chr(97+(ord(c)-84)%26) for c in
"jvyy*jvyyzpthtna^pbz")
Sep 14 '05 #3
Rubinho wrote:
I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if*mylist.count(x)*>*1:
mylist.remove(x)


That would be an odd tradition:
mylist = [1, 2, 1, 3, 2, 3]
for x in mylist: .... if mylist.count(x) > 1:
.... mylist.remove(x)
.... mylist

[2, 1, 2, 3] # oops!

See "Unexpected Behavior Iterating over a Mutating Object"
http://mail.python.org/pipermail/pyt...er/298993.html
thread for the most recent explanation.

Rather, the traditional approach for an algorithmic problem in Python is to
ask Tim Peters, see his recipe at
http://aspn.activestate.com/ASPN/Coo.../Recipe/52560/
(which predates Python's set class).

Peter

Sep 14 '05 #4
Peter Otten wrote:
Rubinho wrote:
I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)
That would be an odd tradition:


By tradition I wasn't really talking Python tradition; what I meant was
that the above pattern is similar to what would be generated by people
used to traditional programming languages.
mylist = [1, 2, 1, 3, 2, 3]
for x in mylist: ... if mylist.count(x) > 1:
... mylist.remove(x)
... mylist

[2, 1, 2, 3] # oops!


But you're absolutely right, it doesn't work! Oops indeed :)

I've gone with Thomas's suggestion above of: mylist=list(set(mylist))

Thanks,

Robin

Sep 14 '05 #5
I do this:

def unique(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

I don't know what is faster at the moment.

Sep 14 '05 #6
<ma*****@gamecreators.nl> wrote in message
news:11*********************@o13g2000cwo.googlegro ups.com...
I do this:

def unique(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

I don't know what is faster at the moment.


This is quadratic, O(n^2), in the length n of the list
if all keys are unique.
Conversion to a set just might use a better sorting
algorithm than this (i.e. n*log(n)) and throwing out
duplicates (which, after sorting, are positioned
next to each other) is O(n). If conversion
to a set should turn out to be slower than O(n*log(n))
[depending on the implementation], then you are well
advised to sort the list first (n*log(n)) and then
throw out the duplicate keys with a single walk over
the list. In this case you know at least what to
expect for large n...

Regards,
Christian
Sep 14 '05 #7
Rubinho wrote:
I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.


To add to what others said, I'd imagine that the technique that's going
to be fastest is going to depend not only on the length of the list, but
also the estimated redundancy. (i.e. a technique that gives good
performance with a list that has only one or two elements duplicated
might be painfully slow when there is 10-100 copies of each element.)

There really is no substitute for profiling with representitive data sets.
Sep 14 '05 #8
On Wed, 14 Sep 2005 13:28:58 +0100, Will McGugan wrote:
Rubinho wrote:
I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.
I would imagine that 2 would be significantly faster.


Don't imagine, measure.

Resist the temptation to guess. Write some test functions and time the two
different methods. But first test that the functions do what you expect:
there is no point having a blindingly fast bug.

Method 1 uses
'count' which must make a pass through every element of the list, which
would be slower than the efficient hashing that set does.


But count passes through the list in C and is also very fast. Is that
faster or slower than the hashing code used by sets? I don't know, and
I'll bet you don't either.
--
Steven.

Sep 14 '05 #9
Steven D'Aprano wrote:


Don't imagine, measure.

Resist the temptation to guess. Write some test functions and time the two
different methods. But first test that the functions do what you expect:
there is no point having a blindingly fast bug.
Thats is absolutely correct. Although I think you do sometimes have to
guess. Otherwise you would write multiple versions of every line of code.


But count passes through the list in C and is also very fast. Is that
faster or slower than the hashing code used by sets? I don't know, and
I'll bet you don't either.

Sure. But if I'm not currently optimizing I would go for the method with
the best behaviour, which usualy means hashing rather than searching.
Since even if it is actualy slower - its not likely to be _very_ slow.
Will McGugan
--
http://www.willmcgugan.com
"".join({'*':'@','^':'.'}.get(c,0) or chr(97+(ord(c)-84)%26) for c in
"jvyy*jvyyzpthtna^pbz")
Sep 14 '05 #10
> I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)

Method 2 (not so traditional)

mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

What do you think?

Cheers,

Robin


Hi,

Try this:

def unique(s):
e = {}
for x in s:
if not e.has_key(x):
e[x] = 1
return e.keys()

Regards
Przemek
Sep 14 '05 #11
This works too, if speed isn't your thing..
a = [ 1,2,3,2,6,1,3,4,1,7,5,6,7]
a = dict( ( (i,None) for i in a)).keys()

a
[1, 2, 3, 4, 5, 6, 7]

Sep 14 '05 #12
przemek drochomirecki wrote:
def unique(s):
e = {}
for x in s:
if not e.has_key(x):
e[x] = 1
return e.keys()


This is basically identical in functionality to the code:

def unique(s):
return list(set(s))

And with the new-and-improved C implementation of sets coming in Python
2.5, there's even more of a reason to use them when you can.

STeVe
Sep 14 '05 #13
Rubinho napisal(a):
I've a list with duplicate members and I need to make each entry
unique.


hi,

other possibility (my newest discovery:) )
a = [1,2,2,4,2,1,3,4]
unique = d.fromkeys(a).keys()
unique

[1, 2, 3, 4]

regards
przemek

Sep 15 '05 #14
Look at the code below

def unique(s):
return list(set(s))

def unique2(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

tmp = [0,1,2,4,2,2,3,4,1,3,2]
print tmp
print unique(tmp)
print unique2(tmp)
--------------------------
[0, 1, 2, 4, 2, 2, 3, 4, 1, 3, 2]
[0, 1, 2, 3, 4]
[0, 1, 2, 4, 3]

As you can see the end result is not the same.
I must get the end result [0, 1, 2, 4, 3] and not [0, 1, 2, 3, 4].
Thats why I use unique2()

Sep 15 '05 #15
there wasn't any information about ordering...
maybe i'll find something better which don't destroy original ordering

regards
przemek

Sep 15 '05 #16
i suppose this one is faster (but in most cases efficiency doesn't
matter)
def stable_unique(s):

e = {}
ret = []
for x in s:
if not e.has_key(x):
e[x] = 1
ret.append(x)
return ret

cheers,
przemek

Sep 15 '05 #17
Ow thanks , i'm I newbie and I did this test. (don't know if this is
the best way to do a small speed test)

import timeit

def unique2(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

def unique3(s):
e = {}
ret = []
for x in s:
if not e.has_key(x):
e[x] = 1
ret.append(x)
return ret

tmp = [0,1,2,4,2,2,3,4,1,3,2]
s = """\
try:
str.__nonzero__
except AttributeError:
pass
"""
t = timeit.Timer(stmt=s)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
print tmp
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
print unique2(tmp)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
print unique3(tmp)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
---------------------
5.80 usec/pass
[0, 1, 2, 4, 2, 2, 3, 4, 1, 3, 2]
7.51 usec/pass
[0, 1, 2, 4, 3]
6.93 usec/pass
[0, 1, 2, 4, 3]
6.45 usec/pass <--- your code unique2(s):

Sep 15 '05 #18
thanks, nice job. but this benchmark is pretty deceptive:

try this:
(definition of unique2 and unique3 as above)
import timeit
a = range(1000)
t = timeit.Timer('unique2(a)','from __main__ import unique2,a')
t2 = timeit.Timer('stable_unique(a)','from __main__ import stable_unique,a')
t2.timeit(2000) 1.8392596235778456 t.timeit(2000)

51.52945844819817

unique2 has quadratic complexity
unique3 has amortized linear complexity
what it means?
it means that speed of your algorithm strongly depends on
len(unique2(a)). the greater distinct elements in a the greater
difference in execution time of both implementations

regards
przemek

Sep 15 '05 #19
Thanks for all the information.
And now I understand the timeit module ;)

GC-Martijn

Sep 16 '05 #20
drochom wrote:
i suppose this one is faster (but in most cases efficiency doesn't
matter)
def stable_unique(s):


e = {}
ret = []
for x in s:
if not e.has_key(x):
e[x] = 1
ret.append(x)
return ret


I'll repeat Peter Otten's link to Tim Peters's recipe here:
http://aspn.activestate.com/ASPN/Coo.../Recipe/52560/
Read the comments at the end, they talk about order-preserving lists.

See Raymond Hettinger's response:

def uniq(alist) # Fastest order preserving
set = {}
return [set.setdefault(e,e) for e in alist if e not in set]

STeVe
Sep 17 '05 #21

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Drew | last post by:
I have a permission tracking app that I am working on, and I have made the insert page for it. I am having issues on how to prevent duplicates from getting entered. Currently the interface for...
4
by: Gerry Abbott | last post by:
Hi all, I've bound cbo field in a form, which draws from a list, with its 'limit to list' property set to true.When i add an entry to the form, from the list, I want to remove this item from the...
5
by: Jani Yusef | last post by:
Based on an interview question I heard of but did not know the answer to....... How do you find and remove a loop from a singly linked list? In a google groups search I found the following code...
9
by: vbportal | last post by:
Hi, I would like to add BitArrays to an ArrayList and then remove any duplicates - can someone please help me forward. I seem to have (at leaset ;-) )2 problems/lack of understanding (see test...
0
by: makthar | last post by:
In your query use DISTINCT SELECT DISTINCT CITY FROM <tablename> WHERE STATE='<state name>'. This will bring only one of each city from the table. >-----Original Message----- >I'm getting a...
16
by: tyrfboard | last post by:
I've been searching for awhile now on how to remove duplicates from a table within an Access db and have found plenty of articles on finding or deleting duplicates. All I want to do is remove them...
6
by: Niyazi | last post by:
Hi all, What is fastest way removing duplicated value from string array using vb.net? Here is what currently I am doing but the the array contains over 16000 items. And it just do it in 10 or...
7
by: vsgdp | last post by:
I have a container of pointers. It is possible for two pointers to point to the same element. I want to remove duplicates. I am open to which container is best for this. I thought of using...
5
by: maphew | last post by:
Hello, I have some lists for which I need to remove duplicates. I found the sets.Sets() module which does exactly this, but how do I get the set back out again? # existing input: A,B,B,C,D #...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.