469,923 Members | 1,357 Online

# filter list fast

I have a list I filter using another list and I would like this to be
as fast as possible
right now I do like this:

[x for x in list1 if x not in list2]

i tried using the method filter:

filter(lambda x: x not in list2, list1)

but it didn't make much difference, because of lambda I guess
is there any way I can speed this up

Mar 18 '06 #1
9 1505
lars_woetmann wrote:
I have a list I filter using another list and I would like this to be
as fast as possible
right now I do like this:

[x for x in list1 if x not in list2]

i tried using the method filter:

filter(lambda x: x not in list2, list1)

but it didn't make much difference, because of lambda I guess
is there any way I can speed this up

Both of these techniques are O(n^2). You can reduce it to O(n log n)
by using sets:
set2 = set(list2)
[x for x in list1 if x not in set2]

Checking to see if an item is in a set is much more efficient than a
list.

--Ben

Mar 18 '06 #2
lars_woetmann wrote:
I have a list I filter using another list and I would like this to be
as fast as possible right now I do like this:

[x for x in list1 if x not in list2]

i tried using the method filter:

filter(lambda x: x not in list2, list1)

but it didn't make much difference, because of lambda I guess
is there any way I can speed this up

if list2 is a list object, "not in list2" is an O(N) operation.

maybe you should use sets instead ? does the following work
better ?

set2 = set(list2)
result = [x for x in list1 if x not in set2]

?

</F>

Mar 18 '06 #3
Lars Woetmann wrote:
I have a list I filter using another list and I would like
this to be as fast as possible
right now I do like this:

[x for x in list1 if x not in list2]

i tried using the method filter:

filter(lambda x: x not in list2, list1)

but it didn't make much difference, because of lambda I guess
is there any way I can speed this up

If you use a reasonably new python version, you could use sets:

#v+
a = set(range(10))
a set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) b = set(range(5, 15))
b set([5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) a.difference(b) set([0, 1, 2, 3, 4]) a-b set([0, 1, 2, 3, 4]) list(a-b) [0, 1, 2, 3, 4]

#v-

Cheers,

--
Klaus Alexander Seistrup
SubZeroNet, Copenhagen, Denmark
http://magnetic-ink.dk/
Mar 18 '06 #4
> Both of these techniques are O(n^2). You can reduce it to O(n log n)
by using sets:
set2 = set(list2)
[x for x in list1 if x not in set2]

Checking to see if an item is in a set is much more efficient than a
list.

Is the set-lookup reliably O(log n)? I was under the impression that it is
hash-based, and this should be O(1) usually, but couldbve O(n) worst-case
(hash the same for _all_ entries).

Regards,

Diez
Mar 18 '06 #5
Diez B. Roggisch wrote:
Both of these techniques are O(n^2). You can reduce it to O(n log n)
by using sets:
>set2 = set(list2)
>[x for x in list1 if x not in set2]

Checking to see if an item is in a set is much more efficient than a
list.

Is the set-lookup reliably O(log n)? I was under the impression that it is
hash-based, and this should be O(1) usually, but couldbve O(n) worst-case
(hash the same for _all_ entries).

That's largely a theoretical concern. Google for something like

'''dict worst-case performance "tim peters"'''

to learn more. (The third article there (no doubt obsolete in some
ways, given that it was in 2000) says that Python "keeps at least 1/3 of
the internal hash table entries unused, making collisions very rarely a
problem... It's possible to contrive keys that will cause collisions
systematically ... but unlikely to happen by accident in 2.0")

-Peter

Mar 18 '06 #6
Thanks all, sets work like i charm

Mar 18 '06 #7
What was the speed-up ?

Mar 18 '06 #8
comparing
[x for x in list1 if x not in list2]
with
set1, set2 = set(list1), set(list2)
list(set1-set2)

gives something like
len(list2) speedup
------------------------------
100 10
1000 100
10000 1000

the speedup is constant for different len(list1)

Mar 18 '06 #9
Thanks.

Mar 18 '06 #10

### This discussion thread is closed

Replies have been disabled for this discussion.