440,772 Members | 919 Online Need help? Post your question and get tips & solutions from a community of 440,772 IT Pros & Developers. It's quick & easy.

# fastest way to find the intersection of n lists of sets

 P: n/a I have 3 variable length lists of sets. I need to find the common elements in each list (across sets) really really quickly. Here is some sample code: # Doesn't make sense to union the sets - we're going to do intersections later anyway l1 = reduce(operator.add, list(x) for x in l1) l2 = reduce(operator.add, list(x) for x in l2) l3 = reduce(operator.add, list(x) for x in l3) # Should I do this in two steps? Maybe by intersecting the two shortest lists first? s = frozenset(l1) & frozenset(l2) & frozenset(l3) I'm assuming frozensets are (somehow) quicker than sets because they're immutable. Any code suggestions? Maybe using something in the new fancy-schmancy itertools module? Thanks, Prateek Apr 29 '07 #1
11 Replies

 P: n/a Prateek wrote: I have 3 variable length lists of sets. I need to find the common elements in each list (across sets) really really quickly. Here is some sample code: # Doesn't make sense to union the sets - we're going to do intersections later anyway l1 = reduce(operator.add, list(x) for x in l1) l2 = reduce(operator.add, list(x) for x in l2) l3 = reduce(operator.add, list(x) for x in l3) # Should I do this in two steps? Maybe by intersecting the two shortest lists first? s = frozenset(l1) & frozenset(l2) & frozenset(l3) I'm assuming frozensets are (somehow) quicker than sets because they're immutable. Any code suggestions? Maybe using something in the new fancy-schmancy itertools module? Thanks, Prateek I don't understand why you cast to list. I would propose: lists_of_sets = [l1, l2, l3] reduce(set.intersection, (reduce(set.union, x) for x in lists_of_sets)) Since this is simplest, I'm guessing it should be fastest because I'm also guessing that set impelmentation is as optimized as list--I think this would hold true especially for large sets with sparse overlap between lists. So it might be reasonable consider the content of your sets and lists and write your code based on the content or on assuming a particular composition. James Apr 29 '07 #2

 P: n/a James Stroud wrote: Prateek wrote: >I have 3 variable length lists of sets. I need to find the commonelements in each list (across sets) really really quickly.Here is some sample code:# Doesn't make sense to union the sets - we're going to dointersections later anywayl1 = reduce(operator.add, list(x) for x in l1)l2 = reduce(operator.add, list(x) for x in l2)l3 = reduce(operator.add, list(x) for x in l3)# Should I do this in two steps? Maybe by intersecting the twoshortest lists first?s = frozenset(l1) & frozenset(l2) & frozenset(l3)I'm assuming frozensets are (somehow) quicker than sets becausethey're immutable.Any code suggestions? Maybe using something in the new fancy-schmancyitertools module?Thanks,Prateek I don't understand why you cast to list. I would propose: lists_of_sets = [l1, l2, l3] reduce(set.intersection, (reduce(set.union, x) for x in lists_of_sets)) Since this is simplest, I'm guessing it should be fastest because I'm also guessing that set impelmentation is as optimized as list--I think this would hold true especially for large sets with sparse overlap between lists. So it might be reasonable consider the content of your sets and lists and write your code based on the content or on assuming a particular composition. Python sets are hashes, like dictionaries, not trees. Intersection is implemented by iterating over the smallest set and trying all its keys on the other set. The Python implementation compares the length of two sets being intersected. This is OK (it's about O(N log N), maybe better). For the above example, it's worth sorting lists_of_sets by the length of the sets, and doing the short ones first. How big are the sets? If they're small, but you have a lot of them, you may be better off with a bit-set representation, then using AND operations for intersection. If they're huge (tens of millions of entries), you might be better off doing sorts and merges on the sets. When you ask questions like this, it's helpful to give some background. We don't know whether this is a homework assignment, or some massive application that's slow and you need to fix it, even if it requires heavy implementation effort. John Nagle Apr 29 '07 #3

 P: n/a For the above example, it's worth sorting lists_of_sets by the length of the sets, and doing the short ones first. Thanks. I thought so - I'm doing just that using a simple Decorate- Sort-Undecorate idiom. How big are the sets? If they're small, but you have a lot of them, you may be better off with a bit-set representation, then using AND operations for intersection. If they're huge (tens of millions of entries), you might be better off doing sorts and merges on the sets. I have either 2 or 3 sets (never more) which can be arbitrarily large. Most of them are small (between 0 and few elements - say less that 5). A few will be megamonstrous ( 100,000 items) When you ask questions like this, it's helpful to give some background. We don't know whether this is a homework assignment, or some massive application that's slow and you need to fix it, even if it requires heavy implementation effort. Its definitely not a homework assignment - its part of a commercial database query engine. Heavy implementation effort is no problem. Prateek Apr 29 '07 #4

 P: n/a On Apr 30, 3:48 am, James Stroud

 P: n/a Prateek wrote: > For the above example, it's worth sorting lists_of_sets by thelength of the sets, and doing the short ones first. Thanks. I thought so - I'm doing just that using a simple Decorate- Sort-Undecorate idiom. > How big are the sets? If they're small, but you have a lot ofthem, you may be better off with a bit-set representation, thenusing AND operations for intersection. If they're huge (tens of millionsof entries), you might be better off doing sorts and merges on thesets. I have either 2 or 3 sets (never more) which can be arbitrarily large. Most of them are small (between 0 and few elements - say less that 5). A few will be megamonstrous ( 100,000 items) > When you ask questions like this, it's helpful to give somebackground. We don't know whether this is a homework assignment, orsome massive application that's slow and you need to fix it, evenif it requires heavy implementation effort. Its definitely not a homework assignment - its part of a commercial database query engine. Heavy implementation effort is no problem. Prateek If you're intersecting a set of 5 vs a set of 100,000, the intersection time won't be the problem. That's just five lookups. It's building a set of 100,000 items that may be time consuming. Does the big set stay around for a while, or do you have to pay that cost on each query? Those really aren't big data sets on modern machines. John Nagle Apr 30 '07 #6

 P: n/a On Apr 30, 5:08 am, John Nagle set casting. Here is what I've got now: from itertools import chain ids1 = [...], ids2 = [...], ids3 = [...] _efs = frozenset() # dbx.get calls return sets l1 = frozenset(chain(*[db1.get(x, _efs) for x in ids1]) l2 = frozenset(chain(*[db2.get(x, _efs) for x in ids2]) l3 = frozenset(chain(*[db3.get(x, _efs) for x in ids3]) decorated = [(len(x), x) for x in [l1, l2, l3]] decorated.sort() result = reduce(set.intersection, [x for x in decorated]) What do you think? Prateek Apr 30 '07 #7

 P: n/a Prateek

 P: n/a Prateek

 P: n/a [Prateek] I have 3 variable length lists of sets. I need to find the common elements in each list (across sets) really really quickly. .. . . l1 = reduce(operator.add, list(x) for x in l1) l2 = reduce(operator.add, list(x) for x in l2) l3 = reduce(operator.add, list(x) for x in l3) s = frozenset(l1) & frozenset(l2) & frozenset(l3) I would write it like this: def multi_union(listofsets): result = set() for s in listofsets: result |= s return result def multi_intersection(listofsets): return reduce(set.intersection, sorted(listofsets, key=len)) s = multi_intersection(map(multi_union, [l1, l2, l3])) I'm assuming frozensets are (somehow) quicker than sets because they're immutable. Frozensets share the same implementation code as regular sets. They are not more efficient. Any code suggestions? Maybe using something in the new fancy-schmancy itertools module? The sets.py module shows how to implement set operations using itertools. In general, real set objects should do as well or better than anything you can cook-up using itertools. Real set objects have the advantage of knowing the hash values of their elements. Accordingly, binary set operations can run without any calls to element.__hash__(). Raymond Hettinger Apr 30 '07 #10

 P: n/a [Prateek] The reason why I'm casting to a list first is because I found that creating a long list which I convert to a set in a single operation is faster (although probably less memory efficient - which I can deal with) than doing all the unions. That would be a surprising result because set-to-set operations do not have to recompute hash values. Also, underneath-the-hood, both approaches share the exact same implementation for inserting new values one the hash value is known. If you've seen an unfavorable speed comparison, then you most likely had code that built new intermediate sets between step: common = s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 Instead, it is faster to build-up a single result set: common = set() for s in s1, s2, s3, s4, s5, s6, s7, s8, s9: common |= s Raymond Hettinger Apr 30 '07 #11

 P: n/a On Apr 30, 12:37 pm, Raymond Hettinger

### This discussion thread is closed

Replies have been disabled for this discussion. 