473,856 Members | 1,773 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

fastest way to find the intersection of n lists of sets

I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.

Here is some sample code:

# Doesn't make sense to union the sets - we're going to do
intersections later anyway
l1 = reduce(operator .add, list(x) for x in l1)
l2 = reduce(operator .add, list(x) for x in l2)
l3 = reduce(operator .add, list(x) for x in l3)

# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)

I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.

Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?

Thanks,
Prateek

Apr 29 '07 #1
11 8578
Prateek wrote:
I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.

Here is some sample code:

# Doesn't make sense to union the sets - we're going to do
intersections later anyway
l1 = reduce(operator .add, list(x) for x in l1)
l2 = reduce(operator .add, list(x) for x in l2)
l3 = reduce(operator .add, list(x) for x in l3)

# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)

I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.

Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?

Thanks,
Prateek
I don't understand why you cast to list. I would propose:

lists_of_sets = [l1, l2, l3]

reduce(set.inte rsection, (reduce(set.uni on, x) for x in lists_of_sets))

Since this is simplest, I'm guessing it should be fastest because I'm
also guessing that set impelmentation is as optimized as list--I think
this would hold true especially for large sets with sparse overlap
between lists. So it might be reasonable consider the content of your
sets and lists and write your code based on the content or on assuming a
particular composition.

James
Apr 29 '07 #2
James Stroud wrote:
Prateek wrote:
>I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.

Here is some sample code:

# Doesn't make sense to union the sets - we're going to do
intersection s later anyway
l1 = reduce(operator .add, list(x) for x in l1)
l2 = reduce(operator .add, list(x) for x in l2)
l3 = reduce(operator .add, list(x) for x in l3)

# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)

I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.

Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?

Thanks,
Prateek

I don't understand why you cast to list. I would propose:

lists_of_sets = [l1, l2, l3]

reduce(set.inte rsection, (reduce(set.uni on, x) for x in lists_of_sets))

Since this is simplest, I'm guessing it should be fastest because I'm
also guessing that set impelmentation is as optimized as list--I think
this would hold true especially for large sets with sparse overlap
between lists. So it might be reasonable consider the content of your
sets and lists and write your code based on the content or on assuming a
particular composition.
Python sets are hashes, like dictionaries, not trees. Intersection
is implemented by iterating over the smallest set and trying all its keys
on the other set. The Python implementation compares the length of two
sets being intersected. This is OK (it's about O(N log N), maybe better).

For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.

How big are the sets? If they're small, but you have a lot of
them, you may be better off with a bit-set representation, then
using AND operations for intersection. If they're huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.

When you ask questions like this, it's helpful to give some
background. We don't know whether this is a homework assignment, or
some massive application that's slow and you need to fix it, even
if it requires heavy implementation effort.

John Nagle
Apr 29 '07 #3
For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.
Thanks. I thought so - I'm doing just that using a simple Decorate-
Sort-Undecorate idiom.
How big are the sets? If they're small, but you have a lot of
them, you may be better off with a bit-set representation, then
using AND operations for intersection. If they're huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.
I have either 2 or 3 sets (never more) which can be arbitrarily large.
Most of them are small (between 0 and few elements - say less that 5).
A few will be megamonstrous ( 100,000 items)
When you ask questions like this, it's helpful to give some
background. We don't know whether this is a homework assignment, or
some massive application that's slow and you need to fix it, even
if it requires heavy implementation effort.
Its definitely not a homework assignment - its part of a commercial
database query engine. Heavy implementation effort is no problem.

Prateek
Apr 29 '07 #4
On Apr 30, 3:48 am, James Stroud <jstr...@mbi.uc la.eduwrote:
Prateek wrote:
I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.
Here is some sample code:
# Doesn't make sense to union the sets - we're going to do
intersections later anyway
l1 = reduce(operator .add, list(x) for x in l1)
l2 = reduce(operator .add, list(x) for x in l2)
l3 = reduce(operator .add, list(x) for x in l3)
# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)
I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.
Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?
Thanks,
Prateek

I don't understand why you cast to list. I would propose:
The reason why I'm casting to a list first is because I found that
creating a long list which I convert to a set in a single operation is
faster (although probably less memory efficient - which I can deal
with) than doing all the unions.

Prateek

Apr 29 '07 #5
Prateek wrote:
> For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.


Thanks. I thought so - I'm doing just that using a simple Decorate-
Sort-Undecorate idiom.

> How big are the sets? If they're small, but you have a lot of
them, you may be better off with a bit-set representation, then
using AND operations for intersection. If they're huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.


I have either 2 or 3 sets (never more) which can be arbitrarily large.
Most of them are small (between 0 and few elements - say less that 5).
A few will be megamonstrous ( 100,000 items)

> When you ask questions like this, it's helpful to give some
background. We don't know whether this is a homework assignment, or
some massive application that's slow and you need to fix it, even
if it requires heavy implementation effort.

Its definitely not a homework assignment - its part of a commercial
database query engine. Heavy implementation effort is no problem.

Prateek
If you're intersecting a set of 5 vs a set of 100,000, the
intersection time won't be the problem. That's just five lookups.
It's building a set of 100,000 items that may be time consuming.

Does the big set stay around for a while, or do you have to pay
that cost on each query?

Those really aren't big data sets on modern machines.

John Nagle
Apr 30 '07 #6
On Apr 30, 5:08 am, John Nagle <n...@animats.c omwrote:
Prateek wrote:
For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.
Thanks. I thought so - I'm doing just that using a simple Decorate-
Sort-Undecorate idiom.
How big are the sets? If they're small, but you have a lot of
them, you may be better off with a bit-set representation, then
using AND operations for intersection. If they're huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.
I have either 2 or 3 sets (never more) which can be arbitrarily large.
Most of them are small (between 0 and few elements - say less that 5).
A few will be megamonstrous ( 100,000 items)
When you ask questions like this, it's helpful to give some
background. We don't know whether this is a homework assignment, or
some massive application that's slow and you need to fix it, even
if it requires heavy implementation effort.
Its definitely not a homework assignment - its part of a commercial
database query engine. Heavy implementation effort is no problem.
Prateek

If you're intersecting a set of 5 vs a set of 100,000, the
intersection time won't be the problem. That's just five lookups.
It's building a set of 100,000 items that may be time consuming.

Does the big set stay around for a while, or do you have to pay
that cost on each query?

Those really aren't big data sets on modern machines.

John Nagle
100,000 is an arbitrary number - that is potentially equivalent to the
number of unique cells in all tables of a typical database (thats the
best analogy I can come up with since this isn't a typical RDBMS).

The big set does stay around for a while - I've implemented an LRU
based caching algorithm on the code that does the I/O. Since the db is
transactioned, I keep one copy in the current transaction cache (which
is a simple dictionary) and one in the main read cache (LRU based)
(which obviously survives across transactions). Since the largest sets
also tend to be the most frequently used, this basically solves my I/O
caching issue.

My problem is that I had ugly code doing a lot of unnecessary list <->
set casting. Here is what I've got now:

from itertools import chain
ids1 = [...], ids2 = [...], ids3 = [...]
_efs = frozenset()
# dbx.get calls return sets
l1 = frozenset(chain (*[db1.get(x, _efs) for x in ids1])
l2 = frozenset(chain (*[db2.get(x, _efs) for x in ids2])
l3 = frozenset(chain (*[db3.get(x, _efs) for x in ids3])

decorated = [(len(x), x) for x in [l1, l2, l3]]
decorated.sort( )
result = reduce(set.inte rsection, [x[1] for x in decorated])

What do you think?

Prateek

Apr 30 '07 #7
Prateek <su*****@gmail. comwrites:
The big set does stay around for a while - I've implemented an LRU
based caching algorithm on the code that does the I/O. Since the db is
transactioned, I keep one copy in the current transaction cache (which
is a simple dictionary) and one in the main read cache (LRU based)
(which obviously survives across transactions). Since the largest sets
also tend to be the most frequently used, this basically solves my I/O
caching issue.
The best approach varies from instance to instance. Some big
databases often will do stuff like statistically sample the sets from
a given query, try a few different strategies on the samples in order
to figure out which one works best on them, then use the apparent best
strategy on the full sets.
Apr 30 '07 #8
Prateek <su*****@gmail. comwrote:
For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.

Thanks. I thought so - I'm doing just that using a simple Decorate-
Sort-Undecorate idiom.
Use, instead, the DSU that is now build into Python:

listofsets.sort (key=len)

this will be faster (as well as more readable &c) than programming your
own DSU, and is exactly the reason the key= parameter was added.

I also suggest avoiding reduce in favor of a simple explicit loop.
Alex
Apr 30 '07 #9
[Prateek]
I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.
.. . .
l1 = reduce(operator .add, list(x) for x in l1)
l2 = reduce(operator .add, list(x) for x in l2)
l3 = reduce(operator .add, list(x) for x in l3)
s = frozenset(l1) & frozenset(l2) & frozenset(l3)
I would write it like this:

def multi_union(lis tofsets):
result = set()
for s in listofsets:
result |= s
return result

def multi_intersect ion(listofsets) :
return reduce(set.inte rsection, sorted(listofse ts, key=len))

s = multi_intersect ion(map(multi_u nion, [l1, l2, l3]))

I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.
Frozensets share the same implementation code as regular sets.
They are not more efficient.

Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?
The sets.py module shows how to implement set operations using
itertools.
In general, real set objects should do as well or better than anything
you can cook-up using itertools.

Real set objects have the advantage of knowing the hash values of
their
elements. Accordingly, binary set operations can run without any
calls to element.__hash_ _().
Raymond Hettinger

Apr 30 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
80996
by: Mickel Grönroos | last post by:
Hi! Are there any standard list methods for getting the intersection and difference of two lists? (The union is easy ("list1.extend(list2)"), unless you want it to contain unique values.) Here is what I would like to have: list1 = list2 =
5
9179
by: Antoine Logean | last post by:
Hi, What is the easiest way to get the intersection of two strings in python (a kind a "and" operator) ? ex: string_1 = "the_car_of_my_fried_is_bigger_as_mine_but_my_girlfriend_is_more_beautifull" string_2 = "my_girlfriend_is_more_beautifull_and_has_blue_eyes"
17
2794
by: Gordon Williams | last post by:
Hi, I have to lists that I need to find the common numbers (2nd rounded to nearest integral) and I am wondering if there is a more efficient way of doing it. >>> a= >>> b= >>> (l,round(m))]
7
2222
by: les_ander | last post by:
Hi, I have 2 lists of tuples that look like: E1= and E2=. In this tuple, the ordering does not matter, i.e. (u,v) is the same as (v,u). What I want to do is the following: given 2 list of tuples, E1 and E2, I want to create another list with tuples that are common to both. So in the above example I would like
2
2538
by: James Stroud | last post by:
Hello All, I find myself in this situation from time to time: I want to compare two lists of arbitrary objects and (1) find those unique to the first list, (2) find those unique to the second list, (3) find those that overlap. But here is the catch: comparison is not straight-forward. For example, I will want to compare 2 objects based on a set of common attributes. These two objects need not be members of the same class, etc. A function...
3
5511
by: ryu | last post by:
Hi, May I know how to do an intersection of sets using C#? Where the number of sets will only be known during runtime. Many Thanks
3
2612
by: Suresh Jeevanandam | last post by:
I have a list of sets in variable lsets . Now I want to find the intersection of all the sets. r = lsets for s in r: r = r & s Is there any other shorter way?
2
2329
by: mkppk | last post by:
I have kind of strange change I'd like to make to the sets.Set() intersection() method.. Normally, intersection would return items in both s1 and s2 like with something like this: s1.intersection(s2) I want the item matching to be a bit "looser".. that is, items in s2 that match to just the beginning of items in s1 would be included in the result of intersection().
22
2725
by: SETT Programming Contest | last post by:
The SETT Programming Contest: The fastest set<Timplementation Write the fastest set<Timplementation using only standard C++/C. Ideally it should have the same interface like std::set. At least the following methods must be implemented: insert(), find(), begin(), end(), erase(), size(), operator<(), and at least the forward iterator. Here, speed and correctness are the 2 most important factors. Functionally it should behave similar to...
0
9916
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9762
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11056
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10782
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10384
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7932
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7093
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5761
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
4174
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.