I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.
Here is some sample code:
# Doesn't make sense to union the sets  we're going to do
intersections later anyway
l1 = reduce(operator.add, list(x) for x in l1)
l2 = reduce(operator.add, list(x) for x in l2)
l3 = reduce(operator.add, list(x) for x in l3)
# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)
I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.
Any code suggestions? Maybe using something in the new fancyschmancy
itertools module?
Thanks,
Prateek 11 8493
Prateek wrote:
I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.
Here is some sample code:
# Doesn't make sense to union the sets  we're going to do
intersections later anyway
l1 = reduce(operator.add, list(x) for x in l1)
l2 = reduce(operator.add, list(x) for x in l2)
l3 = reduce(operator.add, list(x) for x in l3)
# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)
I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.
Any code suggestions? Maybe using something in the new fancyschmancy
itertools module?
Thanks,
Prateek
I don't understand why you cast to list. I would propose:
lists_of_sets = [l1, l2, l3]
reduce(set.intersection, (reduce(set.union, x) for x in lists_of_sets))
Since this is simplest, I'm guessing it should be fastest because I'm
also guessing that set impelmentation is as optimized as listI think
this would hold true especially for large sets with sparse overlap
between lists. So it might be reasonable consider the content of your
sets and lists and write your code based on the content or on assuming a
particular composition.
James
James Stroud wrote:
Prateek wrote:
>I have 3 variable length lists of sets. I need to find the common elements in each list (across sets) really really quickly.
Here is some sample code:
# Doesn't make sense to union the sets  we're going to do intersections later anyway l1 = reduce(operator.add, list(x) for x in l1) l2 = reduce(operator.add, list(x) for x in l2) l3 = reduce(operator.add, list(x) for x in l3)
# Should I do this in two steps? Maybe by intersecting the two shortest lists first? s = frozenset(l1) & frozenset(l2) & frozenset(l3)
I'm assuming frozensets are (somehow) quicker than sets because they're immutable.
Any code suggestions? Maybe using something in the new fancyschmancy itertools module?
Thanks, Prateek
I don't understand why you cast to list. I would propose:
lists_of_sets = [l1, l2, l3]
reduce(set.intersection, (reduce(set.union, x) for x in lists_of_sets))
Since this is simplest, I'm guessing it should be fastest because I'm
also guessing that set impelmentation is as optimized as listI think
this would hold true especially for large sets with sparse overlap
between lists. So it might be reasonable consider the content of your
sets and lists and write your code based on the content or on assuming a
particular composition.
Python sets are hashes, like dictionaries, not trees. Intersection
is implemented by iterating over the smallest set and trying all its keys
on the other set. The Python implementation compares the length of two
sets being intersected. This is OK (it's about O(N log N), maybe better).
For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.
How big are the sets? If they're small, but you have a lot of
them, you may be better off with a bitset representation, then
using AND operations for intersection. If they're huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.
When you ask questions like this, it's helpful to give some
background. We don't know whether this is a homework assignment, or
some massive application that's slow and you need to fix it, even
if it requires heavy implementation effort.
John Nagle
For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.
Thanks. I thought so  I'm doing just that using a simple Decorate
SortUndecorate idiom.
How big are the sets? If they're small, but you have a lot of
them, you may be better off with a bitset representation, then
using AND operations for intersection. If they're huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.
I have either 2 or 3 sets (never more) which can be arbitrarily large.
Most of them are small (between 0 and few elements  say less that 5).
A few will be megamonstrous ( 100,000 items)
When you ask questions like this, it's helpful to give some
background. We don't know whether this is a homework assignment, or
some massive application that's slow and you need to fix it, even
if it requires heavy implementation effort.
Its definitely not a homework assignment  its part of a commercial
database query engine. Heavy implementation effort is no problem.
Prateek
On Apr 30, 3:48 am, James Stroud <jstr...@mbi.ucla.eduwrote:
Prateek wrote:
I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.
Here is some sample code:
# Doesn't make sense to union the sets  we're going to do
intersections later anyway
l1 = reduce(operator.add, list(x) for x in l1)
l2 = reduce(operator.add, list(x) for x in l2)
l3 = reduce(operator.add, list(x) for x in l3)
# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)
I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.
Any code suggestions? Maybe using something in the new fancyschmancy
itertools module?
Thanks,
Prateek
I don't understand why you cast to list. I would propose:
The reason why I'm casting to a list first is because I found that
creating a long list which I convert to a set in a single operation is
faster (although probably less memory efficient  which I can deal
with) than doing all the unions.
Prateek
Prateek wrote:
> For the above example, it's worth sorting lists_of_sets by the length of the sets, and doing the short ones first.
Thanks. I thought so  I'm doing just that using a simple Decorate
SortUndecorate idiom.
> How big are the sets? If they're small, but you have a lot of them, you may be better off with a bitset representation, then using AND operations for intersection. If they're huge (tens of millions of entries), you might be better off doing sorts and merges on the sets.
I have either 2 or 3 sets (never more) which can be arbitrarily large.
Most of them are small (between 0 and few elements  say less that 5).
A few will be megamonstrous ( 100,000 items)
> When you ask questions like this, it's helpful to give some background. We don't know whether this is a homework assignment, or some massive application that's slow and you need to fix it, even if it requires heavy implementation effort.
Its definitely not a homework assignment  its part of a commercial
database query engine. Heavy implementation effort is no problem.
Prateek
If you're intersecting a set of 5 vs a set of 100,000, the
intersection time won't be the problem. That's just five lookups.
It's building a set of 100,000 items that may be time consuming.
Does the big set stay around for a while, or do you have to pay
that cost on each query?
Those really aren't big data sets on modern machines.
John Nagle
On Apr 30, 5:08 am, John Nagle <n...@animats.comwrote:
Prateek wrote:
For the above example, it's worth sorting lists_of_sets by the length of the sets, and doing the short ones first.
Thanks. I thought so  I'm doing just that using a simple Decorate
SortUndecorate idiom.
How big are the sets? If they're small, but you have a lot of them, you may be better off with a bitset representation, then using AND operations for intersection. If they're huge (tens of millions of entries), you might be better off doing sorts and merges on the sets.
I have either 2 or 3 sets (never more) which can be arbitrarily large.
Most of them are small (between 0 and few elements  say less that 5).
A few will be megamonstrous ( 100,000 items)
When you ask questions like this, it's helpful to give some background. We don't know whether this is a homework assignment, or some massive application that's slow and you need to fix it, even if it requires heavy implementation effort.
Its definitely not a homework assignment  its part of a commercial
database query engine. Heavy implementation effort is no problem.
Prateek
If you're intersecting a set of 5 vs a set of 100,000, the
intersection time won't be the problem. That's just five lookups.
It's building a set of 100,000 items that may be time consuming.
Does the big set stay around for a while, or do you have to pay
that cost on each query?
Those really aren't big data sets on modern machines.
John Nagle
100,000 is an arbitrary number  that is potentially equivalent to the
number of unique cells in all tables of a typical database (thats the
best analogy I can come up with since this isn't a typical RDBMS).
The big set does stay around for a while  I've implemented an LRU
based caching algorithm on the code that does the I/O. Since the db is
transactioned, I keep one copy in the current transaction cache (which
is a simple dictionary) and one in the main read cache (LRU based)
(which obviously survives across transactions). Since the largest sets
also tend to be the most frequently used, this basically solves my I/O
caching issue.
My problem is that I had ugly code doing a lot of unnecessary list <>
set casting. Here is what I've got now:
from itertools import chain
ids1 = [...], ids2 = [...], ids3 = [...]
_efs = frozenset()
# dbx.get calls return sets
l1 = frozenset(chain(*[db1.get(x, _efs) for x in ids1])
l2 = frozenset(chain(*[db2.get(x, _efs) for x in ids2])
l3 = frozenset(chain(*[db3.get(x, _efs) for x in ids3])
decorated = [(len(x), x) for x in [l1, l2, l3]]
decorated.sort()
result = reduce(set.intersection, [x[1] for x in decorated])
What do you think?
Prateek
Prateek <su*****@gmail.comwrites:
The big set does stay around for a while  I've implemented an LRU
based caching algorithm on the code that does the I/O. Since the db is
transactioned, I keep one copy in the current transaction cache (which
is a simple dictionary) and one in the main read cache (LRU based)
(which obviously survives across transactions). Since the largest sets
also tend to be the most frequently used, this basically solves my I/O
caching issue.
The best approach varies from instance to instance. Some big
databases often will do stuff like statistically sample the sets from
a given query, try a few different strategies on the samples in order
to figure out which one works best on them, then use the apparent best
strategy on the full sets.
Prateek <su*****@gmail.comwrote:
For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.
Thanks. I thought so  I'm doing just that using a simple Decorate
SortUndecorate idiom.
Use, instead, the DSU that is now build into Python:
listofsets.sort(key=len)
this will be faster (as well as more readable &c) than programming your
own DSU, and is exactly the reason the key= parameter was added.
I also suggest avoiding reduce in favor of a simple explicit loop.
Alex
[Prateek]
I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.
.. . .
l1 = reduce(operator.add, list(x) for x in l1)
l2 = reduce(operator.add, list(x) for x in l2)
l3 = reduce(operator.add, list(x) for x in l3)
s = frozenset(l1) & frozenset(l2) & frozenset(l3)
I would write it like this:
def multi_union(listofsets):
result = set()
for s in listofsets:
result = s
return result
def multi_intersection(listofsets):
return reduce(set.intersection, sorted(listofsets, key=len))
s = multi_intersection(map(multi_union, [l1, l2, l3]))
I'm assuming frozensets are (somehow) quicker than sets because
they're immutable.
Frozensets share the same implementation code as regular sets.
They are not more efficient.
Any code suggestions? Maybe using something in the new fancyschmancy
itertools module?
The sets.py module shows how to implement set operations using
itertools.
In general, real set objects should do as well or better than anything
you can cookup using itertools.
Real set objects have the advantage of knowing the hash values of
their
elements. Accordingly, binary set operations can run without any
calls to element.__hash__().
Raymond Hettinger
[Prateek]
The reason why I'm casting to a list first is because I found that
creating a long list which I convert to a set in a single operation is
faster (although probably less memory efficient  which I can deal
with) than doing all the unions.
That would be a surprising result because settoset operations do
not have to recompute hash values. Also, underneaththehood,
both approaches share the exact same implementation for inserting
new values one the hash value is known.
If you've seen an unfavorable speed comparison, then you most likely
had code that built new intermediate sets between step:
common = s1  s2  s3  s4  s5  s6  s7  s8  s9
Instead, it is faster to buildup a single result set:
common = set()
for s in s1, s2, s3, s4, s5, s6, s7, s8, s9:
common = s
Raymond Hettinger
On Apr 30, 12:37 pm, Raymond Hettinger <pyt...@rcn.comwrote:
[Prateek]
The reason why I'm casting to a list first is because I found that
creating a long list which I convert to a set in a single operation is
faster (although probably less memory efficient  which I can deal
with) than doing all the unions.
That would be a surprising result because settoset operations do
not have to recompute hash values. Also, underneaththehood,
both approaches share the exact same implementation for inserting
new values one the hash value is known.
If you've seen an unfavorable speed comparison, then you most likely
had code that built new intermediate sets between step:
common = s1  s2  s3  s4  s5  s6  s7  s8  s9
Instead, it is faster to buildup a single result set:
common = set()
for s in s1, s2, s3, s4, s5, s6, s7, s8, s9:
common = s
Raymond Hettinger
Thanks Raymond,
This was my old code:
self.lv is a dictionary which retrieves data from the disk or cache
v_r = reduce(operator.or_, [self.lv[x.id] for x in v], set())
This code ran faster:
v_r = reduce(operator.add,[list(self.lv.get(x.id, [])) for x in v],
[])
v_r = set(v_r)
I was doing 3 of these and then intersecting them.
Now, I'm doing...
v_r = set()
_efs = frozenset()
for y in [self.lv.get(x.id, _efs) for x in v]:
v_r = y
Since the number of sets is always 2 or 3, I just do the intersections
explicitly like so:
if len(list_of_unioned_sets) == 3:
result = list_of_unioned_sets[0]
result &= list_of_unioned_sets[1]
result &= list_of_unioned_sets[2]
elif len(list_of_unioned_sets) == 2:
result = list_of_unioned_sets[0]
result &= list_of_unioned_sets[1]
else:
# Do something else...
Sorry for the relatively nondescript variable names.
Prateek This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Mickel Grönroos 
last post by:
Hi!
Are there any standard list methods for getting the intersection and
difference of two lists? (The union is easy ("list1.extend(list2)"),
unless you want it to contain unique values.)
...

by: Antoine Logean 
last post by:
Hi,
What is the easiest way to get the intersection of two strings in
python (a kind a "and" operator) ?
ex:
string_1 =...

by: Gordon Williams 
last post by:
Hi,
I have to lists that I need to find the common numbers (2nd rounded to
nearest integral) and I am wondering if there is a more efficient way of
doing it.
>>> a=
>>> b=
>>> ...

by: les_ander 
last post by:
Hi,
I have 2 lists of tuples that look like:
E1= and
E2=.
In this tuple, the ordering does not
matter, i.e. (u,v) is the same as (v,u).
What I want to do is the following:
given 2 list of...

by: James Stroud 
last post by:
Hello All,
I find myself in this situation from time to time: I want to compare two lists
of arbitrary objects and (1) find those unique to the first list, (2) find
those unique to the second...

by: ryu 
last post by:
Hi, May I know how to do an intersection of sets using C#? Where the number
of sets will only be known during runtime. Many Thanks

by: Suresh Jeevanandam 
last post by:
I have a list of sets in variable lsets .
Now I want to find the intersection of all the sets.
r = lsets
for s in r:
r = r & s
Is there any other shorter way?

by: mkppk 
last post by:
I have kind of strange change I'd like to make to the sets.Set()
intersection() method..
Normally, intersection would return items in both s1 and s2 like with
something like this: ...

by: SETT Programming Contest 
last post by:
The SETT Programming Contest: The fastest set<Timplementation
Write the fastest set<Timplementation using only standard C++/C.
Ideally it should have the same interface like std::set.
At least...

by: lllomh 
last post by:
Define the method first
this.state = {
buttonBackgroundColor: 'green',
isBlinking: false, // A new status is added to identify whether the button is blinking or not
}
autoStart=()=>{

by: Aliciasmith 
last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...

by: tracyyun 
last post by:
Hello everyone,
I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...

by: giovanniandrean 
last post by:
The energy model is structured as follows and uses excel sheets to give input data:
1Utility.py contains all the functions needed to calculate the variables and other minor things (mentions...

by: NeoPa 
last post by:
Hello everyone.
I find myself stuck trying to find the VBA way to get Access to create a PDF of the currentlyselected (and open) object (Form or Report).
I know it can be done by selecting :...

by: NeoPa 
last post by:
Introduction
For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...

by: Teri B 
last post by:
Hi, I have created a subform Roles. In my course form the user selects the roles assigned to the course.
0netomany. One course many roles.
Then I created a report based on the Course form and...

by: isladogs 
last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, Mike...

by: GKJR 
last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...
 