440,262 Members | 1,128 Online Need help? Post your question and get tips & solutions from a community of 440,262 IT Pros & Developers. It's quick & easy.

# [perl-python] exercise: partition a list by equivalence

 P: n/a here's another interesting algorithmic exercise, again from part of a larger program in the previous series. Here's the original Perl documentation: =pod merge(\$pairings) takes a list of pairs, each pair indicates the sameness of the two indexes. Returns a partitioned list of same indexes. For example, if the input is merge( [ [1,2], [2,4], [5,6] ] ); that means 1 and 2 are the same. 2 and 4 are the same. Therefore 1==2==4. The result returned is [[4,2,1],[6,5]]; (ordering of the returned list and sublists are not specified.) =cut For those of you unfamiliar with math lingoes, partition a list means to create sublists, based on some criteria. In our case, the input list comes in the form of pairs, and its members are the union of all members in the pairs. The criterion for partition in our case is that of equivalence, specified in the pairs input. Try to write a Python code for this. In the Python code, the input should be a list of couples. (for Perlers, sketch out the algorithm on paper and try to code in Python directly.) I'll post Perl & Python code tomorrow. ==This is brought to you by the Perl-Python community.== == http://xahlee.org/perl-python/python.html == Xah xa*@xahlee.org http://xahlee.org/PageTwo_dir/more.html Jul 18 '05 #1
41 Replies

 P: n/a >For those of you unfamiliar with math lingoes, partition a list means to create sublists, based on some criteria. Typical moronic mathematicians with their exclusionary lingoes...why can't they just say "create sublists based on some criteria" like normal people? - alex23 Jul 18 '05 #2

 P: n/a Xah Lee wrote: here's another interesting algorithmic exercise, again from part of a larger program in the previous series. Here's the original Perl documentation: =pod merge(\$pairings) takes a list of pairs, each pair indicates the sameness of the two indexes. Returns a partitioned list of same indexes. For example, if the input is merge( [ [1,2], [2,4], [5,6] ] ); that means 1 and 2 are the same. 2 and 4 are the same. Therefore 1==2==4. The result returned is [[4,2,1],[6,5]]; (ordering of the returned list and sublists are not specified.) =cut Almost a joke: from numarray import * def merge(*pairs): flattened = reduce(tuple.__add__, pairs, tuple()) m, M = min(flattened), max(flattened) d = M - m + 1 matrix = zeros((d, d), type = Bool) for x, y in pairs: X, Y = x - m, y - m matrix[X, X] = 1 matrix[X, Y] = 1 matrix[Y, X] = 1 matrix[Y, Y] = 1 while True: next = greater(dot(matrix, matrix), 0) if alltrue(ravel(next == matrix)): break matrix = next results = [] for i in range(d): eqls, = nonzero(matrix[i]) if eqls.size(): if i == eqls: results.append(tuple(x + m for x in eqls)) return results Disclaimer: I'm not an expert in numarray and suppose the something can be dramatically imporved. Jul 18 '05 #3

 P: n/a On Thu, Feb 17, 2005 at 03:46:20PM -0800, Xah Lee wrote: here's another interesting algorithmic exercise, again from part of a larger program in the previous series. Here's the original Perl documentation: =pod merge(\$pairings) takes a list of pairs, each pair indicates the sameness of the two indexes. Returns a partitioned list of same indexes. For example, if the input is merge( [ [1,2], [2,4], [5,6] ] ); that means 1 and 2 are the same. 2 and 4 are the same. Therefore 1==2==4. The result returned is [[4,2,1],[6,5]]; (ordering of the returned list and sublists are not specified.) =cut in Python: def merge(pairings): rev = {} res = [] for pairing in pairings: first, second = pairing has_first = first in rev has_second = second in rev if has_first and has_second: rev[first].extend(rev[second]) rev[second][:] = [] rev[second] = rev[first] elif has_first: rev[first].append(second) rev[second] = rev[first] elif has_second: rev[second].append(first) rev[first] = rev[second] else: copy = [first, second] res.append(copy) rev[first] = rev[second] = copy return filter(None, res) and in Perl: sub merge(\$) { my %rev = (); my @res = (); my (\$pairing, \$first, \$second, \$has_first, \$has_second); foreach \$pairing (@{\$_}) { (\$first, \$second) = @\$pairing; \$has_first = exists \$rev{\$first}; \$has_second = exists \$rev{\$second}; if (\$has_first and \$has_second) { push @{\$rev{\$first}}, @{\$rev{\$second}}; @{\$rev{\$second}} = (); \$rev{\$second} = \$rev{\$first}; } elsif (exists \$rev{\$first}) { push @{\$rev{\$first}}, \$second; \$rev{\$second} = \$rev{\$first}; } elsif (exists \$rev{\$second}) { push @{\$rev{\$second}}, \$first; \$rev{\$first} = \$rev{\$second}; } else { my @copy = (\$first, \$second); push @res, \@copy; \$rev{\$first} = \$rev{\$second} = \@copy; } } return [grep @\$_, @res]; } although in Perl you wouldn't define it to take a reference to a list (as you did), but rather a list, and you wouldn't define it to return a reference to a list, but rather a list in list context (and possibly a reference in scalar context). Both versions are (IMHO) pretty clear (when the docstring/pod is included), and O(n) because dict/hash lookup and list appending/pushing is O(1) in both languages. Both versions can probably be tweaked for speed quite a bit, but I don't *think* there's a better-than-O(n) algorithm for this. Note that the Python version assumes that the pairs' elements are hashable; your example used numbers, so I thought it was pretty safe assumption. The Perl version has no such restriction. -- John Lenton (jo**@grulic.org.ar) -- Random fortune: Noble cosa es, aún para un anciano, el aprender. -- Sófocles. (497-406 a.C.) Filósofo griego. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) iD8DBQFCFhXjgPqu395ykGsRAus5AJ9iCSbGsyP3QHZy/whP195GSVNIwwCcDyqf hIA+oWaLBHdSGUi7t79Wfx8= =BlC7 -----END PGP SIGNATURE----- Jul 18 '05 #4

 P: n/a John Lenton wrote: On Thu, Feb 17, 2005 at 03:46:20PM -0800, Xah Lee wrote: here's another interesting algorithmic exercise, again from part of a larger program in the previous series. Here's the original Perl documentation: =pod merge(\$pairings) takes a list of pairs, each pair indicates the sameness of the two indexes. Returns a partitioned list of same indexes. For example, if the input is merge( [ [1,2], [2,4], [5,6] ] ); that means 1 and 2 are the same. 2 and 4 are the same. Therefore 1==2==4. The result returned is [[4,2,1],[6,5]]; (ordering of the returned list and sublists are not specified.) =cut in Python: def merge(pairings): rev = {} res = [] for pairing in pairings: first, second = pairing has_first = first in rev has_second = second in rev Not robust in the face of input like: [[1,1]] or [[1,2], [1,2]] or [[1,2], [2,1]] or [[1,2], [2,3], [3,1]] needs "if first == second: continue" here if has_first and has_second: needs "if rev[first] == rev[second]: continue" here rev[first].extend(rev[second]) rev[second][:] = [] rev[second] = rev[first] elif has_first: rev[first].append(second) rev[second] = rev[first] elif has_second: rev[second].append(first) rev[first] = rev[second] else: copy = [first, second] res.append(copy) My reaction to the "magic" by which res grows was "omigod that's the most horrible thing I've seen for quite a while" but there was worse to come :-) rev[first] = rev[second] = copy return filter(None, res) and in Perl: [snip] Both versions are (IMHO) pretty clear (when the docstring/pod is included), and O(n) because dict/hash lookup and list appending/pushing is O(1) in both languages. Both versions can probably be tweaked for speed quite a bit, but I don't *think* there's a better-than-O(n) algorithm for this. List appending is O(1) only in the amortised sense. Extending is not O(1) in any sense. Neither is the list comparison that is necessary for robustness (using your data structure, that is). You don't need to think. This problem has been extensively picked over by folk who are a lot smarter than us, starting from 30 years ago. Google for "disjoint set" and "union-find". One gets the impression that the best possible algorithm would be slightly *worse* than O(n). Note that the Python version assumes that the pairs' elements are hashable; your example used numbers, so I thought it was pretty safe assumption. The Perl version has no such restriction. In the real world, the objects under comparison (images, fingerprints, customer records, etc) may not themselves be hashable and comparison for equality may be expensive but a unique identifier would be assigned to each object and that ID would of course be cheaply hashable and comparable. Jul 18 '05 #5

 P: n/a On Fri, Feb 18, 2005 at 03:21:10PM -0800, John Machin wrote: Not robust in the face of input like: [[1,1]] or [[1,2], [1,2]] or [[1,2], [2,1]] or [[1,2], [2,3], [3,1]] oops, my bad. needs "if first == second: continue" here if has_first and has_second: needs "if rev[first] == rev[second]: continue" here an 'is' is enough, and better. rev[first].extend(rev[second]) rev[second][:] = [] rev[second] = rev[first] elif has_first: rev[first].append(second) rev[second] = rev[first] elif has_second: rev[second].append(first) rev[first] = rev[second] else: copy = [first, second] res.append(copy) My reaction to the "magic" by which res grows was "omigod that's the most horrible thing I've seen for quite a while" but there was worse to come :-) what is magic about it? is it really that horrible? List appending is O(1) only in the amortised sense. Extending is not O(1) in any sense. Neither is the list comparison that is necessary for robustness (using your data structure, that is). You don't need to think. This problem has been extensively picked over by folk who are a lot smarter than us, starting from 30 years ago. Google for "disjoint set" and "union-find". One gets the impression that the best possible algorithm would be slightly *worse* than O(n). Of course! I'd forgotten clean about union-find. And yes, it's O(n*log(n)) union and find, and my implementation is O(n**2) for union, O(1) for find; I'm pleased that, in spite of having forgotten about union-find, I "reinvented" (heh) something that is better than the "naive" implementation we saw in class. Now I'm even more worried about your dismissal of this as magic and horrible. -- John Lenton (jo**@grulic.org.ar) -- Random fortune: Why don't you fix your little problem... and light this candle? -- Alan Shepherd, the first man into space, Gemini program -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) iD8DBQFCFoBIgPqu395ykGsRApUIAKCnNmctznqq1KTfZhi7Dl 8YpjPnbQCguo2E z/Ss9rWC64h9vuDrJM//Ge8= =rVAU -----END PGP SIGNATURE----- Jul 18 '05 #6

 P: n/a Hello John, Try your python code on this example: merge([[1,2], [3,4], [1,2], [5,3]]) The result given by your function is: [[3, 4, 5]] Sorry... To Xah: next time you propose an exercise, write some UNIT TESTS!!! Then people will be able to test if there answers are correct or not. Cyril On Fri, 18 Feb 2005 13:20:51 -0300, John Lenton wrote: On Thu, Feb 17, 2005 at 03:46:20PM -0800, Xah Lee wrote: here's another interesting algorithmic exercise, again from part of a larger program in the previous series. Here's the original Perl documentation: =pod merge(\$pairings) takes a list of pairs, each pair indicates the sameness of the two indexes. Returns a partitioned list of same indexes. For example, if the input is merge( [ [1,2], [2,4], [5,6] ] ); that means 1 and 2 are the same. 2 and 4 are the same. Therefore 1==2==4. The result returned is [[4,2,1],[6,5]]; (ordering of the returned list and sublists are not specified.) =cut in Python: def merge(pairings): rev = {} res = [] for pairing in pairings: first, second = pairing has_first = first in rev has_second = second in rev if has_first and has_second: rev[first].extend(rev[second]) rev[second][:] = [] rev[second] = rev[first] elif has_first: rev[first].append(second) rev[second] = rev[first] elif has_second: rev[second].append(first) rev[first] = rev[second] else: copy = [first, second] res.append(copy) rev[first] = rev[second] = copy return filter(None, res) and in Perl: sub merge(\$) { my %rev = (); my @res = (); my (\$pairing, \$first, \$second, \$has_first, \$has_second); foreach \$pairing (@{\$_}) { (\$first, \$second) = @\$pairing; \$has_first = exists \$rev{\$first}; \$has_second = exists \$rev{\$second}; if (\$has_first and \$has_second) { push @{\$rev{\$first}}, @{\$rev{\$second}}; @{\$rev{\$second}} = (); \$rev{\$second} = \$rev{\$first}; } elsif (exists \$rev{\$first}) { push @{\$rev{\$first}}, \$second; \$rev{\$second} = \$rev{\$first}; } elsif (exists \$rev{\$second}) { push @{\$rev{\$second}}, \$first; \$rev{\$first} = \$rev{\$second}; } else { my @copy = (\$first, \$second); push @res, \@copy; \$rev{\$first} = \$rev{\$second} = \@copy; } } return [grep @\$_, @res]; } although in Perl you wouldn't define it to take a reference to a list (as you did), but rather a list, and you wouldn't define it to return a reference to a list, but rather a list in list context (and possibly a reference in scalar context). Both versions are (IMHO) pretty clear (when the docstring/pod is included), and O(n) because dict/hash lookup and list appending/pushing is O(1) in both languages. Both versions can probably be tweaked for speed quite a bit, but I don't *think* there's a better-than-O(n) algorithm for this. Note that the Python version assumes that the pairs' elements are hashable; your example used numbers, so I thought it was pretty safe assumption. The Perl version has no such restriction. -- John Lenton (jo**@grulic.org.ar) -- Random fortune: Noble cosa es, aún para un anciano, el aprender. -- Sófocles. (497-406 a.C.) Filósofo griego. -- http://mail.python.org/mailman/listinfo/python-list Jul 18 '05 #7

 P: n/a John Lenton wrote: On Fri, Feb 18, 2005 at 03:21:10PM -0800, John Machin wrote: Not robust in the face of input like: [[1,1]] or [[1,2], [1,2]] or [[1,2], [2,1]] or [[1,2], [2,3], [3,1]] oops, my bad. needs "if first == second: continue" here if has_first and has_second: needs "if rev[first] == rev[second]: continue" here an 'is' is enough, and better. Good point. You're redeeming yourself :-) rev[first].extend(rev[second]) rev[second][:] = [] rev[second] = rev[first] elif has_first: rev[first].append(second) rev[second] = rev[first] elif has_second: rev[second].append(first) rev[first] = rev[second] else: copy = [first, second] res.append(copy) My reaction to the "magic" by which res grows was "omigod that's the most horrible thing I've seen for quite a while" but there was worse to come :-) what is magic about it? is it really that horrible? Try explaining to the newbies over on the tutor list how despite "res" only ever *explicitly* having little bits like [3, 4] appended to it, it (or more properly the thing to which it refers) is actually festering and growing and being mutated under the surface until at the finale it bursts out dripping slime just like the creature from the black lagoon ... Jul 18 '05 #8

 P: n/a Xah Lee wrote: merge(\$pairings) takes a list of pairs, each pair indicates the sameness of the two indexes. Returns a partitioned list of same indexes. For example, if the input is merge( [ [1,2], [2,4], [5,6] ] ); that means 1 and 2 are the same. 2 and 4 are the same. Therefore 1==2==4. The result returned is [[4,2,1],[6,5]]; Not sure how efficient this is, but I decided to take advantage of the operations provided by sets: def merge(pairs): pairs = set(tuple(sorted(p)) for p in pairings) merged = [] # Each loop will result in a new, complete sublist in the result. while pairs: p = set(pairings.pop()) remove = set([]) for pair in pairs: pairSet = set(pair) if pairSet & p: p |= pairSet remove.add(pair) pairs -= remove merged.append(list(p)) return merged -- Brian Beck Adventurer of the First Order Jul 18 '05 #9

 P: n/a Brian Beck wrote: [code] Whoops, that should say: def merge(pairs): pairs = set(tuple(sorted(p)) for p in pairs) merged = [] # Each loop will result in a new, complete sublist in the result. while pairs: p = set(pairs.pop()) remove = set([]) for pair in pairs: pairSet = set(pair) if pairSet & p: p |= pairSet remove.add(pair) pairs -= remove merged.append(list(p)) return merged -- Brian Beck Adventurer of the First Order Jul 18 '05 #10

 P: n/a Brian Beck wrote: Brian Beck wrote: > [code] Ah heck, nevermind... it worked for my small test cases but the algorithm is just wrong. -- Brian Beck Adventurer of the First Order Jul 18 '05 #11

 P: n/a In article <11**********************@c13g2000cwb.googlegroups .com>, "John Machin" wrote: You don't need to think. This problem has been extensively picked over by folk who are a lot smarter than us, starting from 30 years ago. Google for "disjoint set" and "union-find". One gets the impression that the best possible algorithm would be slightly *worse* than O(n). It can be solved by union-find (e.g. with UnionFind from ): import UnionFind import sets def merge(pairs): uf = UnionFind.UnionFind() items = sets.Set() for a,b in pairs: uf.union(a,b) items.add(a) items.add(b) leaders = {} for a in items: leaders.setdefault(uf[a],sets.Set()).add(a) return[list(component) for component in leaders.values()] If you do all the unions before all the finds, as in this algorithm, union-find is O(n). However it might be easier to treat the input pairs as the edges of a graph and apply a depth-first-search based graph connected components algorithm. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ Jul 18 '05 #12

 P: n/a Well, it looks like David posted a good solution, but I haven't tested it (I'm assuming his works fine). I fixed mine up anyway... it actually works now. If you're not using 2.4 you'll have to import sets.Set as set. def merge(pairList): pairList = set(tuple(sorted(p)) for p in pairList) # Sort & set to remove duplicates, tuple to make hashable merged = [] removePairs = set([]) # Each loop will result in a new, complete sublist in the result while pairList: if removePairs: removePairs = set([]) else: subList = set(pairList.pop()) # Start a new sublist for pair in pairList: pairSet = set(pair) # True when pairSet and subList share at least one element if pairSet & subList: subList |= pairSet # Merge pair with subList removePairs.add(pair) # Mark pair for removal if removePairs: pairList -= removePairs else: merged.append(list(subList)) return merged -- Brian Beck Adventurer of the First Order Jul 18 '05 #13

 P: n/a On Fri, Feb 18, 2005 at 04:52:46PM -0800, John Machin wrote: needs "if rev[first] == rev[second]: continue" here an 'is' is enough, and better. Good point. You're redeeming yourself :-) this, together with you saying that it is hard to explain, makes me think that you aren't comfortable thinking of lists as mutable objects. what is magic about it? is it really that horrible? Try explaining to the newbies over on the tutor list how despite "res" only ever *explicitly* having little bits like [3, 4] appended to it, it (or more properly the thing to which it refers) is actually festering and growing and being mutated under the surface until at the finale it bursts out dripping slime just like the creature from the black lagoon ... understanding why that works, and why it is 'is' and not '==', are both part of the same thing. Lists are mutable, and you can mutate them, and they mutate. Unless you actually write code that uses the fact you will forget it, and it will bite you. Of course, don't use it just for the heck of it, but that creature you dismiss as a slime-dripping mutation is actually quite useful. While I'm at being unpolite, do you really think this code was harder to understand than the code posted by anton, using numarray? And, of course, if this code were for anything non-throw-awayable, there would've been quite a bit of explaining going on between those lines of code. Ok, now back to being polite :) -- John Lenton (jo**@grulic.org.ar) -- Random fortune: Hay más días que longanizas. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) iD8DBQFCFq8rgPqu395ykGsRAnTNAJ4rtCpfoFUYJYLpQ6WmvH bmTLSsYgCeJzRE dXMUU9lYxyECNtld9JjcdeA= =2pzJ -----END PGP SIGNATURE----- Jul 18 '05 #14

 P: n/a In article , David Eppstein wrote: It can be solved by union-find (e.g. with UnionFind from ): Here's a cleaned up version, after I added a proper iter() method to the UnionFind data structure: import UnionFind def merge(pairs): uf = UnionFind.UnionFind() for a,b in pairs: uf.union(a,b) components = {} for a in uf: components.setdefault(uf[a],[]).append(a) return components.values() If you do all the unions before all the finds, as in this algorithm, union-find is O(n). That was a little too sloppy. It is possible to solve the union find problem, with all unions before all finds, in time O(n). However the usual union find data structure takes more time, O(n alpha(n)). However it might be easier to treat the input pairs as the edges of a graph and apply a depth-first-search based graph connected components algorithm. This would be O(n), though. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ Jul 18 '05 #15

 P: n/a John Lenton wrote: On Fri, Feb 18, 2005 at 04:52:46PM -0800, John Machin wrote: > needs "if rev[first] == rev[second]: continue" here an 'is' is enough, and better. Good point. You're redeeming yourself :-) this, together with you saying that it is hard to explain, makes me think that you aren't comfortable thinking of lists as mutable objects. How so? There is no connection between is/== and mutability. Let me amplify: The point about 'is' is a good one, and aids your redemption after your failure to have adequate guards caused your algorithm not to work. what is magic about it? is it really that horrible? Try explaining to the newbies over on the tutor list how despite "res" only ever *explicitly* having little bits like [3, 4] appended to it, it (or more properly the thing to which it refers) is actually festering and growing and being mutated under the surface until at the finale it bursts out dripping slime just like the creature from the black lagoon ... understanding why that works, and why it is 'is' and not '==', are both part of the same thing. What same thing is that? Lists are mutable, and you can mutate them, and they mutate. Unless you actually write code that uses the fact you will forget it, and it will bite you. Of course, don't use it just for the heck of it, but that creature you dismiss as a slime-dripping mutation is actually quite useful. You are confusing mutability of lists (without which they would be called tuples!) with my point that the 'res' list was having its contents fiddled with implicitly through other "pointers". While I'm at being unpolite, do you really think this code was harder to understand than the code posted by anton, using numarray? I only read it as far as the bit where it creating a matrix of size O(N**2) -- in my app N can be over a million so I lost interest rapidly. And, of course, if this code were for anything non-throw-awayable, there would've been quite a bit of explaining going on between those lines of code. Not of course, but of necessity. Ok, now back to being polite :) Welcome back. Jul 18 '05 #16

 P: n/a Xah Lee wrote: here's the answer to the partition by equivalence exercise. Your Python solution is, as expected, wrong (try with [[10, 8], [7, 3], [1, 7], [5, 4], [2, 2], [3, 8], [7, 10], [2, 3], [6, 10], [3, 2]] for example). The solution by John Lenton is wrong, too. The solution by Brian Beck delivers the correct result for most input, but for some input lists like [[3, 3], [8, 7], [3, 2], [8, 5], [5, 6], [6, 3], [10, 8], [8, 10], [4, 10], [10, 2]] it chokes and returns the empty list. My solution (which may not be the fastest or most effective, but till now is the shortest and it works): def merge(pairings): sets = {} for x1, x2 in pairings: newset = (sets.get(x1, frozenset([x1])) | sets.get(x2, frozenset([x2]))) for i in newset: sets[i] = newset return[list(aset) for aset in set(sets.itervalues())] Reinhold Jul 18 '05 #19

 P: n/a David Eppstein: However it might be easier to treat the input pairs as the edges of a graph and apply a depth-first-search based graph connected components algorithm. || This would be O(n), though. Is the DFS the best one here (instead of BFS)? With the graph implementation that I've just shown here it becomes: .. import graph .. def merge(l): .. g = graph.Graph() .. g.addArcs(l) .. return g.connectedComponents() .. print merge( [ [1,2], [2,4], [5,6] ] ) I presume the complexity is O(n+a); n (the nodes) is proportional to the number of pairs, and a (the arcs) depends on the "intricacy" of the input pairs. For a complete graph, this can become slow (but there is a high probably that all the pairs are in the same connected component). Bye, Bearophile Jul 18 '05 #20

 P: n/a Bearophile: I presume the complexity is O(n+a); n (the nodes) is proportional to the number of pairs, and a (the arcs) depends on the "intricacy" of the input pairs. Opps... n (the number of nodes) is the number of different numbers contained in the pairs :-] Bearophile Jul 18 '05 #21

 P: n/a In article <11**********************@f14g2000cwb.googlegroups .com>, be************@lycos.com wrote: David Eppstein: However it might be easier to treat the input pairs as the edges of a graph and apply a depth-first-search based graph connected components algorithm. || This would be O(n), though. Is the DFS the best one here (instead of BFS)? I'm not sure it makes much difference. With the graph implementation that I've just shown here it becomes: . import graph . def merge(l): . g = graph.Graph() . g.addArcs(l) . return g.connectedComponents() . print merge( [ [1,2], [2,4], [5,6] ] ) I presume the complexity is O(n+a); n (the nodes) is proportional to the number of pairs, and a (the arcs) depends on the "intricacy" of the input pairs. For a complete graph, this can become slow (but there is a high probably that all the pairs are in the same connected component). It's still linear in the input size, which is the best you could hope for. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ Jul 18 '05 #22

 P: n/a Reinhold Birkenfeld wrote: def merge(pairings): sets = {} for x1, x2 in pairings: newset = (sets.get(x1, frozenset([x1])) | sets.get(x2, frozenset([x2]))) for i in newset: sets[i] = newset return[list(aset) for aset in set(sets.itervalues())] Looks good. I used it as inspiration for this new one, which takes less time for large datasets, and especially for those where a good number of merges are expected (at least that looks to be the pattern; with some large datasets this takes less than a quarter of the time as the one above). I'm sure it can be improved still -- yours is still faster in many cases. def merge2(pairings): elements = {} sets = [] for x1, x2 in pairings: i = [elements.get(x1, -1), elements.get(x2, -1)] i.sort() if i == -1: i = len(sets) sets.append(set([x1, x2])) elif i == -1: sets[i] |= set([x1, x2]) elif i == i: continue else: sets[i] |= sets[i] sets[i].clear() for x in sets[i]: elements[x] = i return[list(s) for s in sets if s] # Comparison import profile import random # Play with these values xMax, nPairs = 1000, 5000 l = [[random.randint(0, xMax), random.randint(0, xMax)] for i in range(nPairs)] print 'merge2:' profile.run('merge2(l)') # Mine print 'merge:' profile.run('merge(l)') # Reinhold's -- Brian Beck Adventurer of the First Order Jul 18 '05 #24

 P: n/a Reinhold Birkenfeld wrote: Xah Lee wrote: here's the answer to the partition by equivalence exercise. Your Python solution is, as expected, wrong (try with [[10, 8], [7, 3], [1, 7], [5, 4], [2, 2], [3, 8], [7, 10], [2, 3], [6, 10], [3, 2]] for example). The solution by John Lenton is wrong, too. The solution by Brian Beck delivers the correct result for most input, but for some input lists like [[3, 3], [8, 7], [3, 2], [8, 5], [5, 6], [6, 3], [10, 8], [8, 10], [4, 10], [10, 2]] it chokes and returns the empty list. My solution (which may not be the fastest or most effective, but till now is the shortest and it works): def merge(pairings): sets = {} for x1, x2 in pairings: newset = (sets.get(x1, frozenset([x1])) | sets.get(x2, frozenset([x2]))) for i in newset: sets[i] = newset return[list(aset) for aset in set(sets.itervalues())] Reinhold FWIW, here's a brief UAT report: Appears to work: Reinhold, David, Xah (v2) Has bug(s): John L (v*), Brian (v*) Incomprehensible: Xah (v*) 'Come back after lunch' award goes to Xah v2, which at a glance appears to be O(N**4) -- dictionary.update() inside 3-deep nest of 'for' statements. Here's a small input that busts all non-working versions: input: [[1, 2], [3, 4], [2, 3], [4, 5]] merge_RB -> [[1, 2, 3, 4, 5]] merge_DE -> [[1, 2, 3, 4, 5]] merge_JL -> [[1, 2, 3, 4], ] merge_JL2 -> [[1, 2, 3, 4], ] merge_BB -> [] merge_XL -> [[1, 2, 3, 4, 5], [3, 4, 5]] merge_XL2 -> [[1, 2, 3, 4, 5]] Jul 18 '05 #25

 P: n/a an interesting problem so developed now is to write a function that generate test cases for the purpose of testing performance. (just for fun) the design of this function could be interesting. We want to be able to give parameters in this function so as to spit out all possible screw test cases. First of all, a range of n (some millions) numbers. Then, a fraction that specifies the percentage of these number are equivalent. 1 being all equivalent. 0 being all "distinct" (having only one equivalent member (since the input comes in pairs)). Then we want to have a parameter that says something like the sizes of the equivalence groups. Suppose 50% of number are equal among themselves (i.e. have more than one equivalent member). 1 would be all in one group. 0 would mean all partitions have size 3 or 2. (there are more to it here... since this is a distribution) ... Then, we need to turn this range of integers into pairs. That's something more to think about. So with this function at hand, we'll be able to say for sure which code performs better (and under what type of input) the Order notion in computing mathematics is fairly useless for finer details. PS it is not trivial to design this pair generating function. I don't know if the above is on the right track, but essentially we want to categorize the type of inputs according to the mathematical operational performance aspect of partition by equivalence, and distill them into parameters. another func to write is one that canonicalize the output. Such as sorting. So that all results can be compared simply by = in Python. failing to design a elaborate pair_gen, we can just have pairs of random numbers. But exactly what nature is such input... is more to think about. (in my original application, each number represent a computer file, there are up to tens of thousands of files, and much less than 0.1% is same as another, and for files that are same, each equivalent group number no more than 10 or so.) Xah xa*@xahlee.org http://xahlee.org/PageTwo_dir/more.html John Machin wrote: FWIW, here's a brief UAT report: Appears to work: Reinhold, David, Xah... Has bug(s): ... Jul 18 '05 #26

 P: n/a John Machin>FWIW, here's a brief UAT report: Here is something else for you. Note: for more correct comparisons, for the following tests I've disabled the use of Psyco in Graph (usually enabled, if present). This looks faster than merge2 for this (quite sparse) random pairs test: .. import graph # http://www.fantascienza.net/animalia/graph.zip .. def merge4(l): .. g = graph.Graph() .. for n1,n2 in l: .. g.addArc(n1,n2) .. return g.connectedComponents() Creating a better addArcs produces good results (addArcs method is present in Graph, but addArcs2 isn't present): .. def merge5(l): .. g = graph.Graph() .. g.addArcs2(l) .. return g.connectedComponents() If you like to see the actual code, without using Graph, I've made a very stripped down version of addArcs2 and connectedComponents: .. from collections import deque .. from itertools import chain .. .. def merge6(l): .. # graph creation .. gi = {} .. go = {} .. for n1,n2 in l: .. if n1 not in go: .. go[n1] = {} .. gi[n1] = {} .. if n2 not in go: .. go[n2] = {} .. gi[n2] = {} .. go[n1][n2] = 0 .. gi[n2][n1] = 0 .. # Connected components: .. result = [] .. visited = set() .. for root in go.iterkeys(): .. if root not in visited: .. component = [root] .. visited.add(root) .. Q = deque() # A queue .. Q.append(root) .. while Q: .. n1 = Q.popleft() .. for n2 in chain( go[n1].iterkeys(), gi[n1].iterkeys() ): .. if n2 not in visited: .. visited.add(n2) .. Q.append(n2) .. component.append(n2) .. result.append(component) .. return result At the end of the tests I've added this to be sure the results are always the same: r2 = set( frozenset(e) for e in merge2(l) ) r4 = set( frozenset(e) for e in merge4(l) ) r5 = set( frozenset(e) for e in merge5(l) ) r6 = set( frozenset(e) for e in merge6(l) ) assert r2 == r4 assert r2 == r6 For xMax, nPairs = 1000, 5000: merge2: 15750 function calls in 0.939 CPU seconds merge4: 11029 function calls in 0.560 CPU seconds merge5: 6030 function calls in 0.303 CPU seconds merge6: 6011 function calls in 0.297 CPU seconds For xMax, nPairs = 2000, 10000: merge2: 31503 function calls in 2.505 CPU seconds merge6: 12011 function calls in 0.639 CPU seconds Timings using clock() (much more reliable than the profilers!), for xMax, nPairs = 2000, 10000: merge2: 1.222 merge6: 0.201 Probably merge6 can be improved, reducing memory used... Bear hugs, Bearophile Jul 18 '05 #27

 P: n/a On 19 Feb 2005 17:56:27 -0800, be************@lycos.com wrote: John Machin>FWIW, here's a brief UAT report:Here is something else for you.Note: for more correct comparisons, for the following tests I'vedisabled the use of Psyco in Graph (usually enabled, if present).This looks faster than merge2 for this (quite sparse) random pairstest: [snip] Timings using clock() (much more reliable than the profilers!), forxMax, nPairs = 2000, 10000:merge2: 1.222merge6: 0.201Probably merge6 can be improved, reducing memory used... Perhaps I'm missing a posting of yours -- what are merge2 and merge4? What is "this random pairs test"? What is xMax (nPairs isn't hard to guess), and how are you generating your input data? FWIW, my offering is attached. Jul 18 '05 #28

 P: n/a Xah Lee wrote: when i try to run the following program, Python complains about some global name frozenset is not defined. Is set some new facility in Python 2.4? http://www.python.org/doc/2.3/whatsnew/ http://www.python.org/doc/2.4/whatsnew/whatsnew24.html You must be running 2.3. If you persist in running 2.3, put the following at the top of the file: import sys PY_VERSION = sys.version_info[:2] if PY_VERSION == (2,3): from sets import Set as set from sets import ImmutableSet as frozenset That will get Reinhold's gadget working under 2.3. The bearophile's function uses collections.deque which is built-in to 2.4. it would be nice if the two working programs do not use some package. This problem shouldn't need to. Look at the newsgroup again. By my count there are apparently-working versions from SIX people: Reinhold, bear.+, Brian Beck, David Eppstein, yourself, and me. Only David's uses stuff that's not in the Python 2.4 off-the-shelf distribution. Jul 18 '05 #30

 P: n/a John Machin:Perhaps I'm missing a posting of yours -- what are merge2 and merge4? What is "this random pairs test"? What is xMax (nPairs isn't hard to guess), and how are you generating your input data?< merge2 and this random pairs test comes from the post by Brian Beck. merge4 is the first in my last post. And merge3 isn't included (nor tested) because it uses the original (slow) addArcs. This is the simple test generator taken from his post: xMax = 3000 nPairs = xMax*5 l = [[randint(0,xMax), randint(0,xMax)] for i in xrange(nPairs)] For such high number of nPairs (arcs), the resulting graph usuall has one big connected component, and few tiny debris... To produce a more divided graph, nPairs have to be much smaller, like xMax*1.5. FWIW, my offering is attached. < I'm sorry, I don't see the attach... (just the different title). John Machin:Only David's uses stuff that's not in the Python 2.4 off-the-shelf distribution.< Well, using already mede functions and classes is good, it avoids you to reinvent things each time. Some of my versions too use my graph library (for a problem like this I usually don't create stand alone versions like merge6 and merge7). And my original point was adding one graph data structure to the standard Python library :-] I haven't tested the speed of David Eppstein version... Anyway, this is a simplified version of merge6, that uses an indirect graph g, instead of the directed graph, and avoids the use of chain: .. from collections import deque .. .. def merge7(l): .. # Undirect graph g creation: .. g = {} .. for n1,n2 in l: .. if n1 not in g: g[n1] = {n2:0} .. else: g[n1][n2] = 0 .. if n2 not in g: g[n2] = {n1:0} .. else: g[n2][n1] = 0 .. # Connected components: .. result = [] .. visited = set() .. for root in g.iterkeys(): .. if root not in visited: .. component = [root] .. visited.add(root) .. Q = deque() # A queue .. Q.append(root) .. while Q: .. n1 = Q.popleft() .. for n2 in g[n1].iterkeys(): .. if n2 not in visited: .. visited.add(n2) .. Q.append(n2) .. component.append(n2) .. result.append(component) .. return result It's a bit shorter and a little faster than merge6, memory used is similar (there is only one dictionary, so there is only one ammortization overhead of the dictionary, but it contains the sum of both arcs, so the ammortization can be twice as big, so... memory used can be the same). (Usually BFS and DFS memory requirements are a little different; this implementation uses BFS, and I think it uses a bit less memory than DFS, but for this problem/test the difference isn't significative.) Bear hugs, Bearophile Jul 18 '05 #31

 P: n/a > From: "Xah Lee" The GOTO statement from Perl has been messed up. hey, I didn't do it! This block: Â© for group in interm: what do the funny little "Â©"s stand for? Eric Pederson http://www.songzilla.blogspot.com ::::::::::::::::::::::::::::::::::: domainNot="@something.com" domainIs=domainNot.replace("s","z") ePrefix="".join([chr(ord(x)+1) for x in "do"]) mailMeAt=ePrefix+domainIs ::::::::::::::::::::::::::::::::::: Jul 18 '05 #32

 P: n/a Reinhold Birkenfeld wrote: My solution (which may not be the fastest or most effective, but till now is the shortest and it works): def merge(pairings): sets = {} for x1, x2 in pairings: newset = (sets.get(x1, frozenset([x1])) | sets.get(x2, frozenset([x2]))) for i in newset: sets[i] = newset return[list(aset) for aset in set(sets.itervalues())] A recursive solution (around twice as fast as the above, though very slow still...) def merge_rb2(pairings): def merge(sets): res_sets = [] for tset in sets: for item in tset: for rset in res_sets: if item in rset: rset.update(item for item in tset) break else: continue break else: res_sets.append(set(tset)) if len(res_sets) == len(sets): return res_sets else: return merge(res_sets) return[list(s) for s in merge([set(pair) for pair in pairings])] Another one: def merge_rb3(pairings): dic = {} for x1, x2 in pairings: dic.setdefault(x1, []).append(x2) dic.setdefault(x2, []).append(x1) def red(k, l): l.append(k) sub = dic[k] for i in dic[k]: if i not in l: red(i, l) del dic[k] res = [] while len(dic): rl = [] red(iter(dic).next(), rl) res.append(rl) return res Reinhold Jul 18 '05 #34

 P: n/a Reinhold Birkenfeld wrote: Reinhold Birkenfeld wrote: My solution (which may not be the fastest or most effective, but till now is the shortest and it works): [snip RB] A recursive solution (around twice as fast as the above, though very slow still...) [snip RB2] Another one: [snip RB3] Dunno what data you are using for timing, but my tests suggest that RB is fast enough, RB3 is slightly faster, but RB2 is a real dog and appears to be quadratic [hint: it has that same for-for-for-update signature found in phase 2 of Xah's effort]. Not only that, but it seems to be somewhat irregular. Below are some results on trivial test data: prototype input: python -m timeit -s "import merge;n=3000;inp=((x,x+1)for x in xrange(0,n,2))" "merge.merge_RB3(inp)" 100000 loops, best of 3: 3.98 usec per loop n=3000;RB2 64.9 msec per loop n=3000;RB3 3.98 usec per loop n=3000;RB 3.06 usec per loop n=3000;JM3 2.73 usec per loop n=1000;RB2 4.92 usec per loop n=1250;RB2 5.34 usec per loop n=1500;RB2 20.4 usec per loop n=1750;RB2 22.1 msec per loop n=2000;RB2 28.8 msec per loop n=2500;RB2 44.9 msec per loop n=3000;RB2 64.9 msec per loop [background: Python 2.4, Win2000, 1.4GHz Athlon chip, 768Mb memory] Here's a function to generate some serious timing data: !def mktimedata(n,lev): ! res = [] ! d = 1 ! for k in range(lev): ! res.extend(zip(xrange(0, n, 2*d), xrange(d, n, 2*d))) ! d *= 2 ! return res Try that with n=3000 and lev=5 ... Cheers, John Jul 18 '05 #35

 P: n/a Eric Pederson wrote: From: "Xah Lee" © for group in interm: what do the funny little "©"s stand for? ....>>> import unicodedata as ucd; ucd.name('©'.decode('cp1252')) 'COPYRIGHT SIGN' Xah is asserting his right to be recognised as the author of his artistic creations, line by line. Jul 18 '05 #36

 P: n/a John Machin wrote: Xah is asserting his right to be recognised as the author of his artistic creations, line by line. Not very well, however, since his usage doesn't constitute a valid copyright notice :-). -- Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/ San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis Wyrd has swept all my kin / all the brave chiefs away! / Now I must follow them! -- Beowulf Jul 18 '05 #37

 P: n/a John Machin wrote: Reinhold Birkenfeld wrote: Reinhold Birkenfeld wrote: > My solution (which may not be the fastest or most effective, but till > now is the shortest and it works): [snip RB] A recursive solution (around twice as fast as the above, though very slow still...) [snip RB2] Another one: [snip RB3] Dunno what data you are using for timing, but my tests suggest that RB is fast enough, RB3 is slightly faster, but RB2 is a real dog and appears to be quadratic [hint: it has that same for-for-for-update signature found in phase 2 of Xah's effort]. Not only that, but it seems to be somewhat irregular. Below are some results on trivial test data: [snip] Yes, I don't know exactly how I timed this, and I just posted the solutions to show that there are very different solutions possible. They are surely not using the best algorithms, as bearophile's function showed. Reinhold Jul 18 '05 #38

 P: n/a In article <11**********************@l41g2000cwc.googlegroups .com>, "John Machin" wrote: it would be nice if the two working programs do not use some package. This problem shouldn't need to. Look at the newsgroup again. By my count there are apparently-working versions from SIX people: Reinhold, bear.+, Brian Beck, David Eppstein, yourself, and me. Only David's uses stuff that's not in the Python 2.4 off-the-shelf distribution. Where "stuff" means a single 62-line file that I linked to in my post. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ Jul 18 '05 #39

 P: n/a I started to collect i believe the 4 or so solutions by different people... but seems it's gonna take some an hour or more... So far the only other one i've run and find alright is Reinhold Birkenfeld's original. Others worth noting i'm aware of is David Epsteinn, improved versions from Reinhold Birkenfeld, the one using graphs by bearophile .... since many of you have already collected and tested these... i wonder if anyone'd be interested to put together all the (working) code in a single message? (or even a webpage.) thanks. Xah xa*@xahlee.org http://xahlee.org/PageTwo_dir/more.html Jul 18 '05 #40

 P: n/a Please consider my submission also (Python 2.3-compatible). -- Paul McGuire .. import sets .. .. input = [[1, 2], [3, 4], [2, 3], [4, 5]] .. input = [[1, 2], [3, 4], [4, 5]] .. input = [[1, 2],[2,1], [3, 4], [4, 5],[2,2],[2,3],[6,6]] .. .. def merge(pairings): .. ret = [] .. for a,b in pairings: .. s1 = None .. s2 = None .. for s in ret: .. if a in s or b in s: .. if s1 is None: .. s1 = s .. else: .. s2 = s .. break .. else: .. for s in ret: .. if a in s: .. s.add(b) .. break .. elif b in s: .. s.add(a) .. break .. else: .. ret.append(sets.Set([a,b])) .. continue .. ret.remove(s1) .. ret.remove(s2) .. ret.append(s1.union(s2)) .. .. return[list(s) for s in ret] .. .. print merge(input) Jul 18 '05 #41

 P: n/a A slightly better version, only walks the set of cumulative list of sets once per pairing. -- Paul .. import sets .. .. input = [[1, 2], [3, 4], [2, 3], [4, 5]] .. input = [[1, 2], [3, 4], [4, 5]] .. input = [[1, 2],[2,1], [3, 4], [4, 5],[2,2],[2,3],[6,6]] .. ..def merge(pairings): .. ret = [] .. for a,b in pairings: .. aset = None .. bset = None .. for s in ret: .. if not aset and a in s: .. aset = s .. if not bset and b in s: .. bset = s .. if aset and bset: .. break .. else: .. if aset: .. aset.add(b) .. elif bset: .. bset.add(a) .. else: .. ret.append(sets.Set([a,b])) .. continue .. if aset is not bset: .. ret.remove(aset) .. ret.remove(bset) .. ret.append(aset.union(bset)) .. .. return[list(s) for s in ret] .. ..print merge(input) Jul 18 '05 #42

### This discussion thread is closed

Replies have been disabled for this discussion. 