468,110 Members | 1,573 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,110 developers. It's quick & easy.

Merging ordered lists

Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref[i] != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
Jun 1 '08 #1
14 3683
On May 31, 10:00*pm, etal <eric.talev...@gmail.comwrote:
Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?
I would do it two steps. There's a number of ways to merge depending
on whether everything is pulled into memory or not:
http://aspn.activestate.com/ASPN/Coo.../Recipe/491285
http://aspn.activestate.com/ASPN/Coo.../Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

result = [k for k, g in groupby(imerge(*sources))]
Raymond
Jun 1 '08 #2
etal wrote:
Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref[i] != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
#untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
else:
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return reduce(merge, sorted(sources, key=len, reverse=True))

Peter
Jun 1 '08 #3
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau

Jun 1 '08 #4
Peter Otten wrote:
#untested
Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
Jun 1 '08 #5
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau
Jun 1 '08 #6
etal wrote:
Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
Looks as if set does it for you.

--
Taekyon
Jun 1 '08 #7
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau

Jun 27 '08 #8
Peter Otten wrote:
#untested
Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
Jun 27 '08 #9
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau
Jun 27 '08 #10
etal wrote:
Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
Looks as if set does it for you.

--
Taekyon
Jun 27 '08 #11
On Jun 1, 1:49*am, Peter Otten <__pete...@web.dewrote:
Peter Otten wrote:
#untested

Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
* * sm = difflib.SequenceMatcher(None, a, b)
* * for op, a1, a2, b1, b2 in sm.get_opcodes():
* * * * if op == "insert":
* * * * * * yield b[b1:b2]
* * * * elif op == "replace":
* * * * * * yield a[a1:a2]
* * * * * * yield b[b1:b2]
* * * * else: # delete, equal
* * * * * * yield a[a1:a2]

def merge(a, b):
* * return sum(_merge(a, b), [])

def merge_to_unique(sources):
* * return unique(reduce(merge, sorted(sources, key=len, reverse=True)))
difflib.SequenceMatcher looks promising; I'll try it. Thanks!

def unique(items):
* * u = set(items)
* * if len(u) == len(items):
* * * * return items
* * result = []
* * for item in items:
* * * * if item in u:
* * * * * * result.append(item)
* * * * * * u.remove(item)
* * return result
You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

(Secret: the reference list (or, any of the sources) is unlikely to be
more than a few dozen elements long. The data set that puts
merge_to_unique through a workout will be a giant list of
comparatively short lists, so the unique() part just needs to be short
and conceptually clean, while merge() should attempt sane behavior for
large len(sources).)
Jun 27 '08 #12
On Jun 1, 12:34*am, Raymond Hettinger <pyt...@rcn.comwrote:
>
I would do it two steps. *There's a number of ways to merge depending
on whether everything is pulled into memory or not:http://aspn.activestate..com/ASPN/Co.../Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

* *result = [k for k, g in groupby(imerge(*sources))]

Raymond
Thanks for the tip; itertools never ceases to amaze. One issue:
groupby doesn't seem to remove all duplicates, just consecutive ones
(for lists of strings and integers, at least):
>>[k for k, g in itertools.groupby(list("asdfdfffdf"))]
['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']
Another issue: dropping everything into a heap and pulling it back out
looks like it loses the original ordering, which isn't necessarily
alphabetical, but "however the user wants to organize the
spreadsheet". That's why I originally avoided using
sorted(set(itertools.chain(*sources))). Do you see another way around
this?
Jun 27 '08 #13
etal wrote:
def unique(items):
* * u = set(items)
* * if len(u) == len(items):
* * * * return items
* * result = []
* * for item in items:
* * * * if item in u:
* * * * * * result.append(item)
* * * * * * u.remove(item)
* * return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?
Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
def unique_lc(ref):
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
sample_a = range(1000)
sample_b = range(1000)
import random
for i in random.sample(sample_b, 10):
sample_b[i] = 0

$ python -m timeit -s"from unique import unique as u, sample_a as s" "u(s)"
10000 loops, best of 3: 52.8 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_a as
s" "u(s)"
10 loops, best of 3: 44.2 msec per loop

3 orders of magnitude for the shortcut.

$ python -m timeit -s"from unique import unique as u, sample_b as s" "u(s)"
1000 loops, best of 3: 646 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_b as
s" "u(s)"
10 loops, best of 3: 39 msec per loop

Still 2 orders of magnitude if my unique() does some actual work.

Peter
Jun 27 '08 #14
On Jun 3, 1:22*am, Peter Otten <__pete...@web.dewrote:
>
Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
* * u = set(items)
* * if len(u) == len(items):
* * * * return items
* * result = []
* * for item in items:
* * * * if item in u:
* * * * * * result.append(item)
* * * * * * u.remove(item)
* * return result
Yikes... and the list comprehension looked so innocent. I had resigned
myself to O(n) lookup, but you and Raymond appear to agree on the
basic concept for unique() -- check set membership, then generate the
final sequence and update the set based on that.
Jun 27 '08 #15

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Stanimir Stamenkov | last post: by
33 posts views Thread by Jim Cobban | last post: by
3 posts views Thread by Kirsty Ryder | last post: by
210 posts views Thread by Christoph Zwerschke | last post: by
22 posts views Thread by bearophileHUGS | last post: by
6 posts views Thread by bearophileHUGS | last post: by
14 posts views Thread by hidrkannan | last post: by
33 posts views Thread by Andreas Prilop | last post: by
1 post views Thread by Solo | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.