473,406 Members | 2,707 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Merging ordered lists

Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref[i] != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
Jun 1 '08 #1
14 3984
On May 31, 10:00*pm, etal <eric.talev...@gmail.comwrote:
Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?
I would do it two steps. There's a number of ways to merge depending
on whether everything is pulled into memory or not:
http://aspn.activestate.com/ASPN/Coo.../Recipe/491285
http://aspn.activestate.com/ASPN/Coo.../Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

result = [k for k, g in groupby(imerge(*sources))]
Raymond
Jun 1 '08 #2
etal wrote:
Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref[i] != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
#untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
else:
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return reduce(merge, sorted(sources, key=len, reverse=True))

Peter
Jun 1 '08 #3
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau

Jun 1 '08 #4
Peter Otten wrote:
#untested
Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
Jun 1 '08 #5
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau
Jun 1 '08 #6
etal wrote:
Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
Looks as if set does it for you.

--
Taekyon
Jun 1 '08 #7
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau

Jun 27 '08 #8
Peter Otten wrote:
#untested
Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
Jun 27 '08 #9
Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau
Jun 27 '08 #10
etal wrote:
Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
Looks as if set does it for you.

--
Taekyon
Jun 27 '08 #11
On Jun 1, 1:49*am, Peter Otten <__pete...@web.dewrote:
Peter Otten wrote:
#untested

Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
* * sm = difflib.SequenceMatcher(None, a, b)
* * for op, a1, a2, b1, b2 in sm.get_opcodes():
* * * * if op == "insert":
* * * * * * yield b[b1:b2]
* * * * elif op == "replace":
* * * * * * yield a[a1:a2]
* * * * * * yield b[b1:b2]
* * * * else: # delete, equal
* * * * * * yield a[a1:a2]

def merge(a, b):
* * return sum(_merge(a, b), [])

def merge_to_unique(sources):
* * return unique(reduce(merge, sorted(sources, key=len, reverse=True)))
difflib.SequenceMatcher looks promising; I'll try it. Thanks!

def unique(items):
* * u = set(items)
* * if len(u) == len(items):
* * * * return items
* * result = []
* * for item in items:
* * * * if item in u:
* * * * * * result.append(item)
* * * * * * u.remove(item)
* * return result
You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

(Secret: the reference list (or, any of the sources) is unlikely to be
more than a few dozen elements long. The data set that puts
merge_to_unique through a workout will be a giant list of
comparatively short lists, so the unique() part just needs to be short
and conceptually clean, while merge() should attempt sane behavior for
large len(sources).)
Jun 27 '08 #12
On Jun 1, 12:34*am, Raymond Hettinger <pyt...@rcn.comwrote:
>
I would do it two steps. *There's a number of ways to merge depending
on whether everything is pulled into memory or not:http://aspn.activestate..com/ASPN/Co.../Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

* *result = [k for k, g in groupby(imerge(*sources))]

Raymond
Thanks for the tip; itertools never ceases to amaze. One issue:
groupby doesn't seem to remove all duplicates, just consecutive ones
(for lists of strings and integers, at least):
>>[k for k, g in itertools.groupby(list("asdfdfffdf"))]
['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']
Another issue: dropping everything into a heap and pulling it back out
looks like it loses the original ordering, which isn't necessarily
alphabetical, but "however the user wants to organize the
spreadsheet". That's why I originally avoided using
sorted(set(itertools.chain(*sources))). Do you see another way around
this?
Jun 27 '08 #13
etal wrote:
def unique(items):
Â* Â* u = set(items)
Â* Â* if len(u) == len(items):
Â* Â* Â* Â* return items
Â* Â* result = []
Â* Â* for item in items:
Â* Â* Â* Â* if item in u:
Â* Â* Â* Â* Â* Â* result.append(item)
Â* Â* Â* Â* Â* Â* u.remove(item)
Â* Â* return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?
Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
def unique_lc(ref):
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
sample_a = range(1000)
sample_b = range(1000)
import random
for i in random.sample(sample_b, 10):
sample_b[i] = 0

$ python -m timeit -s"from unique import unique as u, sample_a as s" "u(s)"
10000 loops, best of 3: 52.8 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_a as
s" "u(s)"
10 loops, best of 3: 44.2 msec per loop

3 orders of magnitude for the shortcut.

$ python -m timeit -s"from unique import unique as u, sample_b as s" "u(s)"
1000 loops, best of 3: 646 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_b as
s" "u(s)"
10 loops, best of 3: 39 msec per loop

Still 2 orders of magnitude if my unique() does some actual work.

Peter
Jun 27 '08 #14
On Jun 3, 1:22*am, Peter Otten <__pete...@web.dewrote:
>
Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
* * u = set(items)
* * if len(u) == len(items):
* * * * return items
* * result = []
* * for item in items:
* * * * if item in u:
* * * * * * result.append(item)
* * * * * * u.remove(item)
* * return result
Yikes... and the list comprehension looked so innocent. I had resigned
myself to O(n) lookup, but you and Raymond appear to agree on the
basic concept for unique() -- check set membership, then generate the
final sequence and update the set based on that.
Jun 27 '08 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Stanimir Stamenkov | last post by:
So if the 'type' attribute of the OL element is deprecated and authors should rely only on stylesheets how one could use a reference (as clear text) in the content to a particular list element? ...
33
by: Jim Cobban | last post by:
I cannot get Netscape 4.79 to properly display the ordered list in the following fragment. <P>Get a specific portion of the date. Depending upon the value of index: <ol start=0> <li>complete...
24
by: Lasse Vågsæther Karlsen | last post by:
I need to merge several sources of values into one stream of values. All of the sources are sorted already and I need to retrieve the values from them all in sorted order. In other words: s1 = ...
3
by: Kirsty Ryder | last post by:
Hi I have two tables containing largely unrelated data. Eventually I want a report that lists the contents of each table simultaneously in date order as it were. I think this is better explained...
210
by: Christoph Zwerschke | last post by:
This is probably a FAQ, but I dare to ask it nevertheless since I haven't found a satisfying answer yet: Why isn't there an "ordered dictionary" class at least in the standard list? Time and again...
22
by: bearophileHUGS | last post by:
>From this interesting blog entry by Lawrence Oluyede: http://www.oluyede.org/blog/2006/07/05/europython-day-2/ and the Py3.0 PEPs, I think the people working on Py3.0 are doing a good job, I am...
6
by: bearophileHUGS | last post by:
I have found that in certain situations ordered dicts are useful. I use an Odict class written in Python by ROwen that I have improved and updated some for personal use. So I'm thinking about a...
14
by: hidrkannan | last post by:
aList = bList = cList = dList = eList = list1 = list(set(aList) | set(bList) |set(cList)) list2 = list(set(dList) | set(eList))
33
by: Andreas Prilop | last post by:
To get bold numbers in ordered lists, one can write ol { font-weight: bold } ol span { font-weight: normal } <ol> <li><span>......</span></li> <li><span>......</span></li> </ol>
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.