Merging ordered lists

etal wrote:

Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref[i] != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?

#untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
else:
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return reduce(merge, sorted(sources, key=len, reverse=True))

Peter

Jun 1 '08 #3

Méta-MCI $MVP$

Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)
@-salutations

Michel Claveau

Jun 1 '08 #4

Peter Otten wrote:

#untested

Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result

Jun 1 '08 #5

MClaveau

Jun 1 '08 #6

Taekyon

etal wrote:

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?

Looks as if set does it for you.

--
Taekyon

Jun 1 '08 #7

Méta-MCI $MVP$

Jun 27 '08 #8

Peter Otten wrote:

#untested

Jun 27 '08 #9

MClaveau

Jun 27 '08 #10

Taekyon

etal wrote:

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?

Looks as if set does it for you.

--
Taekyon

Jun 27 '08 #11

etal

On Jun 1, 1:49*am, Peter Otten <__pete...@web.dewrote:

Peter Otten wrote:
#untested

Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
* * sm = difflib.SequenceMatcher(None, a, b)
* * for op, a1, a2, b1, b2 in sm.get_opcodes():
* * * * if op == "insert":
* * * * * * yield b[b1:b2]
* * * * elif op == "replace":
* * * * * * yield a[a1:a2]
* * * * * * yield b[b1:b2]
* * * * else: # delete, equal
* * * * * * yield a[a1:a2]

def merge(a, b):
* * return sum(_merge(a, b), [])

def merge_to_unique(sources):
* * return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

difflib.SequenceMatcher looks promising; I'll try it. Thanks!

def unique(items):
* * u = set(items)
* * if len(u) == len(items):
* * * * return items
* * result = []
* * for item in items:
* * * * if item in u:
* * * * * * result.append(item)
* * * * * * u.remove(item)
* * return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

(Secret: the reference list (or, any of the sources) is unlikely to be
more than a few dozen elements long. The data set that puts
merge_to_unique through a workout will be a giant list of
comparatively short lists, so the unique() part just needs to be short
and conceptually clean, while merge() should attempt sane behavior for
large len(sources).)

Jun 27 '08 #12

etal

On Jun 1, 12:34*am, Raymond Hettinger <pyt...@rcn.comwrote:

>
I would do it two steps. *There's a number of ways to merge depending
on whether everything is pulled into memory or not:http://aspn.activestate..com/ASPN/Co.../Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

* *result = [k for k, g in groupby(imerge(*sources))]

Raymond

Thanks for the tip; itertools never ceases to amaze. One issue:
groupby doesn't seem to remove all duplicates, just consecutive ones
(for lists of strings and integers, at least):

>>[k for k, g in itertools.groupby(list("asdfdfffdf"))]

['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']
Another issue: dropping everything into a heap and pulling it back out
looks like it loses the original ordering, which isn't necessarily
alphabetical, but "however the user wants to organize the
spreadsheet". That's why I originally avoided using
sorted(set(itertools.chain(*sources))). Do you see another way around
this?

Jun 27 '08 #13

Ordered lists - numbering importance

etal wrote:

def unique(items):
Â* Â* u = set(items)
Â* Â* if len(u) == len(items):
Â* Â* Â* Â* return items
Â* Â* result = []
Â* Â* for item in items:
Â* Â* Â* Â* if item in u:
Â* Â* Â* Â* Â* Â* result.append(item)
Â* Â* Â* Â* Â* Â* u.remove(item)
Â* Â* return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
def unique_lc(ref):
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
sample_a = range(1000)
sample_b = range(1000)
import random
for i in random.sample(sample_b, 10):
sample_b[i] = 0

$ python -m timeit -s"from unique import unique as u, sample_a as s" "u(s)"
10000 loops, best of 3: 52.8 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_a as
s" "u(s)"
10 loops, best of 3: 44.2 msec per loop

3 orders of magnitude for the shortcut.

$ python -m timeit -s"from unique import unique as u, sample_b as s" "u(s)"
1000 loops, best of 3: 646 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_b as
s" "u(s)"
10 loops, best of 3: 39 msec per loop

Still 2 orders of magnitude if my unique() does some actual work.

Peter

Jun 27 '08 #14

etal

On Jun 3, 1:22*am, Peter Otten <__pete...@web.dewrote:

>
Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
* * u = set(items)
* * if len(u) == len(items):
* * * * return items
* * result = []
* * for item in items:
* * * * if item in u:
* * * * * * result.append(item)
* * * * * * u.remove(item)
* * return result

Yikes... and the list comprehension looked so innocent. I had resigned
myself to O(n) lookup, but you and Raymond appear to agree on the
basic concept for unique() -- check set membership, then generate the
final sequence and update the set based on that.

Jun 27 '08 #15

Similar topics

by: Stanimir Stamenkov | last post by:

So if the 'type' attribute of the OL element is deprecated and authors should rely only on stylesheets how one could use a reference (as clear text) in the content to a particular list element? ...

HTML / CSS

Problems Starting an Ordered List at Zero

by: Jim Cobban | last post by:

I cannot get Netscape 4.79 to properly display the ordered list in the following fragment. Get a specific portion of the date. Depending upon the value of index: <ol start=0> <li>complete...

HTML / CSS

Merging sorted lists/iterators/generators into one stream of values...

by: Lasse Vågsæther Karlsen | last post by:

I need to merge several sources of values into one stream of values. All of the sources are sorted already and I need to retrieve the values from them all in sorted order. In other words: s1 = ...

Merging two unrelated tables

by: Kirsty Ryder | last post by:

Hi I have two tables containing largely unrelated data. Eventually I want a report that lists the contents of each table simultaneously in date order as it were. I think this is better explained...

Microsoft Access / VBA

210

Why are there no ordered dictionaries?

by: Christoph Zwerschke | last post by:

This is probably a FAQ, but I dare to ask it nevertheless since I haven't found a satisfying answer yet: Why isn't there an "ordered dictionary" class at least in the standard list? Time and again...

EuroPython 2006 and Py3.0

by: bearophileHUGS | last post by:

>From this interesting blog entry by Lawrence Oluyede: http://www.oluyede.org/blog/2006/07/05/europython-day-2/ and the Py3.0 PEPs, I think the people working on Py3.0 are doing a good job, I am...

Ordered dicts

by: bearophileHUGS | last post by:

I have found that in certain situations ordered dicts are useful. I use an Odict class written in Python by ROwen that I have improved and updated some for personal use. So I'm thinking about a...

merging of lists

by: hidrkannan | last post by:

aList = bList = cList = dList = eList = list1 = list(set(aList) | set(bList) |set(cList)) list2 = list(set(dList) | set(eList))

Bold numbers in ordered lists

by: Andreas Prilop | last post by:

To get bold numbers in ordered lists, one can write ol { font-weight: bold } ol span { font-weight: normal } <ol> <li>......</li> <li>......</li> </ol>

HTML / CSS

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...