By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,089 Members | 1,952 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,089 IT Pros & Developers. It's quick & easy.

Merging sorted lists/iterators/generators into one stream of values...

P: n/a
I need to merge several sources of values into one stream of values. All
of the sources are sorted already and I need to retrieve the values from
them all in sorted order.

In other words:
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]

for value in ???(s1, s2, s3):
print value

will print out 10, 15, 17, 20, 25, 27, 30, 37, 40, 50 in that order.

The sources are cursors retrieving data from several databases, not from
the same server, and there's a potential for a large number of rows from
several of the sources. As such, any method that would load it all into
memory and sort it is no good as it would too much memory.

Is there a function or whatnot in Python that will do what I want? I
have come up with my own method but since it uses generators I'm not
sure how much overhead there is. Additionally, since the values
retrieved from the cursors will be dictionaries of "fieldname":value
pairs, the method would either need to handle that construct (and be
told what fieldname to sort on), or be able to take a function object to
use for the comparison operation.

Basically, I'll use my own function for this unless someone can point me
to something that is already available. Couldn't seem to find anything
in the builtin modules but I didn't find glob.glob when I was looking
for how to find filenames either so who knows what it's called :)

Since I need to deal with a variable list of sources (that is, 2 in one
application, 3 in another and 4-5 in a third), a simple 2-source method
isn't enough but if it's better than what I do for 2 sources then I can
make a wrapper for it, since that's what I do anyway.

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la***@vkarlsen.no
PGP KeyID: 0x2A42A1C2
Oct 6 '05 #1
Share this Question
Share on Google+
24 Replies


P: n/a
Lasse Vågsæther Karlsen wrote:
I need to merge several sources of values into one stream of values. All
of the sources are sorted already and I need to retrieve the values from
them all in sorted order.

In other words:
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]

for value in ???(s1, s2, s3):
print value

will print out 10, 15, 17, 20, 25, 27, 30, 37, 40, 50 in that order.

The sources are cursors retrieving data from several databases, not from
the same server, and there's a potential for a large number of rows from
several of the sources. As such, any method that would load it all into
memory and sort it is no good as it would too much memory.

Is there a function or whatnot in Python that will do what I want? I
have come up with my own method but since it uses generators I'm not
sure how much overhead there is. Additionally, since the values
retrieved from the cursors will be dictionaries of "fieldname":value
pairs, the method would either need to handle that construct (and be
told what fieldname to sort on), or be able to take a function object to
use for the comparison operation.

Basically, I'll use my own function for this unless someone can point me
to something that is already available. Couldn't seem to find anything
in the builtin modules but I didn't find glob.glob when I was looking
for how to find filenames either so who knows what it's called :)

Since I need to deal with a variable list of sources (that is, 2 in one
application, 3 in another and 4-5 in a third), a simple 2-source method
isn't enough but if it's better than what I do for 2 sources then I can
make a wrapper for it, since that's what I do anyway.

I doubt you'll find a prebuilt one, it's fairly easy to create your own,
after all. Generators are fairly fast constructs in Python, btw.
Here's what I whipped up in a few minutes:

def firstIter( value ):
it = iter( value )
try:
return it.next(), it
except StopIteration, err:
return None, None

def inorder( comparision, *args ):
iterators = [
[value,it]
for (value,it) in [ firstIter( arg ) for arg in args ]
if it is not None
]
iterators.sort()
while iterators:
yield iterators[0][0]
try:
value = iterators[0][0] = iterators[0][1].next()
except StopIteration, err:
iterators.pop(0)
else:
if len(iterators) > 1 and comparision( value,
iterators[1][0]) == 1:
iterators.sort()
continue

if __name__ == "__main__":
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]
s4 = []
for value in inorder(cmp, s1, s2, s3, s4):
print value
Anyway, have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

Oct 7 '05 #2

P: n/a
Ok, that one looks more sleak than what I came up with.

Couple of things I learn from your solution, please correct me if I
misunderstood something:

1. list containing other lists will sort itself based on first element
on lists inside ?
2. sort(), pop() is not costly operations

Other than that you seem to not use the cmp operator but that's easily
fixed.

This one looks much better than mine, here's what I came up with:

def merge_sorted(iterables, comparer=None):
"""
Generator that will take a list of iterables/generators that is
individually sorted, and then produce
values in a sorted order by taking the lowest value from each
source each call.

@param iterables: List of iterables/generators to retrieve values
from
@type iterables: List of iterables/generators

@param comparer: Optional fn(v1, v2) function that can compare two
values, should return <0 if v1<v2,
0 if v1>v2 and ==0 if v1==v2.

@type comparer: function-object, example: lambda x, y: x-y

@note: The "list of iterables" can actually be anything that
produces a list of iterables, so you can
use a function that yields lists for instance.
"""

# First convert whatever we're given into a list of sources
iterables = [iterable for iterable in iterables]

# This series of if-statements will determine how many sources we
have and work out sub-problems
# that are manageable.
if len(iterables) != 2:
if len(iterables) == 0:
# List, but no sources
pass
elif len(iterables) == 1:
# Only 1 source, just return its contents
for value in iterables[0]:
yield value
elif len(iterables) == 3:
# 3 sources, sub-divide into 0 <--> (1, 2)
left_iterable = iterables[0]
right_iterable = merge_sorted([iterables[1], iterables[2]],
comparer)
for value in merge_sorted([left_iterable, right_iterable],
comparer):
yield value
elif len(iterables) == 4:
# 4 sources, sub-divide into (0, 1) <--> (2, 3)
left_iterable = merge_sorted([iterables[0], iterables[1]],
comparer)
right_iterable = merge_sorted([iterables[2], iterables[3]],
comparer)
for value in merge_sorted((left_iterable, right_iterable),
comparer):
yield value
elif len(iterables) > 4:
# >4 sources, sub-divide into (0, 1) <--> (2, ...)
left_iterable = merge_sorted([iterables[0], iterables[1]],
comparer)
right_iterable = merge_sorted(iterables[2:], comparer)
for value in merge_sorted((left_iterable, right_iterable),
comparer):
yield value
raise StopIteration

# The method will only get here if there is only two sources, which
is an easy case to handle
i1 = iter(iterables[0])
i2 = iter(iterables[1])

# Grab the first two values from the two sources, if possible
try:
v1 = i1.next()
has_v1 = True
except StopIteration:
has_v1 = False
try:
v2 = i2.next()
has_v2 = True
except StopIteration:
has_v2 = False

# As long as we got values from both generates/iterators/whatever,
compare and yield the lowest of the
# two, and then get the next value from the corresponding source
while has_v1 and has_v2:
# Work out which of v1 and v2 comes first
if comparer is not None:
comp = comparer(v1, v2)
if comp <= 0:
yield_v1 = True
else:
yield_v1 = False
else:
if v1 <= v2:
yield_v1 = True
else:
yield_v1 = False

# Yield the next value, then grab a new value from the
corresponding source
if yield_v1:
yield v1
try:
v1 = i1.next()
except StopIteration:
has_v1 = False
else:
yield v2
try:
v2 = i2.next()
except StopIteration:
has_v2 = False

# When we get here, we got 3 possibilities:
# 1. has_v1 == True, has_v2 == False --> yield rest of v1/i1 and
just exit on StopIteration exception
# 2. has_v1 == False, has_v1 == True --> yield rest of v2/i2 and
just exit on StopIteration exception
# 3. has_v1 == has_v2 == False --> while-loops will skip, function
falls off the end
while has_v1:
yield v1
v1 = i1.next()
while has_v2:
yield v2
v2 = i2.next()

Oct 7 '05 #3

P: n/a
"Lasse Vågsæther Karlsen" <la*****@gmail.com> wrote:
Ok, that one looks more sleak than what I came up with.


For the most efficient and elegant solution, check out Raymond Hettinger's reply to:

http://aspn.activestate.com/ASPN/Coo.../Recipe/141934

George
Oct 7 '05 #4

P: n/a
Thanks, that looks like Mike's solution except that it uses the
built-in heapq module.

While this one contains less code than Mike's solution it seems to lack
the ability to control the comparison operation, which means it won't
work in my case. I need to both be able to sort on an arbitrary field
name (could be done using a list where I place the field first), and
also to be able to sort in a different order than smallest-first.

Perhaps by adjusting the data that is returned through each source
would do that. I'll look into it.

Oct 7 '05 #5

P: n/a
"Lasse Vågsæther Karlsen" <la*****@gmail.com> wrote:
Thanks, that looks like Mike's solution except that it uses the
built-in heapq module.
This make a big difference for the algorithmic complexity; replacing an item in a heap is much more
efficient than sorting the whole list.
While this one contains less code than Mike's solution it seems to lack
the ability to control the comparison operation, which means it won't
work in my case. I need to both be able to sort on an arbitrary field
name (could be done using a list where I place the field first), and
also to be able to sort in a different order than smallest-first.

Perhaps by adjusting the data that is returned through each source
would do that. I'll look into it.


Yes, it's a little inconvenient that the builtin heap doesn't take a comparison operation but you
can easily roll your own heap by transforming each item to a (key,item) tuple. Now that I'm thinking
about it, it might be a good addition to the cookbook.

George
Oct 7 '05 #6

P: n/a
Lasse Vågsæther Karlsen wrote:
Ok, that one looks more sleak than what I came up with.

Couple of things I learn from your solution, please correct me if I
misunderstood something:

1. list containing other lists will sort itself based on first element
on lists inside ?

Correct, comparison is recursive for lists/tuples.
2. sort(), pop() is not costly operations

They *can* be costly, but the algorithm reduces the number of times they
are called to less-than-linear for all but pathological cases (i.e. they
are only called when you need to switch streams). It also only requires
sorting only the number of streams, rather than the whole set.
Other than that you seem to not use the cmp operator but that's easily
fixed.

Sorry about that, revised version below:

def firstIter( value ):
it = iter( value )
try:
return it.next(), it
except StopIteration, err:
return None, None

def cmpFirstWith( comparison ):
def compare( a,b ):
return comparison(a[0],b[0])
return compare

def inorder( comparison, *args ):
iterators = [
[value,it]
for (value,it) in [ firstIter( arg ) for arg in args ]
if it is not None
]
iterators.sort()
while iterators:
yield iterators[0][0]
try:
value = iterators[0][0] = iterators[0][1].next()
except StopIteration, err:
iterators.pop(0)
else:
if len(iterators) > 1 and comparison( value,
iterators[1][0]) == 1:
iterators.sort( cmpFirstWith( comparison ) )
continue

if __name__ == "__main__":
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]
s4 = []
for value in inorder(cmp, s1, s2, s3, s4):
print value
s1 = [{'a':'b'}, {'a':'e'}]
s2 = [{'a':'d'}, {'a':'z'}]
def comp( a,b ):
return cmp( a['a'],b['a'])
for value in inorder(cmp, s1, s2 ):
print value

Have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

Oct 7 '05 #7

P: n/a
Lasse Vågsæther Karlsen wrote:
I need to merge several sources of values into one stream of values. All
of the sources are sorted already and I need to retrieve the values from

<snip>

Ok, after working through the various sources and solutions, here's what
I finally ended up with:

def merge_sorted(comparison, *sources):
iterables = []
for source in sources:
try:
source = iter(source)
iterables.append([source.next(), source])
except StopIteration:
pass

iterables.sort(cmp=comparison, key=lambda x: x[0])
while iterables:
yield iterables[0][0]
try:
iterables[0][0] = iterables[0][1].next()
if len(iterables) > 1 and comparison(iterables[0][0],
iterables[1][0]) > 0:
iterables.sort(comparison, key=lambda x: x[0])
except StopIteration:
iterables.pop(0)

Thanks to Mike and George for the solutions and pointers.

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la***@vkarlsen.no
PGP KeyID: 0x2A42A1C2
Oct 7 '05 #8

P: n/a
George Sakkis <gs*****@rutgers.edu> wrote:
"Lasse Vågsæther Karlsen" <la*****@gmail.com> wrote:
Thanks, that looks like Mike's solution except that it uses the
built-in heapq module.
This make a big difference for the algorithmic complexity; replacing an
item in a heap is much more efficient than sorting the whole list.


In the most general case, yes. However, Python's sort ("timsort") is
preternaturally fast when sorting sequences that are mostly sorted
except for maybe one element being in the wrong place... try it (and
read the Timbot's article included in Python's sources, and the sources
themselves)... I suspect that heapq will still be faster, but by
nowhere as much as one might think.

Yes, it's a little inconvenient that the builtin heap doesn't take a
comparison operation but you can easily roll your own heap by transforming
each item to a (key,item) tuple. Now that I'm thinking about it, it might
be a good addition to the cookbook.


I believe Python 2.5 adds a key= argument to heapq's functions...
Alex
Oct 8 '05 #9

P: n/a
Alex Martelli wrote:
try it (and read the Timbot's article included in Python's sources, and the
sources themselves)...


Just a reading advise. The translated PyPy source
pypy/objectspace/listsort.py might be more accessible than the
corresponding C code.

Kay

Oct 8 '05 #10

P: n/a
Kay Schluehr wrote:
Alex Martelli wrote:
try it (and read the Timbot's article included in Python's sources, and the
sources themselves)...


Just a reading advise. The translated PyPy source
pypy/objectspace/listsort.py might be more accessible than the
corresponding C code.


indeed. it is at

http://codespeak.net/svn/pypy/dist/p...td/listsort.py

Cheers,

Carl Friedrich Bolz

Oct 8 '05 #11

P: n/a
[Alex Martelli]
try it (and read the Timbot's article included in Python's sources, andthe
sources themselves)...

[Kay Schluehr] Just a reading advise. The translated PyPy source
pypy/objectspace/listsort.py might be more accessible than the
corresponding C code.

[cfbolz] indeed. it is at

http://codespeak.net/svn/pypy/dist/p...td/listsort.py


While the Python version is certainly easier to read, I believe Alex
had in mind the detailed English _explanation_ of the algorithm:

http://cvs.sf.net/viewcvs.py/python/...s/listsort.txt

It's a complex algorithm, dripping with subtleties that aren't
apparent from staring at an implementation.

Note that if a list has N elements, sorting it requires at least N-1
comparisons, because that's the minimum number of compares needed
simply to determine whether or not it's already sorted. A heap-based
priority queue never requires more than O(log(N)) compares to push or
pop an element. If N is small, it shouldn't make much difference. As
N gets larger, the advantage of a heap grows without bound.
Oct 8 '05 #12

P: n/a
Alex Martelli wrote:
George Sakkis <gs*****@rutgers.edu> wrote:

<snip>
Yes, it's a little inconvenient that the builtin heap doesn't take a
comparison operation but you can easily roll your own heap by transforming
each item to a (key,item) tuple. Now that I'm thinking about it, it might
be a good addition to the cookbook.

I believe Python 2.5 adds a key= argument to heapq's functions...

<snip>

I will revisit the heapq solution when 2.5 is released then.

Thanks for the heads up. For the moment I will stay with the list
solution that Mike came up with slightly changed to accomodate tips and
pointers from others in this thread.

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la***@vkarlsen.no
PGP KeyID: 0x2A42A1C2
Oct 8 '05 #13

P: n/a
"Lasse Vågsæther Karlsen" <la***@vkarlsen.no> wrote:
Alex Martelli wrote:
George Sakkis <gs*****@rutgers.edu> wrote:

<snip>
Yes, it's a little inconvenient that the builtin heap doesn't take a
comparison operation but you can easily roll your own heap by transforming
each item to a (key,item) tuple. Now that I'm thinking about it, it might
be a good addition to the cookbook.

I believe Python 2.5 adds a key= argument to heapq's functions...

<snip>

I will revisit the heapq solution when 2.5 is released then.

Thanks for the heads up. For the moment I will stay with the list
solution that Mike came up with slightly changed to accomodate tips and
pointers from others in this thread.


Just added a recipe at http://aspn.activestate.com/ASPN/Coo.../Recipe/440673. You can try
both and see if there's any significant performance difference for your data.

George
Oct 8 '05 #14

P: n/a
George Sakkis wrote:
<snip>
Just added a recipe at http://aspn.activestate.com/ASPN/Coo.../Recipe/440673. You can try
both and see if there's any significant performance difference for your data.

<snip>

Thanks, will take a look at it later. The sort solution seems to work
nicely. Might be a big difference if I have a lot of sources though as I
bet the overhead in doing a sort of N items gets higher than doing a
manipulation of a heap to place an item in the right spot, but with 4-5
or a few more sources might not make an impact at all.

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la***@vkarlsen.no
PGP KeyID: 0x2A42A1C2
Oct 8 '05 #15

P: n/a
"Lasse Vågsæther Karlsen" <la***@vkarlsen.no> wrote:
George Sakkis wrote:
<snip>
Just added a recipe at http://aspn.activestate.com/ASPN/Coo.../Recipe/440673. You can try both and see if there's any significant performance difference for yourdata.

<snip>

Thanks, will take a look at it later. The sort solution seems to work
nicely. Might be a big difference if I have a lot of sources though as I
bet the overhead in doing a sort of N items gets higher than doing a
manipulation of a heap to place an item in the right spot, but with 4-5
or a few more sources might not make an impact at all.


Unless you're talking about hundreds or thousands sources, it probably
won't. I would still go for the heap solution since IMO the resulting
code it's more readable and easier to understand.

George

Oct 8 '05 #16

P: n/a
George Sakkis <gs*****@rutgers.edu> wrote:
...
manipulation of a heap to place an item in the right spot, but with 4-5
or a few more sources might not make an impact at all.


Unless you're talking about hundreds or thousands sources, it probably
won't. I would still go for the heap solution since IMO the resulting
code it's more readable and easier to understand.


I'm not so sure about either sentence...:

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=S
Best time for 10 loops: 0.247116088867

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=H
Best time for 10 loops: 0.10344004631

i.e., a heap solution may be over 4 times faster than a sort-based one
(in the following implementations). Readability seems quite comparable
(skipping the rest of the infrastructure, which generates random sorted
streams and ensures a stream is exhausted and verifies it etc etc):

def merge_by_sort(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: sources.pop()

def merge_by_heap(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
heapq.heapify(sources)
while sources:
best_source = sources[0]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: heapq.heappop(sources)
else: heapq.heapreplace(sources, best_source)
Hmmm, I wonder if something like merge_by_heap would be a good candidate
for itertool. Raymond...?
Alex
Oct 10 '05 #17

P: n/a
"Alex Martelli" <al***@mail.comcast.net> wrote:
George Sakkis <gs*****@rutgers.edu> wrote:
...
manipulation of a heap to place an item in the right spot, but with 4-5
or a few more sources might not make an impact at all.
Unless you're talking about hundreds or thousands sources, it probably
won't. I would still go for the heap solution since IMO the resulting
code it's more readable and easier to understand.


I'm not so sure about either sentence...:

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=S
Best time for 10 loops: 0.247116088867

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=H
Best time for 10 loops: 0.10344004631

i.e., a heap solution may be over 4 times faster than a sort-based one
(in the following implementations).


Interesting; I thought timsort on small almost ordered lists would be practically as fast as the
heap. Still, how is 0.10344 over 4 times faster than 0.247116 ?
Readability seems quite comparable
(skipping the rest of the infrastructure, which generates random sorted
streams and ensures a stream is exhausted and verifies it etc etc):

def merge_by_sort(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: sources.pop()

def merge_by_heap(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
heapq.heapify(sources)
while sources:
best_source = sources[0]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: heapq.heappop(sources)
else: heapq.heapreplace(sources, best_source)
Indeed, these are almost equivalent as far as readability goes; the previous examples in the thread
were less clear. By the way, why do you need 'i' and enumerate above ?
Hmmm, I wonder if something like merge_by_heap would be a good candidate
for itertool. Raymond...?
Alex


Yes, it would be nice to make it into 2.5.

George
Oct 10 '05 #18

P: n/a
Alex Martelli wrote:
George Sakkis <gs*****@rutgers.edu> wrote:
...

manipulation of a heap to place an item in the right spot, but with 4-5
or a few more sources might not make an impact at all.

Unless you're talking about hundreds or thousands sources, it probably
won't. I would still go for the heap solution since IMO the resulting
code it's more readable and easier to understand.

....
i.e., a heap solution may be over 4 times faster than a sort-based one
(in the following implementations). Readability seems quite comparable
(skipping the rest of the infrastructure, which generates random sorted
streams and ensures a stream is exhausted and verifies it etc etc):

One thing to keep in mind (if you care about performance) is that you
one could use bisect, instead of sort, as the sorted list of streams is
already in order save for the one element you are processing. Btw, nice
trick with reverse to reduce memory copying, when did that get
introduced? Wonder if bisect can deal with reverse-sorted elements.
Anyway, should reduce the O-complexity of that part of the operation,
though you'll still have to do a memcpy to shift the rest of the source
list's array, and if it can't deal with reverse-sorted lists it would
move you back to front-of-list popping.

Oh, we're still missing use of a comparison function in both versions.
I'd think you'd want that functionality *available* if you're going to
make this a general tool. You'll also need to check for StopIteration
on creation of sources for null sequences. Finally, why the 'i'
element? It's never used AFAICS.
def merge_by_sort(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: sources.pop()

....

Have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

Oct 10 '05 #19

P: n/a
George Sakkis <gs*****@rutgers.edu> wrote:
...
i.e., a heap solution may be over 4 times faster than a sort-based one
(in the following implementations).
Interesting; I thought timsort on small almost ordered lists would be
practically as fast as the heap. Still, how is 0.10344 over 4 times faster
than 0.247116 ?


Oops, 2.5 times (for 10 streams) -- the ratio keeps getting bigger with
the number of streams, the figure 4 I had in mind was probably from
comparisons with 50 streams or so.

timsort needs N comparisons even if the list is already ordered (just to
check it is) while heaps can do with log(N) comparisons or even less in
the best case -- that's the key issue here.
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]

... Indeed, these are almost equivalent as far as readability goes; the
previous examples in the thread were less clear. By the way, why do you
need 'i' and enumerate above ?


Making sure that, if the first items are identical, the sorting (or
comparison for heap) does not compare the _methods_ -- no big deal if it
does, but that doesn't logically make sense. If we had key= arguments,
that would ENSURE no such comparison, but heaps didn't support that in
2.4 and I wanted to keep the playing field level, so I used the above
old trick to ensure stable sorting without comparisons beyond a given
field (more details were explained in the 1st edition of the Cookbook,
but I hope I give the general idea).
Alex
Oct 10 '05 #20

P: n/a
Mike C. Fletcher <mc******@rogers.com> wrote:
...
One thing to keep in mind (if you care about performance) is that you
one could use bisect, instead of sort, as the sorted list of streams is
already in order save for the one element you are processing. Btw, nice
trick with reverse to reduce memory copying, when did that get
introduced?
Python 2.4
Wonder if bisect can deal with reverse-sorted elements.
Not AFAIK.
Anyway, should reduce the O-complexity of that part of the operation,
though you'll still have to do a memcpy to shift the rest of the source
list's array, and if it can't deal with reverse-sorted lists it would
move you back to front-of-list popping.
Yeah, if you have to pop the front you're O(N) anyway, which is
basically what timsort does with nearly-ordered lists (though timsort
uses O(N) _comparisons_ while bisecting would use O(N) _moves_).

Oh, we're still missing use of a comparison function in both versions.
Trivial, just have the functions accept a key= and use that to prepare
the auxiliary list they're using anyway.
I'd think you'd want that functionality *available* if you're going to
make this a general tool. You'll also need to check for StopIteration
on creation of sources for null sequences.
True, the functions as prepared don't accept empty streams (exactly
because they don't specialcase StopIteration on the first calls to
next). Pretty trivial to remedy, of course.
Finally, why the 'i'
element? It's never used AFAICS.


It's used by the list's lexicographic comparison when the first elements
of two lists being compared are equal (which can happen, of course) to
avoid comparing the last elements (which are methods... no big deal if
they get compared, but it makes no logical sense).

So, an example enhanced merge_by_sort:
def merge_by_sort(streams, key=None):

if not key: key = lambda x: x
sources = []
for i, s in enumerate(streams):
try: first_item = s.next()
except StopIteration: pass
else: sources.append((key(item), i, item, s.next))

while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[2]
try: best_source[2] = best_source[-1]()
except StopIteration: sources.pop()
else: best_source[0] = key(best_source[2])
Of course, since the sort method DOES accept a key= parameter, this
could be simplified, but I'll leave it like this to make it trivial to
see how to recode the merging by heap as well (in 2.4)...
Alex

Oct 10 '05 #21

P: n/a
<snip>

Another idea for this method would be that in some cases I noticed that
it was useful to know which source each element would come from as well,
as well as removing duplicates from the results.

For instance

s1 = [1, 3, 5, 7]
s2 = [2, 3, 4]

for k, s in merge_by_sort(s1, s2):
print k, "from source", s

this would print:

1 from source 0
2 from source 1
3 from source 1
3 from source 0
4 from source 1
5 from source 0
7 from source 0

and the above list has 3 twice, so possibly:

1 from sources [0]
2 from sources [1]
3 from sources [0, 1]
4 from sources [1]
5 from sources [0]
7 from sources [0]

This latter one would be a slightly more heavy method as it would have
to compare the N first elements of the list or heap to figure out what
indices to yield as well.

However, the previous solution could be:

def merge_by_sort(*sources, **options):
if "cmp" in options:
comparison = options["cmp"]
else:
comparison = cmp

iterables = []
for index, source in enumerate(sources):
try:
source = iter(source)
iterables.append([source.next(), index, source])
except StopIteration:
pass

iterables.sort(cmp=comparison, key=lambda x: x[0], reverse=True)
while iterables:
yield iterables[-1][0], iterables[-1][1]
try:
iterables[-1][0] = iterables[-1][2].next()
if len(iterables) > 1 and comparison(iterables[-1][0],
iterables[-2][0]) > 0:
iterables.sort(comparison, key=lambda x: x[0],
reverse=True)
except StopIteration:
iterables.pop(-1)

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la***@vkarlsen.no
PGP KeyID: 0x2A42A1C2
Oct 11 '05 #22

P: n/a
Lasse Vågsæther Karlsen wrote:
<snip>

Another idea for this method would be that in some cases I noticed that
it was useful to know which source each element would come from as well,
as well as removing duplicates from the results.

<snip>

The "removing duplicates" problem would probably be best as a separate
function and it occurs to me that perhaps Python has such a function
already.

Again, this function would need the following criteria:

1. Be able to iterate through something other than a list
2. Yield the values, not return a list
3. Take an arbitrary cmp function to determine what is a duplicate

As sugar, perhaps also the following criteria:

- Ability to "combine" the duplicates through a special function

A simple first-version function I hacked together does this:

def unique(source, cmp=cmp, key=None, combine=None):
it = iter(source)
first = True
value = it.next()
values = [value]
while True:
try:
value = it.next()
if key is not None:
cmp_result = cmp(values[0][key], value[key])
else:
cmp_result = cmp(values[0], value)
if cmp_result == 0:
values.append(value)
else:
if combine is not None:
yield combine(values)
else:
yield values[0]
values = [value]
except StopIteration:
if combine is not None:
yield combine(values)
else:
yield values[0]
break
raise StopIteration

Note that this function does not do any sorting so if the source that it
gets the values from is not sorted, the result will be very wrong. This
is again due to my criteria of being able to handle cursors retrieving
data from a database and thus avoid loading everything into memory.

The combine function is passed a list of "duplicate" values and must
return a value that will be yielded out of unique.

Example of usage:

def sum_counts(values):
value = values[0][0]
sum = 0
for row in values:
sum += row[1]
return value, sum

fruits = [["Apple", 10], ["Apple", 15], ["Banana", 23], ["Orange", 17],
["Orange", 17]]
for fruit, total_sum in unique(fruits, key=0, combine=sum_counts):
print fruit, "has a sum of", total_sum

This will produce:

Apple has a sum of 25
Banana has a sum of 23
Orange has a sum of 34

Function name is perhaps not the best one. It occurs to me that this is
the GROUP BY function in SQL so perhaps a different name is better, but
then again this might be moot if such a function already exists somewhere :)

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la***@vkarlsen.no
PGP KeyID: 0x2A42A1C2
Oct 11 '05 #23

P: n/a
> Function name is perhaps not the best one. It occurs to me that this
is the GROUP BY in SQL so perhaps a different name is better, but
then again this might be moot if such a already exists somewhere :)


Amazing, you keep reinventing things, even with the exact same name :)

from itertools import imap,groupby
from operator import itemgetter

for fruit,group in groupby(fruits, itemgetter(0)):
print fruit, "has a sum of", sum(imap(itemgetter(1),group))

For this to work as intended, fruits has to be already sorted by the
same key given to grouby; otherwise just replace fruits with
sorted(fruits, itemgetter(0)).

By the way, read all the functions in the itertools module
(http://docs.python.org/lib/itertools-functions.html), it will save you
a lot of time.

George

Oct 11 '05 #24

P: n/a
George Sakkis wrote:
Function name is perhaps not the best one. It occurs to me that this
is the GROUP BY in SQL so perhaps a different name is better, but
then again this might be moot if such a already exists somewhere :)

Amazing, you keep reinventing things, even with the exact same name :)

from itertools import imap,groupby
from operator import itemgetter

for fruit,group in groupby(fruits, itemgetter(0)):
print fruit, "has a sum of", sum(imap(itemgetter(1),group))

For this to work as intended, fruits has to be already sorted by the
same key given to grouby; otherwise just replace fruits with
sorted(fruits, itemgetter(0)).

By the way, read all the functions in the itertools module
(http://docs.python.org/lib/itertools-functions.html), it will save you
a lot of time.

George


Itertools, meh, there's that wheel again :)

Didn't know about this one so thank you :)

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la***@vkarlsen.no
PGP KeyID: 0x2A42A1C2
Oct 11 '05 #25

This discussion thread is closed

Replies have been disabled for this discussion.