By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,966 Members | 812 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,966 IT Pros & Developers. It's quick & easy.

itertools.groupby

P: n/a
Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:
import itertools

mylist = ['a', 1, 'b', 2, 3, 'c']

def isString(x):
s = str(x)
if s == x:
return True
else:
return False

uniquekeys = []
groups = []
for k, g in itertools.groupby(mylist, isString):
uniquekeys.append(k)
groups.append(list(g))

print uniquekeys
print groups

--output:--
[True, False, True, False, True]
[['a'], [1], ['b'], [2, 3], ['c']]

May 27 '07 #1
Share this Question
Share on Google+
13 Replies


P: n/a
On May 27, 11:28 am, Steve Howell <showel...@yahoo.comwrote:
--- 7stud <bbxx789_0...@yahoo.comwrote:
Bejeezus. The description of groupby in the docs is
a poster child
for why the docs need user comments. Can someone
explain to me in
what sense the name 'uniquekeys' is used this
example: [...]

The groupby method has its uses, but it's behavior is
going to be very surprising to anybody that has used
the "group by" syntax of SQL, because Python's groupby
method will repeat groups if your data is not sorted,
whereas SQL has the luxury of (knowing that it's)
working with a finite data set, so it can provide the
more convenient semantics.

__________________________________________________ _________________________ _________You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.http://advision.webevents.yahoo.com/...mail_html.html
The groupby method has its uses
I'd settle for a simple explanation of what it does in python.

May 27 '07 #2

P: n/a
On Sun, 2007-05-27 at 10:17 -0700, 7stud wrote:
Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:
import itertools

mylist = ['a', 1, 'b', 2, 3, 'c']

def isString(x):
s = str(x)
if s == x:
return True
else:
return False

uniquekeys = []
groups = []
for k, g in itertools.groupby(mylist, isString):
uniquekeys.append(k)
groups.append(list(g))

print uniquekeys
print groups

--output:--
[True, False, True, False, True]
[['a'], [1], ['b'], [2, 3], ['c']]
The so-called example you're quoting from the docs is not an actual
example of using itertools.groupby, but suggested code for how you can
store the grouping if you need to iterate over it twice, since iterators
are in general not repeatable.

As such, 'uniquekeys' lists the key values that correspond to each group
in 'groups'. groups[0] is the list of elements grouped under
uniquekeys[0], groups[1] is the list of elements grouped under
uniquekeys[1], etc. You are getting surprising results because your data
is not sorted by the group key. Your group key alternates between True
and False.

Maybe you need to explain to us what you're actually trying to do.
User-supplied comments to the documentation won't help with that.

Regards,

--
Carsten Haese
http://informixdb.sourceforge.net
May 27 '07 #3

P: n/a
7stud wrote:
Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:
This is my first exposure to this function, and I see that it does have
some uses in my code. I agree that it is confusing, however.
IMO the confusion could be lessened if the function with the current
behavior were renamed 'telescope' or 'compact' or 'collapse' or
something (since it collapses the iterable linearly over homogeneous
sequences.)
A function named groupby could then have what I think is the clearly
implied behavior of creating just one iterator for each unique type of
thing in the input list, as categorized by the key function.
May 28 '07 #4

P: n/a
Gordon Airporte <JH*****@fbi.govwrites:
This is my first exposure to this function, and I see that it does
have some uses in my code. I agree that it is confusing, however.
IMO the confusion could be lessened if the function with the current
behavior were renamed 'telescope' or 'compact' or 'collapse' or
something (since it collapses the iterable linearly over homogeneous
sequences.)
It chops up the iterable into a bunch of smaller ones, but the total
size ends up the same. "Telescope", "compact", "collapse" etc. make
it sound like the output is going to end up smaller than the input.

There is also a dirty secret involved <wink>, which is that the
itertools functions (including groupby) are mostly patterned after
similarly named functions in the Haskell Prelude, which do about the
same thing. They are aimed at helping a similar style of programming,
so staying with similar names IMO is a good thing.
A function named groupby could then have what I think is the clearly
implied behavior of creating just one iterator for each unique type of
thing in the input list, as categorized by the key function.
But that is what groupby does, except its notion of uniqueness is
limited to contiguous runs of elements having the same key.
May 28 '07 #5

P: n/a
Paul Rubin wrote:
It chops up the iterable into a bunch of smaller ones, but the total
size ends up the same. "Telescope", "compact", "collapse" etc. make
it sound like the output is going to end up smaller than the input.
Good point... I guess I was thinking in terms of the number of iterators
being returned being smaller than the length of the input, and ordered
relative to the input - not about the fact that the iterators contain
all of the objects.

There is also a dirty secret involved <wink>, which is that the
itertools functions (including groupby) are mostly patterned after
similarly named functions in the Haskell Prelude, which do about the
same thing. They are aimed at helping a similar style of programming,
so staying with similar names IMO is a good thing.
Ah - those horrible, intolerant Functionalists. I dig ;-).
But that is what groupby does, except its notion of uniqueness is
limited to contiguous runs of elements having the same key.
"itertools.groupby_except_the_notion_of_uniqueness _is_limited_to-
_contiguous_runs_of_elements_having_the_same_key() " doesn't have much of
a ring to it. I guess this gets back to documentation problems, because
the help string says nothing about this limitation:

'''
class groupby(__builtin__.object)
| groupby(iterable[, keyfunc]) -create an iterator which returns
| (key, sub-iterator) grouped by each value of key(value).
|
'''

"Each" seems to imply uniqueness here.
May 29 '07 #6

P: n/a
Gordon Airporte <JH*****@fbi.govwrites:
"itertools.groupby_except_the_notion_of_uniqueness _is_limited_to-
_contiguous_runs_of_elements_having_the_same_key() " doesn't have much
of a ring to it. I guess this gets back to documentation problems,
because the help string says nothing about this limitation:

'''
class groupby(__builtin__.object)
| groupby(iterable[, keyfunc]) -create an iterator which returns
| (key, sub-iterator) grouped by each value of key(value).
|
'''
I wouldn't call it a "limitation"; it's a designed behavior which is
the right thing for some purposes and maybe not for others. For
example, groupby (as currently defined) works properly on infinite
sequences, but a version that scans the entire sequence to get bring
together every occurrence of every key would fail in that situation.
I agree that the doc description could be reworded slightly.
May 29 '07 #7

P: n/a
On Mon, 28 May 2007 23:02:31 -0400, Gordon Airporte wrote
'''
class groupby(__builtin__.object)
| groupby(iterable[, keyfunc]) -create an iterator which returns
| (key, sub-iterator) grouped by each value of key(value).
|
'''

"Each" seems to imply uniqueness here.
Yes, I can see how somebody might read it this way.

How about "...grouped by contiguous runs of key(value)" instead? And while
we're at it, it probably should be keyfunc(value), not key(value).

--
Carsten Haese
http://informixdb.sourceforge.net

May 29 '07 #8

P: n/a
On May 28, 8:36 pm, "Carsten Haese" <cars...@uniqsys.comwrote:
And while
we're at it, it probably should be keyfunc(value), not key(value).
No dice. The itertools.groupby() function is typically used
in conjunction with sorted(). It would be a mistake to call
it keyfunc in one place and not in the other. The mental
association is essential. The key= nomenclature is used
throughout Python -- see min(), max(), sorted(), list.sort(),
itertools.groupby(), heapq.nsmallest(), and heapq.nlargest().

Really. People need to stop making-up random edits to the docs.
For the most part, the current wording is there for a reason.
The poster who wanted to rename the function to telescope() did
not participate in the extensive python-dev discussions on the
subject, did not consider the implications of unnecessarily
breaking code between versions, did not consider that the term
telescope() would mean A LOT of different things to different
people, did not consider the useful mental associations with SQL, etc.

I recognize that the naming of things and the wording
of documentation is something *everyone* has an opinion
about. Even on python-dev, it is typical that posts with
technical analysis or use case studies are far outnumbered
by posts from folks with strong opinions about how to
name things.

I also realize that you could write a book on the subject
of this particular itertool and someone somewhere would still
find it confusing. In response to this thread, I've put in
additional documentation (described in an earlier post).
I think it is time to call this one solved and move on.
It currently has a paragraph plain English description,
a pure python equivalent, an example, advice on when to
list-out the iterator, triply repeated advice to pre-sort
using the same key function, an alternate description as
a tool that groups whenever key(x) changes, a comparison to
UNIX's uniq filter, a contrast against SQL's GROUP BY clauses,
and two worked-out examples on the next page which show
sample inputs and outputs. It is now one of the most
throughly documented individual functions in the language.
If someone reads all that, runs a couple of experiments
at the interactive prompt, and still doesn't get it,
then god help them when they get to the threading module
or to regular expressions.

If the posters on this thread have developed an interest
in the subject, I would find it useful to hear their
ideas on new and creative ways to use groupby(). The
analogy to UNIX's uniq filter was found only after the
design was complete. Likewise, the page numbering trick
(shown above by Paul and in the examples in the docs)
was found afterwards. I have a sense that there are entire
classes of undiscovered use cases which would emerge
if serious creative effort where focused on new and
interesting key= functions (the page numbering trick
ought to serve as inspiration in this regard).

The gauntlet has been thrown down. Any creative thinkers
up to the challenge? Give me cool recipes.
Raymond

May 29 '07 #9

P: n/a
On May 28, 8:02 pm, Gordon Airporte <JHoo...@fbi.govwrote:
"Each" seems to imply uniqueness here.
Doh! This sort of micro-massaging the docs misses the big picture.
If "each" meant unique across the entire input stream, then how the
heck could the function work without reading in the entire data stream
all at once. An understanding of iterators and itertools philosophy
reveals the correct interpretation. Without that understanding, it is
a fools errand to try to inject all of the attendant knowledge into
the docs for each individual function. Without that understanding, a
user would be *much* better off using list based functions (i.e. using
zip() instead izip() so that they will have a thorough understanding
of what their code actually does).

The itertools module necessarily requires an understanding of
iterators. The module has a clear philosophy and unifying theme. It
is about consuming data lazily, writing out results in small bits,
keeping as little as possible in memory, and being a set of composable
functional-style tools running at C speed (often making it possible to
avoid the Python eval-loop entirely).

The docs intentionally include an introduction that articulates the
philosophy and unifying theme. Likewise, there is a reason for the
examples page and the recipes page. Taken together, those three
sections and the docs on the individual functions guide a programmer
to a clear sense of what the tools are for, when to use them, how to
compose them, their inherent strengths and weaknesses, and a good
intuition about how they work under the hood.

Given that context, it is a trivial matter to explain what groupby()
does: it is an itertool (with all that implies) that emits groups
from the input stream whenever the key(x) function changes or the
stream ends.

Without the context, someone somewhere will find a way to get confused
no matter how the individual function docs are worded. When the OP
said that he hadn't read the examples, it is not surprising that he
found a way to get confused about the most complex tool in the
toolset.*

Debating the meaning of "each" is sure sign of ignoring context and
editing with tunnel vision instead of holistic thinking. Similar
issues arise in the socket, decimal, threading and regular expression
modules. For users who do not grok those module's unifying concepts,
no massaging of the docs for individual functions can prevent
occasional bouts of confusion.
Raymond
* -- FWIW, the OP then did the RightThing (tm) by experimenting at the
interactive prompt to observe what the function actually does and then
posted on comp.lang.python in a further effort to resolve his
understanding.

May 29 '07 #10

P: n/a
On Mon, 2007-05-28 at 23:34 -0700, Raymond Hettinger wrote:
On May 28, 8:36 pm, "Carsten Haese" <cars...@uniqsys.comwrote:
And while
we're at it, it probably should be keyfunc(value), not key(value).

No dice. The itertools.groupby() function is typically used
in conjunction with sorted(). It would be a mistake to call
it keyfunc in one place and not in the other. The mental
association is essential. The key= nomenclature is used
throughout Python -- see min(), max(), sorted(), list.sort(),
itertools.groupby(), heapq.nsmallest(), and heapq.nlargest().
Point taken, but in that case, the argument name in the function
signature is technically incorrect. I don't really need this corrected,
I was merely pointing out the discrepancy between the name 'keyfunc' in
the signature and the call 'key(value)' in the description. For what
it's worth, which is probably very little, help(sorted) correctly
identifies the name of the key argument as 'key'.

As an aside, while groupby() will indeed often be used in conjunction
with sorted(), there is a significant class of use cases where that's
not the case: I use groupby to produce grouped reports from the results
of an SQL query. In such cases, I use ORDER BY to guarantee that the
results are supplied in the correct order rather than using sorted().

Having said that, I'd like to expressly thank you for providing such a
mindbogglingly useful feature. Writing reports would be much less
enjoyable without groupby.

Best regards,

--
Carsten Haese
http://informixdb.sourceforge.net
May 29 '07 #11

P: n/a
On May 29, 2:34 am, Raymond Hettinger <pyt...@rcn.comwrote:
If the posters on this thread have developed an interest
in the subject, I would find it useful to hear their
ideas on new and creative ways to use groupby(). The
analogy to UNIX's uniq filter was found only after the
design was complete. Likewise, the page numbering trick
(shown above by Paul and in the examples in the docs)
was found afterwards. I have a sense that there are entire
classes of undiscovered use cases which would emerge
if serious creative effort where focused on new and
interesting key= functions (the page numbering trick
ought to serve as inspiration in this regard).

The gauntlet has been thrown down. Any creative thinkers
up to the challenge? Give me cool recipes.
Although obfuscated one-liners don't have a large coolness factor in
Python, I'll bite:

from itertools import groupby
from random import randint
x = [randint(0,100) for _ in xrange(20)]
print x
n = 7
# <-- insert fat comments here about the next line --#
reduce(lambda acc,(rem,divs): acc[rem].extend(divs) or acc,
groupby(x, key=lambda div: div%n),
[[] for _ in xrange(n)])
George

May 30 '07 #12

P: n/a
Raymond Hettinger <py****@rcn.comwrites:
The gauntlet has been thrown down. Any creative thinkers
up to the challenge? Give me cool recipes.
Here is my version (with different semantics) of the grouper recipe in
the existing recipe section:

snd = operator.itemgetter(1) # I use this so often...

def grouper(seq, n):
for k,g in groupby(enumerate(seq), lambda (i,x): i//n):
yield imap(snd, g)

I sometimes use the above for chopping large (multi-gigabyte) data
sets into manageable sized runs of a program. That is, my value of n
might be something like 1 million, so making tuples that size (as the
version in the itertools docs does) starts being unpleasant. Plus,
I think the groupby version makes more intuitive sense, though it
has pitfalls if you do anything with the output other than iterate
through each item as it emerges. I guess you could always use map
instead of imap.

May 30 '07 #13

P: n/a
On 27 May 2007 10:49:06 -0700, 7stud <bb**********@yahoo.comwrote:
On May 27, 11:28 am, Steve Howell <showel...@yahoo.comwrote:
The groupby method has its uses, but it's behavior is
going to be very surprising to anybody that has used
the "group by" syntax of SQL, because Python's groupby
method will repeat groups if your data is not sorted,
whereas SQL has the luxury of (knowing that it's)
working with a finite data set, so it can provide the
more convenient semantics.
The groupby method has its uses

I'd settle for a simple explanation of what it does in python.
Here is another example:

import itertools
import random

dierolls = sorted(random.randint(1, 6) for x in range(200))

for number, numbers in itertools.groupby(dierolls):
number_count = len(list(numbers))
print number, "came up", number_count, "times."

--
mvh Björn
Jun 5 '07 #14

This discussion thread is closed

Replies have been disabled for this discussion.