itertools.group by

7stud

Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:
import itertools

mylist = ['a', 1, 'b', 2, 3, 'c']

def isString(x):
s = str(x)
if s == x:
return True
else:
return False

uniquekeys = []
groups = []
for k, g in itertools.group by(mylist, isString):
uniquekeys.appe nd(k)
groups.append(l ist(g))

print uniquekeys
print groups

--output:--
[True, False, True, False, True]
[['a'], [1], ['b'], [2, 3], ['c']]

May 27 '07 #1

Subscribe Reply

4567

7stud

On May 27, 11:28 am, Steve Howell <showel...@yaho o.comwrote:

--- 7stud <bbxx789_0...@y ahoo.comwrote:
Bejeezus. The description of groupby in the docs is
a poster child
for why the docs need user comments. Can someone
explain to me in
what sense the name 'uniquekeys' is used this
example: [...]

The groupby method has its uses, but it's behavior is
going to be very surprising to anybody that has used
the "group by" syntax of SQL, because Python's groupby
method will repeat groups if your data is not sorted,
whereas SQL has the luxury of (knowing that it's)
working with a finite data set, so it can provide the
more convenient semantics.

_______________ _______________ _______________ _______________ _______________ _________You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.http://advision.webevents.yahoo.com/...mail_html.html

The groupby method has its uses

I'd settle for a simple explanation of what it does in python.

May 27 '07 #2

Carsten Haese

On Sun, 2007-05-27 at 10:17 -0700, 7stud wrote:

Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:
import itertools

mylist = ['a', 1, 'b', 2, 3, 'c']

def isString(x):
s = str(x)
if s == x:
return True
else:
return False

uniquekeys = []
groups = []
for k, g in itertools.group by(mylist, isString):
uniquekeys.appe nd(k)
groups.append(l ist(g))

print uniquekeys
print groups

--output:--
[True, False, True, False, True]
[['a'], [1], ['b'], [2, 3], ['c']]

The so-called example you're quoting from the docs is not an actual
example of using itertools.group by, but suggested code for how you can
store the grouping if you need to iterate over it twice, since iterators
are in general not repeatable.

As such, 'uniquekeys' lists the key values that correspond to each group
in 'groups'. groups[0] is the list of elements grouped under
uniquekeys[0], groups[1] is the list of elements grouped under
uniquekeys[1], etc. You are getting surprising results because your data
is not sorted by the group key. Your group key alternates between True
and False.

Maybe you need to explain to us what you're actually trying to do.
User-supplied comments to the documentation won't help with that.

Regards,

--
Carsten Haese
http://informixdb.sourceforge.net

May 27 '07 #3

Gordon Airporte

7stud wrote:

Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:

This is my first exposure to this function, and I see that it does have
some uses in my code. I agree that it is confusing, however.
IMO the confusion could be lessened if the function with the current
behavior were renamed 'telescope' or 'compact' or 'collapse' or
something (since it collapses the iterable linearly over homogeneous
sequences.)
A function named groupby could then have what I think is the clearly
implied behavior of creating just one iterator for each unique type of
thing in the input list, as categorized by the key function.

May 28 '07 #4

Paul Rubin

Gordon Airporte <JH*****@fbi.go vwrites:

This is my first exposure to this function, and I see that it does
have some uses in my code. I agree that it is confusing, however.
IMO the confusion could be lessened if the function with the current
behavior were renamed 'telescope' or 'compact' or 'collapse' or
something (since it collapses the iterable linearly over homogeneous
sequences.)

It chops up the iterable into a bunch of smaller ones, but the total
size ends up the same. "Telescope" , "compact", "collapse" etc. make
it sound like the output is going to end up smaller than the input.

There is also a dirty secret involved <wink>, which is that the
itertools functions (including groupby) are mostly patterned after
similarly named functions in the Haskell Prelude, which do about the
same thing. They are aimed at helping a similar style of programming,
so staying with similar names IMO is a good thing.

A function named groupby could then have what I think is the clearly
implied behavior of creating just one iterator for each unique type of
thing in the input list, as categorized by the key function.

But that is what groupby does, except its notion of uniqueness is
limited to contiguous runs of elements having the same key.

May 28 '07 #5

Gordon Airporte

Paul Rubin wrote:

It chops up the iterable into a bunch of smaller ones, but the total
size ends up the same. "Telescope" , "compact", "collapse" etc. make
it sound like the output is going to end up smaller than the input.

Good point... I guess I was thinking in terms of the number of iterators
being returned being smaller than the length of the input, and ordered
relative to the input - not about the fact that the iterators contain
all of the objects.

There is also a dirty secret involved <wink>, which is that the
itertools functions (including groupby) are mostly patterned after
similarly named functions in the Haskell Prelude, which do about the
same thing. They are aimed at helping a similar style of programming,
so staying with similar names IMO is a good thing.

Ah - those horrible, intolerant Functionalists. I dig ;-).

But that is what groupby does, except its notion of uniqueness is
limited to contiguous runs of elements having the same key.

"itertools.grou pby_except_the_ notion_of_uniqu eness_is_limite d_to-
_contiguous_run s_of_elements_h aving_the_same_ key()" doesn't have much of
a ring to it. I guess this gets back to documentation problems, because
the help string says nothing about this limitation:

'''
class groupby(__built in__.object)
| groupby(iterabl e[, keyfunc]) -create an iterator which returns
| (key, sub-iterator) grouped by each value of key(value).
|
'''

"Each" seems to imply uniqueness here.

May 29 '07 #6

Paul Rubin

Gordon Airporte <JH*****@fbi.go vwrites:

"itertools.grou pby_except_the_ notion_of_uniqu eness_is_limite d_to-
_contiguous_run s_of_elements_h aving_the_same_ key()" doesn't have much
of a ring to it. I guess this gets back to documentation problems,
because the help string says nothing about this limitation:

'''
class groupby(__built in__.object)
| groupby(iterabl e[, keyfunc]) -create an iterator which returns
| (key, sub-iterator) grouped by each value of key(value).
|
'''

I wouldn't call it a "limitation "; it's a designed behavior which is
the right thing for some purposes and maybe not for others. For
example, groupby (as currently defined) works properly on infinite
sequences, but a version that scans the entire sequence to get bring
together every occurrence of every key would fail in that situation.
I agree that the doc description could be reworded slightly.

May 29 '07 #7

Carsten Haese

On Mon, 28 May 2007 23:02:31 -0400, Gordon Airporte wrote

'''
class groupby(__built in__.object)
| groupby(iterabl e[, keyfunc]) -create an iterator which returns
| (key, sub-iterator) grouped by each value of key(value).
|
'''

"Each" seems to imply uniqueness here.

Yes, I can see how somebody might read it this way.

How about "...grouped by contiguous runs of key(value)" instead? And while
we're at it, it probably should be keyfunc(value), not key(value).

--
Carsten Haese
http://informixdb.sourceforge.net

May 29 '07 #8

Raymond Hettinger

On May 28, 8:36 pm, "Carsten Haese" <cars...@uniqsy s.comwrote:

And while
we're at it, it probably should be keyfunc(value), not key(value).

No dice. The itertools.group by() function is typically used
in conjunction with sorted(). It would be a mistake to call
it keyfunc in one place and not in the other. The mental
association is essential. The key= nomenclature is used
throughout Python -- see min(), max(), sorted(), list.sort(),
itertools.group by(), heapq.nsmallest (), and heapq.nlargest( ).

Really. People need to stop making-up random edits to the docs.
For the most part, the current wording is there for a reason.
The poster who wanted to rename the function to telescope() did
not participate in the extensive python-dev discussions on the
subject, did not consider the implications of unnecessarily
breaking code between versions, did not consider that the term
telescope() would mean A LOT of different things to different
people, did not consider the useful mental associations with SQL, etc.

I recognize that the naming of things and the wording
of documentation is something *everyone* has an opinion
about. Even on python-dev, it is typical that posts with
technical analysis or use case studies are far outnumbered
by posts from folks with strong opinions about how to
name things.

I also realize that you could write a book on the subject
of this particular itertool and someone somewhere would still
find it confusing. In response to this thread, I've put in
additional documentation (described in an earlier post).
I think it is time to call this one solved and move on.
It currently has a paragraph plain English description,
a pure python equivalent, an example, advice on when to
list-out the iterator, triply repeated advice to pre-sort
using the same key function, an alternate description as
a tool that groups whenever key(x) changes, a comparison to
UNIX's uniq filter, a contrast against SQL's GROUP BY clauses,
and two worked-out examples on the next page which show
sample inputs and outputs. It is now one of the most
throughly documented individual functions in the language.
If someone reads all that, runs a couple of experiments
at the interactive prompt, and still doesn't get it,
then god help them when they get to the threading module
or to regular expressions.

If the posters on this thread have developed an interest
in the subject, I would find it useful to hear their
ideas on new and creative ways to use groupby(). The
analogy to UNIX's uniq filter was found only after the
design was complete. Likewise, the page numbering trick
(shown above by Paul and in the examples in the docs)
was found afterwards. I have a sense that there are entire
classes of undiscovered use cases which would emerge
if serious creative effort where focused on new and
interesting key= functions (the page numbering trick
ought to serve as inspiration in this regard).

The gauntlet has been thrown down. Any creative thinkers
up to the challenge? Give me cool recipes.
Raymond

May 29 '07 #9

Raymond Hettinger

On May 28, 8:02 pm, Gordon Airporte <JHoo...@fbi.go vwrote:

"Each" seems to imply uniqueness here.

Doh! This sort of micro-massaging the docs misses the big picture.
If "each" meant unique across the entire input stream, then how the
heck could the function work without reading in the entire data stream
all at once. An understanding of iterators and itertools philosophy
reveals the correct interpretation. Without that understanding, it is
a fools errand to try to inject all of the attendant knowledge into
the docs for each individual function. Without that understanding, a
user would be *much* better off using list based functions (i.e. using
zip() instead izip() so that they will have a thorough understanding
of what their code actually does).

The itertools module necessarily requires an understanding of
iterators. The module has a clear philosophy and unifying theme. It
is about consuming data lazily, writing out results in small bits,
keeping as little as possible in memory, and being a set of composable
functional-style tools running at C speed (often making it possible to
avoid the Python eval-loop entirely).

The docs intentionally include an introduction that articulates the
philosophy and unifying theme. Likewise, there is a reason for the
examples page and the recipes page. Taken together, those three
sections and the docs on the individual functions guide a programmer
to a clear sense of what the tools are for, when to use them, how to
compose them, their inherent strengths and weaknesses, and a good
intuition about how they work under the hood.

Given that context, it is a trivial matter to explain what groupby()
does: it is an itertool (with all that implies) that emits groups
from the input stream whenever the key(x) function changes or the
stream ends.

Without the context, someone somewhere will find a way to get confused
no matter how the individual function docs are worded. When the OP
said that he hadn't read the examples, it is not surprising that he
found a way to get confused about the most complex tool in the
toolset.*

Debating the meaning of "each" is sure sign of ignoring context and
editing with tunnel vision instead of holistic thinking. Similar
issues arise in the socket, decimal, threading and regular expression
modules. For users who do not grok those module's unifying concepts,
no massaging of the docs for individual functions can prevent
occasional bouts of confusion.
Raymond
* -- FWIW, the OP then did the RightThing (tm) by experimenting at the
interactive prompt to observe what the function actually does and then
posted on comp.lang.pytho n in a further effort to resolve his
understanding.

May 29 '07 #10

Similar topics

1406

whatsnew 2.4 about itertools.groupby:

by: G?nter Jantzen | last post by:

In the documentation http://www.python.org/dev/doc/devel/whatsnew/node7.html is written about itertools.groupby: """Like it SQL counterpart, groupby() is typically used with sorted input.""" In SQL queries is the groupby clause not related to 'input order'. This notion makes not much sense in SQL context. SQL is based on relational Algebra. A SQL- table is based on an

Python

2615

Wishlist item: itertools.flatten

by: Ville Vainio | last post by:

For quick-and-dirty stuff, it's often convenient to flatten a sequence (which perl does, surprise surprise, by default): ]]] -> One such implementation is at http://aspn.activestate.com/ASPN/Mail/Message/python-tutor/2302348

Python

2157

itertools to iter transition (WAS: Pre-PEP: Dictionary accumulatormethods)

by: Steven Bethard | last post by:

Jack Diederich wrote: > > itertools to iter transition, huh? I slipped that one in, I mentioned > it to Raymond at PyCon and he didn't flinch. It would be nice not to > have to sprinkle 'import itertools as it' in code. iter could also > become a type wrapper instead of a function, so an iter instance could > be a wrapper that figures out whether to call .next or __getitem__ > depending on it's argument. > for item in...

Python

2640

itertools examples

by: Felipe Almeida Lessa | last post by:

Hi, IMHO, on http://www.python.org/doc/current/lib/itertools-example.html , shouldn't the part >>> for k, g in groupby(enumerate(data), lambda (i,x):i-x): .... print map(operator.itemgetter(1), g) be

Python

2290

groupby

by: Bryan | last post by:

can some explain why in the 2nd example, m doesn't print the list which i had expected? >>> for k, g in groupby(): .... print k, list(g) .... 1 2 3

Python

1703

Problem with itertools.groupby.

by: trebucket | last post by:

What am I doing wrong here? >>> import operator >>> import itertools >>> vals = .... (1, 16), (2, 17), (3, 18), (4, 19), (5, 20)] >>> for k, g in itertools.groupby(iter(vals), operator.itemgetter(0)): .... print k, .... 1

Python

1902

"groupby" is brilliant!

by: Frank Millman | last post by:

Hi all This is probably old hat to most of you, but for me it was a revelation, so I thought I would share it in case someone has a similar requirement. I had to convert an old program that does a traditional pass through a sorted data file, breaking on a change of certain fields, processing each row, accumulating various totals, and doing additional processing at each break. I am not using a database for this one, as the file

Python

7672

Fate of itertools.dropwhile() and itertools.takewhile()

by: Raymond Hettinger | last post by:

I'm considering deprecating these two functions and would like some feedback from the community or from people who have a background in functional programming. * I'm concerned that use cases for the two functions are uncommon and can obscure code rather than clarify it. * I originally added them to itertools because they were found in other functional languages and because it seemed like they would serve basic building blocks in...

Python

9237

[LINQ] GroupBy vs ToLookup

by: Wiktor Zychla [C# MVP] | last post by:

could someone enlighten me on what would be the difference between GroupBy and ToLookup? I try hard but am not able to spot any difference between these two. the syntax and behavioral semantics is the same. is there any explanation on why we need them both? Thanks in advance, Wiktor Zychla

C# / C Sharp

8266

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8705

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8638

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8365

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8505

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7196

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5574

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

2626

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1811

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP