By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,749 Members | 1,627 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,749 IT Pros & Developers. It's quick & easy.

groupby() seems slow

P: n/a
I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )
def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []
import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Oct 16 '07 #1
Share this Question
Share on Google+
10 Replies


P: n/a
On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?
Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George

Oct 16 '07 #2

P: n/a
Shouldn't this
>>print re.sub('a','\\n','bab')
b
b

output

b\nb

instead?

Massimo

On Oct 16, 2007, at 1:34 AM, George Sakkis wrote:
On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
>I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George

--
http://mail.python.org/mailman/listinfo/python-list
Oct 16 '07 #3

P: n/a
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b

Massimo
On Oct 16, 2007, at 1:54 AM, DiPierro, Massimo wrote:
Shouldn't this
>>>print re.sub('a','\\n','bab')
b
b

output

b\nb

instead?

Massimo

On Oct 16, 2007, at 1:34 AM, George Sakkis wrote:
>On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
>>I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>"
strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1
still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George

--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list
Oct 16 '07 #4

P: n/a
On 10/16/07, Massimo Di Pierro <md*******@cti.depaul.eduwrote:
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
You called print, so instead of getting an escaped string literal, the
string is being printed to your terminal, which is printing the
newline.
Oct 16 '07 #5

P: n/a
It is the fisrt line that is wrong, the second follows from the first, I agree.

________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb'
s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'

-tkc
Oct 16 '07 #6

P: n/a
Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the same as file1.txt while it should be.
Massimo

________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb'
s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'

-tkc
Oct 16 '07 #7

P: n/a
Even stranger
>
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb'
s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'

-tkc


Oct 16 '07 #8

P: n/a
Let me show you a very bad consequence of this...
>
a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.
That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc

Oct 16 '07 #9

P: n/a
Thank you this answers my question. I wanted to make sure it was actually designed this way.

Massimo

________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:38 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.
That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc
Oct 16 '07 #10

P: n/a
On Oct 15, 8:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?
The groupby() function is not where you are losing speed. In test1,
you've in-lined the code for computing the key. In test3, groupby()
makes expensive, repeated calls to a pure python key function. For
an apples-to-apples comparison, try something like this:

def test4():
master_list = []
row = []
for elem in data:
if key(elem) == 'a':
row.append(elem)
elif row:
master_list.append(' '.join(row))
del row[:]
Raymond
Oct 16 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.