473,239 Members | 1,464 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,239 software developers and data experts.

groupby() seems slow

I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )
def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []
import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Oct 16 '07 #1
10 1597
On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?
Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George

Oct 16 '07 #2
Shouldn't this
>>print re.sub('a','\\n','bab')
b
b

output

b\nb

instead?

Massimo

On Oct 16, 2007, at 1:34 AM, George Sakkis wrote:
On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
>I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George

--
http://mail.python.org/mailman/listinfo/python-list
Oct 16 '07 #3
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b

Massimo
On Oct 16, 2007, at 1:54 AM, DiPierro, Massimo wrote:
Shouldn't this
>>>print re.sub('a','\\n','bab')
b
b

output

b\nb

instead?

Massimo

On Oct 16, 2007, at 1:34 AM, George Sakkis wrote:
>On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
>>I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>"
strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1
still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George

--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list
Oct 16 '07 #4
On 10/16/07, Massimo Di Pierro <md*******@cti.depaul.eduwrote:
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
You called print, so instead of getting an escaped string literal, the
string is being printed to your terminal, which is printing the
newline.
Oct 16 '07 #5
It is the fisrt line that is wrong, the second follows from the first, I agree.

________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb'
s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'

-tkc
Oct 16 '07 #6
Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the same as file1.txt while it should be.
Massimo

________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb'
s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'

-tkc
Oct 16 '07 #7
Even stranger
>
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb'
s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'

-tkc


Oct 16 '07 #8
Let me show you a very bad consequence of this...
>
a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.
That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc

Oct 16 '07 #9
Thank you this answers my question. I wanted to make sure it was actually designed this way.

Massimo

________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:38 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.
That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc
Oct 16 '07 #10
On Oct 15, 8:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?
The groupby() function is not where you are losing speed. In test1,
you've in-lined the code for computing the key. In test3, groupby()
makes expensive, repeated calls to a pure python key function. For
an apples-to-apples comparison, try something like this:

def test4():
master_list = []
row = []
for elem in data:
if key(elem) == 'a':
row.append(elem)
elif row:
master_list.append(' '.join(row))
del row[:]
Raymond
Oct 16 '07 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: G?nter Jantzen | last post by:
In the documentation http://www.python.org/dev/doc/devel/whatsnew/node7.html is written about itertools.groupby: """Like it SQL counterpart, groupby() is typically used with sorted input.""" ...
4
by: Bryan | last post by:
can some explain why in the 2nd example, m doesn't print the list which i had expected? >>> for k, g in groupby(): .... print k, list(g) .... 1 2 3
3
by: trebucket | last post by:
What am I doing wrong here? >>> import operator >>> import itertools >>> vals = .... (1, 16), (2, 17), (3, 18), (4, 19), (5, 20)] >>> for k, g in itertools.groupby(iter(vals),...
20
by: Frank Millman | last post by:
Hi all This is probably old hat to most of you, but for me it was a revelation, so I thought I would share it in case someone has a similar requirement. I had to convert an old program that...
1
by: Roman Bertle | last post by:
Hello, there is an example how to use groupby in the itertools documentation (http://docs.python.org/lib/itertools-example.html): # Show a dictionary sorted and grouped by value .... ...
13
by: 7stud | last post by:
Bejeezus. The description of groupby in the docs is a poster child for why the docs need user comments. Can someone explain to me in what sense the name 'uniquekeys' is used this example: ...
3
by: Steve Howell | last post by:
George Sakkis produced the following cookbook recipe, which addresses a common problem that comes up on this mailing list: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/521877 I...
9
by: patrick.waldo | last post by:
Hi all, I tried reading http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/334695 on the same subject, but it didn't work for me. I'm trying to learn how to make pivot tables from some...
3
by: Wiktor Zychla [C# MVP] | last post by:
could someone enlighten me on what would be the difference between GroupBy and ToLookup? I try hard but am not able to spot any difference between these two. the syntax and behavioral semantics...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.