I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:
data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]
import itertools
def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'
def test3():
master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )
def test1():
master_list = []
row = []
for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []
import timeit
t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()
--output:---
42.791079998
19.0128788948
I thought groupby() would be faster. Am I doing something wrong? 10 1600
On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:
data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]
import itertools
def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'
def test3():
master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )
def test1():
master_list = []
row = []
for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []
import timeit
t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()
--output:---
42.791079998
19.0128788948
I thought groupby() would be faster. Am I doing something wrong?
Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.
FWIW, here's a faster and more compact version with groupby:
def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George
Shouldn't this
>>print re.sub('a','\\n','bab')
b
b
output
b\nb
instead?
Massimo
On Oct 16, 2007, at 1:34 AM, George Sakkis wrote:
On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
>I'm applying groupby() in a very simplistic way to split up some data, but when I timeit against another method, it takes twice as long. The following groupby() code groups the data between the "</tr>" strings:
data = [ "1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>", "1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>", "1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>", ]
import itertools
def key(s): if s[0] == "<": return 'a' else: return 'b'
def test3():
master_list = [] for group_key, group in itertools.groupby(data, key): if group_key == "b": master_list.append(list(group) )
def test1(): master_list = [] row = []
for elmt in data: if elmt[0] != "<": row.append(elmt) else: if row: master_list.append(" ".join(row) ) row = []
import timeit
t = timeit.Timer("test3()", "from __main__ import test3, key, data") print t.timeit() t = timeit.Timer("test1()", "from __main__ import test1, data") print t.timeit()
--output:--- 42.791079998 19.0128788948
I thought groupby() would be faster. Am I doing something wrong?
Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.
FWIW, here's a faster and more compact version with groupby:
def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]
George
-- http://mail.python.org/mailman/listinfo/python-list
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
Massimo
On Oct 16, 2007, at 1:54 AM, DiPierro, Massimo wrote:
Shouldn't this
>>>print re.sub('a','\\n','bab')
b
b
output
b\nb
instead?
Massimo
On Oct 16, 2007, at 1:34 AM, George Sakkis wrote:
>On Oct 15, 11:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
>>I'm applying groupby() in a very simplistic way to split up some data, but when I timeit against another method, it takes twice as long. The following groupby() code groups the data between the "</tr>" strings:
data = [ "1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>", "1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>", "1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>", ]
import itertools
def key(s): if s[0] == "<": return 'a' else: return 'b'
def test3():
master_list = [] for group_key, group in itertools.groupby(data, key): if group_key == "b": master_list.append(list(group) )
def test1(): master_list = [] row = []
for elmt in data: if elmt[0] != "<": row.append(elmt) else: if row: master_list.append(" ".join(row) ) row = []
import timeit
t = timeit.Timer("test3()", "from __main__ import test3, key, data") print t.timeit() t = timeit.Timer("test1()", "from __main__ import test1, data") print t.timeit()
--output:--- 42.791079998 19.0128788948
I thought groupby() would be faster. Am I doing something wrong?
Yes and no. Yes, the groupby version can be improved a little by calling a builtin method instead of a Python function. No, test1 still beats it hands down (and with Psyco even further); it is almost good as it gets in pure Python.
FWIW, here's a faster and more compact version with groupby:
def test3b(data): join = ' '.join return [join(group) for key,group in itertools.groupby(data, "</tr>".__eq__) if not key]
George
-- http://mail.python.org/mailman/listinfo/python-list
-- http://mail.python.org/mailman/listinfo/python-list
On 10/16/07, Massimo Di Pierro <md*******@cti.depaul.eduwrote:
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
You called print, so instead of getting an escaped string literal, the
string is being printed to your terminal, which is printing the
newline.
It is the fisrt line that is wrong, the second follows from the first, I agree.
________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb' s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'
-tkc
Let me show you a very bad consequence of this...
a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)
Now if file1.txt contains a \n or \" then file2.txt is not the same as file1.txt while it should be.
Massimo
________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Even stranger
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb' s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'
-tkc
Even stranger
>
>>re.sub('a', '\\n','bab')
'b\nb'
>>print re.sub('a', '\\n','bab')
b
b
That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
>>s = 'a\nb' s
'a\nb'
>>print s
a
b
>>print repr(s)
'a\nb'
-tkc
Let me show you a very bad consequence of this...
>
a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)
Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.
That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want. http://docs.python.org/lib/node46.html#l2h-407
Or, you can specially treat newlines:
b=re.sub('x', a.replace('\n', '\\n'), 'x')
or just escape the backslashes on the incoming pattern:
b=re.sub('x', a.replace('\\', '\\\\'), 'x')
In the help for the RE module's syntax, this is explicitly noted: http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""
The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."
Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".
-tkc
Thank you this answers my question. I wanted to make sure it was actually designed this way.
Massimo
________________________________________
From: Tim Chase [py*********@tim.thechases.com]
Sent: Tuesday, October 16, 2007 1:38 PM
To: DiPierro, Massimo
Cc: py*********@python.org; Berthiaume, Andre
Subject: Re: re.sub
Let me show you a very bad consequence of this...
a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)
Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.
That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want. http://docs.python.org/lib/node46.html#l2h-407
Or, you can specially treat newlines:
b=re.sub('x', a.replace('\n', '\\n'), 'x')
or just escape the backslashes on the incoming pattern:
b=re.sub('x', a.replace('\\', '\\\\'), 'x')
In the help for the RE module's syntax, this is explicitly noted: http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""
The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."
Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".
-tkc
On Oct 15, 8:02 pm, 7stud <bbxx789_0...@yahoo.comwrote:
t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()
--output:---
42.791079998
19.0128788948
I thought groupby() would be faster. Am I doing something wrong?
The groupby() function is not where you are losing speed. In test1,
you've in-lined the code for computing the key. In test3, groupby()
makes expensive, repeated calls to a pure python key function. For
an apples-to-apples comparison, try something like this:
def test4():
master_list = []
row = []
for elem in data:
if key(elem) == 'a':
row.append(elem)
elif row:
master_list.append(' '.join(row))
del row[:]
Raymond This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: G?nter Jantzen |
last post by:
In the documentation
http://www.python.org/dev/doc/devel/whatsnew/node7.html is written
about itertools.groupby:
"""Like it SQL counterpart, groupby() is typically used with sorted
input."""
...
|
by: Bryan |
last post by:
can some explain why in the 2nd example, m doesn't print the list
which i had expected?
>>> for k, g in groupby():
.... print k, list(g)
....
1
2
3
|
by: trebucket |
last post by:
What am I doing wrong here?
>>> import operator
>>> import itertools
>>> vals =
.... (1, 16), (2, 17), (3, 18), (4, 19), (5, 20)]
>>> for k, g in itertools.groupby(iter(vals),...
|
by: Frank Millman |
last post by:
Hi all
This is probably old hat to most of you, but for me it was a
revelation, so I thought I would share it in case someone has a similar
requirement.
I had to convert an old program that...
|
by: Roman Bertle |
last post by:
Hello,
there is an example how to use groupby in the itertools documentation
(http://docs.python.org/lib/itertools-example.html):
# Show a dictionary sorted and grouped by value
.... ...
|
by: 7stud |
last post by:
Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:
...
|
by: Steve Howell |
last post by:
George Sakkis produced the following cookbook recipe,
which addresses a common problem that comes up on this
mailing list:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/521877
I...
|
by: patrick.waldo |
last post by:
Hi all,
I tried reading http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/334695
on the same subject, but it didn't work for me. I'm trying to learn
how to make pivot tables from some...
|
by: Wiktor Zychla [C# MVP] |
last post by:
could someone enlighten me on what would be the difference between GroupBy
and ToLookup?
I try hard but am not able to spot any difference between these two. the
syntax and behavioral semantics...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM).
In this month's session, the creator of the excellent VBE...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: Aftab Ahmad |
last post by:
Hello Experts!
I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
|
by: Aftab Ahmad |
last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below.
Dim IE As Object
Set IE =...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: marcoviolo |
last post by:
Dear all,
I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
| |