473,395 Members | 1,891 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

str.count is slow

It seems to me that str.count is awfully slow. Is there some reason
for this?
Evidence:

######## str.count time test ########
import string
import time
import array

s = string.printable * int(1e5) # 10**7 character string
a = array.array('c', s)
u = unicode(s)
RIGHT_ANSWER = s.count('a')

def main():
print 'str: ', time_call(s.count, 'a')
print 'array: ', time_call(a.count, 'a')
print 'unicode:', time_call(u.count, 'a')

def time_call(f, *a):
start = time.clock()
assert RIGHT_ANSWER == f(*a)
return time.clock()-start

if __name__ == '__main__':
main()

###### end ########

On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303

If a unicode object can count characters so fast, why should an str
object be ten times slower? Just curious, really - it's still fast
enough for me (so far).

This is with Python 2.4.1 on WinXP.
Chris Perkins

Feb 27 '06 #1
3 2839
ch************@gmail.com wrote:
It seems to me that str.count is awfully slow. Is there some reason
for this?
Evidence:

######## str.count time test ########
import string
import time
import array

s = string.printable * int(1e5) # 10**7 character string
a = array.array('c', s)
u = unicode(s)
RIGHT_ANSWER = s.count('a')

def main():
print 'str: ', time_call(s.count, 'a')
print 'array: ', time_call(a.count, 'a')
print 'unicode:', time_call(u.count, 'a')

def time_call(f, *a):
start = time.clock()
assert RIGHT_ANSWER == f(*a)
return time.clock()-start

if __name__ == '__main__':
main()

###### end ########

On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303

If a unicode object can count characters so fast, why should an str
object be ten times slower? Just curious, really - it's still fast
enough for me (so far).

This is with Python 2.4.1 on WinXP.
Chris Perkins

Your evidence points to some unoptimized code in the underlying C
implementation of Python. As such, this should probably go to the
python-dev list (http://mail.python.org/mailman/listinfo/python-dev).

The problem is that the C library function memcmp is slow, and
str.count calls it frequently. See lines 2165+ in stringobject.c
(inside function string_count):

r = 0;
while (i < m) {
if (!memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This could be optimized as:

r = 0;
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.

The above might be optimized further for cases such as yours, where a
single character appears many times in the string:

r = 0;
if (n == 1) {
/* optimize for a single character */
while (i < m) {
if (s[i] == *sub)
r++;
i++;
}
} else {
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}
}

Note that there might be some subtle reason why neither of these
optimizations are done that I'm unaware of... in which case a comment
in the C source would help. :-)

--Ben

Feb 27 '06 #2
Ben Cartwright wrote:
On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303
This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.


it's about time that someone sat down and merged the string and unicode
implementations into a single "stringlib" code base (see the SRE sources for
an efficient way to do this in plain C).

moving to (basic) C++ might also be a good idea (in 3.0, perhaps). is any-
one still stuck with pure C89 these days ?

</F>

Feb 28 '06 #3

"Ben Cartwright" <be****@gmail.com> wrote in message
news:11**********************@v46g2000cwv.googlegr oups.com...
Your evidence points to some unoptimized code in the underlying C
implementation of Python. As such, this should probably go to the
python-dev list (http://mail.python.org/mailman/listinfo/python-dev).

The problem is that the C library function memcmp is slow, and
str.count calls it frequently. See lines 2165+ in stringobject.c
(inside function string_count):

r = 0;
while (i < m) {
if (!memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This could be optimized as:

r = 0;
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.


If not doing the same in str.count is indeed an oversight. a patch should
be welcome (on the SF tracker).

Feb 28 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: phil | last post by:
Using Tkinter Canvas to teach High School Geometry with A LOT of success. My drawing gets very slow after a lot of actions. For instance I have created code to rotate a set of objects about a...
3
by: Xah Lee | last post by:
if i have mytext.replace(a,b) how to find out many many occurances has been replaced? Xah xah@xahlee.org ∑ http://xahlee.org/
2
by: boonkit | last post by:
I run a query on a 3 million rows table (avrg row length 2988) as below: SELECT COUNT(*) FROM tbl WHERE MATCH (col) AGAINST ('keyword'); The query above took from 5 - 10 seconds. Below is...
2
by: cefrancke | last post by:
I can't seem to find a straight answer for my specific issue. Any help would be appreciated. I would like to count the various items in a table where the fields have a 'group' relationship. I...
10
by: Henk Ernst Blok | last post by:
Hi Posgres users/developers, Can anyone explain why PosgreSQL (version 7.4.5 on Linux) does a full table scan to compute a count(*) on a base table after a vacuum analyze has been done with no...
11
by: Farel | last post by:
Which is Faster in Python and Why? jc = {}; m = x = ,,.......] # upwards of 10000 entries def mcountb(): for item in x: b = item; b.sort(); bc = 0 for bitem in b: bc += int(bitem) try: m...
1
by: aps786 | last post by:
Hi, There is a table where I store ipaddress and user who logged in from that IP. I have a query to findout all ipaddresses, from where diff users had made request. stat ------------ ip...
22
by: MP | last post by:
vb6,ado,mdb,win2k i pass the sql string to the .Execute method on the open connection to Table_Name(const) db table fwiw (the connection opened via class wrapper:) msConnString = "Data Source="...
8
by: C10B | last post by:
hi, I have a table with several million rows. Each row is simply the date and time a certain page was viewed. eg page1 1-1-00 page2 2-1-00 page1 16-1-00 page1 17-1-00
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.