str.count is slow

chrisperkins99

It seems to me that str.count is awfully slow. Is there some reason
for this?
Evidence:

######## str.count time test ########
import string
import time
import array

s = string.printable * int(1e5) # 10**7 character string
a = array.array('c', s)
u = unicode(s)
RIGHT_ANSWER = s.count('a')

def main():
print 'str: ', time_call(s.count, 'a')
print 'array: ', time_call(a.count, 'a')
print 'unicode:', time_call(u.count, 'a')

def time_call(f, *a):
start = time.clock()
assert RIGHT_ANSWER == f(*a)
return time.clock()-start

if __name__ == '__main__':
main()

###### end ########

On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303

If a unicode object can count characters so fast, why should an str
object be ten times slower? Just curious, really - it's still fast
enough for me (so far).

This is with Python 2.4.1 on WinXP.
Chris Perkins

Feb 27 '06 #1

Subscribe Post Reply

2839

Ben Cartwright

ch************@gmail.com wrote:

It seems to me that str.count is awfully slow. Is there some reason
for this?
Evidence:

######## str.count time test ########
import string
import time
import array

s = string.printable * int(1e5) # 10**7 character string
a = array.array('c', s)
u = unicode(s)
RIGHT_ANSWER = s.count('a')

def main():
print 'str: ', time_call(s.count, 'a')
print 'array: ', time_call(a.count, 'a')
print 'unicode:', time_call(u.count, 'a')

def time_call(f, *a):
start = time.clock()
assert RIGHT_ANSWER == f(*a)
return time.clock()-start

if __name__ == '__main__':
main()

###### end ########

On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303

If a unicode object can count characters so fast, why should an str
object be ten times slower? Just curious, really - it's still fast
enough for me (so far).

This is with Python 2.4.1 on WinXP.
Chris Perkins

Your evidence points to some unoptimized code in the underlying C
implementation of Python. As such, this should probably go to the
python-dev list (http://mail.python.org/mailman/listinfo/python-dev).

The problem is that the C library function memcmp is slow, and
str.count calls it frequently. See lines 2165+ in stringobject.c
(inside function string_count):

r = 0;
while (i < m) {
if (!memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This could be optimized as:

r = 0;
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.

The above might be optimized further for cases such as yours, where a
single character appears many times in the string:

r = 0;
if (n == 1) {
/* optimize for a single character */
while (i < m) {
if (s[i] == *sub)
r++;
i++;
}
} else {
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}
}

Note that there might be some subtle reason why neither of these
optimizations are done that I'm unaware of... in which case a comment
in the C source would help. :-)

--Ben

Feb 27 '06 #2

Fredrik Lundh

Ben Cartwright wrote:

On my machine, the output is:

str: 0.29365715475
array: 0.448095498171
unicode: 0.0243757237303
This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.

it's about time that someone sat down and merged the string and unicode
implementations into a single "stringlib" code base (see the SRE sources for
an efficient way to do this in plain C).

moving to (basic) C++ might also be a good idea (in 3.0, perhaps). is any-
one still stuck with pure C89 these days ?

</F>

Feb 28 '06 #3

Terry Reedy

"Ben Cartwright" <be****@gmail.com> wrote in message
news:11**********************@v46g2000cwv.googlegr oups.com...

Your evidence points to some unoptimized code in the underlying C
implementation of Python. As such, this should probably go to the
python-dev list (http://mail.python.org/mailman/listinfo/python-dev).

The problem is that the C library function memcmp is slow, and
str.count calls it frequently. See lines 2165+ in stringobject.c
(inside function string_count):

r = 0;
while (i < m) {
if (!memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This could be optimized as:

r = 0;
while (i < m) {
if (s[i] == *sub && !memcmp(s+i, sub, n)) {
r++;
i += n;
} else {
i++;
}
}

This tactic typically avoids most (sometimes all) of the calls to
memcmp. Other string search functions, including unicode.count,
unicode.index, and str.index, use this tactic, which is why you see
unicode.count performing better than str.count.

If not doing the same in str.count is indeed an oversight. a patch should
be welcome (on the SF tracker).

Feb 28 '06 #4

Similar topics

Tkinter weirdness item count

by: phil | last post by:

Using Tkinter Canvas to teach High School Geometry with A LOT of success. My drawing gets very slow after a lot of actions. For instance I have created code to rotate a set of objects about a...

Python

count string replace occurances

by: Xah Lee | last post by:

if i have mytext.replace(a,b) how to find out many many occurances has been replaced? Xah xah@xahlee.org âˆ‘ http://xahlee.org/

Python

MySQL full-text SELECT COUNT(*) slow

by: boonkit | last post by:

I run a query on a 3 million rows table (avrg row length 2988) as below: SELECT COUNT(*) FROM tbl WHERE MATCH (col) AGAINST ('keyword'); The query above took from 5 - 10 seconds. Below is...

MySQL Database

Group Level Count

by: cefrancke | last post by:

I can't seem to find a straight answer for my specific issue. Any help would be appreciated. I would like to count the various items in a table where the fields have a 'group' relationship. I...

Microsoft Access / VBA

Strange count(*) implementation?

by: Henk Ernst Blok | last post by:

Hi Posgres users/developers, Can anyone explain why PosgreSQL (version 7.4.5 on Linux) does a full table scan to compute a count(*) on a base table after a vacuum analyze has been done with no...

PostgreSQL Database

Which is faster? (if not b in m) or (if m.count(b) > 0)

by: Farel | last post by:

Which is Faster in Python and Why? jc = {}; m = x = ,,.......] # upwards of 10000 entries def mcountb(): for item in x: b = item; b.sort(); bc = 0 for bitem in b: bc += int(bitem) try: m...

Python

count(distinct) query too slow

by: aps786 | last post by:

Hi, There is a table where I store ipaddress and user who logged in from that IP. I have a query to findout all ipaddresses, from where diff users had made request. stat ------------ ip...

MySQL Database

Select Count(*) problem

by: MP | last post by:

vb6,ado,mdb,win2k i pass the sql string to the .Execute method on the open connection to Table_Name(const) db table fwiw (the connection opened via class wrapper:) msConnString = "Data Source="...

Microsoft Access / VBA

faster count(*) or alternative

by: C10B | last post by:

hi, I have a table with several million rows. Each row is simply the date and time a certain page was viewed. eg page1 1-1-00 page2 2-1-00 page1 16-1-00 page1 17-1-00

Microsoft SQL Server

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General