PATCH: Speed up direct string concatenation by 20+%!

Larry Hastings

This is such a long posting that I've broken it out into sections.
Note that while developing this patch I discovered a Subtle Bug
in CPython, which I have discussed in its own section below.

____________
THE OVERVIEW

I don't remember where I picked it up, but I remember reading years
ago that the simple, obvious Python approach for string concatenation:
x = "a" + "b"
is slow. Obviously it's fine for the simple case above, but if you're
adding together twenty-five things:
x = a + b + c + d + ... + y
the performance is terrible, as the interpreter creates temporary
objects
for each intermediate result. (a + b, then (a + b) + c,
then ((a + b) + c) + d, and so on.)

The suggested idiom for fast string concatenation was this:
x = "".join([a, b, c, d, ... y])
This works fine. But I don't like this approach, not only because it's
kind of ugly and inobvious, but because it violates the "one obvious
way
to do it" principle.

I mentioned all this to a friend of mine (Mike Thornburgh), and told
him matter-of-factly that I'd like to fix this, but the design of
Python
simply precluded it. He said "no it doesn't!" and proceeded to
describe
in great detail an approach to permit fast string concatenation using
+.
I have implemented what he suggested--with only slight changes--and
present it to you below. I only had to touch four files, and only did
major surgery on one. The resulting Python passes the test suite as
well
as my unmodified locally-built Python 2.5. (I didn't install the
external
source trees for stuff like SQLite, so it fails those tests.)

Note that times have changed; using Python 2.5, a + b + c isn't *that*
much slower than "".join([]). But is is still slower. Or, was--with
my
patch, a + b + c becomes *the fastest method* for string concatenation.
Hooray!

(It didn't pay off as *much* as I'd hoped, but it's still an
improvement.)

_________
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:

* String concatenation using + is now the fastest way to
concatenate
strings (that I know of).

* Throw off the shackles of "".join([]), you don't need it
anymore.

* Did I mention it was faster?

Downsides to this approach:

* Changes how PyStringObjects are stored internally; ob_sval is
no
longer a char[1], but a char *. This makes each StringObject
four bytes larger.

* Adds another memory dereference in order to get the value of
a string, which is a teensy-weensy slowdown.

* Would force a recompile of all C modules that deal directly
with string objects (which I imagine is most of them).

* Also, *requires* that C modules use the PyString_AS_STRING()
macro, rather than casting the object and grabbing ob_sval
directly. (I was pleased to see that the Python source
was very good about using this macro; if all Python C
modules are this well-behaved, this point is happily moot.)

* On a related note, the file Mac/Modules/MacOS.c implies
that there are Mac-specific Python scripts that peer
directly into string objects. These would have to be
changed to understand the new semantics.

* String concatenation objects are 36 bytes larger than
string objects, and this space will often go unreclaimed
after the string is rendered.

* When rendered, string concatenation objects storing long
strings will allocate a second buffer from the heap to
store the string. So this adds some minor allocation
overhead (though this is offset by the speed gain from
the approach overall).

* Will definitely need some heavy review before it could
go in, in particular I worry I got the semantics surrounding
"interned" strings wrong.
Internally, a "string concatenation" object is the same as a "string"
object with two extra fields: an array of references to string objects
(and string concatenation objects), and a count. Also, ob_sstate now
stores a third flag, indicating whether or not the string is a string
concatenation object. External users don't need to worry about these
details, they just use PyString_AS_STRING() and all is well with their
world.

I've already added some optimizations to the approach:

* If you're adding a string to an existing string concatenation
object whose reference count is 1, who isn't full, and who
hasn't been rendered yet, it adds the string directly to the
existing concatenation object (rather than creating a new
one).
This works whether the existing object is on the right or the
left side.

* The general case is a + b + c ..., where you're only adding
strings to the *end*. This will build a degenerate tree
where
the first slot points to the child. The tree for
concatenating
the first twenty-two letters of the alphabet would look like
this:

[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| p q r s t u v
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| i j k l m n o
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
v v v v v v v v
a b c d e f g h

The rendering function is by necessity recursive. However,
it is optimized for this shape; when the tree looks like
this,
it will be rendered purely iteratively--no recursion.

* If, after rendering, the resulting string will be small
enough
to fit inside the extra space normally used by the
concatenation
object, the final string will be copied on top of this
otherwise
now-wasted space.

______________
THE BENCHMARKS

Benchmark 1:
def add(a, b, c, ... t): return a + b + c + ... + t
for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
Benchmark 2:
for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... +
"ttt"
Benchmark 3:
for i in range(10000000): x = "".join(["aaa", "bbb", "ccc",
...., "ttt"])
Benchmark 4:
for i in range(10000000):
x = "aaa"
x += "bbb"
x += "ccc"
...
x += "ttt"
Benchmark 5:
for i in range(10000000):
array = []
array.append("aaa")
array.append("bbb")
array.append("ccc")
...
array.append("ttt")
x = "".join(array)
Benchmark 6:
for i in range(10000000):
x = cStringIO.StringIO()
x.write("aaa")
x.write("bbb")
x.write("ccc")
...
x.write("ttt")

Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | %
improvement
1 | 32.1s | 32.8s | 26.7s |
22.8%
2 | 18.7s | 18.7s | 15.1s |
23.8%
3 | 16.1s | 16.2s | 16.6s |
-1.1%
4 | 64.6s | 64.0s | 57.1s |
12.1%
5 | 74.1s | 79.4s | 76.7s |
3.5%
6 | 126.0s | too slow, didn't bother!

______________
THE SUBTLE BUG

While developing this patch, I discovered a subtle bug in Python.
It happened only on one gruesome test in the middle of test_descr.py.
The problem was: '' == '' was returning False.

Starting on line 1116 of stringobject.c we have this:

if (op == Py_EQ) {
/* Supporting Py_NE here as well does not save
much time, since Py_NE is rarely used. */
if (a->ob_size == b->ob_size
&& (a->ob_sval[0] == b->ob_sval[0]
&& memcmp(a->ob_sval, b->ob_sval,
a->ob_size) == 0)) {
result = Py_True;
} else {
result = Py_False;
}
goto out;
}

The bug is with the "a->ob_sval[0] == b->ob_sval[0]". This is
apparently
a speed optimization; check that the first byte is the same before
proceeding with the memcmp(). But it does this even when *both* strings
are *zero length*! I doubt the Python source tree has an official
position on what the first byte of a zero-length string should be, but
I assert that it should be undefined. In practice, it's seemingly
always
set to the same value in the released Python, but with my changes it's
now possible for them to be different.

I think it's sinful to compare the first bytes of zero-length strings.
Therefore I suggest this bugfix:

if (a->ob_size == b->ob_size
/* >change >*/ && ( (!a->ob_size || a->ob_sval[0] ==
b->ob_sval[0])
&& memcmp(a->ob_sval, b->ob_sval,

I've already incorporated this fix in my code.

__________
THE FUTURE

If this patch is accepted, it paves the way for another cheap
optimization: string slices could be done without copying.
Add another field into the PyString structure: a reference
to another PyString *. If non-NULL, it means the local ob_sval
actually points into the ob_sval owned by that other PyString *.
It'd be another four bytes, but we'd only need to incur that
when creating a slice; we could set a bit in ob_sstate indicating
"this is a slice", and if so we'd have to copy-on-write ob_sval,
and free the reference when the object was destroyed.

______________
THE SUBMISSION

I don't know the protocol from this point out; should I email the patch
somewhere? Put it up on a web page? Post some diff output? (Or just
forget about it entirely?)

Also, for what it's worth: I fixed the .dsp so pythoncore builds under
VC6 again. I would be happy to submit that separately. (What can I
say, old habits die hard. I can't stand the newer Microsoft IDEs.)
Cheers,
/larry/

p.s. No, I haven't looked at a Unicode port yet. If there's interest
in
this patch at all, I might take a stab at it.

Sep 29 '06 #1

Subscribe Post Reply

2610

Steve Holden

Larry Hastings wrote:

This is such a long posting that I've broken it out into sections.
Note that while developing this patch I discovered a Subtle Bug
in CPython, which I have discussed in its own section below.

[...]

______________
THE SUBMISSION

I don't know the protocol from this point out; should I email the patch
somewhere? Put it up on a web page? Post some diff output? (Or just
forget about it entirely?)

Also, for what it's worth: I fixed the .dsp so pythoncore builds under
VC6 again. I would be happy to submit that separately. (What can I
say, old habits die hard. I can't stand the newer Microsoft IDEs.)
Cheers,
/larry/

p.s. No, I haven't looked at a Unicode port yet. If there's interest
in
this patch at all, I might take a stab at it.

A fairly short reply: you should diff your source against the current
SVN repository and lodge that diff as a patch on SourceForge. It
wouldn't hurt to let one or two developers know, thought they *might*
pick it up without.

Your suggested bug isn't, I think a real bug in the current
implementation because as I understand it Python strings do always
include a trailing null byte to act as a terminator when the string is
passed to C code in extensions.

Nice idea, though. You might also see how it does on tasks like

s = ""
for i in range(100000):
s += "a"

since I believe that's the main culprit in increasing the O(). Also note
that some optimisation has recently been performed on string
concatenation, which I presume your code has already taken into account.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 29 '06 #2

Robin Becker

Larry Hastings wrote:
______

THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:

.........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real code.
--
Robin Becker

Sep 29 '06 #3

Rob Williscroft

Larry Hastings wrote in news:1159495643.213830.289620
@m7g2000cwm.googlegroups.com in comp.lang.python:

_________
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

On the python 3k list there is a thread about stings becoming "views",
I think this is similar to what you are doing here.

<url:http://search.gmane.org/?
query=string+view&author=&group=gmane.comp.python. python-
3000.devel&sort=relevance&DEFAULTOP=and&xP=string. view.
&xFILTERS=Gcomp.python.python-3000.devl---A>

http://tinyurl.com/kadco

Rob.

Sep 29 '06 #4

Felipe Almeida Lessa

28 Sep 2006 19:07:23 -0700, Larry Hastings <la***@hastings.org>:

THE BENCHMARKS

Benchmark 1:
def add(a, b, c, ... t): return a + b + c + ... + t
for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")

[snip]

What about "a + b"? Or "a + b + c"? Does it have a large overhead on
small string concatenations?

--
Felipe.

Sep 29 '06 #5

Carl Friedrich Bolz

Robin Becker wrote:

Larry Hastings wrote:
______
>THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.

I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Cheers,

Carl Friedrich

Sep 29 '06 #6

Carl Friedrich Bolz

Robin Becker wrote:

Larry Hastings wrote:
______
>THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.

I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real

code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Cheers,

Carl Friedrich

Sep 29 '06 #7

cfbolz

Robin Becker wrote:

Larry Hastings wrote:
______
>THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too (at least if you don't do additional things).

I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real code.

I think there are quite a lot of string additions around, it's just
that people get told to use join all the time, so they are not written
using "+".

Cheers,

Carl Friedrich Bolz

Sep 29 '06 #8

Carl Friedrich Bolz

Robin Becker wrote:

Larry Hastings wrote:
______
>THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.

I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real

code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Cheers,

Carl Friedrich

Sep 29 '06 #9

Carl Friedrich Bolz

Carl Friedrich Bolz wrote:

Robin Becker wrote:
>Larry Hastings wrote:
______
>>THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:
........

wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.

no, I think it depends on strings being immutable. If you do list1 +
list2 that way and list1 is mutated then the resulting list would be
changed too.

>I suppose the utility of such an approach depends on the frequency with
which multiple strings/lists/sequences etc are added together in real
code.

I think there are quite a lot of string additions around, it's just that
people get told to use join all the time, so they are not written using "+".

Sorry for the multiple post :-(

Sep 29 '06 #10

Larry Hastings

Steve Holden wrote:

you should diff your source against the current
SVN repository and lodge that diff as a patch on SourceForge.

Okay, I'll try to do that today.

Your suggested bug isn't, I think a real bug in the current
implementation because as I understand it Python strings do always
include a trailing null byte to act as a terminator when the string
is passed to C code in extensions.

Ah! Excellent point. So the Python source tree *does* have an
official position on what the first byte of a zero-length string should
be. I'll fix my code so that is ensured.

Nice idea, though. You might also see how it does on tasks like

s = ""
for i in range(100000):
s += "a"

Sure. Here are the results, but with range (10000000):
Python 2.5 release: 31.0s
Python 2.5 locally built: 30.7s
Python 2.5 concat: 4.4s
% Improvement: *697%*!

Woah! Thanks for suggesting a benchmark where it really shines :)

The real big win here is storing eight pointers per concatenation
object. That means 1/8th as many objects created/destroyed.

In testing your suggested benchmark, it overflowed the stack
when freeing the concatenation tree. So I changed deallocation
of string concatenation objects so it's iterative down the
left-hand side like rendering; now it's happy again. Thanks
again for suggesting the test.

It would still blow up if you ran
s = ""
for i in range(10000000):
s = "a" + s
because it's only clever about recursion down the left-hand side.
So my advice is, don't. :)

Also note that some optimisation has recently been performed on string
concatenation, which I presume your code has already taken into account.

Totally. I'm just glad there was something left for me to contribute!
Rob Williscroft wrote:

On the python 3k list there is a thread about stings becoming "views",
I think this is similar to what you are doing here.

As I understand it, sort-of. Definitely moreso in my section on THE
FUTURE--adding another subtype of string for slices which point into
other strings. I believe views are more sophisticated however.
Felipe Almeida Lessa wrote:

What about "a + b"? Or "a + b + c"? Does it have a large overhead on
small string concatenations?

It's *slightly* slower for two:

def addTwoThings(a, b):
return a + b
for i in range(10000000):
x = addTwoThings("aaa", "bbb")

Python 2.5 release: 6.219s
Python 2.5 locally built: 6.219s
Python 2.5 concat: 6.234s
% improvement: -0.03%

But starts paying off already, even with three:

def addThreeThings(a, b, c):
return a + b + c
for i in range(10000000):
x = addThreeThings("aaa", "bbb", "ccc")

Python 2.5 release: 7.9s
Python 2.5 locally built: 7.7s
Python 2.5 concat: 7.2s
% improvement: 6.94%

Note that I was lazy and only did one benchmark run apiece.

Also, keep in mind that memory utilization will be a little higher
with the string concatenation objects.
Carl Friedrich Bolz wrote:

Robin Becker wrote:
wouldn't this approach apply to other additions eg list+list seq+seq etc
etc.
no, I think it depends on strings being immutable.

Exactly.
/larry/

Sep 29 '06 #11

Fredrik Lundh

Larry Hastings wrote:

>Nice idea, though. You might also see how it does on tasks like

s = ""
for i in range(100000):
s += "a"

Sure. Here are the results, but with range (10000000):

ten million? what hardware are you running this on?

Python 2.5 release: 31.0s
Python 2.5 locally built: 30.7s
Python 2.5 concat: 4.4s
% Improvement: *697%*!

Woah! Thanks for suggesting a benchmark where it really shines :)

what's in "s" when that loop is done?

</F>

Sep 29 '06 #12

Larry Hastings

Fredrik Lundh wrote:

Sure. Here are the results, but with range (10000000):
ten million? what hardware are you running this on?

Athlon 64 x2 4400+ (aka 2.2GHz), 3GB of RAM, Windows XP.

what's in "s" when that loop is done?

It's equivalent to " 'a' * 10000000 ". (I shan't post it here.)
Also, I misspoke earlier; it's not an improvement of 697%, but of 597%.
To be precise, it takes about 1/7 the time of the original.

Cheers,
/larry/

Sep 29 '06 #13

William Heymann

On Friday 29 September 2006 08:34, Larry Hastings wrote:

It would still blow up if you ran
s = ""
for i in range(10000000):
s = "a" + s

This is a pretty small change but I would suggest xrange instead of range.
That way you don't allocate that large list just to throw all the items away.

In this case xrange will probably be faster then range also.

Sep 29 '06 #14

Fredrik Lundh

Larry Hastings wrote:

>what's in "s" when that loop is done?

It's equivalent to " 'a' * 10000000 ". (I shan't post it here.)

but what *is* it ? an ordinary PyString object with a flattened buffer,
or something else ?

</F>

Sep 29 '06 #15

Larry Hastings

William Heymann wrote:

This is a pretty small change but I would suggest xrange instead of range.

Good point! Since I was calling range() during the benchmark, it was
timed too. Switching to xrange() will mean less overhead.

I re-ran this benchmark (again):
s = ""
for i in range(100000):
s += "a"

The result:
Python 2.5 release: 30.1s
Python 2.5 locally built: 30.4s
Python 2.5 concat: 3.95s
Improvement: *669%*! (1/7.7 the time.)

So xrange(10000000) was adding between 0.4s and 0.9s overhead, and
losing it makes my approach look even better. Thanks!
/larry/

Sep 29 '06 #16

Larry Hastings

Fredrik Lundh wrote:

what's in "s" when that loop is done?
It's equivalent to " 'a' * 10000000 ". (I shan't post it here.)
but what *is* it ? an ordinary PyString object with a flattened buffer,
or something else ?

At the exact moment that the loop is done, it's a
PyStringConcatenationObject * which points to a deep one-sided tree of
more PyStringConcatenationObject * objects. Its ob_sval is NULL, which
means that the first time someone asks for its value (via the macro
PyString_AS_STRING()) it will be computed. When it's computed, the
interpreter will allocate a buffer of 10000001 bytes and walk the tree,
filling the buffer with ten million 'a's followed by a zero. It'll
then dereference all its children.

The PyStringConcatenationObject struct is a child of PyStringObject,
and external users can ignore the difference as long as they use the
macros in stringobject.h (e.g. using PyString_AS_STRING(), rather than
casting to PyStringObject and using ob_sval directly).

Sorry for misunderstanding the nature of your question the first time,
/larry/

Sep 29 '06 #17

Fredrik Lundh

Larry Hastings wrote:

>

At the exact moment that the loop is done, it's a
PyStringConcatenationObject * which points to a deep one-sided tree of
more PyStringConcatenationObject * objects. Its ob_sval is NULL, which
means that the first time someone asks for its value (via the macro
PyString_AS_STRING()) it will be computed. When it's computed, the
interpreter will allocate a buffer of 10000001 bytes and walk the tree,
filling the buffer with ten million 'a's followed by a zero. It'll
then dereference all its children.

so what does the benchmark look like if you actually do this ?

</F>

Sep 29 '06 #18

Larry Hastings

Fredrik Lundh wrote:

so what does the benchmark look like if you actually do this ?

Okay, timing this:
x = ""
for i in range(100000):
x += "a"
t = x[1] # forces the concat object to render

The result:
Python 2.5 release: 30.0s
Python 2.5 locally built: 30.2s
Python 2.5 concat: 4.3s
Improvement: 600% (1/7 of the time)
/larry/

Sep 29 '06 #19

Carl Friedrich Bolz

Larry Hastings wrote:
[snip]

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

[snip]

Sounds cool. If I understand it correctly you can write even meaner (and
probably even more useless in practive) benchmarks where you safe some
of the intermediate results of the concatenation, thus also explointing
the better space behaviour:
all = []

s = ""

for i in range(1000):
s = s + (str(i) + " ") * 1000
all.append(s)
This should take around 2GB of RAM (so maybe you shouldn't even run it
:-) ) on an unpatched CPython but a lot less with your patch.

<sidenode>
This is exactly the sort of experiment that is extremely easy
to do with PyPy. In fact, some other PyPyers and me wrote a very similar
optimization, which can be compiled in if wanted (or not). The whole
implementation took roughly 100 lines of code, of which 65 are in a
separate file not touching anything else, the rest being minimally
invasive changes. We also implemented a similar optimization for string
slicing (only if the slice has a substantial length and a step of 1).
</sidenode>

Cheers,

Carl Friedrich Bolz

Sep 29 '06 #20

Steve Holden

Larry Hastings wrote:

Fredrik Lundh wrote:

>>so what does the benchmark look like if you actually do this ?

Okay, timing this:
x = ""
for i in range(100000):
x += "a"
t = x[1] # forces the concat object to render

The result:
Python 2.5 release: 30.0s
Python 2.5 locally built: 30.2s
Python 2.5 concat: 4.3s
Improvement: 600% (1/7 of the time)
/larry/

Interesting. Does a comparison also force it to render? If not, I wonder
if comparisons might not end up slower. It does sound like memory usage
might go through the roof with this technique under certain
circumstances, so the more extensive your tests are the more likely you
are to see the change actually used (I'm not convinced you'll persuade
the developers to include this).

Whether or not it does get incorporated I think it's a great
demonstration of how amenable Python is to experimentation with
alternate representations, and I think your project might make a very
interesting PyCon paper for people who were thinking about joining the
development effort but hadn't yet started.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 30 '06 #21

Colin J. Williams

Congratulations on the clear way in which you have set out your proposal.

I hope that it will make its way to PEPdom.

Colin W.

Larry Hastings wrote:

This is such a long posting that I've broken it out into sections.
Note that while developing this patch I discovered a Subtle Bug
in CPython, which I have discussed in its own section below.

____________
THE OVERVIEW

I don't remember where I picked it up, but I remember reading years
ago that the simple, obvious Python approach for string concatenation:
x = "a" + "b"
is slow. Obviously it's fine for the simple case above, but if you're
adding together twenty-five things:
x = a + b + c + d + ... + y
the performance is terrible, as the interpreter creates temporary
objects
for each intermediate result. (a + b, then (a + b) + c,
then ((a + b) + c) + d, and so on.)

The suggested idiom for fast string concatenation was this:
x = "".join([a, b, c, d, ... y])
This works fine. But I don't like this approach, not only because it's
kind of ugly and inobvious, but because it violates the "one obvious
way
to do it" principle.

I mentioned all this to a friend of mine (Mike Thornburgh), and told
him matter-of-factly that I'd like to fix this, but the design of
Python
simply precluded it. He said "no it doesn't!" and proceeded to
describe
in great detail an approach to permit fast string concatenation using
+.
I have implemented what he suggested--with only slight changes--and
present it to you below. I only had to touch four files, and only did
major surgery on one. The resulting Python passes the test suite as
well
as my unmodified locally-built Python 2.5. (I didn't install the
external
source trees for stuff like SQLite, so it fails those tests.)

Note that times have changed; using Python 2.5, a + b + c isn't *that*
much slower than "".join([]). But is is still slower. Or, was--with
my
patch, a + b + c becomes *the fastest method* for string concatenation.
Hooray!

(It didn't pay off as *much* as I'd hoped, but it's still an
improvement.)

_________
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object. Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet. The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:

* String concatenation using + is now the fastest way to
concatenate
strings (that I know of).

* Throw off the shackles of "".join([]), you don't need it
anymore.

* Did I mention it was faster?

Downsides to this approach:

* Changes how PyStringObjects are stored internally; ob_sval is
no
longer a char[1], but a char *. This makes each StringObject
four bytes larger.

* Adds another memory dereference in order to get the value of
a string, which is a teensy-weensy slowdown.

* Would force a recompile of all C modules that deal directly
with string objects (which I imagine is most of them).

* Also, *requires* that C modules use the PyString_AS_STRING()
macro, rather than casting the object and grabbing ob_sval
directly. (I was pleased to see that the Python source
was very good about using this macro; if all Python C
modules are this well-behaved, this point is happily moot.)

* On a related note, the file Mac/Modules/MacOS.c implies
that there are Mac-specific Python scripts that peer
directly into string objects. These would have to be
changed to understand the new semantics.

* String concatenation objects are 36 bytes larger than
string objects, and this space will often go unreclaimed
after the string is rendered.

* When rendered, string concatenation objects storing long
strings will allocate a second buffer from the heap to
store the string. So this adds some minor allocation
overhead (though this is offset by the speed gain from
the approach overall).

* Will definitely need some heavy review before it could
go in, in particular I worry I got the semantics surrounding
"interned" strings wrong.
Internally, a "string concatenation" object is the same as a "string"
object with two extra fields: an array of references to string objects
(and string concatenation objects), and a count. Also, ob_sstate now
stores a third flag, indicating whether or not the string is a string
concatenation object. External users don't need to worry about these
details, they just use PyString_AS_STRING() and all is well with their
world.

I've already added some optimizations to the approach:

* If you're adding a string to an existing string concatenation
object whose reference count is 1, who isn't full, and who
hasn't been rendered yet, it adds the string directly to the
existing concatenation object (rather than creating a new
one).
This works whether the existing object is on the right or the
left side.

* The general case is a + b + c ..., where you're only adding
strings to the *end*. This will build a degenerate tree
where
the first slot points to the child. The tree for
concatenating
the first twenty-two letters of the alphabet would look like
this:

[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| p q r s t u v
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
| v v v v v v v
| i j k l m n o
|
v
[0 1 2 3 4 5 6 7]
| | | | | | | |
v v v v v v v v
a b c d e f g h

The rendering function is by necessity recursive. However,
it is optimized for this shape; when the tree looks like
this,
it will be rendered purely iteratively--no recursion.

* If, after rendering, the resulting string will be small
enough
to fit inside the extra space normally used by the
concatenation
object, the final string will be copied on top of this
otherwise
now-wasted space.

______________
THE BENCHMARKS

Benchmark 1:
def add(a, b, c, ... t): return a + b + c + ... + t
for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
Benchmark 2:
for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... +
"ttt"
Benchmark 3:
for i in range(10000000): x = "".join(["aaa", "bbb", "ccc",
..., "ttt"])
Benchmark 4:
for i in range(10000000):
x = "aaa"
x += "bbb"
x += "ccc"
...
x += "ttt"
Benchmark 5:
for i in range(10000000):
array = []
array.append("aaa")
array.append("bbb")
array.append("ccc")
...
array.append("ttt")
x = "".join(array)
Benchmark 6:
for i in range(10000000):
x = cStringIO.StringIO()
x.write("aaa")
x.write("bbb")
x.write("ccc")
...
x.write("ttt")

Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | %
improvement
1 | 32.1s | 32.8s | 26.7s |
22.8%
2 | 18.7s | 18.7s | 15.1s |
23.8%
3 | 16.1s | 16.2s | 16.6s |
-1.1%
4 | 64.6s | 64.0s | 57.1s |
12.1%
5 | 74.1s | 79.4s | 76.7s |
3.5%
6 | 126.0s | too slow, didn't bother!

______________
THE SUBTLE BUG

While developing this patch, I discovered a subtle bug in Python.
It happened only on one gruesome test in the middle of test_descr.py.
The problem was: '' == '' was returning False.

Starting on line 1116 of stringobject.c we have this:

if (op == Py_EQ) {
/* Supporting Py_NE here as well does not save
much time, since Py_NE is rarely used. */
if (a->ob_size == b->ob_size
&& (a->ob_sval[0] == b->ob_sval[0]
&& memcmp(a->ob_sval, b->ob_sval,
a->ob_size) == 0)) {
result = Py_True;
} else {
result = Py_False;
}
goto out;
}

The bug is with the "a->ob_sval[0] == b->ob_sval[0]". This is
apparently
a speed optimization; check that the first byte is the same before
proceeding with the memcmp(). But it does this even when *both* strings
are *zero length*! I doubt the Python source tree has an official
position on what the first byte of a zero-length string should be, but
I assert that it should be undefined. In practice, it's seemingly
always
set to the same value in the released Python, but with my changes it's
now possible for them to be different.

I think it's sinful to compare the first bytes of zero-length strings.
Therefore I suggest this bugfix:

if (a->ob_size == b->ob_size
/* >change >*/ && ( (!a->ob_size || a->ob_sval[0] ==
b->ob_sval[0])
&& memcmp(a->ob_sval, b->ob_sval,

I've already incorporated this fix in my code.

__________
THE FUTURE

If this patch is accepted, it paves the way for another cheap
optimization: string slices could be done without copying.
Add another field into the PyString structure: a reference
to another PyString *. If non-NULL, it means the local ob_sval
actually points into the ob_sval owned by that other PyString *.
It'd be another four bytes, but we'd only need to incur that
when creating a slice; we could set a bit in ob_sstate indicating
"this is a slice", and if so we'd have to copy-on-write ob_sval,
and free the reference when the object was destroyed.

______________
THE SUBMISSION

I don't know the protocol from this point out; should I email the patch
somewhere? Put it up on a web page? Post some diff output? (Or just
forget about it entirely?)

Also, for what it's worth: I fixed the .dsp so pythoncore builds under
VC6 again. I would be happy to submit that separately. (What can I
say, old habits die hard. I can't stand the newer Microsoft IDEs.)
Cheers,
/larry/

p.s. No, I haven't looked at a Unicode port yet. If there's interest
in
this patch at all, I might take a stab at it.

Oct 1 '06 #22

Larry Hastings

An update: I have submitted this as a patch on SourceForge.
It's request ID #1569040.
http://sourceforge.net/tracker/?grou...70&atid=305470
I invite everyone to take it for a spin!

There are some improvements in this version. Specifically:

* Python will no longer crash if you do ten million prepends
( x = 'a' + x ). Since the problem was blowing the stack
with an incredibly deep render, I now limit the depth of
the string concatenation objects (currently set at 16384).
Note that string prepending is now *immensely* faster, as
prepending in the existing implementation is a worst-case.

* I figured out why my zero-length strings were occasionally
not zero terminated. It had to do with subclassing a string
and storing an attribute in the object, which meant storing
a dict, and where specifically the interpreter chose to store
that. The solution was essentially to ensure there's always
space in the object for the trailing zero.

When running regrtest.py, my patched version produces identical
output to a non-patched build on Windows.
Steve Holden wrote:

Does a comparison also force it to render?

Yes. Any attempt to examine the string causes it to render.

It does sound like memory usage
might go through the roof with this technique under certain
circumstances, so the more extensive your tests are the more likely you
are to see the change actually used (I'm not convinced you'll persuade
the developers to include this).

Yeah, I expect memory usage to be higher too, but not by a fantastic
amount. Once you render the concatenation, it drops all the references
to the child objects held in the tree, and what's left is a string
object with some extra space on the end.

I think your project might make a very
interesting PyCon paper for people who were thinking about joining the
development effort but hadn't yet started.

Perhaps; I've never been to PyCon, but it might be fun to give a
presentation there. That said, it would be way more relevant if the
patch got accepted, don'tcha think?

Cheers,
/larry/

p.s. Thanks for the sentiment, Colin W.!

Oct 2 '06 #23

Fredrik Lundh

Larry Hastings wrote:

There are some improvements in this version. Specifically:

* Python will no longer crash if you do ten million prepends
( x = 'a' + x ). Since the problem was blowing the stack
with an incredibly deep render, I now limit the depth of
the string concatenation objects (currently set at 16384).
Note that string prepending is now *immensely* faster, as
prepending in the existing implementation is a worst-case.

You really should look up something called "ropes".

You should also benchmark this against code that uses the ordinary
append/join pattern. (you've posted conflicting benchmarks for 2.5,
but if I'm trusting the benchmarks that looks more reasonable, the
standard implementation pattern is still around 10 times faster than
your approach).

Perhaps; I've never been to PyCon, but it might be fun to give a
presentation there. That said, it would be way more relevant if the
patch got accepted, don'tcha think?

It's rather unlikely that something like this will ever be added to
the 2.X series. It's pretty unlikely for 3.X as well (GvR used a
rope-like structure for ABC, and it was no fun for anyone), but it'll
most likely be a lot easier to provide this as an option for 3.X.

</F>

Oct 2 '06 #24

John Machin

Fredrik Lundh wrote:

>
You should also benchmark this against code that uses the ordinary
append/join pattern. (you've posted conflicting benchmarks for 2.5,
but if I'm trusting the benchmarks that looks more reasonable, the
standard implementation pattern is still around 10 times faster than
your approach).

And the OP might like to go a bit further, and as well as benchmarking
this style:
array = []
array.append("aaa")
array.append("bbb")
etc
try benchmarking this ... well "style" may not be the appropriate word
:-)
array = []
arrapp = array.append
arrapp('aaa')
arrapp('bbb')
etc

Cheers,
John

Oct 2 '06 #25

Aahz

In article <11**********************@b28g2000cwb.googlegroups .com>,
Larry Hastings <la***@hastings.orgwrote:

>Steve Holden wrote:
>>
I think your project might make a very
interesting PyCon paper for people who were thinking about joining the
development effort but hadn't yet started.

Perhaps; I've never been to PyCon, but it might be fun to give a
presentation there. That said, it would be way more relevant if the
patch got accepted, don'tcha think?

Not really. The principles involved are timeless, and I guarantee you a
large audience regardless of whether the patch gets accepted (provided
you specify an appropriate presentation title and summary).
--
Aahz (aa**@pythoncraft.com) <* http://www.pythoncraft.com/

"LL YR VWL R BLNG T S" -- www.nancybuttons.com

Oct 2 '06 #26

Larry Hastings

Fredrik Lundh wrote:

You should also benchmark this against code that uses the ordinary
append/join pattern.

Sorry, thought I had. Of course, now that the patch is up on
Sourceforce you could download it and run all the benchmarks you like.

For all the benchmarks I ran below, the number listed is the best of
three runs. Time was computed using sum(os.times()[:2]).

Running this under Python 2.5 release:
x = []
for i in xrange(10000000):
x.append("a")
y = "".join(x)
took 4421ms.

Running this under my patched Python:
x = ""
for i in xrange(10000000):
x += "a"
y = x[1]
took 4406ms.

I assert that my code makes + as fast as the old "".join([]) idiom.

It's rather unlikely that something like this will ever be added to
the 2.X series. It's pretty unlikely for 3.X as well (GvR used a
rope-like structure for ABC, and it was no fun for anyone), but it'll
most likely be a lot easier to provide this as an option for 3.X.

I can't address the ABC implementation as I've never seen it. But my
patch only touches four files. Two are obviously stringobject.c and .h.
The other two files, ceval.c and codeobject.c, only got one-line
changes. My changes to PyStringObject are well-encapsulated; as long
as core / extension programmers continue to use PyString_AS_STRING() to
access the char * in a PyStringObject they will never notice the
difference.
John Machin wrote:

try benchmarking this ... well "style" may not be the appropriate word

Running this under Python 2.5 release:
x = []
xappend = x.append
for i in xrange(10000000):
xappend("a")
y = "".join(x)
took 3281ms.

Running this under my patched Python 2.5:
x = ""
xappend = x.__add__
for i in xrange(10000000):
xappend("a")
y = "".join(x)
took 3343ms.
/larry/

Oct 3 '06 #27

John Machin

Larry Hastings wrote:

>
John Machin wrote:
try benchmarking this ... well "style" may not be the appropriate word

Running this under Python 2.5 release:
x = []
xappend = x.append
for i in xrange(10000000):
xappend("a")
y = "".join(x)
took 3281ms.

Running this under my patched Python 2.5:
x = ""
xappend = x.__add__
for i in xrange(10000000):
xappend("a")
y = "".join(x)

Don't you mean y = x[1] or something like that? y = "".join(x) looks
like a copy-paste error.

Playing the devil's advocate here: 10M adds each of a single byte
followed by 1 render doesn't seem very typical. How about 10 adds each
of 10 bytes followed by a render, then repeat that 10 M times.

Another question:
How does this:
buff = ""
for datum in data:
if buff:
buff += "|"
buff += str(datum)
compare with
buff = "|".join(str(datum) for datum in data)
?

Some of us have Windows boxes and don't have the necessary MS compiler.
Is there any chance of someone making a 2.5+patch Windows binary?

Cheers,
John

took 3343ms.
/larry/

Oct 3 '06 #28

Larry Hastings

John Machin wrote:

Don't you mean y = x[1] or something like that? y = "".join(x) looks
like a copy-paste error.

You're right, by gum. Worse than that, my benchmark wasn't actually
*doing* much of anything there; at the end of the run x was still
length 0. That was sloppy, and I apologize.

I cannot find a speedy equivalent to the 'xappend = x.append' trick for
string concatenation using +. 'operator.__iadd__' is slower, a labored
attempt using 'x = x("a").__add__' is better but still slower. So far,
just calling 'x += "a"' is the fastest way, and that's 4.4s, still beat
by the 'xappend = x.append' trick at 3.2s.

Playing the devil's advocate here: 10M adds each of a single byte
followed by 1 render doesn't seem very typical. How about [...]
Some of us have Windows boxes and don't have the necessary MS compiler.
Is there any chance of someone making a 2.5+patch Windows binary?

I actually developed it on Windows. Here you go:
http://larryhastings.com/programming...2.5.concat.zip
That contains the new python25.dll and my hacked-up benchmark script.
Just back up your existing python25.dll, then unzip to your Python
directory, and you're ready to rock.

Enjoy,
/larry/

Oct 3 '06 #29

Istvan Albert

I remember a similar patch from some time ago:

http://mail.python.org/pipermail/pyt...st/046686.html

i

Oct 3 '06 #30

Larry Hastings

Istvan Albert wrote:

I remember a similar patch from some time ago:
http://mail.python.org/pipermail/pyt...st/046686.html

That patch addressed the same general problem, but took a completely
different approach. For the record, this patch (#980695) optimized "x
+= a" by examining the next instruction after the addition. If it's a
store instruction, and the object being stored to holds a reference to
the left side of the concatenation, it makes the object drop the
reference *before* doing the concatenation. In the case where there
was only one reference to the string on the left side (that being "x"),
this allows the string concatenation function to use a very fast
shortcut: resize the left side and concatenate the right side directly
to the end.

The patch was accepted in August of 2004, so it might have been folded
in to CPython during the 2.3 era; certainly it was part of CPython 2.4.

And, happily, that optimization is complimentary to mine. In my patch,
when the left side only has one reference and it's a "string
concatenation object", it will often have space left in its internal
array of strings objects. In that case I just append the string.
Speeds things up immensely.

Cheers,
/larry/

Oct 3 '06 #31

Istvan Albert

Larry Hastings wrote:

That patch addressed the same general problem, but took a completely
different approach. For the record, this patch (#980695) optimized "x

Larry,

Forgot to mention this, in case you haven't done so post your original
message/patch on the python-dev lists since that's where the decisions
are made. This group is more end-user oriented.

http://mail.python.org/pipermail/pyt...er/thread.html

i.

Oct 3 '06 #32

Terry Reedy

"Forgot to mention this, in case you haven't done so post your original

message/patch on the python-dev lists since that's where the decisions
are made. This group is more end-user oriented.

http://mail.python.org/pipermail/pyt...er/thread.html

It is often good to get community input first, as the OP has done. When he
wishes, the OP should submit his latest (not original) version of the patch
to SourceForge (not the list) and then optionally post a new message with a
link to the SF item for discussion.

tjr

Oct 3 '06 #33

Nicko

Larry Hastings wrote:

It's *slightly* slower for two:

def addTwoThings(a, b):
return a + b
for i in range(10000000):
x = addTwoThings("aaa", "bbb")

....

But starts paying off already, even with three:

def addThreeThings(a, b, c):
return a + b + c
for i in range(10000000):
x = addThreeThings("aaa", "bbb", "ccc")

I note that in both of those tests you didn't actually ever realise the
concatenated string. Can you give us figures for these tests having
forced the concatenated string to be computed?

Cheers,
Nicko

Oct 5 '06 #34

Larry Hastings

Nicko wrote:

I note that in both of those tests you didn't actually ever realise the
concatenated string. Can you give us figures for these tests having
forced the concatenated string to be computed?

Sure, good call. And bad news.

All these benchmarks were with functions taking N arguments that just
added (concatenated) the arguments and returned the result. I only ran
each benchmark once.

| Python 2.5 | Python 2.5
arguments | release | concat
-------------+--------------+--------------
2 | 7718ms | 9078ms
3 | 9203ms | 10500ms
4 | 10656ms | 11656ms
5 | 12156ms | 13390ms
20 | 29359ms | 26703ms

Interpret as you see fit,
/larry/

Oct 5 '06 #35

PATCH: Speed up direct string concatenation by 20+%!

Similar topics