vector::insert performance tip.

kostas

Hi,
Trying to dump a set<int(s) in a vector (v) my guess was that
v.insert(v.end(),s.begin(),s.end())
would be at least as efficient as
copy(s.begin(), s.end(), back_inserter(v))
It turn out that it is almost two times slower (even without using
reserve() before copy) and that's because it traverses the set twice.
I'm posting it as an exception to the usual (and usually correct)
advise
"Prefer member functions to algorithms".

Regards
Kostas

Apr 28 '07 #1

Subscribe Post Reply

4886

Victor Bazarov

kostas wrote:

Trying to dump a set<int(s) in a vector (v) my guess was that
v.insert(v.end(),s.begin(),s.end())
would be at least as efficient as
copy(s.begin(), s.end(), back_inserter(v))
It turn out that it is almost two times slower (even without using
reserve() before copy) and that's because it traverses the set twice.

I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

I'm posting it as an exception to the usual (and usually correct)
advise
"Prefer member functions to algorithms".

I don't think the advice has anything to do with performance. The
usual (and always correct) advice WRT performance is "measure first
then try to optimize".

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Apr 28 '07 #2

kostas

On Apr 28, 10:49 pm, "Victor Bazarov" <v.Abaza...@comAcast.netwrote:

I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

The implementation of STL i use it needs first to calculate the
distance
s.end() - s.begin() in order to reserve the appropriate space. Is this
not
the standard (or the usual) implementation of vector::insert ?

I don't think the advice has anything to do with performance.

with what then?

Apr 28 '07 #3

Roland Pibinger

On 28 Apr 2007 14:32:55 -0700, kostas <sk******@gmail.comwrote:

>On Apr 28, 10:49 pm, "Victor Bazarov" <v.Abaza...@comAcast.netwrote:
>I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

The implementation of STL i use it needs first to calculate the
distance s.end() - s.begin() in order to reserve the appropriate
space. Is this not the standard (or the usual) implementation
of vector::insert ?

It's probably the 'normal' implementation which is faster for
random-access iterators but slower in some cases of bidirectional
iterators. I guess your number of elements is very small. Otherwise
the vector relocations in the copy(...,back_inserter(v)) variant would
be more time-consuming that the double traversal once.
--
Roland Pibinger
"The best software is simple, elegant, and full of drama" - Grady Booch

Apr 29 '07 #4

kostas

On Apr 29, 12:09 pm, rpbg...@yahoo.com (Roland Pibinger) wrote:

I guess your number of elements is very small.

Wrong guess. Actually in order to measure the difference i used one
million ints and more.

Otherwise
the vector relocations in the copy(...,back_inserter(v)) variant would
be more time-consuming that the double traversal once.

Both the vector expansion and the set traversal are amortized linear
operations in the number of items. It just happens that in this case
of built-in types the first coefficient is smaller than the other.

Apr 29 '07 #5

James Kanze

On Apr 28, 9:49 pm, "Victor Bazarov" <v.Abaza...@comAcast.netwrote:

kostas wrote:
Trying to dump a set<int(s) in a vector (v) my guess was that
v.insert(v.end(),s.begin(),s.end())
would be at least as efficient as
copy(s.begin(), s.end(), back_inserter(v))
It turn out that it is almost two times slower (even without using
reserve() before copy) and that's because it traverses the set twice.

I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

It's probably an "optimization":-). If the number of elements
in the set is large, and the elements are expensive to copy,
it's faster to travers the set a first time to calculate how
many elements are needed, ensure sufficient capacity, and then
travers the set a second time to copy them.

I'm posting it as an exception to the usual (and usually correct)
advise
"Prefer member functions to algorithms".

I don't think the advice has anything to do with performance. The
usual (and always correct) advice WRT performance is "measure first
then try to optimize".

And this turns out to be a good example of why:-). Off hand, I
would have imagined that traversing the set twice would be
faster, because it avoids unnecessary allocations. Apparently,
I'd have guessed wrong. And apparently, the author of the STL
he's using guessed wrong as well.

(I suspect which form is faster will depend on any number of
external factors, but if the number of elements is large...
std::set has very poor locality, so you're likely to get some
paging, or at least cache misses, in each pass.)

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Apr 29 '07 #6

James Kanze

On Apr 28, 11:32 pm, kostas <skola...@gmail.comwrote:

On Apr 28, 10:49 pm, "Victor Bazarov" <v.Abaza...@comAcast.netwrote:

I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

The implementation of STL i use it needs first to calculate
the distance s.end() - s.begin() in order to reserve the
appropriate space. Is this not the standard (or the usual)
implementation of vector::insert ?

It's the required implementation: "Complexity: The constructor
template <class InputIteratorvector(InputIterator first,
Input- Iterator last) makes only N calls to the copy constructor
of T (where N is the distance between first and last) and no
reallocations if iterators first and last are of forward,
bidirectional, or random access categories." The apparent
assumption in the standard is that copies and allocations are
expensive, and iteration is cheap. Copying basic types,
however, is very, very cheap; modern allocators aren't that
expensive, and iterating over node based containers can be
extremely expensive due to their poor locality.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Apr 29 '07 #7

James Kanze

On Apr 29, 11:09 am, rpbg...@yahoo.com (Roland Pibinger) wrote:

On 28 Apr 2007 14:32:55 -0700, kostas <skola...@gmail.comwrote:

On Apr 28, 10:49 pm, "Victor Bazarov" <v.Abaza...@comAcast.netwrote:
I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

The implementation of STL i use it needs first to calculate the
distance s.end() - s.begin() in order to reserve the appropriate
space. Is this not the standard (or the usual) implementation
of vector::insert ?

It's probably the 'normal' implementation which is faster for
random-access iterators but slower in some cases of bidirectional
iterators. I guess your number of elements is very small. Otherwise
the vector relocations in the copy(...,back_inserter(v)) variant would
be more time-consuming that the double traversal once.

I just did the test with 10 million members in the set. I get
similar results. I suspect that the lack of locality in node
based containers is the reason.
--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Apr 29 '07 #8

James Kanze

On Apr 29, 1:28 pm, kostas <skola...@gmail.comwrote:

[...]

Both the vector expansion and the set traversal are amortized linear
operations in the number of items.

Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.
--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Apr 29 '07 #9

Stuart Golodetz

"James Kanze" <ja*********@gmail.comwrote in message
news:11**********************@n59g2000hsh.googlegr oups.com...
On Apr 29, 1:28 pm, kostas <skola...@gmail.comwrote:

[...]

>Both the vector expansion and the set traversal are amortized linear
operations in the number of items.

>Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.

[...]

I'm not sure if I'm missing the point here, but for the purposes of
complexity analysis, does it actually matter how long each individual
instruction is taking? If we've got a constant (i.e. independent of the
number of items) upper bound for the time taken for an instruction (however
large), then an operation which requires a number of instructions that is
linear in terms of the problem size will run in linear time. For example, if
I've got an operation on n items that requires no more than (say) 2n
instructions, with each instruction taking at most k milliseconds, then the
maximum time taken on a problem of size n is 2nk milliseconds, which (for k
not a function of n) is linear in n. The actual value of k isn't relevant to
the complexity analysis, even if it's relevant to the person implementing
the code.

Regards,
Stu

Apr 30 '07 #10

kostas

On Apr 29, 9:22 pm, James Kanze <james.ka...@gmail.comwrote:

I just did the test with 10 million members in the set. I get
similar results. I suspect that the lack of locality in node
based containers is the reason.

It's not only memory locality issues involved. It's also algorithmic
complexity. Set traversal is amortized linear but not so "linear" as
list traversal for example.

Apr 30 '07 #11

James Kanze

On Apr 30, 2:30 am, "Stuart Golodetz"
<stuart.golod...@NnOeSwP.AoMx.ac.ukwrote:

"James Kanze" <james.ka...@gmail.comwrote in message

news:11**********************@n59g2000hsh.googlegr oups.com...
On Apr 29, 1:28 pm, kostas <skola...@gmail.comwrote:

[...]

Both the vector expansion and the set traversal are amortized linear
operations in the number of items.
Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.

[...]

I'm not sure if I'm missing the point here, but for the purposes of
complexity analysis, does it actually matter how long each individual
instruction is taking?

You're missing the point that what the original poster measured
was execution time. My point is, precisely, that complexity has
little relationship to execution time. It's a useful indicator
as to whether something will scale, but even then, for example,
an O(n) algorithm will suddenly become much, much slower if page
faults occur, and for very large data sets, locality can become
as important a consideration as complexity.

If we've got a constant (i.e. independent of the number of
items) upper bound for the time taken for an instruction
(however large), then an operation which requires a number of
instructions that is linear in terms of the problem size will
run in linear time.

No. That's only true if the time taken for an instruction is
independant of the size of the data set. Paging and caching
introduce a non-linear component into the execution time of
instructions, which can depend on the size of the data set.

For example, if
I've got an operation on n items that requires no more than (say) 2n
instructions, with each instruction taking at most k milliseconds, then the
maximum time taken on a problem of size n is 2nk milliseconds, which (fork
not a function of n) is linear in n. The actual value of k isn't relevantto
the complexity analysis, even if it's relevant to the person implementing
the code.

Once k has any dependence on n, your complexity analysis no
longer makes reasonable predictions concerning the size.
Saying, for example, that ++i takes a maximum of 10
milliseconds, and calculating bounds from that, doesn't mean
much if most of the time, it only takes 1 or 2 microseconds, and
if it will only take more than 5 milliseconds if the number of
elements depasses, say, 10 million. You still do get assymtotic
performance with enough elements, but the number of elements
needed to get there is so high that it won't occur in actual
practice. For performance reasons, if nothing else; once you
start getting page faults, your performance will end up being
unacceptable.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Apr 30 '07 #12

James Kanze

On Apr 30, 8:57 am, kostas <skola...@gmail.comwrote:

On Apr 29, 9:22 pm, James Kanze <james.ka...@gmail.comwrote:

I just did the test with 10 million members in the set. I get
similar results. I suspect that the lack of locality in node
based containers is the reason.

It's not only memory locality issues involved. It's also algorithmic
complexity. Set traversal is amortized linear but not so "linear" as
list traversal for example.

Both can play a role. Typically allocators will allocate
successively allocated objects close to one another, which means
that locality when iterating over a set will be better if the
set was created by insertions in order. A quick test of 50
million elements on a Linux based PC shows that iterating over a
set where the elements were inserted in a random order takes
roughly 8 or 9 times longer than iterating over one where they
were inserted in the final order (and that iterating over an
std::list is considerably faster than either---about 3 or 4
times faster than the sorted set). This is probably due
partially to a simpler algorithm (i.e. the elementary operation
of iter++ is faster, even if everything is already in the
cache), and partially due to better locality: the node of a list
is smaller, so more of them fit into each cache line.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Apr 30 '07 #13

Stuart Golodetz

James Kanze wrote:

On Apr 30, 2:30 am, "Stuart Golodetz"
<stuart.golod...@NnOeSwP.AoMx.ac.ukwrote:
>"James Kanze" <james.ka...@gmail.comwrote in message

>news:11**********************@n59g2000hsh.googleg roups.com...
On Apr 29, 1:28 pm, kostas <skola...@gmail.comwrote:

> [...]

>>>Both the vector expansion and the set traversal are amortized linear
operations in the number of items.
Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.

>[...]

>I'm not sure if I'm missing the point here, but for the purposes of
complexity analysis, does it actually matter how long each individual
instruction is taking?

You're missing the point that what the original poster measured
was execution time. My point is, precisely, that complexity has
little relationship to execution time. It's a useful indicator
as to whether something will scale, but even then, for example,
an O(n) algorithm will suddenly become much, much slower if page
faults occur, and for very large data sets, locality can become
as important a consideration as complexity.

Agreed.

>If we've got a constant (i.e. independent of the number of
items) upper bound for the time taken for an instruction
(however large), then an operation which requires a number of
instructions that is linear in terms of the problem size will
run in linear time.

No. That's only true if the time taken for an instruction is
independant of the size of the data set.

I think that's what I said, actually, assuming we equate "number of
items" and "size of the data set".

Paging and caching
introduce a non-linear component into the execution time of
instructions, which can depend on the size of the data set.

Ah, this is the point which I missed. If the execution time for an
instruction is dependent on the problem size then it's a different
matter entirely. Thanks for clarifying :)

>For example, if
I've got an operation on n items that requires no more than (say) 2n
instructions, with each instruction taking at most k milliseconds, then the
maximum time taken on a problem of size n is 2nk milliseconds, which (for k
not a function of n) is linear in n. The actual value of k isn't relevant to
the complexity analysis, even if it's relevant to the person implementing
the code.

Once k has any dependence on n, your complexity analysis no
longer makes reasonable predictions concerning the size.
Saying, for example, that ++i takes a maximum of 10
milliseconds, and calculating bounds from that, doesn't mean
much if most of the time, it only takes 1 or 2 microseconds, and
if it will only take more than 5 milliseconds if the number of
elements depasses, say, 10 million. You still do get assymtotic
performance with enough elements, but the number of elements
needed to get there is so high that it won't occur in actual
practice. For performance reasons, if nothing else; once you
start getting page faults, your performance will end up being
unacceptable.

Once again, agreed. I think we seem to have a clear case of a discussion
heading along the lines of:

"If and only if X (about which I personally know little), then Y."
"Well, actually, not X."
"Ok, then we're agreed that not Y."

(In this particular instance, I think what I'm trying to say is that X
is "k has no dependence on n" and Y is "the complexity analysis given
above makes sense".)

No worries, anyway, was just curious what you were driving at :)

Regards,
Stu

[...]

Apr 30 '07 #14

James Kanze

On Apr 30, 5:14 pm, Stuart Golodetz
<stuart.golod...@nNeOwS.oPxA.aMc.ukwrote:

James Kanze wrote:
On Apr 30, 2:30 am, "Stuart Golodetz"
<stuart.golod...@NnOeSwP.AoMx.ac.ukwrote:
"James Kanze" <james.ka...@gmail.comwrote in message

[...]

Paging and caching
introduce a non-linear component into the execution time of
instructions, which can depend on the size of the data set.

Ah, this is the point which I missed. If the execution time for an
instruction is dependent on the problem size then it's a different
matter entirely. Thanks for clarifying :)

It's not necessarily directly dependant on the problem size, but
if the time for an instruction can vary over several magnitudes,
depending on locality of access, it starts meaning the locality
of access can be more significant than complexity with regards
to execution time.

Strictly speaking, complexity still wins out in the end; at
some point, you end up getting page faults even with the fastest
algorithm, and the individual operations become enormously
expensive everywhere. In practice, however, we aren't concerned
with such cases, since when (and if) we get there, the program
will take forever anyway; the solution isn't to find a faster
algorithm, but to buy more main memory:-).

BTW: I wasn't so much disagreeing with what you said, but
putting it into the practical context which was being discussed.
--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 1 '07 #15

kostas

On Apr 28, 7:53 pm, kostas <skola...@gmail.comwrote:

Hi,
Trying to dump a set<int(s) in a vector (v) my guess was that
v.insert(v.end(),s.begin(),s.end())
would be at least as efficient as
copy(s.begin(), s.end(), back_inserter(v))
It turn out that it is almost two times slower (even without using
reserve() before copy) and that's because it traverses the set twice.
I'm posting it as an exception to the usual (and usually correct)
advise
"Prefer member functions to algorithms".

So, would be a good idea the addition of a insert variant that instead
of the last iterator in the [firts,last) insertion range would take as
argument the number of steps from the first iterator ?
In this case of back insertion would not do any difference but if you
would like to insert the set at v.begin()?

Regards
Kostas

May 1 '07 #16

vector::insert performance tip.

Similar topics