473,396 Members | 1,599 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Why large performance degradation?

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!
Nov 13 '05 #1
7 3222

"Kevin Wan" <jf***@vip.sina.com> wrote in message
news:5c**************************@posting.google.c om...
Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!

Any decent C library will have a highly optimized version memcpy. Typically
it will be implemented in assembly language.
At least it will do copying of 4 bytes at a time if possible. Some may do 8
bytes at a time using floating-point instructions.
Others may take advantage of SIMD and do copying of 16 bytes at a time. Some
may play with cache lines.

Your program has undefined behavior as you copy uninitialized data.

Carsten Hansen
Nov 13 '05 #2
Kevin Wan wrote:
Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

If you know so much, and aren't willing to read up on it, nor to tell
anything about the architecture, why are you asking?
But, do consider that memcpy() surely performs this 4, 8, 16 or more bytes
at a time, as appropriate to the architecture for which it was built.
--
Tim Prince
Nov 13 '05 #3
Tim Prince wrote:
But, do consider that memcpy() surely performs this 4, 8, 16 or more
bytes at a time, as appropriate to the architecture for which it was
built.


Apart from that, it may be faster on your particular implementation, it may
not be faster on another. Some reasons:

- memcpy may not be using array-indexing, but incrementing a pointer instead
(hence some extra memory access and calculations to get the memory address
of the source and destination) - you could do this to your own code and see
what happens

- memcpy may copy from back to front (on some systems, a comparison with 0
is faster than with an arbitrary number - only a very minor gain but in
65536 comparisons it may help)

- memcpy may make direct use of registers, and in your case i may not be
translated by the compiler to be a register (hence some extra copying to and
from the memory)

- memcpy may use movsb, movsw, movsd (intel only) or similar assembly
instructions

Hope this helped,

--
Martijn Haak
http://www.serenceconcepts.nl
Nov 13 '05 #4
On Sun, 27 Jul 2003 20:38:06 -0700, Kevin Wan wrote:
Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] = 65536 copy operations

#define SMARTCOPY memcpy(destination, source, 65536)

1 call to memcpy that has probably been optimized
You'll have to look at the implementation of memcpy to see what it does
differently.

hth
NPV
Nov 13 '05 #5
In <5c**************************@posting.google.com > jf***@vip.sina.com (Kevin Wan) writes:
Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =

#define SMARTCOPY memcpy(destination, source, 65536)


There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

The first thing to do is to replace the byte-by-byte copying by a
register-by-register copying. The next thing is to play tricks with the
cache, so that the input data is already in the cache by the time you
need it. There may be even more processor-specific tricks for
accelerating the memory accesses when performed in a purely sequential
way. The processor may even have a specialised instruction for performing
this operation as fast as possible. To exploit all these things, memcpy
is typically implemented in assembly.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #6
In article <bg**********@sunnews.cern.ch>, Da*****@cern.ch says...
There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).


I had a situation not too long ago where memcmp() on MSVC 6.0 was a
huge performance bottleneck. gcc seems to be smart enough to use
block compare instructions for appropriate Intel CPUs, where the MS
compiler was not. This isn't normally important, but this app did
some very large block memcmp()'s, and it became a problem. Also
knowing that that the size of the blocks being compared would always
be evenly divisible by 4 allowed it to really be optimized on Intel.

Rewriting an inline assembly version for WIN32 resulted in a 30%
performance improvement. Linux/gcc code compiled from the same source
(minus the memcmp() replacement) ran w/basically identical performance
with no manual intervention.
Nov 13 '05 #7
In article <m3************@eeyore.valparaiso.cl>, vo******@inf.utfsm.cl
says...
In any case, the morals of the story is "Don't do it yourself if the
language gives you a way of doing it".


Usually. Not always. But, there isn't much point in not following the
above, unless you have performance measurements that show a particular
library routine is a bottleneck. In those cases, worry about trying
to find a better method.

Nov 13 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Jason Heyes | last post by:
What can I do to circumvent the performance degradation associated with dynamic allocation and small objects? Thanks.
0
by: Andrew Mayo | last post by:
This problem was discovered with MSDE2000 SP2 and under WinXP SP2. We are unsure whether it is more widespread as it has only been seen on one machine to date. The problem is related to name...
6
by: teedilo | last post by:
We have an application with a SQL Server 2000 back end that is fairly database intensive -- lots of fairly frequent queries, inserts, updates -- the gamut. The application does not make use of...
3
by: adsheehan | last post by:
Hi all, Wondering if a GIL lock/unlock causes a re-schedule/contect swap when embedding Python in a multi-threaded C/C++ app on Unix ? If so, do I have any control or influence on this...
6
by: florian | last post by:
Hello, we are running DB2 UDB EEE Version 7.2 Fixpack 12 on a two machine Windows 2000 Advanced Server Cluster in a dss environment. Some dynamic sql statements for etl processes and even some...
22
by: Kevin Murphy | last post by:
I'm using PG 7.4.3 on Mac OS X. I am disappointed with the performance of queries like 'select foo from bar where baz in (subquery)', or updates like 'update bar set foo = 2 where baz in...
30
by: galiorenye | last post by:
Hi, Given this code: A** ppA = new A*; A *pA = NULL; for(int i = 0; i < 10; ++i) { pA = ppA; //do something with pA
27
by: jty0734 | last post by:
i wanna large divide. input is 1~1024 bit integer. first, i think using sub is solution. so i made two array. and sub divisor to dividend. but this situation happen.
36
by: sh.vipin | last post by:
how to make large macro paste the code as it is Problem Explanation '-- For example in the program below /* a.c - starts here */ #define DECL_VARS() \ unsigned int a0;\ unsigned int a1;\...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.