Why large performance degradation?

Kevin Wan

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!

Nov 13 '05 #1

Subscribe Post Reply

3222

Carsten Hansen

"Kevin Wan" <jf***@vip.sina.com> wrote in message
news:5c**************************@posting.google.c om...

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!

Any decent C library will have a highly optimized version memcpy. Typically
it will be implemented in assembly language.
At least it will do copying of 4 bytes at a time if possible. Some may do 8
bytes at a time using floating-point instructions.
Others may take advantage of SIMD and do copying of 16 bytes at a time. Some
may play with cache lines.

Your program has undefined behavior as you copy uninitialized data.

Carsten Hansen

Nov 13 '05 #2

Tim Prince

Kevin Wan wrote:

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

If you know so much, and aren't willing to read up on it, nor to tell
anything about the architecture, why are you asking?
But, do consider that memcpy() surely performs this 4, 8, 16 or more bytes
at a time, as appropriate to the architecture for which it was built.
--
Tim Prince

Nov 13 '05 #3

Martijn

Tim Prince wrote:

But, do consider that memcpy() surely performs this 4, 8, 16 or more
bytes at a time, as appropriate to the architecture for which it was
built.

Apart from that, it may be faster on your particular implementation, it may
not be faster on another. Some reasons:

- memcpy may not be using array-indexing, but incrementing a pointer instead
(hence some extra memory access and calculations to get the memory address
of the source and destination) - you could do this to your own code and see
what happens

- memcpy may copy from back to front (on some systems, a comparison with 0
is faster than with an arbitrary number - only a very minor gain but in
65536 comparisons it may help)

- memcpy may make direct use of registers, and in your case i may not be
translated by the compiler to be a register (hence some extra copying to and
from the memory)

- memcpy may use movsb, movsw, movsd (intel only) or similar assembly
instructions

Hope this helped,

--
Martijn Haak
http://www.serenceconcepts.nl

Nov 13 '05 #4

Nils Petter Vaskinn

On Sun, 27 Jul 2003 20:38:06 -0700, Kevin Wan wrote:

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] = 65536 copy operations

#define SMARTCOPY memcpy(destination, source, 65536)

1 call to memcpy that has probably been optimized
You'll have to look at the implementation of memcpy to see what it does
differently.

hth
NPV

Nov 13 '05 #5

Dan Pop

In <5c**************************@posting.google.com > jf***@vip.sina.com (Kevin Wan) writes:

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =

#define SMARTCOPY memcpy(destination, source, 65536)

There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

The first thing to do is to replace the byte-by-byte copying by a
register-by-register copying. The next thing is to play tricks with the
cache, so that the input data is already in the cache by the time you
need it. There may be even more processor-specific tricks for
accelerating the memory accesses when performed in a purely sequential
way. The processor may even have a specialised instruction for performing
this operation as fast as possible. To exploit all these things, memcpy
is typically implemented in assembly.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 13 '05 #6

Randy Howard

In article <bg**********@sunnews.cern.ch>, Da*****@cern.ch says...

There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

I had a situation not too long ago where memcmp() on MSVC 6.0 was a
huge performance bottleneck. gcc seems to be smart enough to use
block compare instructions for appropriate Intel CPUs, where the MS
compiler was not. This isn't normally important, but this app did
some very large block memcmp()'s, and it became a problem. Also
knowing that that the size of the blocks being compared would always
be evenly divisible by 4 allowed it to really be optimized on Intel.

Rewriting an inline assembly version for WIN32 resulted in a 30%
performance improvement. Linux/gcc code compiled from the same source
(minus the memcmp() replacement) ran w/basically identical performance
with no manual intervention.

Nov 13 '05 #7

Randy Howard

In article <m3************@eeyore.valparaiso.cl>, vo******@inf.utfsm.cl
says...

In any case, the morals of the story is "Don't do it yourself if the
language gives you a way of doing it".

Usually. Not always. But, there isn't much point in not following the
above, unless you have performance measurements that show a particular
library routine is a bottleneck. In those cases, worry about trying
to find a better method.

Nov 13 '05 #8

by: Jason Heyes | last post by:

What can I do to circumvent the performance degradation associated with dynamic allocation and small objects? Thanks.

C / C++

FYI: a weird performance problem with MSDE and possibly SQL Server 2000

by: Andrew Mayo | last post by:

This problem was discovered with MSDE2000 SP2 and under WinXP SP2. We are unsure whether it is more widespread as it has only been seen on one machine to date. The problem is related to name...

Microsoft SQL Server

Database performance degrading (again)

by: teedilo | last post by:

We have an application with a SQL Server 2000 back end that is fairly database intensive -- lots of fairly frequent queries, inserts, updates -- the gamut. The application does not make use of...

Microsoft SQL Server

GIL, threads and scheduling - performance cost

by: adsheehan | last post by:

Hi all, Wondering if a GIL lock/unlock causes a re-schedule/contect swap when embedding Python in a multi-threaded C/C++ app on Unix ? If so, do I have any control or influence on this...

Python

dramatical performance degradation on dynamic sql over night

by: florian | last post by:

Hello, we are running DB2 UDB EEE Version 7.2 Fixpack 12 on a two machine Windows 2000 Advanced Server Cluster in a dss environment. Some dynamic sql statements for etl processes and even some...

DB2 Database

performance of IN (subquery)

by: Kevin Murphy | last post by:

I'm using PG 7.4.3 on Mac OS X. I am disappointed with the performance of queries like 'select foo from bar where baz in (subquery)', or updates like 'update bar set foo = 2 where baz in...

PostgreSQL Database

simple code performance question

by: galiorenye | last post by:

Hi, Given this code: A** ppA = new A*; A *pA = NULL; for(int i = 0; i < 10; ++i) { pA = ppA; //do something with pA

C / C++

how can i large divide?

by: jty0734 | last post by:

i wanna large divide. input is 1~1024 bit integer. first, i think using sub is solution. so i made two array. and sub divisor to dividend. but this situation happen.

C / C++

how to make large macro paste the code as it is

by: sh.vipin | last post by:

how to make large macro paste the code as it is Problem Explanation '-- For example in the program below /* a.c - starts here */ #define DECL_VARS() \ unsigned int a0;\ unsigned int a1;\...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Why large performance degradation?

Similar topics