Why large performance degradation?

Kevin Wan

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb .c:

/* array_copy_dumb .c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smar t.c:

/* array_copy_smar t.c */
#include <string.h>

#define SMARTCOPY memcpy(destinat ion, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb .c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smar t.c -o array_copy_smar t
$ time array_copy_smar t
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!

Nov 13 '05 #1

Subscribe Reply

3242

Carsten Hansen

"Kevin Wan" <jf***@vip.sina .com> wrote in message
news:5c******** *************** ***@posting.goo gle.com...

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb .c:

/* array_copy_dumb .c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smar t.c:

/* array_copy_smar t.c */
#include <string.h>

#define SMARTCOPY memcpy(destinat ion, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb .c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smar t.c -o array_copy_smar t
$ time array_copy_smar t
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!

Any decent C library will have a highly optimized version memcpy. Typically
it will be implemented in assembly language.
At least it will do copying of 4 bytes at a time if possible. Some may do 8
bytes at a time using floating-point instructions.
Others may take advantage of SIMD and do copying of 16 bytes at a time. Some
may play with cache lines.

Your program has undefined behavior as you copy uninitialized data.

Carsten Hansen

Nov 13 '05 #2

Tim Prince

Kevin Wan wrote:

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb .c:

/* array_copy_dumb .c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =
source[i]

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smar t.c:

/* array_copy_smar t.c */
#include <string.h>

#define SMARTCOPY memcpy(destinat ion, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb .c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smar t.c -o array_copy_smar t
$ time array_copy_smar t
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

If you know so much, and aren't willing to read up on it, nor to tell
anything about the architecture, why are you asking?
But, do consider that memcpy() surely performs this 4, 8, 16 or more bytes
at a time, as appropriate to the architecture for which it was built.
--
Tim Prince

Nov 13 '05 #3

Martijn

Tim Prince wrote:

But, do consider that memcpy() surely performs this 4, 8, 16 or more
bytes at a time, as appropriate to the architecture for which it was
built.

Apart from that, it may be faster on your particular implementation, it may
not be faster on another. Some reasons:

- memcpy may not be using array-indexing, but incrementing a pointer instead
(hence some extra memory access and calculations to get the memory address
of the source and destination) - you could do this to your own code and see
what happens

- memcpy may copy from back to front (on some systems, a comparison with 0
is faster than with an arbitrary number - only a very minor gain but in
65536 comparisons it may help)

- memcpy may make direct use of registers, and in your case i may not be
translated by the compiler to be a register (hence some extra copying to and
from the memory)

- memcpy may use movsb, movsw, movsd (intel only) or similar assembly
instructions

Hope this helped,

--
Martijn Haak
http://www.serenceconcepts.nl

Nov 13 '05 #4

Nils Petter Vaskinn

On Sun, 27 Jul 2003 20:38:06 -0700, Kevin Wan wrote:

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] = 65536 copy operations

#define SMARTCOPY memcpy(destinat ion, source, 65536)

1 call to memcpy that has probably been optimized
You'll have to look at the implementation of memcpy to see what it does
differently.

hth
NPV

Nov 13 '05 #5

Dan Pop

In <5c************ **************@ posting.google. com> jf***@vip.sina. com (Kevin Wan) writes:

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

#define DUMBCOPY for (i = 0; i < 65536; ++i) destination[i] =

#define SMARTCOPY memcpy(destinat ion, source, 65536)

There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

The first thing to do is to replace the byte-by-byte copying by a
register-by-register copying. The next thing is to play tricks with the
cache, so that the input data is already in the cache by the time you
need it. There may be even more processor-specific tricks for
accelerating the memory accesses when performed in a purely sequential
way. The processor may even have a specialised instruction for performing
this operation as fast as possible. To exploit all these things, memcpy
is typically implemented in assembly.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 13 '05 #6

Randy Howard

In article <bg**********@s unnews.cern.ch> , Da*****@cern.ch says...

There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

I had a situation not too long ago where memcmp() on MSVC 6.0 was a
huge performance bottleneck. gcc seems to be smart enough to use
block compare instructions for appropriate Intel CPUs, where the MS
compiler was not. This isn't normally important, but this app did
some very large block memcmp()'s, and it became a problem. Also
knowing that that the size of the blocks being compared would always
be evenly divisible by 4 allowed it to really be optimized on Intel.

Rewriting an inline assembly version for WIN32 resulted in a 30%
performance improvement. Linux/gcc code compiled from the same source
(minus the memcmp() replacement) ran w/basically identical performance
with no manual intervention.

Nov 13 '05 #7

Randy Howard

In article <m3************ @eeyore.valpara iso.cl>, vo******@inf.ut fsm.cl
says...

In any case, the morals of the story is "Don't do it yourself if the
language gives you a way of doing it".

Usually. Not always. But, there isn't much point in not following the
above, unless you have performance measurements that show a particular
library routine is a bottleneck. In those cases, worry about trying
to find a better method.

Nov 13 '05 #8

Similar topics

1932

Small dynamic objects and performance degradation.

by: Jason Heyes | last post by:

What can I do to circumvent the performance degradation associated with dynamic allocation and small objects? Thanks.

C / C++

1193

FYI: a weird performance problem with MSDE and possibly SQL Server 2000

by: Andrew Mayo | last post by:

This problem was discovered with MSDE2000 SP2 and under WinXP SP2. We are unsure whether it is more widespread as it has only been seen on one machine to date. The problem is related to name resolution. If you attempt to connect to a local database with a connect string using server=. rather than

Microsoft SQL Server

2323

Database performance degrading (again)

by: teedilo | last post by:

We have an application with a SQL Server 2000 back end that is fairly database intensive -- lots of fairly frequent queries, inserts, updates -- the gamut. The application does not make use of performance hogs like cursors, but I know there are lots of ways the application could be made more efficient database-wise. The server code is running VB6 of all things, using COM+ database interfaces. There are some clustered and non-clustered...

Microsoft SQL Server

1707

GIL, threads and scheduling - performance cost

by: adsheehan | last post by:

Hi all, Wondering if a GIL lock/unlock causes a re-schedule/contect swap when embedding Python in a multi-threaded C/C++ app on Unix ? If so, do I have any control or influence on this re-scheduling ? The app suffers from serious performance degradation (compared to pure c/C++) and high context switches that I suspect the GIL unlocking may be aggravating ?

Python

3529

dramatical performance degradation on dynamic sql over night

by: florian | last post by:

Hello, we are running DB2 UDB EEE Version 7.2 Fixpack 12 on a two machine Windows 2000 Advanced Server Cluster in a dss environment. Some dynamic sql statements for etl processes and even some for online user queries switched overnight from some minutes runtime to a few hours or "never come back" statements. We didn't change a single database or instance parameter. Even the

DB2 Database

3351

performance of IN (subquery)

by: Kevin Murphy | last post by:

I'm using PG 7.4.3 on Mac OS X. I am disappointed with the performance of queries like 'select foo from bar where baz in (subquery)', or updates like 'update bar set foo = 2 where baz in (subquery)'. PG always seems to want to do a sequential scan of the bar table. I wish there were a way of telling PG, "use the index on baz in your plan, because I know that the subquery will return very few results". Where it really matters, I have...

PostgreSQL Database

3522

simple code performance question

by: galiorenye | last post by:

Hi, Given this code: A** ppA = new A*; A *pA = NULL; for(int i = 0; i < 10; ++i) { pA = ppA; //do something with pA

C / C++

2110

how can i large divide?

by: jty0734 | last post by:

i wanna large divide. input is 1~1024 bit integer. first, i think using sub is solution. so i made two array. and sub divisor to dividend. but this situation happen.

C / C++

2540

how to make large macro paste the code as it is

by: sh.vipin | last post by:

how to make large macro paste the code as it is Problem Explanation '-- For example in the program below /* a.c - starts here */ #define DECL_VARS() \ unsigned int a0;\ unsigned int a1;\ unsigned int a2;\

C / C++

8675

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9160

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9029

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8897

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8862

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7729

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5860

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4370

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4619

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET