By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,194 Members | 884 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,194 IT Pros & Developers. It's quick & easy.

Memset is faster than simple loop?

P: n/a
Hi,

dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!

In an int array scenario:

int array[10];

for(int i=0;i<10;i++) //ten loops
array[i]=0;

or

memset(array,0,sizeof(array));

So, what will memset do inside? Here is a snippet from MS c-run-time
codes:

void * __cdecl memset (
void *dst,
int val,
size_t count
)
{
void *start = dst;

#if defined (_M_MRX000) || defined (_M_ALPHA) || defined (_M_PPC) ||
defined (_M_IA64)
{
extern void RtlFillMemory( void *, size_t count, char );

RtlFillMemory( dst, count, (char)val );
}
#else /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */
while (count--) { //Watch
here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!
*(char *)dst = (char)val;
dst = (char *)dst + 1;
}
#endif /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */

return(start);
}

memset initializes a block of memory byte by byte. So, the while loop
will be executed 4*10 times!
I got confused. Why people still believe memset is faster than loop
and pertains most of scenarios.
Thank you for your help!

Mar 21 '07 #1
Share this Question
Share on Google+
23 Replies


P: n/a
An********@gmail.com wrote:
dose anybody here explain to me why memset would be faster than a
simple loop.
`memset` is provided by the implementation. It may use
lots of Cunning Implementation Tricks that -- for whatever
reason -- might not be applied by default to ordinary
C code.

Whether or not it's /actually/ faster will depend on all
sorts of things.
I doubt about it!
Doubt is good. I think. Well, I'm not sure.
In an int array scenario:

int array[10];

for(int i=0;i<10;i++) //ten loops
array[i]=0;

or

memset(array,0,sizeof(array));

So, what will memset do inside?
That depends.
Here is a snippet from MS c-run-time
codes:
(fx:snip #ifdef)
extern void RtlFillMemory( void *, size_t count, char );

RtlFillMemory( dst, count, (char)val );
(fx:snip #else)
while (count--) { //Watch
here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!
*(char *)dst = (char)val;
dst = (char *)dst + 1;
}
(fx:snip #endif)
memset initializes a block of memory byte by byte. So, the while loop
will be executed 4*10 times!
Urm, did you not notice the call to RtlFillMemory which happends
on suitable architectures?

Also you're /assuming/ that the compiler doesn't implement
`memset` with inline code, which it's allowed to. Perhaps
the original call
memset(array,0,sizeof(array));
was replaced by 10 move-integer instructions.
I got confused. Why people still believe memset is faster than loop
and pertains most of scenarios.
The essence of the truth here is measureument as opposed to
speculation.

--
Chris Dollin
RIP John "BNF, Fortran, FP" Backus 3Dec1924 - 17Mar2007

Mar 21 '07 #2

P: n/a
Chris Dollin wrote:
Also you're /assuming/ that the compiler doesn't implement
`memset` with inline code, which it's allowed to. Perhaps
the original call
>memset(array,0,sizeof(array));

was replaced by 10 move-integer instructions.
Make that 10 clear-integer instructions.

--
Chris "moving zero" Dollin
RIP John "BNF, Fortran, FP" Backus 3Dec1924 - 17Mar2007

Mar 21 '07 #3

P: n/a
An********@gmail.com wrote:
dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!
In an int array scenario:
int array[10];
for(int i=0;i<10;i++) //ten loops
array[i]=0;
or
memset(array,0,sizeof(array));
So, what will memset do inside?
There's no guarantee that memset() is faster, it's probably just some
observation a number of people have made. And if it is faster it is
probably due to the implementation using some carefully tuned method
that exploits some features of the processor the program is running
on which the compiler may not be able to find when it compiles the loop.
But that doesn't has to be the case, it's a question of how good the
implementation of memset() is on the one hand and how god the compiler
is on the other hand (and the compiler could even be clever enough to
replace the loop by a single call of memset() and there goes all your
difference in speed;-).
Here is a snippet from MS c-run-time
codes:
void * __cdecl memset (
void *dst,
int val,
size_t count
)
{
void *start = dst;
#if defined (_M_MRX000) || defined (_M_ALPHA) || defined (_M_PPC) ||
defined (_M_IA64)
{
extern void RtlFillMemory( void *, size_t count, char );
RtlFillMemory( dst, count, (char)val );
}
#else /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */
while (count--) { //Watch
here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!
*(char *)dst = (char)val;
dst = (char *)dst + 1;
}
#endif /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */
return(start);
}
memset initializes a block of memory byte by byte. So, the while loop
will be executed 4*10 times!
That's just one of many implementations and you can't deduce any
general statement from looking at a certain one. And, as you will
notice when you have a close look, memset() seems to be implemented
in a different way depending on the architecture, so you can't
even say how memset() is implemented for "MS c-run-time" but only
for "MS c-run-time" on a certain architecture.
I got confused. Why people still believe memset is faster than loop
and pertains most of scenarios.
It isn't a question of believes. You have to carefully measure the
behaviour for the implementation and architecture you are using.

Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
Mar 21 '07 #4

P: n/a
Thank you for yr reply. Actually, I am working on measurement and I
have to agree the fact that you are right, memset is fater than loop
even in the example I wrote above.
Mar 21 '07 #5

P: n/a
An********@gmail.com wrote:
>
dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!
It might be faster, it might be slower, or there
might be no difference. The C language Standard says
nothing about the speeds of language constructs or
library functions, and those speeds -- and relative
speeds -- will be different in different implementations
of C.
[...] Here is a snippet from MS c-run-time
codes: [...]
Compilers play many tricks in pursuit of speedier or
smaller code. One of the tricks often played with the
Standard functions (which the compiler can "know about")
is to generate in-line machine instructions instead of
generating an actual function call. What looks like a
call on the fabs() function might actually produce an
FABS instruction in the generated code, and no function
call at all.

memset() is a fairly simple function -- not like
printf(), say -- and many compilers will play this kind
of game with it. If you write what looks like a call
to memset(), the program might actually use a block-clear
instruction or instruction sequence instead of a call.

But even if the implementation plays this kind of
game with memset() or sqrt() or whatever, it must still
provide an actual, callable function that does the same
thing (possibly at a different speed). This enables a
program to use a function pointer to call a library
function, without necessarily being able to predict
what function will be called until run-time:

#include <string.h /* for memset */

void shortset(void *s, int c, size_t n) {
short *sp = s;
for (n /= sizeof(short); n 0; --n)
*sp++ = c;
}

void intset(void *s, int c, size_t n) {
int *ip = s;
for (n /= sizeof(int); n 0; --n)
*ip++ = c;
}

void (*fptr)(void*, int, size_t);
...
switch (something_unpredictable) {
default: fptr = memset; break;
case 1: fptr = shortset; break;
case 2: fptr = intset; break;
}
fptr(buffer, 42, sizeof buffer);

In short, the code you have found (MS disclosed
their source? Surprising, but not astonishing) may
be used only in oddball circumstances like the above,
while "ordinary" memset() calls wind up using another
mechanism altogether. You'll need to dig deeper.

... and you'll need to remember that all such
tricks and timings vary from one implementation to
the next; they aren't the province of the language
as such.

--
Eric Sosman
es*****@acm-dot-org.invalid
Mar 21 '07 #6

P: n/a
Eric Sosman wrote:
In short, the code you have found (MS disclosed
their source? Surprising, but not astonishing) may
The source code to MSVCRT.DLL (the C library)
is available as part of their SDK, which is free to download
from microsoft itself.

Mar 21 '07 #7

P: n/a
On Wed, 21 Mar 2007 10:42:11 -0500, An********@gmail.com wrote
(in article <11*********************@p15g2000hsd.googlegroups. com>):
Thank you for yr reply. Actually, I am working on measurement and I
have to agree the fact that you are right, memset is fater than loop
even in the example I wrote above.
Well, just keep in mind that what's true for you today, on a particular
system with a particular development environment installed, may not be
true anywhere else, or even on your own system 6 months from now. You
can not extrapolate "rules" for this sort of thing and actually know
anything at all about what will happen in general.

--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw

Mar 21 '07 #8

P: n/a
Racaille wrote:
>The source code to MSVCRT.DLL (the C library)
is available as part of their SDK, which is free to download
from microsoft itself.
You are also able to find the c run-time library source codes in your
VSStudio/vc_version/crt/src folder

Mar 21 '07 #9

P: n/a
An********@gmail.com wrote:
Thank you for yr reply. Actually, I am working on measurement and I
have to agree the fact that you are right, memset is fater than loop
even in the example I wrote above.
An observation: If the time spent in memset() is a
significant or even a noticeable fraction of your program's
running time, it is highly unlikely that the cure is to use
a faster memset() equivalent. (After all, memset() generates
almost no "new information," and does not "advance the state
of the computation" by very much.) Rather, the cure is to
consider what it is about the program that makes the memset()
necessary, and to rearrange things so it isn't. For example,

char buffer[BIGSIZE];
...
memset (buffer, 0, sizeof buffer);
if (arriving)
strcat (buffer, "Hello,");
else
strcat (buffer, "Goodbye, cruel");
strcat (buffer, " world!");

.... doesn't really need the full effect of memset(), but just
its effect on buffer[0]. With that in mind, the code can be

char buffer[BIGSIZE];
...
buffer[0] = 0;
if (arriving)
... etc ...

Still better:

char buffer[BIGSIZE];
...
if (arriving)
strcpy (buffer, "Hello,");
else
strcpy (buffer, "Goodbye, cruel");
strcat (buffer, " world!");

And, of course, there are lots more alternatives. My point is
that a very large fraction of memset() calls can be eliminated
by transformations not much more complex than these, and when
you get rid of an operation altogether you get an infinite
improvement in its speed. It is said that "the fastest I/O is
the one you don't do," and this generalizes to "the fastest X
is the one you don't do."

Just about the only time memset() speed is crucial is when
one needs to "destroy" information. For example, an O/S may want
to ensure that a new memory page given to Program X doesn't still
contain data left in it by Previous Program Y, and so uses memset()
or an equivalent to clobber Y's data. The speed of memset() in
this kind of setting can indeed be important -- but in most user
programs, memset() should be a negligible fraction of the running
time, and even an infinite speedup makes a negligible overall
improvement.

--
Eric Sosman
es*****@acm-dot-org.invalid
Mar 21 '07 #10

P: n/a
Eric Sosman,

I appreciate your replies. I have to say I partially agree with you:)
An observation: If the time spent in memset() is a
significant or even a noticeable fraction of your program's
running time, it is highly unlikely that the cure is to use
a faster memset() equivalent.
That's right! I agree:)
>(After all, memset() generates
almost no "new information," and does not "advance the state
of the computation" by very much.) Rather, the cure is to
consider what it is about the program that makes the memset()
necessary, and to rearrange things so it isn't.
It really depands on what you are gonna do. Let say, I have an
initialized integer array with ten thousand elements,all elements
inside are zero, and then I will fill out this array by user input,
actually I don't know how many elements will be assigned new values
other than 0.(assuming 0 is not valid from user input).

So, if I want to retrieve these values, I have to loop the array until
seeing the first zero element. In this scenario,"new information" is
also very important. If I can not initialize the array with 0 , no
expected result will happen next. memset() becomes crucial to my
program.
Just about the only time memset() speed is crucial is when
one needs to "destroy" information. For example, an O/S may want
to ensure that a new memory page given to Program X doesn't still
contain data left in it by Previous Program Y, and so uses memset()
or an equivalent to clobber Y's data. The speed of memset() in
this kind of setting can indeed be important -- but in most user
programs, memset() should be a negligible fraction of the running
time, and even an infinite speedup makes a negligible overall
improvement.
If the size of the memory block is very huge, it'll be a different
story,isn't it?
Mar 21 '07 #11

P: n/a
An********@gmail.com wrote:
Eric Sosman,
>[...]
(After all, memset() generates
almost no "new information," and does not "advance the state
of the computation" by very much.) Rather, the cure is to
consider what it is about the program that makes the memset()
necessary, and to rearrange things so it isn't.

It really depands on what you are gonna do. Let say, I have an
initialized integer array with ten thousand elements,all elements
inside are zero, and then I will fill out this array by user input,
actually I don't know how many elements will be assigned new values
other than 0.(assuming 0 is not valid from user input).

So, if I want to retrieve these values, I have to loop the array until
seeing the first zero element. In this scenario,"new information" is
also very important. If I can not initialize the array with 0 , no
expected result will happen next. memset() becomes crucial to my
program.
Probably not. Compared to the time taken by the input
itself, the time for memset() will likely be negligible. Ten
thousand elements, you say? Let's suppose we can get them all
with one disk read, which will take perhaps ten milliseconds.
In that ten milliseconds, your CPU's clock will tick ten to
thirty million times. With a dual-issue instruction pipeline
that's twenty to sixty million instruction opportunities, forty
to one-twenty with a quad-issue. ("Instruction opportunities"
are not necessarily "instructions executed" because the CPU will
eventually need to wait for memory to catch up to it, but you
get the idea: The speed difference between the CPU and the I/O
is about six or seven decimal orders of magnitude.)

"File system buffer cache," I hear you cry. Fine, but you've
told us this is some kind of sparse array, which means you can't
just read the data "in place." Instead, you need to read it into
some kind of buffer, then examine what you find in the buffer,
decide where it belongs in the big array, and deposit it there.
Do you think that will be faster or slower than the memset()?
And then you've got to scan the whole array afterwards, searching
out the data you deposited there -- faster than memset(), or
slower?

"But it's a *very* sparse array, and I'll only need to take
the decide-and-deposit step a few times, whereas memset() needs
to clear the entire thing." If the array is *that* sparse, it's
probably the wrong data structure: Stick the data on a linked
list, for example, and memset() will be entirely unnecessary.
If the size of the memory block is very huge, it'll be a different
story,isn't it?
Yes, clearing a large memory area to all-bytes-constant
will probably take a noticeable amount of time. My point is
that the clearing itself is often a waste, and could be avoided
altogether by rearranging the program.

An old friend of mine used to characterize this sort of thing
as a laudable effort to clean the beach of bottle caps and other
kinds of trash, to make the sand nice and neat around the whale
carcases.

--
Eric Sosman
es*****@acm-dot-org.invalid
Mar 21 '07 #12

P: n/a
Eric,

Now I understand you:)

Don't put too much energy on this sort of tiny things, try to figure
out how to get rid of the whale carcases. Don't use memset(),if not
necessary. If have to, memset() won't be ur concern on performance,
unless the purpose of the program is to test the speed of memset()
itself:)

Mar 21 '07 #13

P: n/a
An********@gmail.com wrote:
Don't put too much energy on this sort of tiny things, try to figure
out how to get rid of the whale carcases. Don't use memset(),if not
necessary. If have to, memset() won't be ur concern on performance,
unless the purpose of the program is to test the speed of memset()
itself:)
If I understood Eric correctly he wasn't advising against the use
of memset() but against worrying about possible speed differences
between using memset() and a loop. Actually there are arguments for
using memset() instead of a loop: it's shorter (just one line of
code instead of 2 or 3), no need for a loop variable and it might
be easier to understand what you're doing there (and it might even
be a bit faster;-). On the other hand you have to know where you
can't use memset() (e.g. for zeroing out an array of doubles or
pointers).
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
Mar 22 '07 #14

P: n/a
On Mar 21, 1:36 pm, Eric Sosman <esos...@acm-dot-org.invalidwrote:
[snip]
An old friend of mine used to characterize this sort of thing
as a laudable effort to clean the beach of bottle caps and other
kinds of trash, to make the sand nice and neat around the whale
carcases.
Too bad it needs context, or that would be a Jim-Dandy Sig.

Duff's device makes for a faster memset, except when it makes things
slower (Dik Winter provided an example IIRC).

Typically, library memsets use CPU specific assembly instructions to:
Fill in register wide words at a time,
then with modulus remainder -- fill in bytes.

Alternatively, some memset variants will use fancy-pants registers
like the MMX stuff or other CPU specific instuctions.

At any rate, trying to speed up memset() is a tempest in a teapot if
there ever was one.

Once again, I am forced to cough up Mike Lee's mantra:
"The number one rule of code optimization is: Don't do it.
The number two rule of code optimization (for experts only) is: Don't
do it yet."

IMO-YMMV.

Mar 22 '07 #15

P: n/a
Jens Thoms Toerring wrote:
An********@gmail.com wrote:
>Don't put too much energy on this sort of tiny things, try to figure
out how to get rid of the whale carcases. Don't use memset(),if not
necessary. If have to, memset() won't be ur concern on performance,
unless the purpose of the program is to test the speed of memset()
itself:)

If I understood Eric correctly he wasn't advising against the use
of memset() but against worrying about possible speed differences
between using memset() and a loop. Actually there are arguments for
using memset() instead of a loop: it's shorter (just one line of
code instead of 2 or 3), no need for a loop variable and it might
be easier to understand what you're doing there (and it might even
be a bit faster;-). On the other hand you have to know where you
can't use memset() (e.g. for zeroing out an array of doubles or
pointers).
Right: that's my argument, with the further point that if
you're doing so much memory-clearing that memset() becomes a
bottleneck, you're doing far too much memory-clearing and should
reconsider your design.

--
Eric Sosman
es*****@acm-dot-org.invalid
Mar 22 '07 #16

P: n/a
user923005 wrote:
[...]
Duff's device makes for a faster memset, except when it makes things
slower (Dik Winter provided an example IIRC).
FWIW, Duff's original Device was not memset-like.

http://en.wikipedia.org/wiki/Duff's_device
[...]
Once again, I am forced to cough up Mike Lee's mantra:
"The number one rule of code optimization is: Don't do it.
The number two rule of code optimization (for experts only) is: Don't
do it yet."
s/Mike Lee/Michael A. Jackson/

Since I harp on this theme at such nauseating length, some
people have accused me of being in favor of slow code. I reject
the accusation: I am in favor of optimization, but only when it
makes economic sense. If I spend five minutes on an optimization
that shaves a microsecond off a piece of code that will run one
million times, I have made a bad bargain. But if the code will
run a million times a minute twenty-four by seven, or if the
microsecond savings is the difference between keeping the video
streaming smoothly or suffering a glitch, that's another matter.

Like Theodore Roosevelt's Big Stick, optimization skills
should be carefully honed, well maintained, and rarely used.

--
Eric Sosman
es*****@acm-dot-org.invalid
Mar 22 '07 #17

P: n/a
On Mar 22, 6:02 am, Eric Sosman <esos...@acm-dot-org.invalidwrote:
user923005 wrote:
[...]
Duff's device makes for a faster memset, except when it makes things
slower (Dik Winter provided an example IIRC).

FWIW, Duff's original Device was not memset-like.

http://en.wikipedia.org/wiki/Duff's_device
[snip]
It's also a C-FAQ, but (of course) the idea is used to unroll loops
and (ostensibly) make them faster.
20.35: What is "Duff's Device"?

A: It's a devastatingly deviously unrolled byte-copying loop,
devised by Tom Duff while he was at Lucasfilm. In its "classic"
form, it looks like:

register n = (count + 7) / 8; /* count 0 assumed */
switch (count % 8)
{
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n 0);
}

where count bytes are to be copied from the array pointed to by
from to the memory location pointed to by to (which is a memory-
mapped device output register, which is why to isn't
incremented). It solves the problem of handling the leftover
bytes (when count isn't a multiple of 8) by interleaving a
switch statement with the loop which copies bytes 8 at a time.
(Believe it or not, it *is* legal to have case labels buried
within blocks nested in a switch statement like this. In his
announcement of the technique to C's developers and the world,
Duff noted that C's switch syntax, in particular its "fall
through" behavior, had long been controversial, and that "This
code forms some sort of argument in that debate, but I'm not
sure whether it's for or against.")

But appearances can be deceiving. Despite that wonderfully arcane
switch goo, it might not be faster (and could even be slower).

Most optimizing compilers can unroll loops automatically anyway.

Mar 22 '07 #18

P: n/a
On Thu, 22 Mar 2007 14:06:33 -0500, user923005 wrote
(in article <11**********************@b75g2000hsg.googlegroups .com>):

[duff's device]
But appearances can be deceiving. Despite that wonderfully arcane
switch goo, it might not be faster (and could even be slower).

Most optimizing compilers can unroll loops automatically anyway.
It turns out that isn't always a true. There was an interesting
article on this in DDJ in the last year or two that goes into some
depth on it and some variations on the same theme.

--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw

Mar 22 '07 #19

P: n/a
On 21 Mar 2007 11:08:54 -0700, An********@gmail.com wrote:
>You are also able to find the c run-time library source codes in your
VSStudio/vc_version/crt/src folder
I've checked my debian system all over and I can't find any such directory.
--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up
Mar 28 '07 #20

P: n/a
Kevin D. Quitt <KQ****@IEEInc.comwrites:
On 21 Mar 2007 11:08:54 -0700, An********@gmail.com wrote:
>>You are also able to find the c run-time library source codes in your
VSStudio/vc_version/crt/src folder

I've checked my debian system all over and I can't find any such directory.
<OT><SILLY>
Well, there's your problem; you need to look for a folder, not a
directory.
</SILLY></OT>

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Mar 28 '07 #21

P: n/a
Kevin D. Quitt wrote:
On 21 Mar 2007 11:08:54 -0700, An********@gmail.com wrote:
>You are also able to find the c run-time library source codes in your
VSStudio/vc_version/crt/src folder

I've checked my debian system all over and I can't find any such directory.

I think you will find most of your headers in /usr/include

Kind Regards,
Anthony Irwin
Mar 28 '07 #22

P: n/a
Anthony Irwin wrote, On 28/03/07 05:38:
Kevin D. Quitt wrote:
>On 21 Mar 2007 11:08:54 -0700, An********@gmail.com wrote:
>>You are also able to find the c run-time library source codes in your
VSStudio/vc_version/crt/src folder

I've checked my debian system all over and I can't find any such
directory.

I think you will find most of your headers in /usr/include
I think you missed at least two points.

1) It is the source for the libraries not the headers being talked about.

2) Saying that the sources are in a specific directory is wrong in
general, because it depends on lots of things including the OS. The
same, of course, applies to headers, including that in both cases they
may not be available as text files.
--
Flash Gordon
Mar 28 '07 #23

P: n/a
An********@gmail.com wrote:
Hi,

dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!

This is a compiler QoI issue, and has nothing to do about the C
language. Also, it depend on which hardware you are running the compiled
program on. Speedup tricks working for one CPU, might change when you
run your program on a different model of the same CPU family.

How fast a memcpy() implementation execute, may also depend of what kind
of load/store you want to do, e.g.

L1 to L1
L1 to L2
L1 to main memory
etc.

first speedup rule, is to make sure data is aligned, before moving big
chunks of data. Then there are possibilities with pre-warming the cache,
loop-unrolling etc.

When the Pentium II came out, I wrote a memcpy() replacement in C, which
used all the tricks I knew, when I checked it vs the asm implementation
by Microsoft, my replacement ran 2x faster.

According to Intel Manuals, really fast memory transfers was possible
via the MMX registers, but I didn't check that out, since it would
require inline assembler.

Of course, when I measured my memcpy() implementation a couple of years
later on a P3, the latest Microsoft version in the standard library ran
2x faster!!! :-)
--
Tor
Mar 28 '07 #24

This discussion thread is closed

Replies have been disabled for this discussion.