By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,750 Members | 1,165 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,750 IT Pros & Developers. It's quick & easy.

Cache

P: n/a
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

Thanks in advance,
Thomas Kowalski

Sep 25 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Thomas Kowalski wrote:
I would like to know which high level techniques can improve the
number of cache hits. I am not trying to rewrite some code in
assembler, just want to know which things might be considered to help
a good compiler "understanding" my code better. Some examples or
links would be very helpful.
Last time I came across the recommendations of this sort it was mostly
to place objects in such a way that when you follow the references (not
C++ references, but generally using pointers, indices, etc) from one to
another, the objects found (resolved) would be close-by.

Beware, those are always platform-specific, and so is the advice you'll
receive, I am fairly certain. Try a good book, like "Efficient C++".

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask
Sep 25 '06 #2

P: n/a
Thomas Kowalski wrote:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.
My reply is perhaps slightly OT here: Have you tried profiling it with
valgrind's cachegrind, or the frontend to it, kcachegrind? I can't give
many hard and fast rules (except for the case of linear algebra), but I
imagine kcachegrind could give you a somewhat better view of what's
going on.

Alan
Sep 25 '06 #3

P: n/a
Thomas Kowalski wrote:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.
I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[i][j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j][i];

because in the first case the memory occupied by arr is
traversed linearly. Therefore it is sometimes beneficial
to have a transposed copy of an array that you need to
access in the latter fashion.

<OT>
Other than that, I remember that aligning objects to
some boundaries may speed things up. Try looking up
"alignment" or __declspec(align(n)) in your compiler's
manual. For example on the Itanium 2 it is beneficial
to align variables at 16-byte boundaries. The cache
also usually displays this property that certain
alignments are ill advised, because they interfere
with the cache's internal hashing algorithm. For
instance on the Itanium 2 accessing variables in
succession that all lie at addresses divisible by 2048
causes very poor performance.
</OT>

As a last note -- definitely use a profiler to
check where the bottleneck lies. It quite often
lies far from where you'd expect it to be by
inspecting the source code.

HTH,
- J.
Sep 25 '06 #4

P: n/a
Thomas Kowalski wrote:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.
As others have pointed out, it's obviously very platform-specific.
Ideally, the part of your data structure that you are accessing should
fit into cache. There are various tricks already mentioned by others
like alignment, but again, these are just too specific to the
particular processor you're using.

Regards,
Bart.

Sep 25 '06 #5

P: n/a
Jacek Dziedzic wrote:
I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[i][j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j][i];

because in the first case the memory occupied by arr is
traversed linearly.
That's entirely dependent on the situation. If the whole structure fits
into cache then that wouldn't be the case. Okay, arguably not too many
computers have 1MB of cache, but the point is that you have to analyze
each case separately. General advice on this particular topic is rarely
useful and gets obsolete very quickly.

Regards,
Bart.

Sep 25 '06 #6

P: n/a
Bart wrote:
Jacek Dziedzic wrote:
I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[i][j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j][i];

because in the first case the memory occupied by arr is
traversed linearly.

That's entirely dependent on the situation. If the whole structure fits
into cache then that wouldn't be the case. Okay, arguably not too many
computers have 1MB of cache,
For some reason, I'm somewhat careless today. I meant 1 million ints
but wrote 1MB instead.

Sorry for that,

Regards,
Bart.

Sep 25 '06 #7

P: n/a
Thomas Kowalski wrote:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.
One of the most effective techniques is "blocking" or taking chunks of
your data and operating on the chunks.

e.g. If I was doing this:

for ( i = 0; i < 1000; ++ i ) {
for ( j = 0; j < 1000; ++ j ) {
magic....
}
}

it can be turned into :

for ( i = 0; i < 1000; i += 10 ) {
for ( j = 0; j < 1000; j += 10) {
type temp_data[10];
for ( ii = i; ii < i + 10; ++i ) {
for ( jj = j; jj < j + 10; ++j ) {
tmp_data[ii-i] += magic ....
}
}
}
}

This is very simplistic, but, basically the idea is to keep as much data
in cache. I have seen 100x speed increases using this technique on
large data.
Sep 25 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.