I have C code which computes the row sums of a matrix, divide each
element of each row with the row sum and then compute the column sum of
the resulting matrix. Is there a way I can speed up the code in C:
/* Here is the code */
// Table is "wij"
int i, j;
for(i = 0; i < N; ++i)
{
for(j = 0; j < N; ++j)
{
sum_over_j_wij[i] += wij[i,j];
}
for(j = 0; j < N; ++j)
{
sum_over_i[j] += wij[i,j]/sum_over_j_wij[i)];
}
There are some loop unrolling techniques that might help. You can also
avoid indexing all the time by using pointers and advancing those using
the builtin ++ operator or += operator.
Next time, please refrain from typing your code into the message and
instead use the "copyandpaste" capability offered by all modern Otes
and applications.
V
1. If you're talking C, why are you posting to c.l.c++?
2. Your code won't compile; x[i,j] may be a syntax error
2.a.Your code might compile, but it may only provide x[j] ( or x[i], I
don't remember whether the comma operator return the right or left
You have some errors in your code.
Besides the error, perhaps you could see something by
expanding your loop into a series.
For N = 1:
sum_over_j_wij[0] = wij[0][0];
sum_over_i[0] = wij[0][0] / (wij[0][0]); /* substitution */
For N = 2:
sum_over_j_wij[0] = wij[0][0] + wij[0][1];
sum_over_i[0] = wij[0][0] / (wij[0][0] + wij[0][1])
+ wij[0][1] / (wij[0][0] + wij[0][1]);
sum_over_j_wij[1] = wij[1][0] + wij[1][1];
sum_over_i[1] = wij[1][0] / (wij[1][0] + wij[1][1])
+ wij[1][1] / (wij[1][1] + wij[1][1]);
Follow this through N = 5.
See if there are any commonalities or if the terms
can be rearranged so that the operation can be
distributed (such as one pass is addition, another
division, etc.)
See also loop unrolling and also repetitive subtraction
rather than division.
Think about trying to "pipeline" the operations.

Thomas Matthews
Try this:
cat f.c
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
bool f( // false if any row sum is zero.
const size_t m, // number of rows
const size_t n, // number of columns
const double w[m][n], // input matrix
double_t* restrict sum // output (columm sums)
) {
for (size_t i = 0; i < m; ++i) {
sum[i] = 0.0;
}
for (size_t i = 0; i < m; ++i) {
double_t t = 0.0; // row sum accumulator
for (size_t j = 0; j < n; ++j) {
t += w[i][j];
}
if (0.0 == t)
return false;
for (size_t j = 0; j < n; ++j) {
sum[j] += w[i][j]/t;
}
}
return true;
}
gcc Wall std=c99 pedantic O3 c f.c
Performance will degrade when rows of array w are too large
to keep in level 1 cache along with the column sums.
There are some loop unrolling techniques that might help.
The biggest time consumer here is likely to be the division. Division is
usually a lot slower than other arithmetic operations. Loop unrolling
probably won't help much in comparison, and is something the compiler
might do for you anyway.
You can also avoid indexing all the time by using pointers and advancing those using the builtin ++ operator or += operator.
Indexing is usually not an expensive operation, and compilers tend to be
good at optimising it. It isn't uncommon for an indexed version of code
to turn out faster than one using pointer arithmetic.
You could experimet with the following (untested) code.
int i, j;
/* Initialise sum_over_i elements if necessary here */
for(j = 0; j < N; ++j)
{
sum_over_i[j] = 0.0;
}
for(i = 0; i < N; ++i)
{
/* Assuming elements are of type double */
const double *const wij_row = wij[i];
double sum = 0.0;
double sum_reciprocal;
for(j = 0; j < N; ++j)
{
sum += wij_row[j];
}
sum_over_j_wij[i] = sum;
sum_reciprocal = 1.0 / sum;
for(j = 0; j < N; ++j)
{
sum_over_i[j] += wij_row[j] * sum_reciprocal;
}
}
wij_row may or may not help, the compiler could well optimise in a similar
or perhaps better way. Multiplying by the reciprocal may be marginally
less accurate than dividing.
Lawrence
As others have pointed out, you are using Pascalstyle subscripts
rather than C or C++ style. And you have a strange syntax error in
there too. I'm going to assume all these arrays are "doubles".
Remember, division, module and square root are all roughly the same
speed are and very slow (as a rule of thumb you should think of them as
about 10 times slower than addition.) So reducing their quantity in
your inner loops is of prime importance in this case. So, if you can
live with slight accuracy issues, the key "speeding up" consideration
is conversion of division to reciprocal multiplication:
for (i=0; i < N; i++) {
for (s=0.0,j=0; j < N; j++) s += wij[i][j];
s = 1.0 / s; /* if s == 0, you are SOL. */
for (j=0; j < N; j++) sum_over_i[j] += s*wij[i][j];
}
Some compilers are not able to hoist out the wij[i] calculation, so it
might be useful to precalculate this as double *wptr = wij[i]; and
replace the instances of wij[i] with wptr.
Certainly, using a good vectorizing compiler, such as Intel's compiler
will likely make a *huge* difference on code like this. Microsoft
claims that their latest compilers have vectorization capabilities, but
I have not verified this myself. In any event, the benefits of using
the SIMD instruction set basically goes straight to the bottom line,
especially in cases like this.
If you are on a processor which has a "multiply accumulate" (PowerPC,
Itanium, PARISC) instead of SIMD, you can invert the loops:
for (i=0; i < N; i++) {
for (s=0.0,j=0; j < N; j++) s += wij[i][j];
recip_sum_over_ j[i] = 1.0 / s; /* if s == 0, you are SOL. */
}
for (j=0; j < N; j++) {
for (s=0.0,i=0; i < N; i++) s += recip_sum_over_ j[i]*wij[i][j];
sum_over_i[j] = s;
}
So you can try each possibility, and check your compiler settings
(w.r.t SIMD and "MultiplyAccumulate") to see which one works better
for your platform.

Easy. Note that wij[i,j] is exactly the same as wij[j], so we change
this to
int i, j;
for(i = 0; i < N; ++i)
{
for(j = 0; j < N; ++j)
{
sum_over_j_wij[i] += wij[j];
}
for(j = 0; j < N; ++j)
{
sum_over_i[j] += wij[j]/sum_over_j_wij[i)];
}
}
Now the two inner loops are independent and can be split; then i and j
can be exchanged in the second loop, so we get:
int i, j;
for(i = 0; i < N; ++i)
{
for(j = 0; j < N; ++j)
{
sum_over_j_wij[i] += wij[j];
}
}
for(j = 0; j < N; ++j)
{
for(i = 0; i < N; ++i)
{
sum_over_i[j] += wij[j]/sum_over_j_wij[i)];
}
}
In the first nested loop, we always add the same values to
sum_over_j_wij, so we calculate that sum only once:
int i, j;
double s = 0.0;
for (j = 0; j < N; ++j) s += wij[j];
for(i = 0; i < N; ++i)
{
sum_over_j_wij[i] += s;
}
for(j = 0; j < N; ++j)
{
for(i = 0; i < N; ++i)
{
sum_over_i[j] += wij[j]/sum_over_j_wij[i)];
}
}
In the second nested loop, we add wij[j] multiplied by the sum over 1 /
sum_over_j_wij, so we change this to:
int i, j;
double s = 0.0, t = 0.0;
for (j = 0; j < N; ++j) s += wij[j];
for (i = 0; i < N; ++i) sum_over_j_wij[i] += s;
for (i = 0; i < N; ++i) t += 1.0 / sum_over_j_wij[i];
for (j = 0; j < N; ++j) sum_over_i[j] += wij[j] / t;
