Language efficiency of C versus FORTRAN et al

travisperkins03

Hi,

I have read somewhere that C code sometimes cannot be compiled to be as
efficient as FORTRAN, eg for matrix multiplication, because a C
compiler cannot make the assumptions about arrays that a FORTRAN
compiler can. But I don't understand the example, not least because I
don't understand FORTRAN. I also don't understand why it is more
efficient in this case for a compiler to choose the order of evaluation
(or whatever it is that it does for matrix multiplication to make it
faster).

Can anyone explain all this, please? And how much speed-up might one
get from using FORTRAN over C for such things? What sort of compilers
offer the best performance for issues like this? Is there any general
advice about how to achieve efficient code for such linear algebra?

This is a fairly live issue, because matrix mulitplication (and other
things, like evaluating a dot product) often take an extremely long
time for large matrices and vectors.

I am also wondering how other languages like Pascal might compare to C
and Fortran in this regard; does Pascal have enough array structure to
allow compilers to take advantage of such optimisations?

Any more general docs on issues like this would also be interesting
reading.

Thanks

Nov 14 '05 #1

Subscribe Post Reply

3802

Tim Prince

<tr*************@hotmail.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...

Hi,

I have read somewhere that C code sometimes cannot be compiled to be as
efficient as FORTRAN, eg for matrix multiplication, because a C
compiler cannot make the assumptions about arrays that a FORTRAN
compiler can. But I don't understand the example, not least because I
don't understand FORTRAN. I also don't understand why it is more
efficient in this case for a compiler to choose the order of evaluation
(or whatever it is that it does for matrix multiplication to make it
faster).

Can anyone explain all this, please? And how much speed-up might one
get from using FORTRAN over C for such things? What sort of compilers
offer the best performance for issues like this? Is there any general
advice about how to achieve efficient code for such linear algebra?

This is a fairly live issue, because matrix mulitplication (and other
things, like evaluating a dot product) often take an extremely long
time for large matrices and vectors.

Although Fortran (since 15 years ago) provides matmul() and dot_product()
intrinsics, this is only a matter of convenience. As these operations, in
themselves, don't offer any danger of over-writing their operands, the
larger differences in potential efficiency lie elsewhere, provided you don't
choose an algorithm which is more awkward in one language than the other.
Compilers tend to ignore differences in restrictions on order of evaluation;
"vectorizing" compilers are likely to implement them the same way in C and
Fortran. Even on a single CPU, it is generally necessary to break them down
into 8 or so batch sums which can be calculated in parallel. In principle,
the Fortran intrinsics could incorporate MP parallelism, but in practice,
that is likely to depend on application of OpenMP, which is not part of
either language, but works the same way with either.
For large matrix multiplication (say, with minimum dimension 24 or more) or
dot products, you may want to use a library optimized for your processor,
usually following the BLAS schemes. If you are trying to avoid wrapper
overhead in calling basic BLAS, that is written with an old-style Fortran
interface, which requires some study to emulate in C. Other than that,
differences in efficiency among calling languages or compilers disappear,
for those operations supported in the library.
If you do have operations which one compiler succeeds in vectorizing
effectively, while another fails, the speedup could range up to a factor of
6 or so on common processors. It may be more difficult to achieve that in
C, requiring full use of restrict keywords and the like, and maybe more hand
optimization of the code. If you are "restricted" to a C subset, or are
required to use options which support violations of the C standard, you may
have no hope of matching Fortran performance.

Nov 14 '05 #2

Gregory Toomey

tr*************@hotmail.com wrote:

Hi,

I have read somewhere that C code sometimes cannot be compiled to be as
efficient as FORTRAN, eg for matrix multiplication, because a C
compiler cannot make the assumptions about arrays that a FORTRAN
compiler can. But I don't understand the example, not least because I
don't understand FORTRAN. I also don't understand why it is more
efficient in this case for a compiler to choose the order of evaluation
(or whatever it is that it does for matrix multiplication to make it
faster).

Can anyone explain all this, please? And how much speed-up might one
get from using FORTRAN over C for such things? What sort of compilers
offer the best performance for issues like this? Is there any general
advice about how to achieve efficient code for such linear algebra?

This is a fairly live issue, because matrix mulitplication (and other
things, like evaluating a dot product) often take an extremely long
time for large matrices and vectors.

I am also wondering how other languages like Pascal might compare to C
and Fortran in this regard; does Pascal have enough array structure to
allow compilers to take advantage of such optimisations?

Any more general docs on issues like this would also be interesting
reading.

Thanks

Fortran compilers have been optimised for SIMD architectures eg Cray
supercomputers. They improve the speed of vector/matrix calculations.

I presume the same is now true for C matrix libraries on SIMD machines.

See http://en.wikipedia.org/wiki/SIMD

gtoomey

Nov 14 '05 #3

Chris Torek

In article <11**********************@f14g2000cwb.googlegroups .com>
<tr*************@hotmail.com> wrote:

I have read somewhere that C code sometimes cannot be compiled to be as
efficient as FORTRAN, eg for matrix multiplication, because a C
compiler cannot make the assumptions about arrays that a FORTRAN
compiler can.
In theory, there is no reason the C code cannot be just as fast.
As a practical matter, however, this was true before C99. In
C99, one can now tell a compiler that a given pointer is the only
"handle" by which some array is accessed, through the "restrict"
type-qualifier.
But I don't understand the example, not least because I
don't understand FORTRAN. I also don't understand why it is more
efficient in this case for a compiler to choose the order of evaluation
(or whatever it is that it does for matrix multiplication to make it
faster).

Can anyone explain all this, please?
Assume we have some routine/function f() that has takes or more
pointers and performs some operation on using them. For instance,
consider the boring old Fortran "DAXPY" function, double-precision
a*X + Y where X and Y are vectors with N elements. (I am leaving
out the xinc and yinc parameters on purpose; this is not quite the
usual DAXPY.) The result of the multiply-and-add is stuffed back
into the vector Y. This gives the following simple C code:

void daxpy(size_t n, double a, double *x, double *y) {
size_t i;

for (i = 0; i < n; i++)
y[i] += a * x[i];
}

(aside from the increments and occasional optimizations for a==0.0
and such, this really *is* all there is to DAXPY).

In C, given the above function, the following is perfectly
legal as a call:

double arr[100];
... /* fill in some initial values for arr */ ...
daxpy(99, 3.0, &arr[0], &arr[1]);

Inside daxpy(), we compute "y[i] += a * x[i]"; this is now the
same as doing:

for (i = 0; i < 99; i++)
arr[i + 1] += a * arr[i];

This means that the first trip through the loop, we use arr[0] to
adjust arr[1], then we use the new, adjusted arr[1] to adjust
arr[2], and so on. In other words, the C compiler *must not*
"preload" the input values of arr[1] through arr[98] and use some
or all of those, instead of the new values that will be written
into arr[1] through arr[98] when i is 0 through 97 respectively.
The C code *must* use the newly computed values each time.

If this were Fortran code instead, the call would be spelled slightly
differently, using all-capitals (in F77 at least) and omitting the
"&"s and changing [] to ():

DAXPY(99, 3.0, ARR(0), ARR(1))

The kicker is that this call is *illegal* in Fortran. More precisely,
it invokes undefined behavior, just like "i = i++" in C. A Fortran
compiler does not have to detect the problem; no diagnostic is
required; but the code is allowed to misbehave in arbitrary ways.

The call is legal and well-defined in C, but not in Fortran. Well,
so what?

The answer is: "so, the Fortran compiler *can* preload the input
values". On some machines, this helps a little. On some machines,
this helps a lot. On some machines, it does not help at all.
And how much speed-up might one get from using FORTRAN over C for
such things?
A little (say, 5%), a lot (e.g., subroutine runs about 40 times
faster), or perhaps none at all.

C99 compilers (what few of them there are) have a new keyword,
"restrict". If we rewrite daxpy() as:

void daxpy(size_t n, double a, double *restrict x, double *restrict y) {
size_t i;

for (i = 0; i < n; i++)
y[i] += a * x[i];
}

then that C99 compiler is now allowed (but not required) to make
the *same* assumptions as the Fortran compiler. In other words,
this function is *less* useful to you (the programmer) than the
original, unrestricted version -- but by making it less useful
(i.e., more constrained), we tell the compiler more about expressions
like x[i] and y[i]. In particular, we have told the compiler --
whether it is true or not -- that x[i] and y[j] *never* name the
same underlying object, no matter what valid values of i and j are
used. The compiler may, if it chooses, "preload" some or all of
x[i] and/or y[i], if that speeds up the machine code.
What sort of compilers offer the best performance for issues like
this?
comp.lang.c is all about portable, Standard-conforming code;
performance is irrelevant here. Fortunately, it turns out that
the answer depends greatly on your particular hardware, and
hence a hardware-specific group is the right place to ask.
(*Which* hardware-specific group depends on the hardware, of
course.)
I am also wondering how other languages like Pascal might compare to C
and Fortran in this regard; does Pascal have enough array structure to
allow compilers to take advantage of such optimisations?

Pascal -- the original J&W version, anyway -- has such strong
constraints on arrays that the code can be very fast, but you cannot
write useful programs in it. :-) You will find, as a general rule,
that the more generally-useful the language, the more difficult it
is for a compiler to produce fast machine code for it (at least,
without "hints" that the program does not use most of those wonderful
features). On the other hand, if you can work directly "in the
problem" as it were -- for instance, writing mathematical equations
directly, rather than expanding them out into loops or subroutine
calls like DAXPY -- then you (the human) may be able to simplify
the problem, so that nowhere near as much machine code is required.
As a rule, the fastest and most reliable parts of any program are
those that are not there.

(For this last reason, I always thought it was kind of nutty to
write these programs in Fortran instead of just using APL... :-) )
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Nov 14 '05 #4

beliavsky

To optimize the performance of linear algebra operations, one should
consider calling hardware-specific libraries, such as the Intel Math
Kernel Library http://www.intel.com/software/products/mkl/features.htm
or the AMD Core Math Library (ACML). Typically, such libraries are
callable from Fortran, C, and C++.

Nov 14 '05 #5

beliavsky

Chris Torek wrote:

On the other hand, if you can work directly "in the
problem" as it were -- for instance, writing mathematical equations
directly, rather than expanding them out into loops or subroutine
calls like DAXPY -- then you (the human) may be able to simplify
the problem, so that nowhere near as much machine code is required.
As a rule, the fastest and most reliable parts of any program are
those that are not there. (For this last reason, I always thought it was kind of nutty to
write these programs in Fortran instead of just using APL... :-) )

I think it's "kind of nutty" that you write this in the year 2004, when
the last 3 Fortran standards (1990, 1995, and 2003) have provided a
full set of array operations, including array sections, described at
http://www.pcc.qub.ac.uk/tec/courses...dentMIF_5.html
.. There are many Fortran 95 compilers available, including a free one
called g95 at http://www.g95.org .

Nov 14 '05 #6

Chris Torek

>Chris Torek wrote:

(For this last reason, I always thought it was kind of nutty to
write these programs in Fortran instead of just using APL... :-) )

In article <11**********************@f14g2000cwb.googlegroups .com>
<be*******@aol.com> wrote:I think it's "kind of nutty" that you write this in the year 2004, when
the last 3 Fortran standards (1990, 1995, and 2003) have provided a
full set of array operations, including array sections, described at
http://www.pcc.qub.ac.uk/tec/courses...dentMIF_5.html
. There are many Fortran 95 compilers available, including a free one
called g95 at http://www.g95.org .

Well, I *was* kidding. Perhaps I should have said "F77" though.
It is true that F90 and F95 have greatly improved the language's
capabilities (and greatly changed it -- if people thought C90 to
C99 was a shock, try F77 to F90...). The last time I actually
*used* Fortran was in the days of F77, in any case.

One might also note that APL never really "made it" as a general-use
language, unlike Fortran. This may have been because it required
a special type-ball on the Selectric, to print out the funny
character-set -- remember, this language was used in the days of
printing terminals, and even depended on them in the form of
overstrike (quote-quad, for instance, was formed by typing both
the quad and quote characters in the same position, by backspacing
between the two) -- or perhaps because APL seemed to be one of
those "write-only languages": no one could read programs written
in APL, not even the author. :-) (On the other hand, if "ugly
syntax" is such a problem, perhaps F90-and-successors are doomed
too. :-) Oddly enough, Fortran afficionados use this same argument
against C...)

(I did a google search -- keywords "apl iverson", to avoid hits on
other APL TLAs -- and was surprised to find that the language is
still in use. Of course, today, with bitmapped displays, it is
easy enough to construct APL fonts. Input methods are a bit
problematic: if C programmers think {|} are troublesome on
international keyboards, well, where are your rho and iota keys,
unless you have a Greek keyboard? But see, e.g.,
<http://home.earthlink.net/~swsirlin/apl.faq.html>. A couple
of followon languages, J and K, are perhaps more suitable today,
though.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Nov 14 '05 #7

Christian Bau

In article <11**********************@f14g2000cwb.googlegroups .com>,
tr*************@hotmail.com wrote:

Hi,

I have read somewhere that C code sometimes cannot be compiled to be as
efficient as FORTRAN, eg for matrix multiplication, because a C
compiler cannot make the assumptions about arrays that a FORTRAN
compiler can. But I don't understand the example, not least because I
don't understand FORTRAN. I also don't understand why it is more
efficient in this case for a compiler to choose the order of evaluation
(or whatever it is that it does for matrix multiplication to make it
faster).

Can anyone explain all this, please?

For details of various processors, go to www.intel.com, www.amd.com,
www.apple.com etc. etc. You will find all the information about
processors that you want, with as much detail as you want.

A Google search will at least find a freely available copy of the
Fortran 77 Standard, and a freely available copy of the last draft for
the C99 Standard, which contains enough information to understand the
differences between C and Fortran in this respect.

Nov 14 '05 #8

E. Robert Tisdale

tr*************@hotmail.com wrote:

I have read somewhere
Do you remember where?
that C code sometimes cannot be compiled to be as efficient as FORTRAN,
e.g. for matrix multiplication,
because a C compiler cannot make the assumptions about arrays
that a FORTRAN compiler can. But I don't understand the example,
not least because I don't understand FORTRAN.
I also don't understand why it is more efficient in this case
for a compiler to choose the order of evaluation
(or whatever it is that it does
for matrix multiplication to make it faster). Can anyone explain all this, please?
And how much speed-up might one get from using FORTRAN over C for such things?
What sort of compilers offer the best performance for issues like this?
Is there any general advice
about how to achieve efficient code for such linear algebra? This is a fairly live issue
No. It is a *dead* issue.
because matrix mulitplication
(and other things, like evaluating a dot product)
often take an extremely long time for large matrices and vectors. I am also wondering how other languages like Pascal
Please ask in the comp.lang.pascal newsgroup instead.
might compare to C and Fortran in this regard;
does Pascal have enough array structure to
allow compilers to take advantage of such optimisations?

Any more general docs on issues like this
would also be interesting reading.
cat main.f program main
real C(1024, 1024)
real B(1024, 1024)
real A(1024, 1024)
integer i, j, k
do j = 1, 1024
do i = 1, 1024
A(i, j) = i - 1 +(j - 1)*1024
B(i, j) = i - 1 +(j - 1)*1024
end do
end do
! C <-- A^TB (matrix-matrix dot product)
do j = 1, 1024
do i = 1, 1024
C(i, j) = 0.0
do k = 1, 1024
C(i, j) = C(i, j) + A(k, i)*B(k, j)
end do
end do
end do
end program main
f90 -O3 -o main main.f
time ./main 27.957u 0.105s 0:28.06 99.9% 0+0k 0+0io 0pf+0w cat main.c #include <stdio.h>
#include <limits.h>

int main(int argc, char* argv[]) {
const
int n = 1024;
float A[n][n];
float B[n][n];
float C[n][n];
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < n; ++j) {
A[i][j] = j + i*n;
B[i][j] = j + i*n;
}
}
// C <-- BA^T (matrix-matrix dot product)
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < n; ++j) {
C[i][j] = 0.0;
for (size_t k = 0; k < n; ++k) {
C[i][j] = C[i][j] + B[i][k]*A[j][k];
}
}
}
return 0;
}
gcc -Wall -std=c99 -pedantic -O3 -o main main.c
time ./main

31.365u 0.113s 0:31.62 99.5% 0+0k 0+0io 0pf+0w

Quality C and Fortran compilers will optimize these loops
in almost exactly the same way. The difference appears
when you pass arrays to "subprograms".

void matrix_matrix_dot(float C[][1024],
const float A[][1024], const float B[][1024]);

or
interface
subroutine matrix_matrix_dot(C, A, B)
real, intent(in), dimension(1024, 1024):: A
real, intent(in), dimension(1024, 1024):: B
real, intent(out), dimension(1024, 1024):: C
end subroutine matrix_matrix_dot
end interface

The problem is that the compiler does not know that
the destination array C is not an *alias*
for [part of] one of the source operands --
it can't be sure that programmers won't write

call matrix_matrix_dot(A, A, B)

for example. If C or Fortran programmers do this,
they are going to get garbage in A
instead of the matrix-matrix dot product.
Fortran programmers are simply admonished *not* to do this
but, for some reason, C programmers expect to get the same thing
as if they had written

for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < n; ++j) {
A[i][j] = 0.0;
for (size_t k = 0; k < n; ++k) {
A[i][j] = A[i][j] + B[i][k]*A[j][k];
}
}
}

at the same place in their program where they wrote

matrix_matrix_dot(A, A, B);

which inhibits the C compiler from performing any optimizations
which might yield a different [wrong] result.

The new C99 standard allows programmers
to qualify aliases with the 'restrict' keyword
so that the C99 compiler is allowed to perform
the same optimizations as the Fortran compiler.

NOTE: Neither Fortran or C99 do anything
to enhance the safety of subprograms
that may modify input arrays through aliases.

keyword

Nov 14 '05 #9

Michael Wojcik

In article <cq********@news2.newsguy.com>, Chris Torek <no****@torek.net> writes:

(I did a google search -- keywords "apl iverson", to avoid hits on
other APL TLAs -- and was surprised to find that the language is
still in use.
There's a fair bit of activity on comp.lang.apl. A number of free
APL implementations are available, though I haven't found one that's
completely satisfactory. What really seems to be lacking, though, is
a good online tutorial.

I posted an APL program to comp.programming just a few months back,
for one of those silly "write a program to do some trivial thing"
contests - I think it was to add the squares of the numbers from 1
to 10. My APL implementation was, of course, the shortest solution;
that's just the sort of thing APL has operators for.
Of course, today, with bitmapped displays, it is
easy enough to construct APL fonts. Input methods are a bit
problematic: if C programmers think {|} are troublesome on
international keyboards, well, where are your rho and iota keys,
unless you have a Greek keyboard?
Nah. rho is meta-r and iota is meta-i. The tough ones are the
characters that don't have an obvious mapping to standard keyboards
and that you don't use often enough to remember where they are - I
don't even know the names of some of the symbols on my keyboard
chart.
But see, e.g.,
<http://home.earthlink.net/~swsirlin/apl.faq.html>. A couple
of followon languages, J and K, are perhaps more suitable today,
though.)

There are some implementations of "APL in ASCII" too, which are APL2
workspaces that define ASCII names for all the APL special characters.
They're not so fun and pretty as using the APL glyphs, but they're
useful for posting source to Usenet and the like.

--
Michael Wojcik mi************@microfocus.com

Advertising Copy in a Second Language Dept.:
The precious ovum itself is proof of the oath sworn to those who set
eyes upon Mokona: Your wishes will be granted if you are able to invest
it with eternal radiance... -- Noriyuki Zinguzi

Nov 14 '05 #10

Language efficiency of C versus FORTRAN et al

Similar topics