Matrix library where's the speed ?

Frank-O

Hi ,

Recently I have been commited to the task of "translating" some complex
statistical algorithms from Matlab to C++. The goal is to be three
times as fast as matlab ( the latest) .

I've used various techniques ( loop unrolling, loop jamming...) and
tried some matrix libraries : newmat (slow for large matrix) , STL
(fast but ..not usefull) , hand coding (brain consuming...), and
recently Meschach...
I am sad and upset that it is not obvious to write a fast program that
can beat matlab, even for a simple vector or matrix addition, ... Who
told me that matlab is slow...!
Despite my great effort, matlab is fast especially when the size of the
matrix increases...

My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

Ps : I am using Dev-cpp /GCC on XP.

Oct 19 '06 #1

Subscribe Post Reply

5166

Phlip

Frank-O wrote:

Despite my great effort, matlab is fast especially when the size of the
matrix increases...

My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

How you know things are slow? What do your unit tests look like?

Did you profile? And I must ask the obvious: Did you compile with debug info
turned off, and optimizations turned on?

If all those fail, your next effort should be Blitz++ or a similar
Expression Metatemplates library. Thence Boost.

--
Phlip
http://www.greencheese.us/ZeekLand <-- NOT a blog!!!

Oct 19 '06 #2

Victor Bazarov

Frank-O wrote:

Recently I have been commited to the task of "translating" some
complex statistical algorithms from Matlab to C++. [...snip...]

My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

Tips: use fast algorithms, optimize only after measuring so you
know what to optimize. Strategy? Choose what to sacrifice. It
is often portability for the price of performance. Compiler?
Intel C/C++ compiler is good. Microsoft isn't bad either. Like
with anything else, you get what you pay for. Invest in tools
that will help you identify performance bottlenecks, then work
on improving your program.

Now, all this doesn't really have much to do with C++. It is
relevant for pretty much every decent language. So, next time
you want to ask about such nebulous topic, try 'comp.software-eng'.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Oct 20 '06 #3

Thomas Matthews

Frank-O wrote:

Hi ,

[snip]

>
My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

Ps : I am using Dev-cpp /GCC on XP.

After the code works robustly and correctly, try:
1. Writing code in a style that will optimize for your platform.
For example, the ARM processor can conditionally execute
instructions. Replacing "if" statements by boolean expressions
may help the compiler take advantage of this feature.

2. Remove branches.
Most modern processors love to pre-fetch a lot of instructions.
However, branches tend to complicate life and force many to
clear out the prefetched instructions and reload.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

Oct 20 '06 #4

Frank-O

Thx for your answer,
for matlab I use the profile on/profile report to get accurate time
measure ( line by line)
for C++ I use gprof thanks to the option -pg and I read a
not-so-pretty flat file
and I also use a method based on crontructor destructor...and clock()
regarding optimization I use the -02 of g++, it has reduced my
execution time by a a factor of 18/14.

Phlip wrote

Frank-O wrote:

Despite my great effort, matlab is fast especially when the size of the
matrix increases...

My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

How you know things are slow? What do your unit tests look like?

Did you profile? And I must ask the obvious: Did you compile with debug info
turned off, and optimizations turned on?

If all those fail, your next effort should be Blitz++ or a similar
Expression Metatemplates library. Thence Boost.

--
Phlip
http://www.greencheese.us/ZeekLand <-- NOT a blog!!!

Oct 20 '06 #5

Frank-O

well thank you for your concern, which tools ?

p.s :

It has to do with C++, and scientific computing, etc.., since I talk
about matrix operations...
and the speed of C++.

Victor Bazarov wrote:

Frank-O wrote:
Recently I have been commited to the task of "translating" some
complex statistical algorithms from Matlab to C++. [...snip...]

My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

Tips: use fast algorithms, optimize only after measuring so you
know what to optimize. Strategy? Choose what to sacrifice. It
is often portability for the price of performance. Compiler?
Intel C/C++ compiler is good. Microsoft isn't bad either. Like
with anything else, you get what you pay for. Invest in tools
that will help you identify performance bottlenecks, then work
on improving your program.

Now, all this doesn't really have much to do with C++. It is
relevant for pretty much every decent language. So, next time
you want to ask about such nebulous topic, try 'comp.software-eng'.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Oct 20 '06 #6

Frank-O

Here is an example :

------matlab
function RunTest(niter,N)

A=ones(N)*4;
B=A+6;
for i=1:niter;
tic
C=B+A;
sumer(i)=toc;
end
disp(sum(sumer));disp(mean(sumer));disp(C(1:5,1:5) )

-----C++
// Test Addition

void Test_perf(int niter,int N)
{
Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter;
while(niter--)
{
stopwatch sw("STL",niter0);

transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}

MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;

A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;

Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;

fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );

stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}

//m_output(C1);

m_free(A1);m_free(B1), m_free(C1);

Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();

stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}

Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();

stopwatch::reset();
uint32_t _nbelement;

while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);

while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}

switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
case 0 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();

}
}

Oct 20 '06 #7

Jacek Dziedzic

Frank-O wrote:

Hi ,

Recently I have been commited to the task of "translating" some complex
statistical algorithms from Matlab to C++. The goal is to be three
times as fast as matlab ( the latest) .

I've used various techniques ( loop unrolling, loop jamming...) and
tried some matrix libraries : newmat (slow for large matrix) , STL
(fast but ..not usefull) , hand coding (brain consuming...), and
recently Meschach...
I am sad and upset that it is not obvious to write a fast program that
can beat matlab, even for a simple vector or matrix addition, ... Who
told me that matlab is slow...!
Despite my great effort, matlab is fast especially when the size of the
matrix increases...

My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

How about using a maths library that professionals wrote and
had optimized already, like LAPACK? Get it, along with BLAS,
fine-tuned for your particular platform and enjoy the speed.
It is a tad unconvenient to interface it (it's mostly fortran
with a C interface), but the speed should be a reward.

Even simple operations, like vector scalar product or
matrix-matrix multiplication will be optimized heavily
and perform way faster than anything us mortals are likely
to code.

HTH,
- J.

Oct 20 '06 #8

toton

Frank-O wrote:

Here is an example :

------matlab
function RunTest(niter,N)

A=ones(N)*4;
B=A+6;
for i=1:niter;
tic
C=B+A;
sumer(i)=toc;
end
disp(sum(sumer));disp(mean(sumer));disp(C(1:5,1:5) )

-----C++
// Test Addition

void Test_perf(int niter,int N)
{
Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter;
while(niter--)
{
stopwatch sw("STL",niter0);

transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}

MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;

A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;

Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;

fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );

stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}

//m_output(C1);

m_free(A1);m_free(B1), m_free(C1);

Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();

stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}

Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();

stopwatch::reset();
uint32_t _nbelement;

while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);

while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}

switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
case 0 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();

}
}

The codes do not look comparable!
Can you try with Blitz library & check performance?
using namespace blitz;
Array<double,2A(N,N,fortranArray); //Malab default array element is
double, and column major
format. It creates 2D array of size NxN and have garbage value.
A = 4; //all the elements are initialized to 4.
Array<double,2B(N,N,fortranArray);
B = A+6;
Array<double,2C = A+B;
The code may not compile as it is (not tested), but will compile with
little mopdification. In my experience with optimization on (-O2 for
GCC,) and ( /O2 for VC7 may be optionally /G7 for P4 ) it will
outperform matlab array computation by a factor. And if you have a
dynamic array (resizable one ) you will gain a large factor. Add your
timing code from liberary (boost timer or whatever). And dont forget to
turn on optimization (it is a must to check performance for esp
template based library).
I usually gain from a huge factor when I move from matlab to C++
(sometimes more than 100 !) . However it depends on complexity of the
problem, library used (matlab uses ATLAS for computation, well
optimized for P4 by intel compiler! , matlab function calls very slow,
same IO operation) and many other factors.
And be sure to have some sort of shared memory allocation when matrix
size is huge (If you are from the field of CFD ! ) . There is no point
loading a 2 GB matrix fully ointo RAM to do processing! . Use
appropriate library for it.
Blitz link http://www.oonumerics.org/blitz/
Looking for some performance gain results! .
abir

Oct 20 '06 #9

loufoque

Frank-O wrote:

I've used various techniques ( loop unrolling, loop jamming...)

Why would you have to do that?
That's the job of the compiler.

and
tried some matrix libraries : newmat (slow for large matrix) , STL
(fast but ..not usefull) , hand coding (brain consuming...), and
recently Meschach...

You haven't tested the good ones. Boost.ublas and blitz++.

Oct 20 '06 #10

Frank-O

I will try your libraries and post the result.

Oct 20 '06 #11

Diego Martins

In intel processors, Matlab use specialized math intel instructions.

Oct 20 '06 #12

Bill Shortall

"Frank-O" <fr*****************@gmail.comwrote in message
news:11*********************@f16g2000cwb.googlegro ups.com...

Hi ,

Recently I have been commited to the task of "translating" some complex
statistical algorithms from Matlab to C++. The goal is to be three
times as fast as matlab ( the latest) .

I've used various techniques ( loop unrolling, loop jamming...) and
tried some matrix libraries : newmat (slow for large matrix) , STL
(fast but ..not usefull) , hand coding (brain consuming...), and
recently Meschach...
I am sad and upset that it is not obvious to write a fast program that
can beat matlab, even for a simple vector or matrix addition, ... Who
told me that matlab is slow...!
Despite my great effort, matlab is fast especially when the size of the
matrix increases...

My question is :
What kind of tips, strategy, compiler, the real world "professionnal"
use to write very fast c++ code ?

Ps : I am using Dev-cpp /GCC on XP.

Hi Frank,
I've noticed that dev-cpp is a lot slower
than microsoft VC6 . So use this coupled
with ppLinear ( see www.pecos-place.com) and you should be able to do
a 500 x 500 SPFP square matrix matrix multiply
in about 125 millisecs which is a 2 Gigaflop rate. Other timings are given
under news
VC7 is another good compiler for windows but make sure you get the
"professional" version as the developer
version does not have the O2 option for max speed.

Bill

Oct 20 '06 #13

loufoque

Bill Shortall wrote:

I've noticed that dev-cpp is a lot slower
than microsoft VC6 .

MSVC6 is so outdated and non conforming that it doesn't deserve to be
called a C++ compiler.
And I seriously doubt it's really that much faster than GCC.

Oct 21 '06 #14

Frank-O

I've tested Blitz++, Although it is a nice library, ---it does not
outperform anyone...
two computers were used.

1- laptop intel pentium M 1.70Ghz 504Mb
The Me-shark library ranks first along with the "dynamic unroll-loops
style" for any size N*N. matlab always is always a tad better.

2- desktop computer. AMD athlon XP2400+
1.99Ghz 320MB
The "dynamic unroll-loops style" ranks first along with the STL for any
size N*N.

There is one thing I can't understand when comparing the performance
of the computers.
when N (matrix size) is less than 400 , the AMD is three times faster
than intel, Conversely when N>400 intel is five times faster than AMD
!!!

Regarding my algorithm itself it is very fast on the computer 2 (AMD)
and very slow on the other....
using namespace blitz;
//BZ_USING_NAMESPACE(blitz)

using namespace std;

void Test_perf(int niter,int N)
{
//Time comparision for Matrix Addition

Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter, niter4=niter , niter5=niter;

while(niter--)
{
stopwatch sw("STL",niter0);

transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}

A.release(); B.release(); C.release();
// blitz++
Array<Real,2 A_(N,N,fortranArray) ;
Array<Real,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<Real,2 C_(N,N,fortranArray) ;

stopwatch::reset();
while(niter4--)
{
stopwatch sw("Blitz++ fortranArray",niter5);
C_ = A_ + B_;
}

// me-shark
MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;

A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;

Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;

fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );

stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}

m_free(A1);m_free(B1), m_free(C1);

Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();

stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}

A2.release(); B2.release(); C2.release();

Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();

stopwatch::reset();
uint32_t _nbelement;

while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);

while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}

switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();

}
//cout << C3 << endl;
A3.release();B3.release(); C3.release();

return;

Oct 21 '06 #15

toton

Frank-O wrote:

I've tested Blitz++, Although it is a nice library, ---it does not
outperform anyone...
two computers were used.

1- laptop intel pentium M 1.70Ghz 504Mb
The Me-shark library ranks first along with the "dynamic unroll-loops
style" for any size N*N. matlab always is always a tad better.

2- desktop computer. AMD athlon XP2400+
1.99Ghz 320MB
The "dynamic unroll-loops style" ranks first along with the STL for any
size N*N.

There is one thing I can't understand when comparing the performance
of the computers.
when N (matrix size) is less than 400 , the AMD is three times faster
than intel, Conversely when N>400 intel is five times faster than AMD
!!!

Regarding my algorithm itself it is very fast on the computer 2 (AMD)
and very slow on the other....
using namespace blitz;
//BZ_USING_NAMESPACE(blitz)

using namespace std;

void Test_perf(int niter,int N)
{
//Time comparision for Matrix Addition

Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter, niter4=niter , niter5=niter;

while(niter--)
{
stopwatch sw("STL",niter0);

transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}

A.release(); B.release(); C.release();
// blitz++
Array<Real,2 A_(N,N,fortranArray) ;
Array<Real,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<Real,2 C_(N,N,fortranArray) ;

stopwatch::reset();
while(niter4--)
{
stopwatch sw("Blitz++ fortranArray",niter5);
C_ = A_ + B_;
}

// me-shark
MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;

A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;

Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;

fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );

stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}

m_free(A1);m_free(B1), m_free(C1);

Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();

stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}

A2.release(); B2.release(); C2.release();

Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();

stopwatch::reset();
uint32_t _nbelement;

while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);

while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}

switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();

}
//cout << C3 << endl;
A3.release();B3.release(); C3.release();

return;

Here are ther results for your program dith different test runs.
(Processor Pentium D, 3 GHz, RAM 1GB, os WinXP SP2, neither matlab nor
the C++ compiler uses the PentiumD features, WinXP doesnt support it,
if they support it can be made parallel very easily)
Matlab =RunTest(1000,500) 1000 iter, 500x500 matrix time 2.4460 s,
mean 0.0024 s
Matlab =RunTest(1000,5000) 1000 iter, 5000x5000 matrix, time
304.5400 s, mean 0.3045 s.
You can check with varying the parameters. In MY PC, the per process
RAM starts getting exhausted for a size 600x600 or more. and the mean
increases drastically! (The difference you will experience at some
other size!)
For curiosity, I have a Java program also. The performance support my
expectation (JVM just removes the call, as nobody uses the result!. And
it is such a beautiful hotspot !
JVM client 1.5 (Sun) 1000 iter, 500x500 matrix, 1.73 sec
1000 iter ,5000x5000 matrix 173.761 s
Now for C++
with blitz 0.9, VC 7.1 optimization /O2, & /G7 1000 iter, 500x500
matrix 2 sec (not very accurate, used C time.h library, it rounds off
to nearest sec )
1000 iter , 5000x5000 matrix 273 sec.
One point easily can be seen here,
C++ program allocates memory for C_ (check your blitz version ) only
once, while Matlab version (your earlier post) allocates in each loop.
The Java version I think too smart here, it KNOWS you have given him
some garbage to compute and never bothered about the result. May be it
just do not compute anything, or computes only a few times, or even
computes always, but writes only once! .
My programs are,
Matlab =same as your first post, I am not reposting the program,
C++ Blitz,
#include <blitz/array.h>
#include <iostream>
#include <time.h>
using namespace blitz;
using namespace std;
int main(){
int N = 5000;
int iter = 1000;
Array<double,2 A_(N,N,fortranArray) ;
Array<double,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<double,2 C_(N,N,fortranArray) ;
time_t time1 = time(NULL);
for(int i = 0; i< iter ;++i){
C_ = A_ + B_;
}
time_t time2 = time(NULL);
cout<<"time "<<time2-time1<<endl;
}
I dont have the stopwatch library ,replaced it by C time.h library
(where to get that library? )
Java version,
public class RunTest{
public static void main(String[] args){
int N = 5000;
int runs = 1000;
double[][] A = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
A[i][j] = 4;
}
}
double[][] B = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
B[i][j] = A[i][j]+6;
}
}
long time1 = System.currentTimeMillis();

double[][] C = new double[N][N];
for (int k = 0; k< runs ;++k){
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
C[i][j] = A[i][j]+B[i][j];
}
}
}
long time2 = System.currentTimeMillis();
System.out.println ((double)(time2-time1)*0.001);
}
}

Any conversional mistake should be reported.

Thanks
abir

Oct 23 '06 #16

Frank-O

Hmm,
Blitz++ is not so fast... even though you show me that it depends upon
the platform..
You should try also the others hand-coded versions with an iterator on
blitz++ data or a simple pointer on double[N][N]; and the STL
(transform)

so we'll get the full picture

I'm installing ATLAS to test also...
I will send you my class stopwatch at your email...
toton wrote:

Frank-O wrote:
I've tested Blitz++, Although it is a nice library, ---it does not
outperform anyone...
two computers were used.

1- laptop intel pentium M 1.70Ghz 504Mb
The Me-shark library ranks first along with the "dynamic unroll-loops
style" for any size N*N. matlab always is always a tad better.

2- desktop computer. AMD athlon XP2400+
1.99Ghz 320MB
The "dynamic unroll-loops style" ranks first along with the STL for any
size N*N.

There is one thing I can't understand when comparing the performance
of the computers.
when N (matrix size) is less than 400 , the AMD is three times faster
than intel, Conversely when N>400 intel is five times faster than AMD
!!!

Regarding my algorithm itself it is very fast on the computer 2 (AMD)
and very slow on the other....
using namespace blitz;
//BZ_USING_NAMESPACE(blitz)

using namespace std;

void Test_perf(int niter,int N)
{
//Time comparision for Matrix Addition

Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter, niter4=niter , niter5=niter;

while(niter--)
{
stopwatch sw("STL",niter0);

transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}

A.release(); B.release(); C.release();
// blitz++
Array<Real,2 A_(N,N,fortranArray) ;
Array<Real,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<Real,2 C_(N,N,fortranArray) ;

stopwatch::reset();
while(niter4--)
{
stopwatch sw("Blitz++ fortranArray",niter5);
C_ = A_ + B_;
}

// me-shark
MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;

A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;

Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;

fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );

stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}

m_free(A1);m_free(B1), m_free(C1);

Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();

stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}

A2.release(); B2.release(); C2.release();

Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();

stopwatch::reset();
uint32_t _nbelement;

while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);

while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}

switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();

}
//cout << C3 << endl;
A3.release();B3.release(); C3.release();

return;
Here are ther results for your program dith different test runs.
(Processor Pentium D, 3 GHz, RAM 1GB, os WinXP SP2, neither matlab nor
the C++ compiler uses the PentiumD features, WinXP doesnt support it,
if they support it can be made parallel very easily)
Matlab =RunTest(1000,500) 1000 iter, 500x500 matrix time 2.4460 s,
mean 0.0024 s
Matlab =RunTest(1000,5000) 1000 iter, 5000x5000 matrix, time
304.5400 s, mean 0.3045 s.
You can check with varying the parameters. In MY PC, the per process
RAM starts getting exhausted for a size 600x600 or more. and the mean
increases drastically! (The difference you will experience at some
other size!)
For curiosity, I have a Java program also. The performance support my
expectation (JVM just removes the call, as nobody uses the result!. And
it is such a beautiful hotspot !
JVM client 1.5 (Sun) 1000 iter, 500x500 matrix, 1.73 sec
1000 iter ,5000x5000 matrix 173.761 s
Now for C++
with blitz 0.9, VC 7.1 optimization /O2, & /G7 1000 iter, 500x500
matrix 2 sec (not very accurate, used C time.h library, it rounds off
to nearest sec )
1000 iter , 5000x5000 matrix 273 sec.
One point easily can be seen here,
C++ program allocates memory for C_ (check your blitz version ) only
once, while Matlab version (your earlier post) allocates in each loop.
The Java version I think too smart here, it KNOWS you have given him
some garbage to compute and never bothered about the result. May be it
just do not compute anything, or computes only a few times, or even
computes always, but writes only once! .
My programs are,
Matlab =same as your first post, I am not reposting the program,
C++ Blitz,
#include <blitz/array.h>
#include <iostream>
#include <time.h>
using namespace blitz;
using namespace std;
int main(){
int N = 5000;
int iter = 1000;
Array<double,2 A_(N,N,fortranArray) ;
Array<double,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<double,2 C_(N,N,fortranArray) ;
time_t time1 = time(NULL);
for(int i = 0; i< iter ;++i){
C_ = A_ + B_;
}
time_t time2 = time(NULL);
cout<<"time "<<time2-time1<<endl;
}
I dont have the stopwatch library ,replaced it by C time.h library
(where to get that library? )
Java version,
public class RunTest{
public static void main(String[] args){
int N = 5000;
int runs = 1000;
double[][] A = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
A[i][j] = 4;
}
}
double[][] B = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
B[i][j] = A[i][j]+6;
}
}
long time1 = System.currentTimeMillis();

double[][] C = new double[N][N];
for (int k = 0; k< runs ;++k){
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
C[i][j] = A[i][j]+B[i][j];
}
}
}
long time2 = System.currentTimeMillis();
System.out.println ((double)(time2-time1)*0.001);
}
}

Any conversional mistake should be reported.

Thanks
abir

Oct 24 '06 #17

toton

Frank-O wrote:

Hmm,
Blitz++ is not so fast... even though you show me that it depends upon
the platform..
You should try also the others hand-coded versions with an iterator on
blitz++ data or a simple pointer on double[N][N]; and the STL
(transform)

so we'll get the full picture

Yes. One can not say a library is fast or slow in a general way. It
depends on a whole lots of thing. It depends on processor, cache, RAM,
OS, which type of program one is comparing as benchmark. Ther result
may differ even by a factor but never by an order!
But a few things here,
1) ATLAS (or BLAS or anything) doesn't do anything special for matrix
addition! None of the library will make a drastic change in result.
However you will feel it when using other Linear Algebra operations
(say matrix inverse, svd etc).
2) Pointer version is always slower than non-pointer version. Pointer
reduces the scope of optimizatron.
3) Iterator (is nothing but a pointer) is always slower than using
original container, as it adds onel level of inderection. Very good
compiler will however will reduce the gap, esp for simple iteration
loops. However iterator makes it generalized, one need not to worry
about the detail of container.
4) as STL operates on a different philosophy of iterator, container and
algorithm concept, where algorithm operates on iterator, which in turns
gives access to container, they are bound to be slower than BLAS or
ATLAS , but again by a factor only, not by an order. Same is true for
Blitz when one uses iterator.
5) I can't compile your code untill I know what is Real, what is
Matrix, and what is MAT.
However for first try transform is a real problem, as it evaluates a
function in each call and uses pointer to the matrix content. blitz
version I don't see any coding problem , unless you have something
different in Real (in it typedef double ?). 3rd version it depends on
m_add function how it is implemented, and the data(is it an 1D array or
2D ? if 2D do it do justice to cache spacial locality ? ) Final one, I
am not sure. Loop unrolling is done by compiler if needed at the
compile time statically ( as in Blitz) . But it looks in your case it
is done at run time! Again not very sure of that. BTW got your
stopwatch code. It is not a singleton class, but has some static reset.
be aware when using it in MT environment. And may be you are using a ST
application. Do check whether Matlab ATLAS library is ST or MT. There
will be a performance difference (however note Matlab itself is Single
Threaded). And don't forget to tune your compiler for the particular
processor. Matlab specific library are listed in blas.spec file.
6) It never depends on a language whether a program is faster or
slower. It depends on the compiler & compiler options, algorithms uses,
thye processor and related things, and the run time (Hmm runtime! That
has the info what is actually happening! In my PC for computation in
loops, Java always runs faster! )
7) Finally, matlab compiler is much much slower than C++ (or Java) . In
your example Matlab have no role in purformance. it is ATLAS (which is
written in FORTRAN, but do has port in other language) tuned for your
processor. (compiler I suspect Intel Fortran ) . However as for this
specific case (addition) the library has nothing great to offer, any
handwriten code will have nearly same performance as the library, if
proper compiler switch is used. And you hadn't sent your performance
test results!

I'm installing ATLAS to test also...
I will send you my class stopwatch at your email...
toton wrote:
Frank-O wrote:
I've tested Blitz++, Although it is a nice library, ---it does not
outperform anyone...
two computers were used.
>
1- laptop intel pentium M 1.70Ghz 504Mb
The Me-shark library ranks first along with the "dynamic unroll-loops
style" for any size N*N. matlab always is always a tad better.
>
2- desktop computer. AMD athlon XP2400+
1.99Ghz 320MB
The "dynamic unroll-loops style" ranks first along with the STL for any
size N*N.
>
There is one thing I can't understand when comparing the performance
of the computers.
when N (matrix size) is less than 400 , the AMD is three times faster
than intel, Conversely when N>400 intel is five times faster than AMD
!!!
>
Regarding my algorithm itself it is very fast on the computer 2 (AMD)
and very slow on the other....
>
>
using namespace blitz;
//BZ_USING_NAMESPACE(blitz)
>
using namespace std;
>
void Test_perf(int niter,int N)
{
//Time comparision for Matrix Addition
>
Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter, niter4=niter , niter5=niter;
>
while(niter--)
{
stopwatch sw("STL",niter0);
>
transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}
>
A.release(); B.release(); C.release();
>
>
// blitz++
Array<Real,2 A_(N,N,fortranArray) ;
Array<Real,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<Real,2 C_(N,N,fortranArray) ;
>
stopwatch::reset();
while(niter4--)
{
stopwatch sw("Blitz++ fortranArray",niter5);
C_ = A_ + B_;
}
>
>
>
// me-shark
MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;
>
A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;
>
Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;
>
fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );
>
stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}
>
m_free(A1);m_free(B1), m_free(C1);
>
Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();
>
stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}
>
A2.release(); B2.release(); C2.release();
>
Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();
>
stopwatch::reset();
uint32_t _nbelement;
>
while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);
>
while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}
>
switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();
>
}
//cout << C3 << endl;
A3.release();B3.release(); C3.release();
>
return;
Here are ther results for your program dith different test runs.
(Processor Pentium D, 3 GHz, RAM 1GB, os WinXP SP2, neither matlab nor
the C++ compiler uses the PentiumD features, WinXP doesnt support it,
if they support it can be made parallel very easily)
Matlab =RunTest(1000,500) 1000 iter, 500x500 matrix time 2.4460 s,
mean 0.0024 s
Matlab =RunTest(1000,5000) 1000 iter, 5000x5000 matrix, time
304.5400 s, mean 0.3045 s.
You can check with varying the parameters. In MY PC, the per process
RAM starts getting exhausted for a size 600x600 or more. and the mean
increases drastically! (The difference you will experience at some
other size!)
For curiosity, I have a Java program also. The performance support my
expectation (JVM just removes the call, as nobody uses the result!. And
it is such a beautiful hotspot !
JVM client 1.5 (Sun) 1000 iter, 500x500 matrix, 1.73 sec
1000 iter ,5000x5000 matrix 173.761 s
Now for C++
with blitz 0.9, VC 7.1 optimization /O2, & /G7 1000 iter, 500x500
matrix 2 sec (not very accurate, used C time.h library, it rounds off
to nearest sec )
1000 iter , 5000x5000 matrix 273 sec.
One point easily can be seen here,
C++ program allocates memory for C_ (check your blitz version ) only
once, while Matlab version (your earlier post) allocates in each loop.
The Java version I think too smart here, it KNOWS you have given him
some garbage to compute and never bothered about the result. May be it
just do not compute anything, or computes only a few times, or even
computes always, but writes only once! .
My programs are,
Matlab =same as your first post, I am not reposting the program,
C++ Blitz,
#include <blitz/array.h>
#include <iostream>
#include <time.h>
using namespace blitz;
using namespace std;
int main(){
int N = 5000;
int iter = 1000;
Array<double,2 A_(N,N,fortranArray) ;
Array<double,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<double,2 C_(N,N,fortranArray) ;
time_t time1 = time(NULL);
for(int i = 0; i< iter ;++i){
C_ = A_ + B_;
}
time_t time2 = time(NULL);
cout<<"time "<<time2-time1<<endl;
}
I dont have the stopwatch library ,replaced it by C time.h library
(where to get that library? )
Java version,
public class RunTest{
public static void main(String[] args){
int N = 5000;
int runs = 1000;
double[][] A = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
A[i][j] = 4;
}
}
double[][] B = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
B[i][j] = A[i][j]+6;
}
}
long time1 = System.currentTimeMillis();

double[][] C = new double[N][N];
for (int k = 0; k< runs ;++k){
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
C[i][j] = A[i][j]+B[i][j];
}
}
}
long time2 = System.currentTimeMillis();
System.out.println ((double)(time2-time1)*0.001);
}
}

Any conversional mistake should be reported.

Thanks
abir

Thanks
abir

Oct 24 '06 #18

Frank-O

Hi abir,

First of all I am using an ST application.
Matrix and Mat are specific Matrix types in newmat and me-shark library
Real is a typedef of double. Anyway I was talking about the hand coded
version not these ones obviously.

>2) Pointer version is always slower than non-pointer version. Pointer
reduces the scope of optimizatron.

Pointer or iterator enable me to use the STL which is famous to be
faster than one programmer...
I am not sure that pointer version is always slower than non-pointer
version, it depends...
When walking through a matrix array (let's say row major)
for i=0:N-1 and j...
a[i][j] + b[i][j] is equivalent to *(a+nbcolumns*i +j) +
*(b+nbcolumns*i +j) (pointers)
I think that *(_a++) + *(_b++) is faster because it doesn't use
multiplication

>Loop unrolling is done by compiler if needed at the
compile time statically ( as in Blitz)

For optimization I am not sure that compiler can do "dynamic" loop
unrolling...
if it does, does that make any big difference...

>It never depends on a language whether a program is faster or
slower. It depends on the compiler & compiler options, algorithms uses,
thye processor and related things, and the run time (Hmm runtime! That
has the info what is actually happening! In my PC for computation in
loops, Java always runs faster! )

Hmm, then why people say that C or Fortran is faster than C++, Lapack
is writen in fortran could it have been written in Java or VB ?? I am
not an expert so I can't tell.

any handwriten code will have nearly same performance as the library, if
proper compiler switch is used. And you hadn't sent your performance
test results!

ok if so, all the libraries must have the same performance, everybody
can write an addition or an element wise multiplication... but again
why the newmat library is so slow :...

Figures is not much important because my task is to be three times as
faster...
I just look after the ratio between them...
Anyway I will post my results this afternoon there is a lot of tests to
carry on...

Take care.

toton wrote:

Frank-O wrote:
Hmm,
Blitz++ is not so fast... even though you show me that it depends upon
the platform..
You should try also the others hand-coded versions with an iterator on
blitz++ data or a simple pointer on double[N][N]; and the STL
(transform)

so we'll get the full picture
Yes. One can not say a library is fast or slow in a general way. It
depends on a whole lots of thing. It depends on processor, cache, RAM,
OS, which type of program one is comparing as benchmark. Ther result
may differ even by a factor but never by an order!
But a few things here,
1) ATLAS (or BLAS or anything) doesn't do anything special for matrix
addition! None of the library will make a drastic change in result.
However you will feel it when using other Linear Algebra operations
(say matrix inverse, svd etc).
2) Pointer version is always slower than non-pointer version. Pointer
reduces the scope of optimizatron.
3) Iterator (is nothing but a pointer) is always slower than using
original container, as it adds onel level of inderection. Very good
compiler will however will reduce the gap, esp for simple iteration
loops. However iterator makes it generalized, one need not to worry
about the detail of container.
4) as STL operates on a different philosophy of iterator, container and
algorithm concept, where algorithm operates on iterator, which in turns
gives access to container, they are bound to be slower than BLAS or
ATLAS , but again by a factor only, not by an order. Same is true for
Blitz when one uses iterator.
5) I can't compile your code untill I know what is Real, what is
Matrix, and what is MAT.
However for first try transform is a real problem, as it evaluates a
function in each call and uses pointer to the matrix content. blitz
version I don't see any coding problem , unless you have something
different in Real (in it typedef double ?). 3rd version it depends on
m_add function how it is implemented, and the data(is it an 1D array or
2D ? if 2D do it do justice to cache spacial locality ? ) Final one, I
am not sure. Loop unrolling is done by compiler if needed at the
compile time statically ( as in Blitz) . But it looks in your case it
is done at run time! Again not very sure of that. BTW got your
stopwatch code. It is not a singleton class, but has some static reset.
be aware when using it in MT environment. And may be you are using a ST
application. Do check whether Matlab ATLAS library is ST or MT. There
will be a performance difference (however note Matlab itself is Single
Threaded). And don't forget to tune your compiler for the particular
processor. Matlab specific library are listed in blas.spec file.
6) It never depends on a language whether a program is faster or
slower. It depends on the compiler & compiler options, algorithms uses,
thye processor and related things, and the run time (Hmm runtime! That
has the info what is actually happening! In my PC for computation in
loops, Java always runs faster! )
7) Finally, matlab compiler is much much slower than C++ (or Java) . In
your example Matlab have no role in purformance. it is ATLAS (which is
written in FORTRAN, but do has port in other language) tuned for your
processor. (compiler I suspect Intel Fortran ) . However as for this
specific case (addition) the library has nothing great to offer, any
handwriten code will have nearly same performance as the library, if
proper compiler switch is used. And you hadn't sent your performance
test results!

I'm installing ATLAS to test also...
I will send you my class stopwatch at your email...
toton wrote:
Frank-O wrote:
I've tested Blitz++, Although it is a nice library, ---it does not
outperform anyone...
two computers were used.

1- laptop intel pentium M 1.70Ghz 504Mb
The Me-shark library ranks first along with the "dynamic unroll-loops
style" for any size N*N. matlab always is always a tad better.

2- desktop computer. AMD athlon XP2400+
1.99Ghz 320MB
The "dynamic unroll-loops style" ranks first along with the STL for any
size N*N.

There is one thing I can't understand when comparing the performance
of the computers.
when N (matrix size) is less than 400 , the AMD is three times faster
than intel, Conversely when N>400 intel is five times faster than AMD
!!!

Regarding my algorithm itself it is very fast on the computer 2 (AMD)
and very slow on the other....

using namespace blitz;
//BZ_USING_NAMESPACE(blitz)

using namespace std;

void Test_perf(int niter,int N)
{
//Time comparision for Matrix Addition

Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter, niter4=niter , niter5=niter;

while(niter--)
{
stopwatch sw("STL",niter0);

transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}

A.release(); B.release(); C.release();

// blitz++
Array<Real,2 A_(N,N,fortranArray) ;
Array<Real,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<Real,2 C_(N,N,fortranArray) ;

stopwatch::reset();
while(niter4--)
{
stopwatch sw("Blitz++ fortranArray",niter5);
C_ = A_ + B_;
}

// me-shark
MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;

A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;

Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;

fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );

stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}

m_free(A1);m_free(B1), m_free(C1);

Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();

stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}

A2.release(); B2.release(); C2.release();

Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();

stopwatch::reset();
uint32_t _nbelement;

while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);

while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}

switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();

}
//cout << C3 << endl;
A3.release();B3.release(); C3.release();

return;
Here are ther results for your program dith different test runs.
(Processor Pentium D, 3 GHz, RAM 1GB, os WinXP SP2, neither matlab nor
the C++ compiler uses the PentiumD features, WinXP doesnt support it,
if they support it can be made parallel very easily)
Matlab =RunTest(1000,500) 1000 iter, 500x500 matrix time 2.4460 s,
mean 0.0024 s
Matlab =RunTest(1000,5000) 1000 iter, 5000x5000 matrix, time
304.5400 s, mean 0.3045 s.
You can check with varying the parameters. In MY PC, the per process
RAM starts getting exhausted for a size 600x600 or more. and the mean
increases drastically! (The difference you will experience at some
other size!)
For curiosity, I have a Java program also. The performance support my
expectation (JVM just removes the call, as nobody uses the result!. And
it is such a beautiful hotspot !
JVM client 1.5 (Sun) 1000 iter, 500x500 matrix, 1.73 sec
1000 iter ,5000x5000 matrix 173.761 s
Now for C++
with blitz 0.9, VC 7.1 optimization /O2, & /G7 1000 iter, 500x500
matrix 2 sec (not very accurate, used C time.h library, it rounds off
to nearest sec )
1000 iter , 5000x5000 matrix 273 sec.
One point easily can be seen here,
C++ program allocates memory for C_ (check your blitz version ) only
once, while Matlab version (your earlier post) allocates in each loop.
The Java version I think too smart here, it KNOWS you have given him
some garbage to compute and never bothered about the result. May be it
just do not compute anything, or computes only a few times, or even
computes always, but writes only once! .
My programs are,
Matlab =same as your first post, I am not reposting the program,
C++ Blitz,
#include <blitz/array.h>
#include <iostream>
#include <time.h>
using namespace blitz;
using namespace std;
int main(){
int N = 5000;
int iter = 1000;
Array<double,2 A_(N,N,fortranArray) ;
Array<double,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<double,2 C_(N,N,fortranArray) ;
time_t time1 = time(NULL);
for(int i = 0; i< iter ;++i){
C_ = A_ + B_;
}
time_t time2 = time(NULL);
cout<<"time "<<time2-time1<<endl;
}
I dont have the stopwatch library ,replaced it by C time.h library
(where to get that library? )
Java version,
public class RunTest{
public static void main(String[] args){
int N = 5000;
int runs = 1000;
double[][] A = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
A[i][j] = 4;
}
}
double[][] B = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
B[i][j] = A[i][j]+6;
}
}
long time1 = System.currentTimeMillis();
>
double[][] C = new double[N][N];
for (int k = 0; k< runs ;++k){
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
C[i][j] = A[i][j]+B[i][j];
}
}
}
long time2 = System.currentTimeMillis();
System.out.println ((double)(time2-time1)*0.001);
}
}
>
Any conversional mistake should be reported.
>
Thanks
abir

Thanks
abir

Oct 25 '06 #19

Frank-O

>2) Pointer version is always slower than non-pointer version. Pointer
reduces the scope of optimizatron.

>Loop unrolling is done by compiler if needed at the
compile time statically ( as in Blitz)

For optimization I am not sure that compiler can do "dynamic" loop
unrolling...
if it does, does that make any big difference...

>It never depends on a language whether a program is faster or
slower. It depends on the compiler & compiler options, algorithms uses,
thye processor and related things, and the run time (Hmm runtime! That
has the info what is actually happening! In my PC for computation in
loops, Java always runs faster! )

Hmm, then why people say that C or Fortran is faster than C++, Lapack
is writen in fortran could it have been written in Java or VB ?? I am
not an expert so I can't tell.

any handwriten code will have nearly same performance as the library, if
proper compiler switch is used. And you hadn't sent your performance
test results!

Frank-O wrote:
Hmm,
Blitz++ is not so fast... even though you show me that it depends upon
the platform..
You should try also the others hand-coded versions with an iterator on
blitz++ data or a simple pointer on double[N][N]; and the STL
(transform)

so we'll get the full picture
Yes. One can not say a library is fast or slow in a general way. It
depends on a whole lots of thing. It depends on processor, cache, RAM,
OS, which type of program one is comparing as benchmark. Ther result
may differ even by a factor but never by an order!
But a few things here,
1) ATLAS (or BLAS or anything) doesn't do anything special for matrix
addition! None of the library will make a drastic change in result.
However you will feel it when using other Linear Algebra operations
(say matrix inverse, svd etc).
2) Pointer version is always slower than non-pointer version. Pointer
reduces the scope of optimizatron.
3) Iterator (is nothing but a pointer) is always slower than using
original container, as it adds onel level of inderection. Very good
compiler will however will reduce the gap, esp for simple iteration
loops. However iterator makes it generalized, one need not to worry
about the detail of container.
4) as STL operates on a different philosophy of iterator, container and
algorithm concept, where algorithm operates on iterator, which in turns
gives access to container, they are bound to be slower than BLAS or
ATLAS , but again by a factor only, not by an order. Same is true for
Blitz when one uses iterator.
5) I can't compile your code untill I know what is Real, what is
Matrix, and what is MAT.
However for first try transform is a real problem, as it evaluates a
function in each call and uses pointer to the matrix content. blitz
version I don't see any coding problem , unless you have something
different in Real (in it typedef double ?). 3rd version it depends on
m_add function how it is implemented, and the data(is it an 1D array or
2D ? if 2D do it do justice to cache spacial locality ? ) Final one, I
am not sure. Loop unrolling is done by compiler if needed at the
compile time statically ( as in Blitz) . But it looks in your case it
is done at run time! Again not very sure of that. BTW got your
stopwatch code. It is not a singleton class, but has some static reset.
be aware when using it in MT environment. And may be you are using a ST
application. Do check whether Matlab ATLAS library is ST or MT. There
will be a performance difference (however note Matlab itself is Single
Threaded). And don't forget to tune your compiler for the particular
processor. Matlab specific library are listed in blas.spec file.
6) It never depends on a language whether a program is faster or
slower. It depends on the compiler & compiler options, algorithms uses,
thye processor and related things, and the run time (Hmm runtime! That
has the info what is actually happening! In my PC for computation in
loops, Java always runs faster! )
7) Finally, matlab compiler is much much slower than C++ (or Java) . In
your example Matlab have no role in purformance. it is ATLAS (which is
written in FORTRAN, but do has port in other language) tuned for your
processor. (compiler I suspect Intel Fortran ) . However as for this
specific case (addition) the library has nothing great to offer, any
handwriten code will have nearly same performance as the library, if
proper compiler switch is used. And you hadn't sent your performance
test results!

I'm installing ATLAS to test also...
I will send you my class stopwatch at your email...
toton wrote:
Frank-O wrote:
I've tested Blitz++, Although it is a nice library, ---it does not
outperform anyone...
two computers were used.

1- laptop intel pentium M 1.70Ghz 504Mb
The Me-shark library ranks first along with the "dynamic unroll-loops
style" for any size N*N. matlab always is always a tad better.

2- desktop computer. AMD athlon XP2400+
1.99Ghz 320MB
The "dynamic unroll-loops style" ranks first along with the STL for any
size N*N.

There is one thing I can't understand when comparing the performance
of the computers.
when N (matrix size) is less than 400 , the AMD is three times faster
than intel, Conversely when N>400 intel is five times faster than AMD
!!!

Regarding my algorithm itself it is very fast on the computer 2 (AMD)
and very slow on the other....

using namespace blitz;
//BZ_USING_NAMESPACE(blitz)

using namespace std;

void Test_perf(int niter,int N)
{
//Time comparision for Matrix Addition

Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter, niter4=niter , niter5=niter;

while(niter--)
{
stopwatch sw("STL",niter0);

transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}

A.release(); B.release(); C.release();

// blitz++
Array<Real,2 A_(N,N,fortranArray) ;
Array<Real,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<Real,2 C_(N,N,fortranArray) ;

stopwatch::reset();
while(niter4--)
{
stopwatch sw("Blitz++ fortranArray",niter5);
C_ = A_ + B_;
}

// me-shark
MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;

A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;

Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;

fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );

stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}

m_free(A1);m_free(B1), m_free(C1);

Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();

stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}

A2.release(); B2.release(); C2.release();

Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();

stopwatch::reset();
uint32_t _nbelement;

while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);

while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}

switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();

}
//cout << C3 << endl;
A3.release();B3.release(); C3.release();

return;
Here are ther results for your program dith different test runs.
(Processor Pentium D, 3 GHz, RAM 1GB, os WinXP SP2, neither matlab nor
the C++ compiler uses the PentiumD features, WinXP doesnt support it,
if they support it can be made parallel very easily)
Matlab =RunTest(1000,500) 1000 iter, 500x500 matrix time 2.4460 s,
mean 0.0024 s
Matlab =RunTest(1000,5000) 1000 iter, 5000x5000 matrix, time
304.5400 s, mean 0.3045 s.
You can check with varying the parameters. In MY PC, the per process
RAM starts getting exhausted for a size 600x600 or more. and the mean
increases drastically! (The difference you will experience at some
other size!)
For curiosity, I have a Java program also. The performance support my
expectation (JVM just removes the call, as nobody uses the result!. And
it is such a beautiful hotspot !
JVM client 1.5 (Sun) 1000 iter, 500x500 matrix, 1.73 sec
1000 iter ,5000x5000 matrix 173.761 s
Now for C++
with blitz 0.9, VC 7.1 optimization /O2, & /G7 1000 iter, 500x500
matrix 2 sec (not very accurate, used C time.h library, it rounds off
to nearest sec )
1000 iter , 5000x5000 matrix 273 sec.
One point easily can be seen here,
C++ program allocates memory for C_ (check your blitz version ) only
once, while Matlab version (your earlier post) allocates in each loop.
The Java version I think too smart here, it KNOWS you have given him
some garbage to compute and never bothered about the result. May be it
just do not compute anything, or computes only a few times, or even
computes always, but writes only once! .
My programs are,
Matlab =same as your first post, I am not reposting the program,
C++ Blitz,
#include <blitz/array.h>
#include <iostream>
#include <time.h>
using namespace blitz;
using namespace std;
int main(){
int N = 5000;
int iter = 1000;
Array<double,2 A_(N,N,fortranArray) ;
Array<double,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<double,2 C_(N,N,fortranArray) ;
time_t time1 = time(NULL);
for(int i = 0; i< iter ;++i){
C_ = A_ + B_;
}
time_t time2 = time(NULL);
cout<<"time "<<time2-time1<<endl;
}
I dont have the stopwatch library ,replaced it by C time.h library
(where to get that library? )
Java version,
public class RunTest{
public static void main(String[] args){
int N = 5000;
int runs = 1000;
double[][] A = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
A[i][j] = 4;
}
}
double[][] B = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
B[i][j] = A[i][j]+6;
}
}
long time1 = System.currentTimeMillis();
>
double[][] C = new double[N][N];
for (int k = 0; k< runs ;++k){
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
C[i][j] = A[i][j]+B[i][j];
}
}
}
long time2 = System.currentTimeMillis();
System.out.println ((double)(time2-time1)*0.001);
}
}
>
Any conversional mistake should be reported.
>
Thanks
abir

Thanks
abir

Oct 25 '06 #20

toton

Frank-O wrote:

Hi abir,

First of all I am using an ST application.
Matrix and Mat are specific Matrix types in newmat and me-shark library
Real is a typedef of double. Anyway I was talking about the hand coded
version not these ones obviously.

2) Pointer version is always slower than non-pointer version. Pointer
reduces the scope of optimizatron.
Pointer or iterator enable me to use the STL which is famous to be
faster than one programmer...
I am not sure that pointer version is always slower than non-pointer
version, it depends...
When walking through a matrix array (let's say row major)
for i=0:N-1 and j...
a[i][j] + b[i][j] is equivalent to *(a+nbcolumns*i +j) +
*(b+nbcolumns*i +j) (pointers)
I think that *(_a++) + *(_b++) is faster because it doesn't use
multiplication

Given a pair of indices (i,j) you need atleast one multiplication to
know total number of elements!. And most of the compiler will make it
that way, i.e two index will get transformed to one. However for
multi-processing environment (or distributed one), the 2-index loop
will be easily distribute the computation , while the pointer version
will not. However I agree, some times the later one faster, esp on
intel processors & gcc compiler, but by a minor factor! . The reason is
not multiplication, but _a++ operation. Intel have a special addressing
mode for that , something like add & increment! check addl, subl etc
instructions. THis hint it gets from_a++ , but don't get from a[i][j]
and then i++ and j++ (or even a[i] and ++i ) ; Thus two instruction
gets merged into a single one! However most probably VC do it for both
cases (and there is no reason not to do it, if such instruction is
available).

Loop unrolling is done by compiler if needed at the
compile time statically ( as in Blitz)

For optimization I am not sure that compiler can do "dynamic" loop
unrolling...
if it does, does that make any big difference...

>It never depends on a language whether a program is faster or
slower. It depends on the compiler & compiler options, algorithms uses,
thye processor and related things, and the run time (Hmm runtime! That
has the info what is actually happening! In my PC for computation in
loops, Java always runs faster! )

Hmm, then why people say that C or Fortran is faster than C++, Lapack
is writen in fortran could it have been written in Java or VB ?? I am
not an expert so I can't tell.

Don't believe what people say! Do it by yourself and check. Check my
results, it shows Java is faster than fine tuned ATLAS library. And I
explained the reason also! Java has a runtime, and which knows what
exactly happening at that momemnt. It checks that same code you are
running again and again (they call it hotspot) , and thus it makes very
fast code for that portion using all advanced instruction. Which you
can't do in prior, as you dont know where it is going to run , a P4 or
P3 or CoreDuo or AMD :( Even after a few iteration it 'sees' nobody is
bothered about its result! What will it do? If it is intelligent enough
(and I think so!) it will get frustrated and simply stop computing?
simply because nobody bothers about the result! Thus it is faster.
Simply drop the loop count, you will see Java is slower. Because If you
run something once or twice, Java doesnt make optimized code, neither
check if it is really needed. And why Java is slow now? Just because it
has a runtime (The same reason it was faster for a large count ! )
which adds one level of indirection. That is the thumb rule. Compile it
dircetly for your specific platform just like C or FORTRAN, and drop
the additional seafty checks (like array bounds check, RTTI etc) you
will see it has same performance (or exactly same performace esp for
GCC, which has a diffrent front-end for different language, the backend
is same for all, be it f77,f90, CC, CPP, Java, gnat or whatever) . The
minor differences are differences with different compilers.
Thus LAPACK can be virtually written in any language, and performance
will be same, provided they have "same algorithm" and "same level of
compiler", no extra check, no extra feature. After all, language is a
syntax to represent something, compiler translates it to actual machine
language.

>
any handwriten code will have nearly same performance as the library, if
proper compiler switch is used. And you hadn't sent your performance
test results!

ok if so, all the libraries must have the same performance, everybody
can write an addition or an element wise multiplication... but again
why the newmat library is so slow :...

I hadn't used newmat library. Thus I can not comment. If it is a
open-source library, you can check the code. But again, from my example
and results, none of them are slower than matlab call. And ther is no
reason to be! Infact Matlab call has a reason to be slower, it calls
the ATLAS using a function ("builtin") and after a few check gives a
pointer to the array real data to the library! Thus it can be slower
than C++ version. It also slower than Java version (in this particular
case) as its JIT is not as strong as Java's (I have a matlab 7.0
version in office where I found 4-5 bugs in the JIT compiler, reported
a few! )

Figures is not much important because my task is to be three times as
faster...
I just look after the ratio between them...

I also do not like absolute figures! Just a comparison is ok (but ratio
is not a proper way to say I think! mean/sd is a better measure? I am
not a statistics person! )

Anyway I will post my results this afternoon there is a lot of tests to
carry on...

Take care.

Bye

>

toton wrote:
Frank-O wrote:
Hmm,
Blitz++ is not so fast... even though you show me that it depends upon
the platform..
You should try also the others hand-coded versions with an iterator on
blitz++ data or a simple pointer on double[N][N]; and the STL
(transform)
>
so we'll get the full picture
Yes. One can not say a library is fast or slow in a general way. It
depends on a whole lots of thing. It depends on processor, cache, RAM,
OS, which type of program one is comparing as benchmark. Ther result
may differ even by a factor but never by an order!
But a few things here,
1) ATLAS (or BLAS or anything) doesn't do anything special for matrix
addition! None of the library will make a drastic change in result.
However you will feel it when using other Linear Algebra operations
(say matrix inverse, svd etc).
2) Pointer version is always slower than non-pointer version. Pointer
reduces the scope of optimizatron.
3) Iterator (is nothing but a pointer) is always slower than using
original container, as it adds onel level of inderection. Very good
compiler will however will reduce the gap, esp for simple iteration
loops. However iterator makes it generalized, one need not to worry
about the detail of container.
4) as STL operates on a different philosophy of iterator, container and
algorithm concept, where algorithm operates on iterator, which in turns
gives access to container, they are bound to be slower than BLAS or
ATLAS , but again by a factor only, not by an order. Same is true for
Blitz when one uses iterator.
5) I can't compile your code untill I know what is Real, what is
Matrix, and what is MAT.
However for first try transform is a real problem, as it evaluates a
function in each call and uses pointer to the matrix content. blitz
version I don't see any coding problem , unless you have something
different in Real (in it typedef double ?). 3rd version it depends on
m_add function how it is implemented, and the data(is it an 1D array or
2D ? if 2D do it do justice to cache spacial locality ? ) Final one, I
am not sure. Loop unrolling is done by compiler if needed at the
compile time statically ( as in Blitz) . But it looks in your case it
is done at run time! Again not very sure of that. BTW got your
stopwatch code. It is not a singleton class, but has some static reset.
be aware when using it in MT environment. And may be you are using a ST
application. Do check whether Matlab ATLAS library is ST or MT. There
will be a performance difference (however note Matlab itself is Single
Threaded). And don't forget to tune your compiler for the particular
processor. Matlab specific library are listed in blas.spec file.
6) It never depends on a language whether a program is faster or
slower. It depends on the compiler & compiler options, algorithms uses,
thye processor and related things, and the run time (Hmm runtime! That
has the info what is actually happening! In my PC for computation in
loops, Java always runs faster! )
7) Finally, matlab compiler is much much slower than C++ (or Java) . In
your example Matlab have no role in purformance. it is ATLAS (which is
written in FORTRAN, but do has port in other language) tuned for your
processor. (compiler I suspect Intel Fortran ) . However as for this
specific case (addition) the library has nothing great to offer, any
handwriten code will have nearly same performance as the library, if
proper compiler switch is used. And you hadn't sent your performance
test results!

I'm installing ATLAS to test also...
I will send you my class stopwatch at your email...
>
>
toton wrote:
Frank-O wrote:
I've tested Blitz++, Although it is a nice library, ---it does not
outperform anyone...
two computers were used.
>
1- laptop intel pentium M 1.70Ghz 504Mb
The Me-shark library ranks first along with the "dynamic unroll-loops
style" for any size N*N. matlab always is always a tad better.
>
2- desktop computer. AMD athlon XP2400+
1.99Ghz 320MB
The "dynamic unroll-loops style" ranks first along with the STL for any
size N*N.
>
There is one thing I can't understand when comparing the performance
of the computers.
when N (matrix size) is less than 400 , the AMD is three times faster
than intel, Conversely when N>400 intel is five times faster than AMD
!!!
>
Regarding my algorithm itself it is very fast on the computer 2 (AMD)
and very slow on the other....
>
>
using namespace blitz;
//BZ_USING_NAMESPACE(blitz)
>
using namespace std;
>
void Test_perf(int niter,int N)
{
//Time comparision for Matrix Addition
>
Matrix A(N,N);A=4;
Matrix B(A); B=10;
Matrix C(N,N);
uint32_t niter0 = niter , niter1 = niter, niter2=niter,
niter3=niter, niter4=niter , niter5=niter;
>
while(niter--)
{
stopwatch sw("STL",niter0);
>
transform(A.data(),A.data()+N*N,B.data(),C.data(), plus<Real>());
}
>
A.release(); B.release(); C.release();
>
>
// blitz++
Array<Real,2 A_(N,N,fortranArray) ;
Array<Real,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<Real,2 C_(N,N,fortranArray) ;
>
stopwatch::reset();
while(niter4--)
{
stopwatch sw("Blitz++ fortranArray",niter5);
C_ = A_ + B_;
}
>
>
>
// me-shark
MAT *A1=MNULL,*B1=MNULL,*C1=MNULL;
>
A1 = m_get(N,N);
B1 = m_get(N,N);
C1 = m_get(N,N);
const uint32_t N2=N*N;
>
Real *_a = A1->base;
Real *_b = B1->base;
Real *_c = C1->base;
>
fill(_a,_a+N2, 4 );
fill(_b,_b+N2, 10 );
>
stopwatch::reset();
while(niter0--)
{
stopwatch sw("Meschach",niter1);
C1=m_add(A1,B1,C1);
}
>
m_free(A1);m_free(B1), m_free(C1);
>
Matrix A2(N,N);A2=4;
Matrix B2(N,N);B2=10;
Matrix C2(N,N);
Real *_a2 = A2.data();
Real *_b2 = B2.data();
Real *_c2 = C2.data();
>
stopwatch::reset();
while(niter1--)
{
{
stopwatch sw("simple style",niter2);
for(uint32_t i =1;i<=N2;i++)
*_c2++ = *_a2++ + *_b2++;
}
_c2=C2.data(); _a2=A2.data(); _b2=B2.data();
}
>
A2.release(); B2.release(); C2.release();
>
Matrix A3(N,N);A3=4;
Matrix B3(N,N);B3=10;
Matrix C3(N,N);
Real *_a3 = A3.data();
Real *_b3 = B3.data();
Real *_c3 = C3.data();
>
stopwatch::reset();
uint32_t _nbelement;
>
while(niter2--)
{
{
stopwatch sw("unroll-loops style",niter3);
_nbelement = (N2>>BLOCKELEMENT)<<BLOCKELEMENT;
uint32_t indice(0);
>
while(indice<_nbelement)
{
*_c3++ = *_a3++ + *_b3++; *_c3++ = *_a3++ +
*_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
*_c3++ = *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;*_c3++
= *_a3++ + *_b3++;*_c3++ = *_a3++ + *_b3++;
indice+=BLOCKSIZE;
}
>
switch(N2-indice)
{
case 7 : *_c3++ = *_a3++ + *_b3++;indice++;
case 6 : *_c3++ = *_a3++ + *_b3++;indice++;
case 5 : *_c3++ = *_a3++ + *_b3++;indice++;
case 4 : *_c3++ = *_a3++ + *_b3++;indice++;
case 3 : *_c3++ = *_a3++ + *_b3++;indice++;
case 2 : *_c3++ = *_a3++ + *_b3++;indice++;
case 1 : *_c3++ = *_a3++ + *_b3++;indice++;
}
}
_c3=C3.data(); _a3=A3.data(); _b3=B3.data();
>
}
//cout << C3 << endl;
A3.release();B3.release(); C3.release();
>
return;
Here are ther results for your program dith different test runs.
(Processor Pentium D, 3 GHz, RAM 1GB, os WinXP SP2, neither matlab nor
the C++ compiler uses the PentiumD features, WinXP doesnt support it,
if they support it can be made parallel very easily)
Matlab =RunTest(1000,500) 1000 iter, 500x500 matrix time 2.4460 s,
mean 0.0024 s
Matlab =RunTest(1000,5000) 1000 iter, 5000x5000 matrix, time
304.5400 s, mean 0.3045 s.
You can check with varying the parameters. In MY PC, the per process
RAM starts getting exhausted for a size 600x600 or more. and the mean
increases drastically! (The difference you will experience at some
other size!)
For curiosity, I have a Java program also. The performance support my
expectation (JVM just removes the call, as nobody uses the result!. And
it is such a beautiful hotspot !
JVM client 1.5 (Sun) 1000 iter, 500x500 matrix, 1.73 sec
1000 iter ,5000x5000 matrix 173.761 s
Now for C++
with blitz 0.9, VC 7.1 optimization /O2, & /G7 1000 iter, 500x500
matrix 2 sec (not very accurate, used C time.h library, it rounds off
to nearest sec )
1000 iter , 5000x5000 matrix 273 sec.
One point easily can be seen here,
C++ program allocates memory for C_ (check your blitz version ) only
once, while Matlab version (your earlier post) allocates in each loop.
The Java version I think too smart here, it KNOWS you have given him
some garbage to compute and never bothered about the result. May be it
just do not compute anything, or computes only a few times, or even
computes always, but writes only once! .
My programs are,
Matlab =same as your first post, I am not reposting the program,
C++ Blitz,
#include <blitz/array.h>
#include <iostream>
#include <time.h>
using namespace blitz;
using namespace std;
int main(){
int N = 5000;
int iter = 1000;
Array<double,2 A_(N,N,fortranArray) ;
Array<double,2 B_(N,N,fortranArray) ;
A_ = 4;
B_ = A_ + 6;
Array<double,2 C_(N,N,fortranArray) ;
time_t time1 = time(NULL);
for(int i = 0; i< iter ;++i){
C_ = A_ + B_;
}
time_t time2 = time(NULL);
cout<<"time "<<time2-time1<<endl;
}
I dont have the stopwatch library ,replaced it by C time.h library
(where to get that library? )
Java version,
public class RunTest{
public static void main(String[] args){
int N = 5000;
int runs = 1000;
double[][] A = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
A[i][j] = 4;
}
}
double[][] B = new double[N][N];
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
B[i][j] = A[i][j]+6;
}
}
long time1 = System.currentTimeMillis();

double[][] C = new double[N][N];
for (int k = 0; k< runs ;++k){
for(int i = 0; i< N;++i){
for(int j = 0; j< N ;++j){
C[i][j] = A[i][j]+B[i][j];
}
}
}
long time2 = System.currentTimeMillis();
System.out.println ((double)(time2-time1)*0.001);
}
}

Any conversional mistake should be reported.

Thanks
abir
Thanks
abir

Oct 25 '06 #21

Matrix library where's the speed ?

Similar topics