Matrix optimization

Pat

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

Jul 22 '05 #1

Subscribe Post Reply

1741

Gianni Mariani

Pat wrote:

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;

*** Note - I did not syntax check it.

Jul 22 '05 #2

Victor Bazarov

Pat wrote:

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?

(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[i][j] = B[i][j] + C[i][j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A[i], *pb = B[i], *pc = C[i];
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by [i], but I'll leave it
to you to figure out.

(c) Consider that "premature optimization is the root of all evil".
The [i][j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...

Victor

Jul 22 '05 #3

Victor Bazarov

Gianni Mariani wrote:

Pat wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;

Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

V

Jul 22 '05 #4

Peter van Merkerk

Victor Bazarov wrote:

Pat wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?

(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[i][j] = B[i][j] + C[i][j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A[i], *pb = B[i], *pc = C[i];
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by [i], but I'll leave it
to you to figure out.

With a good optimizing compiler it is possible that it makes no
difference at all...
(c) Consider that "premature optimization is the root of all evil".
The [i][j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...

Also you may want to consider using a matrix library rather than
inventing your own.

--
Peter van Merkerk
peter.van.merkerk(at)dse.nl

Jul 22 '05 #5

Victor Bazarov

Peter van Merkerk wrote:

[..]

Also you may want to consider using a matrix library rather than
inventing your own.

That's reasonable, unless the OP's goal is to create a matrix library
for others to consider.

Jul 22 '05 #6

Gianni Mariani

Victor Bazarov wrote:

Gianni Mariani wrote:
....

Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.

On a Athlon64 with gcc 3.3.3 machine w/o unrolling, indexing is 25%
faster. With unrolling, it seems like the gcc compiler likes the
pointer code better and the pointer code is 15% faster.

Timings for the code below:
g++ (GCC) 3.4.0
g++ -O3 -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
3.070u 0.010s 0:03.08 100.0% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
4.080u 0.000s 0:04.08 100.0% 0+0k 0+0io 164pf+0w

- index code is faster

now with unrolling

g++ -O3 -funroll-loops -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
2.330u 0.000s 0:02.31 100.8% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
2.080u 0.010s 0:02.08 100.4% 0+0k 0+0io 164pf+0w

The compiler does a pretty bad job of unrolling index loops !

If you do a disassembly, you'll see that the four increments cost more -
sure you could eliminate one inrement and it might result in faster code
in the non-unrolling case but slower code in the unrolling case because
knowing when to terminate is a little harder for the compiler to figure out.

Anyhow, I think I made the point that "it depends". In general, in
numeric code when you're indexing into multiple arrays, indexing might
be faster than manipulating multiple pointers.

Having has a look at the assembler code, the compiler could do a much
better job of unrolling the index code. I suspect other compilers might
actually do a much better job.

---- the code ----

template <typename T, int Rows, int Cols>
inline void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}
template <typename T, int Rows, int Cols>
inline void Addp(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * Ap = & A[0][0];
const T * Bp = & B[0][0];
const T * Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
* Ap ++ = * Bp ++ + * Cp ++;
}

}

enum {
rows = 20,
cols = 20
};

typedef int matrix_t[ rows ][ cols ];
void foo(matrix_t & A, matrix_t & B, matrix_t & C)
{
Add( A, B, C );
}

void foop(matrix_t & A, matrix_t & B, matrix_t & C)
{
Addp( A, B, C );
}
matrix_t A, B, C;

int main( int argc, char ** argv )
{

void ( * foof )(matrix_t & A, matrix_t & B, matrix_t & C);

foof = argc == 2 ? & foop : & foo;

for ( int i = 0; i < 5000000; i ++ )
{
( *foof )( A, B, C );
}
}

Jul 22 '05 #7

Peter van Merkerk

Gianni Mariani wrote:

Victor Bazarov wrote:
Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.

It is not just a matter of processor architecture. The compiler can do
some optimization tricks as well, like replacing indexing operations
with pointer arithmetric.

--
Peter van Merkerk
peter.van.merkerk(at)dse.nl

Jul 22 '05 #8

tom_usenet

On Thu, 29 Jul 2004 01:31:19 +0800, "Pat" <Pa*@Pat.com> wrote:

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?

Generally the best way to optimize matrix operations is to merge
different loops into one loop, which is usually done with the
assistance of expression templates. With a decent compiler,
micro-optimization of an individual loop is not going to give you
order of magnitude increases in performance, and that kind of
optimization should come only once you've done the algorithmic
improvements and found it's still not fast enough.

There are a lot of good matrix maths libraries available. A good
starting point is www.oonumerics.org, or you could go the boost route:
http://www.boost.org/libs/numeric/ublas/doc/index.htm

Tom

Jul 22 '05 #9

rossum

On Thu, 29 Jul 2004 01:31:19 +0800, "Pat" <Pa*@Pat.com> wrote:

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

One thing to be aware of is whether your system stores matrices in row
order or in column order (yes I know there are othr ways to store a
matrix). If it uses one of these two methods then you should time the
two different ways of setting up the i and j loops

// i outside, j inside
for (int i = 0; i < IMAX; ++i) {
for (int j = 0; j < JMAX; ++j) {
A[i][j]=B[i][j]+C[i][j]
}
}

and

// j outside, i inside
for (int j = 0; j < JMAX; ++j) {
for (int i = 0; i < IMAX; ++i) {
A[i][j]=B[i][j]+C[i][j]
}
}

There may be a difference, and if there is then it is worth knowing
since the code change is so simple. This is also worth trying with
some of the other methods suggested.

rossum

--

The ultimate truth is that there is no Ultimate Truth

Jul 22 '05 #10

by: Ben Ingram | last post by:

Hi all, I am writing a template matrix class in which the template parameters are the number of rows and number of columns. There are a number of reasons why this is an appropriate tradeoff for...

C / C++

Help on operator overload for something like matrix(row)(col)

by: Jason | last post by:

Hello. I am trying to learn how operator overloading works so I wrote a simple class to help me practice. I understand the basic opertoar overload like + - / *, but when I try to overload more...

C / C++

Performance issue: matrix multiplication in C and C++

by: Michael Bader | last post by:

Hi, I'm currently working on a matrix multiplication code (matrix times matrix), and have come along some interesting/confusing results concerning the running time of the (apparently) same...

C / C++

Designing a Matrix class

by: Charulatha Kalluri | last post by:

Hi, I'm implementing a Matrix class, as part of a project. This is the interface I've designed: class Matrix( )

C / C++

Matrix multiply code

by: christopher diggins | last post by:

Here is some code I wrote for Matrix multiplication for arbitrary dimensionality known at compile-time. I am curious how practical it is. For instance, is it common to know the dimensionality of...

C / C++

Matrix library where's the speed ?

by: Frank-O | last post by:

Hi , Recently I have been commited to the task of "translating" some complex statistical algorithms from Matlab to C++. The goal is to be three times as fast as matlab ( the latest) . I've...

C / C++

how to speed up the matrix multiplication in C

by: VijaKhara | last post by:

Hi all, Is there any method which can implememt the matrix multiplication faster than using the formula as we often do by hand? I am writing the following code and my matrice: one is 3x40000 and...

C / C++

matrix class

by: DarrenWeber | last post by:

Below is a module (matrix.py) with a class to implement some basic matrix operations on a 2D list. Some things puzzle me about the best way to do this (please don't refer to scipy, numpy and...

Python

native python matrix class (2D list), without inverse

by: DarrenWeber | last post by:

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Matrix optimization

Similar topics