473,394 Members | 1,701 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Matrix optimization

Pat
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat
Jul 22 '05 #1
9 1741
Pat wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat


This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;

*** Note - I did not syntax check it.

Jul 22 '05 #2
Pat wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?


(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[i][j] = B[i][j] + C[i][j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A[i], *pb = B[i], *pc = C[i];
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by [i], but I'll leave it
to you to figure out.

(c) Consider that "premature optimization is the root of all evil".
The [i][j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...

Victor
Jul 22 '05 #3
Gianni Mariani wrote:
Pat wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?
Thanks.
Pat


This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;


Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

V
Jul 22 '05 #4
Victor Bazarov wrote:
Pat wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?

(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[i][j] = B[i][j] + C[i][j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A[i], *pb = B[i], *pc = C[i];
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by [i], but I'll leave it
to you to figure out.


With a good optimizing compiler it is possible that it makes no
difference at all...
(c) Consider that "premature optimization is the root of all evil".
The [i][j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...

Also you may want to consider using a matrix library rather than
inventing your own.

--
Peter van Merkerk
peter.van.merkerk(at)dse.nl
Jul 22 '05 #5
Peter van Merkerk wrote:
[..]

Also you may want to consider using a matrix library rather than
inventing your own.


That's reasonable, unless the OP's goal is to create a matrix library
for others to consider.
Jul 22 '05 #6
Victor Bazarov wrote:
Gianni Mariani wrote:
....

Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...


It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.

On a Athlon64 with gcc 3.3.3 machine w/o unrolling, indexing is 25%
faster. With unrolling, it seems like the gcc compiler likes the
pointer code better and the pointer code is 15% faster.

Timings for the code below:
g++ (GCC) 3.4.0
g++ -O3 -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
3.070u 0.010s 0:03.08 100.0% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
4.080u 0.000s 0:04.08 100.0% 0+0k 0+0io 164pf+0w

- index code is faster

now with unrolling

g++ -O3 -funroll-loops -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
2.330u 0.000s 0:02.31 100.8% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
2.080u 0.010s 0:02.08 100.4% 0+0k 0+0io 164pf+0w

The compiler does a pretty bad job of unrolling index loops !

If you do a disassembly, you'll see that the four increments cost more -
sure you could eliminate one inrement and it might result in faster code
in the non-unrolling case but slower code in the unrolling case because
knowing when to terminate is a little harder for the compiler to figure out.

Anyhow, I think I made the point that "it depends". In general, in
numeric code when you're indexing into multiple arrays, indexing might
be faster than manipulating multiple pointers.

Having has a look at the assembler code, the compiler could do a much
better job of unrolling the index code. I suspect other compilers might
actually do a much better job.

---- the code ----

template <typename T, int Rows, int Cols>
inline void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}
template <typename T, int Rows, int Cols>
inline void Addp(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * Ap = & A[0][0];
const T * Bp = & B[0][0];
const T * Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
* Ap ++ = * Bp ++ + * Cp ++;
}

}

enum {
rows = 20,
cols = 20
};

typedef int matrix_t[ rows ][ cols ];
void foo(matrix_t & A, matrix_t & B, matrix_t & C)
{
Add( A, B, C );
}

void foop(matrix_t & A, matrix_t & B, matrix_t & C)
{
Addp( A, B, C );
}
matrix_t A, B, C;

int main( int argc, char ** argv )
{

void ( * foof )(matrix_t & A, matrix_t & B, matrix_t & C);

foof = argc == 2 ? & foop : & foo;

for ( int i = 0; i < 5000000; i ++ )
{
( *foof )( A, B, C );
}
}
Jul 22 '05 #7
Gianni Mariani wrote:
Victor Bazarov wrote:
Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.


It is not just a matter of processor architecture. The compiler can do
some optimization tricks as well, like replacing indexing operations
with pointer arithmetric.

--
Peter van Merkerk
peter.van.merkerk(at)dse.nl
Jul 22 '05 #8
On Thu, 29 Jul 2004 01:31:19 +0800, "Pat" <Pa*@Pat.com> wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?


Generally the best way to optimize matrix operations is to merge
different loops into one loop, which is usually done with the
assistance of expression templates. With a decent compiler,
micro-optimization of an individual loop is not going to give you
order of magnitude increases in performance, and that kind of
optimization should come only once you've done the algorithmic
improvements and found it's still not fast enough.

There are a lot of good matrix maths libraries available. A good
starting point is www.oonumerics.org, or you could go the boost route:
http://www.boost.org/libs/numeric/ublas/doc/index.htm

Tom
Jul 22 '05 #9
On Thu, 29 Jul 2004 01:31:19 +0800, "Pat" <Pa*@Pat.com> wrote:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[i][j]=B[i][j]+C[i][j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

One thing to be aware of is whether your system stores matrices in row
order or in column order (yes I know there are othr ways to store a
matrix). If it uses one of these two methods then you should time the
two different ways of setting up the i and j loops

// i outside, j inside
for (int i = 0; i < IMAX; ++i) {
for (int j = 0; j < JMAX; ++j) {
A[i][j]=B[i][j]+C[i][j]
}
}

and

// j outside, i inside
for (int j = 0; j < JMAX; ++j) {
for (int i = 0; i < IMAX; ++i) {
A[i][j]=B[i][j]+C[i][j]
}
}

There may be a difference, and if there is then it is worth knowing
since the code change is so simple. This is also worth trying with
some of the other methods suggested.

rossum

--

The ultimate truth is that there is no Ultimate Truth
Jul 22 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Ben Ingram | last post by:
Hi all, I am writing a template matrix class in which the template parameters are the number of rows and number of columns. There are a number of reasons why this is an appropriate tradeoff for...
5
by: Jason | last post by:
Hello. I am trying to learn how operator overloading works so I wrote a simple class to help me practice. I understand the basic opertoar overload like + - / *, but when I try to overload more...
11
by: Michael Bader | last post by:
Hi, I'm currently working on a matrix multiplication code (matrix times matrix), and have come along some interesting/confusing results concerning the running time of the (apparently) same...
13
by: Charulatha Kalluri | last post by:
Hi, I'm implementing a Matrix class, as part of a project. This is the interface I've designed: class Matrix( )
15
by: christopher diggins | last post by:
Here is some code I wrote for Matrix multiplication for arbitrary dimensionality known at compile-time. I am curious how practical it is. For instance, is it common to know the dimensionality of...
20
by: Frank-O | last post by:
Hi , Recently I have been commited to the task of "translating" some complex statistical algorithms from Matlab to C++. The goal is to be three times as fast as matlab ( the latest) . I've...
7
by: VijaKhara | last post by:
Hi all, Is there any method which can implememt the matrix multiplication faster than using the formula as we often do by hand? I am writing the following code and my matrice: one is 3x40000 and...
2
by: DarrenWeber | last post by:
Below is a module (matrix.py) with a class to implement some basic matrix operations on a 2D list. Some things puzzle me about the best way to do this (please don't refer to scipy, numpy and...
0
by: DarrenWeber | last post by:
# Copyright (C) 2007 Darren Lee Weber # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.