Designing Numerical Computation in C++

Phocas

Hello everyone.

I've been using C++ for many years now, and I am starting to work on
my first large project which has many different components. Some of
these components do extremely performance critical things like
scientific computations and ray-tracing. These are done for many
iterations so the code should run as fast as possible. On the other
hand, I also want the code to make sense to people, be easy to
understand and modify, and to change internally if a higher
performance solution comes along. This has been very challenging so
far. In this post I talk about a particular problem I've been having,
in the hope that those who've worked with C++ in a professional
setting might share some good advice about what they would do in
similar situations.

In particular I have tried to implement a vector class, which models n-
dimensional Cartesian space vectors. Let me start off by saying that
I can't really envision the application right now needing anything
other than 2, 3, or 4 dimensional vectors. I thought of 3 different
ways to make this class:

1.) Don't do anything, use Blitz++, tvmet, uBLAS, etc. Unfortunately,
the expression template techniques which make these go so much faster
than normal C++ on statements like "v = 2*f - dot(f, v)*g + 3/lengh(h)
* k" make them go significantly slower in statements like "v = 2*f"
because of the extra constructor invocation. Even tvmet shows a 50%
penalty when performing y = a*x + b for very small vectors. This can
cause a 50% performance hit because there are so few instructions to
begin with, that adding some extra in tight loop kills performance.
In ray tracing, I think stuff like this really matters because you
measure in millions of rays per second. My solution is to not use
these classes. In the documentation, I write: "No methods contain
Vector as the return type. That means vec.method(...) cannot evaluate
to another Vector. This is because doing so creates a copy (i.e.,
runs the constructor). If this is what you need to do, you should
_see_ the cost of constructing a new object to ensure you really need
it. If you do not want to change "this", manually create a copy and
call an operation on the copy that will change "this". The assignment
operators +=, -=, etc. are implemented because they change "this".
Operators like +, -, etc. are not implemented because they should
treat both operands as const and return a new Vector." Already this
is a big hit to usability and creating a "friendly" library for future
developers, but I really think this is worth it.

2.) The absolute guaranteed fastest solution, I also really dislike.
This is to create Vector2f, Vector2d, Vector3f, etc. for the different
dimensions and floating point precisions. Unfortunately this ensures
that adding new code, changing code, etc., must be changed in many
different places. However, there are several advantages to this
method: if I were to implement something like x86 SEE parallelized
versions of vector additions, subtractions, etc., they must be done in
assembly, in completely different ways for single and double precision
floating pointing. So they are easier to specialize if the classes
are seperate anyway. Plus I doubt anyone will want to change this
extremely simplistic classes anyway. Still, the solution is very
"unelegant".

3.) The way I've actually done it now, with templates and template
partial specializations. There is a base class
VectorBase<NumericType, Sizewhich implements many of the general
functionality like +=, -=, dot, length, etc. in terms of for loops
whose size is known at compile time. There are specialized versions
of Vector, which inherit most of their functionality from VectorBase
(I used this technique because member partial specialization is not
possible), but can also add special extra operations like the cross
product for 3 dimensional vectors only, and special constructors which
use the known size of the vectors. Mostly all they do is implement
constructors which call the VectorBase constructors, which the C++0x
forwarding constructor syntax should clean up nicely, when it comes
out.

So I was really happy with (3), which dramatically minimizes the code
compared to (2) but should go as fast since the loops should get
unrolled. But the compiler I am using (Microsoft Visual C++) won't
unroll the loops, even when they're size 2. This actually results in
a factor of 2 performance difference when you're doing just a few
floating point multiplies and adds (which is roughly like what is used
in a k-d tree traversal in raytracing, although I used the dot product
to test it). And there is no way to force them to unroll that I can
find, except with Intel's compilers which I do not have access to.
Basically, this frustrating exercise has just shown me the cynical
thing is true: you really do need to sacrifice clarity for
performance. I will probably end up implementing 6 high performance
vector classes, and 6 high performance matrix classes, all with
basically the same code. At least I can make SSE versions more
cleanly.

What bothers me is all the time I spent on this. I'm sure the more
experienced software developers out there would look at situations
like this and had some intution about the "best" thing to do. I, on
the other hand, have a directory full of dead implementations of the
same thing. My biggest problem is probably worrying too much over
details, and not knowing what its important. Does anyone have any
advice, if its actually possible to give advice this general?

Thanks for your time,
-Ken

Mar 26 '07 #1

Subscribe Post Reply

2083

=?iso-8859-1?q?Erik_Wikstr=F6m?=

On 26 Mar, 22:45, "Phocas" <kjcam...@gmail.comwrote:

Hello everyone.

I've been using C++ for many years now, and I am starting to work on
my first large project which has many different components. Some of
these components do extremely performance critical things like
scientific computations and ray-tracing. These are done for many
iterations so the code should run as fast as possible. On the other
hand, I also want the code to make sense to people, be easy to
understand and modify, and to change internally if a higher
performance solution comes along. This has been very challenging so
far. In this post I talk about a particular problem I've been having,
in the hope that those who've worked with C++ in a professional
setting might share some good advice about what they would do in
similar situations.

In particular I have tried to implement a vector class, which models n-
dimensional Cartesian space vectors. Let me start off by saying that
I can't really envision the application right now needing anything
other than 2, 3, or 4 dimensional vectors. I thought of 3 different
ways to make this class:

1.) Don't do anything, use Blitz++, tvmet, uBLAS, etc. Unfortunately,
the expression template techniques which make these go so much faster
than normal C++ on statements like "v = 2*f - dot(f, v)*g + 3/lengh(h)
* k" make them go significantly slower in statements like "v = 2*f"
because of the extra constructor invocation. Even tvmet shows a 50%
penalty when performing y = a*x + b for very small vectors. This can
cause a 50% performance hit because there are so few instructions to
begin with, that adding some extra in tight loop kills performance.
In ray tracing, I think stuff like this really matters because you
measure in millions of rays per second. My solution is to not use
these classes. In the documentation, I write: "No methods contain
Vector as the return type. That means vec.method(...) cannot evaluate
to another Vector. This is because doing so creates a copy (i.e.,
runs the constructor). If this is what you need to do, you should
_see_ the cost of constructing a new object to ensure you really need
it. If you do not want to change "this", manually create a copy and
call an operation on the copy that will change "this". The assignment
operators +=, -=, etc. are implemented because they change "this".
Operators like +, -, etc. are not implemented because they should
treat both operands as const and return a new Vector." Already this
is a big hit to usability and creating a "friendly" library for future
developers, but I really think this is worth it.

2.) The absolute guaranteed fastest solution, I also really dislike.
This is to create Vector2f, Vector2d, Vector3f, etc. for the different
dimensions and floating point precisions. Unfortunately this ensures
that adding new code, changing code, etc., must be changed in many
different places. However, there are several advantages to this
method: if I were to implement something like x86 SEE parallelized
versions of vector additions, subtractions, etc., they must be done in
assembly, in completely different ways for single and double precision
floating pointing. So they are easier to specialize if the classes
are seperate anyway. Plus I doubt anyone will want to change this
extremely simplistic classes anyway. Still, the solution is very
"unelegant".

3.) The way I've actually done it now, with templates and template
partial specializations. There is a base class
VectorBase<NumericType, Sizewhich implements many of the general
functionality like +=, -=, dot, length, etc. in terms of for loops
whose size is known at compile time. There are specialized versions
of Vector, which inherit most of their functionality from VectorBase
(I used this technique because member partial specialization is not
possible), but can also add special extra operations like the cross
product for 3 dimensional vectors only, and special constructors which
use the known size of the vectors. Mostly all they do is implement
constructors which call the VectorBase constructors, which the C++0x
forwarding constructor syntax should clean up nicely, when it comes
out.

So I was really happy with (3), which dramatically minimizes the code
compared to (2) but should go as fast since the loops should get
unrolled. But the compiler I am using (Microsoft Visual C++) won't
unroll the loops, even when they're size 2. This actually results in
a factor of 2 performance difference when you're doing just a few
floating point multiplies and adds (which is roughly like what is used
in a k-d tree traversal in raytracing, although I used the dot product
to test it). And there is no way to force them to unroll that I can
find, except with Intel's compilers which I do not have access to.
Basically, this frustrating exercise has just shown me the cynical
thing is true: you really do need to sacrifice clarity for
performance. I will probably end up implementing 6 high performance
vector classes, and 6 high performance matrix classes, all with
basically the same code. At least I can make SSE versions more
cleanly.

What bothers me is all the time I spent on this. I'm sure the more
experienced software developers out there would look at situations
like this and had some intution about the "best" thing to do. I, on
the other hand, have a directory full of dead implementations of the
same thing. My biggest problem is probably worrying too much over
details, and not knowing what its important. Does anyone have any
advice, if its actually possible to give advice this general?

Been working on a similar thing for some time and I too choose to
implement my own version of vectors and matrices using templates, much
for the same reasons. I was wondering what compiler settings you are
using and if you have asked about it at MS groups like
microsoft.public.vc or microsoft.public.dotnet.languages.vc (I don't
know if my loops get unrolled or not). Perhaps there are some pragmas
that can help?

Another way to improve speed (but code cleanliness will take a hit) is
to use OpenMP, if you use VS2005 (non-express) just look it up in the
help, I've personally seen that it can improve performance with very
little effort.

--
Erik Wikström

Mar 27 '07 #2

Lionel B

On Mon, 26 Mar 2007 13:45:57 -0700, Phocas wrote:

Hello everyone.

I've been using C++ for many years now, and I am starting to work on
my first large project which has many different components. Some of
these components do extremely performance critical things like
scientific computations and ray-tracing. These are done for many
iterations so the code should run as fast as possible. On the other
hand, I also want the code to make sense to people, be easy to
understand and modify, and to change internally if a higher
performance solution comes along. This has been very challenging so
far. In this post I talk about a particular problem I've been having,
in the hope that those who've worked with C++ in a professional
setting might share some good advice about what they would do in
similar situations.

In particular I have tried to implement a vector class, which models n-
dimensional Cartesian space vectors. Let me start off by saying that
I can't really envision the application right now needing anything
other than 2, 3, or 4 dimensional vectors. I thought of 3 different
ways to make this class:

1.) Don't do anything, use Blitz++, tvmet, uBLAS, etc. Unfortunately,
the expression template techniques which make these go so much faster
than normal C++ on statements like "v = 2*f - dot(f, v)*g + 3/lengh(h)
* k" make them go significantly slower in statements like "v = 2*f"
because of the extra constructor invocation.

Out of interest, did you look at the Blitz++ TinyVector class:

http://www.oonumerics.org/blitz/docs..._7.html#SEC131

Never used it myself, but it looks to be designed for very much the
situation you have; in particular, a good optimising compiler ought to be
able to unroll its loops.

[...]

Regards,

--
Lionel B

Mar 27 '07 #3

Amol

may be a look at
http://math.nist.gov/tnt/
is worth IMHO

Amol

Mar 27 '07 #4

by: dont bother | last post by:

Hey, I have these attributes: index which is a numerical value value vector which is a numerical float value and I want to concatenate like this:

Python

Designing a Volume Class

by: E G | last post by:

Hi! I am having problems in designing a class. First, I have a base class that allocates a 3D data set and allows some other mathematical operations with it, something like this: template...

C / C++

Designing a Matrix class

by: Charulatha Kalluri | last post by:

Hi, I'm implementing a Matrix class, as part of a project. This is the interface I've designed: class Matrix( )

C / C++

Weird numerical behavior regarding zero

by: matthias_k | last post by:

Hi, I am currently implementing a solver for linear optimization problems using the revised simplex method and I stumbled accross some strange behavior regarding the treatment of the number 0....

C / C++

"Numerical Recipes in C" code

by: Abhi | last post by:

Hi.. I wanted the C source code in machine readable format for the book "Numerical Recipes in C". I got hold of the pdf version of the book somehow. Does anyone have the complete C code of the...

C / C++

Numerical Recipes Rational Approximation Algorithm

by: lcw1964 | last post by:

Greetings groups! I am a rank novice in both C programming and numerical analysis, so I ask in advance your indulgence. Also, this question is directed specifically to those familiar with Numerical...

C / C++

Parallel Python

by: parallelpython | last post by:

Has anybody tried to run parallel python applications? It appears that if your application is computation-bound using 'thread' or 'threading' modules will not get you any speedup. That is because...

Python

designing your own characters in C++?

by: Andy Leese | last post by:

Is there a way of designing your own characters in C++? (For example some foreign symbols with accents etc...) I am relatively inexperienced in C++, but in other programming languages it used to...

C / C++

vectorized computation in C++ such as those in Matlab (Matlab toC++)?

by: Luna Moon | last post by:

Dear all, Can C++/STL/Boost do the vectorized calculation as those in Matlab? For example, in the following code, what I really want to do is to send in a vector of u's. All other...

C / C++

Designing superclasses so inherited methods return objects with sametype as the instance.

by: Felix T. | last post by:

I have a class called Interval(type.ObjectType) that is supposed to mimic closed mathematical intervals. Right now, it has a lot of methods like this: def __add__(self,other): if type(other) in...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Designing Numerical Computation in C++

Similar topics