speed of int vs bool for large matrix

raj

Hi,
I saw it mentioned that "int" is the fastest data-type for use in C
,that is data storage/retrieval would be the fastest if I use int among
the following 4 situations in a 32 bit machine with 4-byte ints:

int m[1000][1000];
bool m[1000][1000]; // assuming I use C++
char m[1000][1000];
unsigned char m[1000][1000];

I need to know if this statement is true. Also, if int is indeed the
fastest, in practice, how slower would the bool matrix be ? What if I
had to dynamically allocate all the 4 cases above - would the same
statements still hold?

Actually I need a large matrix for storing just a binary value (so
bool would be most appropriate) - but speed is my utmost concern, space
required comes next.

Some pointers ?

Thanks in advance.
Rajorshi

Nov 14 '05 #1

Subscribe Post Reply

4631

pete

raj wrote:

Hi,
I saw it mentioned that "int" is the fastest data-type for use in C I need to know if this statement is true.

It's almost true.
This is what the rules of C actually say:

N869
6.2.5 Types
[#5]
A ``plain'' int object has the natural size suggested by the
architecture of the execution environment (large enough to
contain any value in the range INT_MIN to INT_MAX as defined
in the header <limits.h>).

--
pete

Nov 14 '05 #2

jacob navia

raj wrote:

Hi,
I saw it mentioned that "int" is the fastest data-type for use in C
,that is data storage/retrieval would be the fastest if I use int among
the following 4 situations in a 32 bit machine with 4-byte ints:

int m[1000][1000];
bool m[1000][1000]; // assuming I use C++
char m[1000][1000];
unsigned char m[1000][1000];

I need to know if this statement is true. Also, if int is indeed the
fastest, in practice, how slower would the bool matrix be ? What if I
had to dynamically allocate all the 4 cases above - would the same
statements still hold?

Actually I need a large matrix for storing just a binary value (so
bool would be most appropriate) - but speed is my utmost concern, space
required comes next.

Some pointers ?

Thanks in advance.
Rajorshi

If you declare your table as a table of unsigned chars, memory
requitrements will be 1 000 000 bytes, in the other case will be
4 times as much (supposing a 32 bit machine), i.e. there is 4 times
as much data that will fit the Level 1 cache.

Input output requirements will be much more reduced and the
program will be surely much faster.

If you have a purely boolean matrix, I would use a packed
representation with 8 bits/byte, making your matrix just
125K, i.e. 8 times less than if you use the int data type.

Note that reducing space requirements increases speed, since
filling the cache from RAM introduces an enormous amount of wait
states. Memory access go through the bus at 300 MHZ, with a CPU
running at 3 000 MHZ, i.e. a memory access can be 10 times slower
when it doesn't hit the L1 cache.

Of course this analysis implies a sequential access to each
element of the matrix. In other cases where you just access
randomly elements of the matrix, an integer representation would
beat all others since you are NOT limited by I/O.

MAIN POINT
----------
Speculating about speed will never replace actual measurements.

1: Build a test matrix.
2: Simulate the access pattern of your application.
3: Time it using unsigned char, int.

Then you will know for sure.

jacob

Nov 14 '05 #3

SM Ryan

"raj" <ra******@fastmail.fm> wrote:
# Hi,
# I saw it mentioned that "int" is the fastest data-type for use in C

That's wrong. The correct answer is MU. Suppose you're solving PDEs; the fastest
data type is float or double. You can do it in int, using various integers to
hold the exponents and fractions, but it's going to be a lot faster using what
is encoded in the hardware and math libraries. What is fastest depends on the
operations performed on what data in what order on what hardware with what caches
and what kind of virtual memory (if any) management. There's is no universally
correct answer.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Who's leading this mob?

Nov 14 '05 #4

Keith Thompson

jacob navia <ja***@jacob.remcomp.fr> writes:

raj wrote:
I saw it mentioned that "int" is the fastest data-type for use in C
,that is data storage/retrieval would be the fastest if I use int among
the following 4 situations in a 32 bit machine with 4-byte ints:

int m[1000][1000];
bool m[1000][1000]; // assuming I use C++
char m[1000][1000];
unsigned char m[1000][1000];

I need to know if this statement is true. Also, if int is indeed the
fastest, in practice, how slower would the bool matrix be ? What if I
had to dynamically allocate all the 4 cases above - would the same
statements still hold?

Actually I need a large matrix for storing just a binary value (so
bool would be most appropriate) - but speed is my utmost concern, space
required comes next.
[...]
If you declare your table as a table of unsigned chars, memory
requitrements will be 1 000 000 bytes, in the other case will be
4 times as much (supposing a 32 bit machine), i.e. there is 4 times
as much data that will fit the Level 1 cache.
You're assuming there is a "Level 1 cache".
Input output requirements will be much more reduced and the
program will be surely much faster.
What input output requirements? The OP didn't indicate that the
matrix would be read from or written to an external file.
If you have a purely boolean matrix, I would use a packed
representation with 8 bits/byte, making your matrix just
125K, i.e. 8 times less than if you use the int data type.
Which will likely make access to individual elements slower.
Note that reducing space requirements increases speed, since
filling the cache from RAM introduces an enormous amount of wait
states. Memory access go through the bus at 300 MHZ, with a CPU
running at 3 000 MHZ, i.e. a memory access can be 10 times slower
when it doesn't hit the L1 cache.
That's all very system-specific.
Of course this analysis implies a sequential access to each
element of the matrix. In other cases where you just access
randomly elements of the matrix, an integer representation would
beat all others since you are NOT limited by I/O.

MAIN POINT
----------
Speculating about speed will never replace actual measurements.

1: Build a test matrix.
2: Simulate the access pattern of your application.
3: Time it using unsigned char, int.

Then you will know for sure.

You could have safely skipped everything before "MAIN POINT".

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #5

Christian Bau

In article <ch********@odbk17.prod.google.com>,
"raj" <ra******@fastmail.fm> wrote:

Hi,
I saw it mentioned that "int" is the fastest data-type for use in C
,that is data storage/retrieval would be the fastest if I use int among
the following 4 situations in a 32 bit machine with 4-byte ints:

int m[1000][1000];
bool m[1000][1000]; // assuming I use C++
char m[1000][1000];
unsigned char m[1000][1000];

I need to know if this statement is true. Also, if int is indeed the
fastest, in practice, how slower would the bool matrix be ? What if I
had to dynamically allocate all the 4 cases above - would the same
statements still hold?

Actually I need a large matrix for storing just a binary value (so
bool would be most appropriate) - but speed is my utmost concern, space
required comes next.

Why don't you just try it out. Write four functions, one with each kind
of matrix, and then set every element to zero, then to one, and repeat
this thousand times. Measure the time. Observe the difference.

Nov 14 '05 #6

jacob navia

Keith Thompson wrote:

jacob navia <ja***@jacob.remcomp.fr> writes:

If you declare your table as a table of unsigned chars, memory
requitrements will be 1 000 000 bytes, in the other case will be
4 times as much (supposing a 32 bit machine), i.e. there is 4 times
as much data that will fit the Level 1 cache.

You're assuming there is a "Level 1 cache".

I am assuming of course a modern CPU, either a PPC, x86 or similar.
Please lets be realistic: handling 1-4MB matrices in a DSP
with 4K of RAM would be ridiculous...

Input output requirements will be much more reduced and the
program will be surely much faster.

What input output requirements? The OP didn't indicate that the
matrix would be read from or written to an external file.

Input output to/from the CPU of course!
I am not speaking about file I/O but RAM I/O.

If you have a purely boolean matrix, I would use a packed
representation with 8 bits/byte, making your matrix just
125K, i.e. 8 times less than if you use the int data type.

Which will likely make access to individual elements slower.

No, this is my point. If you have to read from RAM 8 times less
data, each read producing at least 10-20 wait states, the few
more instructions to access the individual bits would be
compensated by the smaller I/O.

Note that reducing space requirements increases speed, since
filling the cache from RAM introduces an enormous amount of wait
states. Memory access go through the bus at 300 MHZ, with a CPU
running at 3 000 MHZ, i.e. a memory access can be 10 times slower
when it doesn't hit the L1 cache.

That's all very system-specific.

Not really. As the CPUs of modern micro computers go to
ever higher clock speeds, memory doesn't catch up at all,
and I/O from RAM becomes the limiting factor.
Of course this analysis implies a sequential access to each
element of the matrix. In other cases where you just access
randomly elements of the matrix, an integer representation would
beat all others since you are NOT limited by I/O.

MAIN POINT
----------
Speculating about speed will never replace actual measurements.

1: Build a test matrix.
2: Simulate the access pattern of your application.
3: Time it using unsigned char, int.

Then you will know for sure.

You could have safely skipped everything before "MAIN POINT".

No, I wanted to describe alternatives and leave the OP the choice
but of an informed choice.

Nov 14 '05 #7

Derrick Coetzee

raj wrote:

int m[1000][1000];
bool m[1000][1000]; // assuming I use C++
char m[1000][1000];
unsigned char m[1000][1000];

Like many people, you make the mistake of assuming that bools occupy a
single bit (also note that C99 includes a bool type much like the C++
one.) There is a difficult-to-predict performance tradeoff here that
will vary from machine to machine. Here's a comparison of the advantages
and disadvantages:

int m[1000][1000];

This stores a single value in each word, consuming at least one million
words. Random access requires only the address calculation plus a single
read/write operation on most machines, with no need for bit operations.
That makes this alternative good for random access on machines with weak
bit-level capabilities. If the array is going to be traversed
sequentially on a machine with a data cache though, it will produce many
more cache misses than other approaches.

char m[1000][1000];

This stores a single value in each byte, so that you only need about
million bytes. If you traverse the array in order on a machine with a
data cache, you'll get a lot less cache misses with this version, and on
byte-addressable machines with instructions to fetch bytes, no bit
operations are needed on top of address calculation. On word-addressable
machines with no special instructions for loading bytes, however, this
may turn out to compile into something as slow as the bit array below.
It also still takes 8 times as much space as the bit array, and has
about 8 times as many cache misses when traversed in order on machines
with data caches.

#define INT_BITS (sizeof(int)*8)
#define CEIL_DIV(x,y) ((x+y-1)/y)
int m[1000][CEIL_DIV(1000,INT_BITS)];
#define M_GET(r,c) (m[(r)][(c)/INT_BITS] & (1 << ((c) % INT_BITS)))
#define M_SET(r,c) (m[(r)][(c)/INT_BITS] |= (1 << ((c) % INT_BITS)))
#define M_CLEAR(r,c) (m[(r)][(c)/INT_BITS] &= ~(1 << ((c) % INT_BITS)))

This creates an array of bits and some macros for accessing it. Although
the division and modulus operations will probably compile to bit
operations, each random access to a non-constant index pair will require
on most machines at least two ANDs and two shifts that the word version
doesn't. Accesses to fixed or partially fixed indexes are faster, since
it can precompute the index and bit mask (shown here for an INT_BITS of 32):

M_GET(40,65)
(m[(40)][(65)/INT_BITS] & (1 << ((65) % INT_BITS)))
m[40][65/32] & (1 << (65 % 32))
m[40][2] & (1 << 1)
m[40][2] & 2

Sequential access actually can yield very good performance, since you
can do that with a loop like this:

unsigned int current_word, i, j, mask;
for(i=0; i<1000; i++) {
for(j=0; j<CEIL_DIV(1000, INT_BITS); j++) {
current_word = m[i][j];
for(mask = 1; mask != 0; mask <<= 1) {
int current_bit = current_word & mask; /*get current bit*/
current_word |= mask; /* set the current bit */
current_word &= ~mask; /* clear the current bit */
}
m[i][j] = current_word;
}
}

This amounts to two bit operations and 2/INT_BITS word accesses per
element on average, which is not only often faster than two word
accesses but cache misses on machines with caches are reduced by about
INT_BIT times.

Even if you only do random access, considering that you reduce your
space requirements by INT_BIT times, a few extra cycles per access is
justified in many applications.
--
Derrick Coetzee
I grant this newsgroup posting into the public domain. I disclaim all
express or implied warranty and all liability. I am not a professional.

Nov 14 '05 #8

Randy Howard

In article <ch**********@news-reader5.wanadoo.fr>, ja***@jacob.remcomp.fr
says...

I am assuming of course a modern CPU, either a PPC, x86 or similar.
Intel is STILL selling Celerons. *sigh*
No, this is my point. If you have to read from RAM 8 times less
data, each read producing at least 10-20 wait states, the few
more instructions to access the individual bits would be
compensated by the smaller I/O.

I don't know that's a given. Hypertransport 4-way Opteron boxes
can move almost 8GB/s to/from memory with current DDR modules when
spinning through 64GB of total memory, using 4GB block move/compares.

I'm not at all sure that 8GB/s (being close to 4X what a Xeon 4-way can
do) is going to care much for typical application read/write sizes
whether you use char's, int's, etc. As usual, it depends on the
specifics, the hardware, etc. There is no general rule for this sort of
thing that isn't so general as to be practically worthless. Measure,
then you'll know.

--
Randy Howard
To reply, remove FOOBAR.

Nov 14 '05 #9

Keith Thompson

jacob navia <ja***@jacob.remcomp.fr> writes:

Keith Thompson wrote:
jacob navia <ja***@jacob.remcomp.fr> writes:
If you declare your table as a table of unsigned chars, memory
requitrements will be 1 000 000 bytes, in the other case will be
4 times as much (supposing a 32 bit machine), i.e. there is 4 times
as much data that will fit the Level 1 cache.

You're assuming there is a "Level 1 cache".

I am assuming of course a modern CPU, either a PPC, x86 or similar.
Please lets be realistic: handling 1-4MB matrices in a DSP
with 4K of RAM would be ridiculous...
Input output requirements will be much more reduced and the
program will be surely much faster.

What input output requirements? The OP didn't indicate that the
matrix would be read from or written to an external file.

Input output to/from the CPU of course!
I am not speaking about file I/O but RAM I/O.

Hmm. I don't think that's what most people mean by "I/O".

If you have a purely boolean matrix, I would use a packed
representation with 8 bits/byte, making your matrix just
125K, i.e. 8 times less than if you use the int data type.

Which will likely make access to individual elements slower.

No, this is my point. If you have to read from RAM 8 times less
data, each read producing at least 10-20 wait states, the few
more instructions to access the individual bits would be
compensated by the smaller I/O.

If the matrix is read randomly, the code will still have to read
memory for each access, *plus* the overhead of extracting the desired
bit (and similarly for write access).

[...]

MAIN POINT
----------
Speculating about speed will never replace actual measurements.

1: Build a test matrix.
2: Simulate the access pattern of your application.
3: Time it using unsigned char, int.

Then you will know for sure.

You could have safely skipped everything before "MAIN POINT".

No, I wanted to describe alternatives and leave the OP the choice
but of an informed choice.

I may have been slightly too harsh. My real complaint is that you
didn't qualify your statements. The amount of cache, if any, is still
system-specific, even if it happens that most modern systems have at
least some particular amount. If you had made that clearer, I
wouldn't have minded as much.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #10

Tim Rentsch

Randy Howard <ra*********@FOOverizonBAR.net> writes:

In article <ch**********@news-reader5.wanadoo.fr>, ja***@jacob.remcomp.fr
says...
No, this is my point. If you have to read from RAM 8 times less
data, each read producing at least 10-20 wait states, the few
more instructions to access the individual bits would be
compensated by the smaller I/O.

I don't know that's a given. Hypertransport 4-way Opteron boxes
can move almost 8GB/s to/from memory with current DDR modules when
spinning through 64GB of total memory, using 4GB block move/compares.

I'm not at all sure that 8GB/s (being close to 4X what a Xeon 4-way can
do) is going to care much for typical application read/write sizes
whether you use char's, int's, etc. As usual, it depends on the
specifics, the hardware, etc. There is no general rule for this sort of
thing that isn't so general as to be practically worthless. Measure,
then you'll know.

This is comparing apples and oranges - 8GB/s is a measure of
bandwidth, 10-20 wait states is a measure of latency. The bandwidth
could be a million GB/s, and the latency would still very likely have
a significant performance impact. At least, for processors with less
cache than the 4MB that a 1 million entry word array would take, which
includes most processors in use today.

It's right that the only way to know for sure is to measure.

[Yes, we have gotten off topic for comp.lang.c here....]

Nov 14 '05 #11

Flash Gordon

On Sat, 11 Sep 2004 12:30:27 -0000
SM Ryan <wy*****@tango-sierra-oscar-foxtrot-tango.fake.org> wrote:

"raj" <ra******@fastmail.fm> wrote:
# Hi,
# I saw it mentioned that "int" is the fastest data-type for use in C

That's wrong. The correct answer is MU. Suppose you're solving PDEs;
the fastest data type is float or double. You can do it in int, using
various integers to hold the exponents and fractions, but it's going
to be a lot faster using what is encoded in the hardware and math
libraries. What is fastest depends on the operations performed on what
data in what order on what hardware with what caches and what kind of
virtual memory (if any) management. There's is no universally correct
answer.

That's not true. It depends on the hardware and how the library is
implemented. It is entirely conceivable that you could solve such a
problem to sufficient accuracy on some implementations using integer
arithmetic and various look up tables, just as I know of lots of trig
being done in SW mainly using integers for speed and only occasionally
using floating point.

However, on most implementations unless you actually need floating point
(and when discussing the speed of integer types it is implicit that you
don't) integer types are almost always faster than floating point types.
--
Flash Gordon
Sometimes I think shooting would be far too good for some people.
Although my email address says spam, it is real and I read it.

Nov 14 '05 #12

wana

Actually I need a large matrix for storing just a binary value (so
bool would be most appropriate) - but speed is my utmost concern, space
required comes next.

How big is a bool in C? Isn't it strange that any language would make
a bool bigger than a bit? I came across the same problem in Java once
and someone told me that a bool is 32 bits! Imagine a one bit value
stored in enough space for 32 one bit values?

It must be a speed issue. The extra steps of doing bit manipulations
would certainly slow things down if you are measuring speed by the
number of steps the program must go through.

I suppose you could sit down at the computer with your stop watch and
time bit manipulations v.s. using bloated built-in bool. You could
also list the machine code and count the number of instructions. But,
then again, some CPUs might be smart enough to process the
instructions in such a way that you really won't know unless you
actually run the thing and time it.

I would go the route of storing bools as bits because if speed is a
concern, you are probably programming non-portable code for some slow
embedded system that will probably be short on storage as well as
speed. Or maybe not.

wana

Nov 14 '05 #13

Keith Thompson

io*****@yahoo.com (wana) writes:

Actually I need a large matrix for storing just a binary value (so
bool would be most appropriate) - but speed is my utmost concern, space
required comes next.

How big is a bool in C? Isn't it strange that any language would make
a bool bigger than a bit? I came across the same problem in Java once
and someone told me that a bool is 32 bits! Imagine a one bit value
stored in enough space for 32 one bit values?

C90 doesn't have a bool type. C99 does (it's actually a typedef for
the predefined type _Bool, for compatibility reasons).

C doesn't support objects (other than bit fields) smaller than one
byte, and a byte has to be at least 8 bits. Bit fields have a number
of restrictions, the most important of which is that you can't take
its address. Since array indexing depends on addressing, if bool were
a one-bit type you couldn't have an array of bool.

It would have been possible, I suppose, for C99 to have introduced
full support for sub-byte objects. A pointer to such an object would
probably have to be a "fat" pointer, incorporating the address of the
containing byte and a bit offset. This would have been a fairly
radical change to C's object model (no, I don't mean "object" in the
sense of "object-oriented"); it could have been very useful if done
right, but it would have been a lot of work, probably exceeding the
available resources.

If you want a single boolean object, you don't want it to occupy less
than a byte anyway. If you want a single-bit boolean in a struct, C99
allows bit fields of type bool. If you want an array of single-bit
boolean values, you'll need to use the existing mechanisms with
explicit bitwise operations. The time and space tradeoffs are
system-dependent, and actual measurements are the only way to
determine which is going to be best for you.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #14

Derrick Coetzee

Derrick Coetzee wrote:

It also still takes 8 times as much space as the bit array, and has
about 8 times as many cache misses when traversed in order on machines
with data caches.

#define INT_BITS (sizeof(int)*8)

Er, wherever I said 8 here, I meant CHAR_BITS (not that we've seen a
machine with CHAR_BITS != 8 in a long time, but at least it kills the
magic number and enhances compliance).
--
Derrick Coetzee
I grant this newsgroup posting into the public domain. I disclaim all
express or implied warranty and all liability. I am not a professional.

Nov 14 '05 #15

Flash Gordon

On Thu, 16 Sep 2004 04:07:13 -0400
Derrick Coetzee <dc****@moonflare.com> wrote:

Derrick Coetzee wrote:
It also still takes 8 times as much space as the bit array, and has
about 8 times as many cache misses when traversed in order on
machines with data caches.

#define INT_BITS (sizeof(int)*8)

Er, wherever I said 8 here, I meant CHAR_BITS (not that we've seen a
machine with CHAR_BITS != 8 in a long time, but at least it kills the
magic number and enhances compliance).

You might not see them but they are still very common in the embedded
world and, IMHO, one is much more likely to be concerned about saving
space in an embedded system than on a *modern* hosted implementation.

Of course, most people are asking questions about hosted
implementations with CHAR_BITS==8, especially the novices.
--
Flash Gordon
Sometimes I think shooting would be far too good for some people.
Although my email address says spam, it is real and I read it.

Nov 14 '05 #16

Ben Pfaff

Derrick Coetzee <dc****@moonflare.com> writes:

Er, wherever I said 8 here, I meant CHAR_BITS (not that we've seen a
machine with CHAR_BITS != 8 in a long time, but at least it kills the
magic number and enhances compliance).

The constant you're looking for is actual CHAR_BIT.
--
Ben Pfaff
email: bl*@cs.stanford.edu
web: http://benpfaff.org

Nov 14 '05 #17

speed of int vs bool for large matrix

Similar topics