Efficient sieving implementation?

Oswald Kluge

Dear Reader,

I'm trying to implement the following short algorithm:

Given a number n, remove all the multiples of n
from the list of non-negative numbers from 1 through a limit k.

My solution is to take an array with all entries '1' and iterate
over it in steps of n, setting the current pointer to '0'.

Here's the code:

const LIMIT=k+1;
unsigned char Sieve[LIMIT];

memset(Sieve,1,LIMIT); // initialise all entries to '1'
for(i=n;i<LIMIT;i+=n) {Sieve[i]=0;}

As k gets large this implementation is really slow for small n,
especially if you have to sieve more than one n. Is there a
more efficient way to implement this? Are there fast memory
operators like memset in c++ for this task?

Thanks for your help!

Best,
Oswald Kluge

Jul 24 '06 #1

Subscribe Post Reply

1760

Tom St Denis

Oswald Kluge wrote:

Dear Reader,

I'm trying to implement the following short algorithm:

Given a number n, remove all the multiples of n
from the list of non-negative numbers from 1 through a limit k.

My solution is to take an array with all entries '1' and iterate
over it in steps of n, setting the current pointer to '0'.

Here's the code:

const LIMIT=k+1;
unsigned char Sieve[LIMIT];

memset(Sieve,1,LIMIT); // initialise all entries to '1'
for(i=n;i<LIMIT;i+=n) {Sieve[i]=0;}

As k gets large this implementation is really slow for small n,
especially if you have to sieve more than one n. Is there a
more efficient way to implement this? Are there fast memory
operators like memset in c++ for this task?

Sure, instead of

if (Sieve[blah] == 0) { happy_dance(); }

Do

if (Sieve(blah) == 0) { happy_dance(); }

Where Sieve() is a function that returns the value desired.

If your domain is truly large than a table is just inefficient and
wasteful. If the table is really small then your concern is moot.

Tom

Jul 24 '06 #2

pegguru

Oswald Kluge wrote:

Dear Reader,

I'm trying to implement the following short algorithm:

Given a number n, remove all the multiples of n
from the list of non-negative numbers from 1 through a limit k.

My solution is to take an array with all entries '1' and iterate
over it in steps of n, setting the current pointer to '0'.

Here's the code:

const LIMIT=k+1;
unsigned char Sieve[LIMIT];

memset(Sieve,1,LIMIT); // initialise all entries to '1'
for(i=n;i<LIMIT;i+=n) {Sieve[i]=0;}

As k gets large this implementation is really slow for small n,
especially if you have to sieve more than one n. Is there a
more efficient way to implement this? Are there fast memory
operators like memset in c++ for this task?

Thanks for your help!

Best,
Oswald Kluge

It may be possible to improve this for a generalized case (it can be
done for specific cases of multiple "n"s), but it will certainly
involve more memory.

I know of nothing like memset that would be useful in this case. When
you get down to the machine level, any processor I've ever seen (and
I've seen many) will still need to do a loop like your for loop. A
good compiler will turn the for loop into efficient machine code - it's
a very simple bit of code.

Can you provide more information about the range of values for which
you are doing this, the number and types of "n"s, programmatic
constraints on cpu time and memory availability, and the overall
purpose? If you are doing this to find primes, etc, this is not likely
the best route.

Jeff

Jul 24 '06 #3

Christian Siebert

I'm trying to implement the following short algorithm:

>
Given a number n, remove all the multiples of n
from the list of non-negative numbers from 1 through a limit k.

Seems to be the "Sieve of Eratosthenes", or at least a very similar
algorithm ...

My solution is to take an array with all entries '1' and iterate
over it in steps of n, setting the current pointer to '0'.

That might be what you want (input and output). On the other hand, a
direct method (like Tom already suggested) can be much faster. And even
if you need this table-driven approach, this usual iteration is not the
only way to achieve this goal.

As k gets large this implementation is really slow for small n,

This is because memory is optimized for streaming access and such
pseudo-random pattern isn't cache-friendly.

especially if you have to sieve more than one n.

Instead of sieving for each n separately, try to combine several passes
for different n's and sieve in blocks of your processors cache size.
Depending on the number of n's, this can give you good performance
boost.

Christian

Jul 24 '06 #4

bachelor

it is worth caching if you will call Sieve(x) on the same x several
times. Otherwise don't cache. If you are just not sure try this:
___________________________________
const int K = 1000000;
unsigned char sieve[K];

const int N = 3;
int divs[N] = {3, 7, 11};

char div = 0, notdiv = 1, unknown = 2;

void init_Sieve(){ memset(Sieve, unknown, K); }

inline Sieve(int x)
{
if( sieve[x] == unknown)
{
for(int i=0; i<N; i++)
if(x%divs[i] == 0){
sieve[x] = div;
break;
}
}
return sieve[x];
}
_______________________________________
This way you get sure you don't calculate twice for the same x, with a
trade off -- another couple of CPU cycles for each call Sieve(x) BUT
never calculate Sieve(x) for any never-used x.

Jul 24 '06 #5

Oswald Kluge

Can you provide more information about the range of values for which

you are doing this, the number and types of "n"s, programmatic
constraints on cpu time and memory availability, and the overall
purpose? If you are doing this to find primes, etc, this is not likely
the best route.

Jeff

The purpose of this program is to find coprime numbers. If n is prime
then
if you sieve out all multiples of n, then all remaining numbers will
have no
common divisors with n.

If n is composite, I always know what prime factors are occurring, so
after
sieving out all the prime factors you still get all the coprime
numbers.

The range of values starts at k=10^5. I got a pentium 2 GHz with 2GB
RAM.

Oswald

Jul 24 '06 #6

Tom St Denis

Oswald Kluge wrote:

Can you provide more information about the range of values for which
you are doing this, the number and types of "n"s, programmatic
constraints on cpu time and memory availability, and the overall
purpose? If you are doing this to find primes, etc, this is not likely
the best route.

Jeff

The purpose of this program is to find coprime numbers. If n is prime
then
if you sieve out all multiples of n, then all remaining numbers will
have no
common divisors with n.

Why not just compute the GCD of the two numbers?

Tom

Jul 24 '06 #7

pegguru

Oswald Kluge wrote:
....snip...

The purpose of this program is to find coprime numbers. If n is prime
then
if you sieve out all multiples of n, then all remaining numbers will
have no
common divisors with n.

If n is composite, I always know what prime factors are occurring, so
after
sieving out all the prime factors you still get all the coprime
numbers.

The range of values starts at k=10^5. I got a pentium 2 GHz with 2GB
RAM.

Oswald

I was really looking for a range of values (i.e. both ends). If the
cardinality of the range is not too large, sieving might be worthwhile.
I'm uncertain why you would need to know all of the coprimes of n
within a given range.

If the range of values is not too large, and if you have to do a LOT of
sieving, it might be possible to set up a network in memory, wherein
each number in the range would be connected to its prime factors. To
eliminate numbers with a common factor to n, you would only have to
specify the factors of n itself. This concept would require a fair bit
of precomputation and would use a lot of memory for overhead, so it
would not be useful if your range was, say, hundreds of millions of
numbers.

Are you really certain that you need to know all of the coprimes of n
within a given range? If not, Tom's suggestion of calculating the GCD
of n and a specific other number is excellent.

Jeff

Jul 24 '06 #8

Logan Shaw

Christian Siebert wrote:

>I'm trying to implement the following short algorithm:

Given a number n, remove all the multiples of n
from the list of non-negative numbers from 1 through a limit k.

>As k gets large this implementation is really slow for small n,

This is because memory is optimized for streaming access and such
pseudo-random pattern isn't cache-friendly.

>especially if you have to sieve more than one n.

Instead of sieving for each n separately, try to combine several passes
for different n's and sieve in blocks of your processors cache size.

One way to do this, though it may be obvious to some, is to combine n's
into groups by taking the least common multiple. That gives you a pattern
you can repeat. For example, when doing a sieve with 2 and 3, you can do
the straightforward way:

for (i = 0; i < end; i += 2) { a[i] = 0; }
for (i = 0; i < end; i += 3) { a[i] = 0; }

Or instead, you can (mostly) combine these into one pass:

for (i = 0; i < end; i += 6)
{
/* multiples of 2 */
a[i + 0] = 0;
a[i + 2] = 0;
a[i + 4] = 0;

/* multiples of 3 */
a[i + 0] = 0;
a[i + 3] = 0;
}

This can be minimized by combining the redundant assignments:

for (i = 0; i < end; i += 6)
{
a[i + 0] = 0;
a[i + 2] = 0;
a[i + 3] = 0;
a[i + 4] = 0;
a[i + 5] = 0;
}

Now, here's what I think is the fun part. Once you've established
that pattern that repeats after 6 elements, you can, instead of
performing all the assignment statements separately, create a mask:

int lcm = 6;
bool mask[lcm];

for (i = 0; i < lcm; i++) { mask[i] = 1; }

for (i = 0; i < lcm; i += 2) { mask[i] = 0; }
for (i = 0; i < lcm; i += 3) { mask[i] = 0; }

and then apply the mask:

for (i = 0; i < end; i += lcm)
{
for (j = 0; j < lcm; j++)
{
a[i + j] &= mask[j];
}
}

The good thing about this, obviously, is that the inner loop in the
code just above can be optimized to DEATH. Your operating system
or some readily available library may in fact already provide a
routine to AND two regions of memory together efficiently. You
can duplicate the mask until it is some convenient size. You can
make its size a multiple of 8 or 16 or 32 if that helps you align
it in a useful way. And you can combine multiple different n's
together to produce one mask.

Of course, all this is a waste of time if the n's aren't relatively
small compared to k.

- Logan

Jul 25 '06 #9

Christian Siebert

Hi Logan,

Instead of sieving for each n separately, try to combine several passes
for different n's and sieve in blocks of your processors cache size.

One way to do this, though it may be obvious to some, is to combine n's
into groups by taking the least common multiple. That gives you a pattern
you can repeat.

nice that you think of this improvement too. I posted this idea some
months ago in conjunction with the QS:
http://groups.google.de/group/sci.cr...90676702d754ba
which gives you an additional speed-up. Though I meant the "more
common" optimization technique in the first place:

Instead of using the whole large sieve at once

memset(Sieve,1,LIMIT); // initialise all entries to '1'
for each n do
for (i=n; i<LIMIT; i+=n) { Sieve[i]=0; }
use_sieve(Sieve, LIMIT);

it is often better to split this large sieve into smaller blocks

start = 0;
while (start < LIMIT) {
memset(SieveBlock,1,CACHESIZE);
for each n do
for (i = ceil(start/n)*n; i < (start+CACHESIZE); i+=n)
{ SieveBlock[i-start]=0; }
use_sieve_block(SieveBlock, start, CACHESIZE);
start += CACHESIZE; // proceed with next block
}

Now, here's what I think is the fun part. Once you've established
that pattern that repeats after 6 elements, you can, instead of
performing all the assignment statements separately, create a mask:
... and then apply the mask: ... & ...
Your operating system or some readily available library may in fact
already provide a routine to AND two regions of memory together efficiently.

Why should you waste an additional logical operation, when it is
possible to shift this work into the initialization step. Create the
"mask" like suggested, and then use memcpy() to repeatedly copy it to
your sieve memory...

Christian

Jul 25 '06 #10

Logan Shaw

Christian Siebert wrote:

>Now, here's what I think is the fun part. Once you've established
that pattern that repeats after 6 elements, you can, instead of
performing all the assignment statements separately, create a mask:
... and then apply the mask: ... & ...
Your operating system or some readily available library may in fact
already provide a routine to AND two regions of memory together efficiently.

Why should you waste an additional logical operation, when it is
possible to shift this work into the initialization step. Create the
"mask" like suggested, and then use memcpy() to repeatedly copy it to
your sieve memory...

I was thinking that you'd create one mask for several n's, then another
mask for a few more n's, then another for some more, and so on.
For example, if doing a sieve of eratostenes, you might do one mask
of size 2310 for 2, 3, 5, and 7, and 11. Then you might do another
mask of size 4199 for 13, 17, and 19. And so on.

There's a reason for this: if you try to create one mask for all n's,
then the process of creating the mask degenerates into just running
the sieve. If you do with only one n for a single mask, then it also
degenerates into just running the sieve! So you have to do more than
1 of the n's at once, but fewer than all of them for this method to
have any benefit.

Of course, for the first pass through the array, you could do something
like memcpy() of your mask. That would eliminate a single pass, which
might be worthwhile if you are doing only a few passes total.

Come to think of it, if you want to optimize for sequential memory
access, you could (in the example above), make your {2,3,5,7} mask
of size 2310 and your {13,17,19} mask of size 4199, and you could
go through the target array in chunks, going through the masks like
a circular list. For example, you might write the first 1024 of
the two masks to the array. Then write the next 1024 of both.
Then the first mask would only have 262 elements left in it, so
you'd write the final 262 elements of the mask, then make up the
rest of the 262 by writing another 762 from the beginning.

With this method, if you have enough cache space to hold all your
masks, you could actually write the entire array in a single pass,
causing almost pure sequential memory access, which would probably
be a win with a lot of memory controllers.

- Logan

Jul 25 '06 #11

Richard Heathfield

Logan Shaw said:

<snip>

>
Or instead, you can (mostly) combine these into one pass:

for (i = 0; i < end; i += 6)
{
/* multiples of 2 */
a[i + 0] = 0;
a[i + 2] = 0;
a[i + 4] = 0;

/* multiples of 3 */
a[i + 0] = 0;
a[i + 3] = 0;
}

Good and readable. Well done. The redundancy stands out, and your next step
is obvious.

This can be minimized by combining the redundant assignments:

Indeed.

>
for (i = 0; i < end; i += 6)
{
a[i + 0] = 0;
a[i + 2] = 0;
a[i + 3] = 0;
a[i + 4] = 0;

So far, so good...

a[i + 5] = 0;

....but this is an optimisation too far.

<snip>

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)

Jul 25 '06 #12

Christian Siebert

Why should you waste an additional logical operation, when it is

possible to shift this work into the initialization step. Create the
"mask" like suggested, and then use memcpy() to repeatedly copy it to
your sieve memory...

I was thinking that you'd create one mask for several n's, then another
mask for a few more n's, then another for some more, and so on.
For example, if doing a sieve of eratostenes, you might do one mask
of size 2310 for 2, 3, 5, and 7, and 11.

I agree. Note that this mask saves you 2310*(1/2 + 1/3 + 1/5 + 1/7 +
1/11)=2927 strikes for each new sieving block. So you save
approximately 27% memory accesses...

Then you might do another mask of size 4199 for 13, 17, and 19. And so on.

Theoretically you could continue this way. Practically this would save
you only "4199*(1/13 + 1/17 + 1/19)=791" strikes per block, while you
introduce a "logical and" with 4199 memory cells. Therefore you have a
factor of 4.3 times as many memory accesses as the original algorithm.
It gets even worse when you continue this way. All in all, this
technique is only helpful for the first few elements....

Come to think of it

I hope you do too. :-)

Christian

Jul 25 '06 #13

Steve O'Hara-Smith

On Tue, 25 Jul 2006 03:34:47 GMT
Logan Shaw <ls**********@austin.rr.comwrote:

Now, here's what I think is the fun part. Once you've established
that pattern that repeats after 6 elements, you can, instead of
performing all the assignment statements separately, create a mask:

int lcm = 6;
bool mask[lcm];

for (i = 0; i < lcm; i++) { mask[i] = 1; }

At this point bit flags start getting very tempting.

--
C:>WIN | Directable Mirror Arrays
The computer obeys and wins. | A better way to focus the sun
You lose and Bill collects. | licences available see
| http://www.sohara.org/

Jul 25 '06 #14

Oswald Kluge

Thanks for all the input! I actually need all coprime numbers to a
given n ranging from 1 to LIMIT.

The good thing about this, obviously, is that the inner loop in the
code just above can be optimized to DEATH. Your operating system
or some readily available library may in fact already provide a
routine to AND two regions of memory together efficiently. You
can duplicate the mask until it is some convenient size. You can
make its size a multiple of 8 or 16 or 32 if that helps you align
it in a useful way. And you can combine multiple different n's
together to produce one mask.

What ways are there to AND/OR two regions of memory in c++?
I just recently found the bitfield datastructure... would it be
advantagous to use this as a basic datastructure for this program?

Jul 25 '06 #15

Logan Shaw

Christian Siebert wrote:

>>Why should you waste an additional logical operation, when it is
possible to shift this work into the initialization step. Create the
"mask" like suggested, and then use memcpy() to repeatedly copy it to
your sieve memory...
I was thinking that you'd create one mask for several n's, then another
mask for a few more n's, then another for some more, and so on.
For example, if doing a sieve of eratostenes, you might do one mask
of size 2310 for 2, 3, 5, and 7, and 11.
I agree. Note that this mask saves you 2310*(1/2 + 1/3 + 1/5 + 1/7 +
1/11)=2927 strikes for each new sieving block. So you save
approximately 27% memory accesses...

>Then you might do another mask of size 4199 for 13, 17, and 19. And so on.

Theoretically you could continue this way. Practically this would save
you only "4199*(1/13 + 1/17 + 1/19)=791" strikes per block, while you
introduce a "logical and" with 4199 memory cells. Therefore you have a
factor of 4.3 times as many memory accesses as the original algorithm.
It gets even worse when you continue this way. All in all, this
technique is only helpful for the first few elements....

I'm thinking in terms of real hardware. Most modern computers have
cache lines well larger than a single byte, even significantly larger
than a single word, and most modern computers have memory controllers
that can perform bursts of sequential memory access much faster
than random access to the same number of bytes. The net result is
that this:

char a[10000];

for (i = 0; i < 10000; i++)
{
a[i] = 0;
}

and this:

char a[10000];

for (i = 0; i < 10000; i += 2)
{
a[i] = 0;
}

might generate nearly-identical numbers of memory accesses on
certain machines. Even replacing "i += 2" with "i += 16" might
not reduce the actual number of memory accesses.

Since a sieve is inherently demanding of memory bandwidth (unless the
entire sieve is small enough to fit in cache), and since many
processors' raw computing power far outstrips their ability to access
memory not in the cache (especially the case on desktop machines),
the limiting factor is likely to be memory access. Specifically,
the fewer passes through memory you do, the better.

I would definitely agree that if you take the classical model of a
computer where every address in memory takes equal time to access,
then the technique I was describing doesn't benefit you that much.
For one thing, it requires reading the memory when you do the logical
"or", not just writing it (as a traditional sieve would), and for
another, it doesn't really save that many modifications to the sieve.
But in reality, if you can reduce the number of passes through
memory by a factor of 2 or 3, on real hardware that could be a big
win in terms of actual execution time. If you can save even more
by building up a whole set of masks, you might be able to reduce
the number of passes to one or two total passes, which could be a
huge performance win.

So, I think it all depends on how you model the machine...

- Logan

Jul 26 '06 #16

Efficient sieving implementation?

Similar topics