P: n/a

For a while now i have been "playing" with a little C program to
compute the factorial of large numbers. Currently it takes aboy 1
second per 1000 multiplications, that is 25000P1000 will take about a
second. It will be longer for 50000P1000 as expected, since more digits
will be in the answer. Now, on the Num Analyses forum/Group there is a
post reaporting that that person wrot java code that computed 1000000!
in about a second. That is about 10000 times faste than I would expect
my code to do it. So the two possiblilities are:
1) I am doing something terribly wrong
2) The othr person is lying
At the moment i am inclined to believe that its number 1.
I am posing my code below, I would like to hear your opinions about
why it is slow and how i can improove its speed.
I know that there are public BIGNUM libraries which are already
optimized for such calculations, but I dont want to use them, bcause i
want to approach this problem on a lower level. I am mostly interested
to find out how to get this code perform faster or what alternative
algorythms i should consider. The factorial calculation is just a test
program.
===================start paste==========================
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define al 1024*20
#define base 1000
typedef long int IntegerArrayType;
struct AEI{
IntegerArrayType data[al];
long int digits;
};
void pack(IntegerArrayType i, struct AEI *N1);
void Amult(struct AEI * A, struct AEI * B, struct AEI * C);
void AEIprintf(struct AEI * N1);
void AEIfprintf(FILE * fp, struct AEI * N1);
int main(void)
{
struct AEI *N1, *MO, *Ans;
long i = 0, j = 0, ii, NUM, iii;
FILE *ff;
N1=malloc(sizeof(struct AEI));
MO=malloc(sizeof(struct AEI));
Ans=malloc(sizeof(struct AEI));
while (i < al){
N1>data[i] = 0;
MO>data[i] = 0;
Ans>data[i]=0;
i++;
}
printf("Enter integer to Factorialize: ");
scanf("%ld", &NUM);
pack(1, N1);
pack(1, Ans);
ff = fopen("Results.txt", "w");
printf("you entered: %ld", NUM);
i=1;
while(i < NUM ){
iii=0;
while(iii<NUM && iii<1000){
ii = 1;
while (ii < al)
{
MO>data[ii] = 0;
ii++;
}
pack((i+iii), MO);
Amult(N1, MO, N1);
iii++;
}
i+=iii;
Amult(Ans, N1, Ans);
printf("\nProgress is: %d",i);
pack(1, N1);
}
if(ff!=NULL){
fprintf(ff,"\n%d\n",i1);
AEIfprintf(ff, Ans);
}
fclose(ff);
printf("\nProgress: 100\%");
return 0;
}
void AEIprintf(struct AEI *N1){
float fieldLength;
double temp;
char format1[8];
long j, FL0;
j = N1>digits1;
FL0=(long)log10((float)base);
fieldLength = (float)log10((float)base);
temp = modf(fieldLength, &fieldLength);
format1[0] = '%';
format1[1] = '0';
format1[2] = fieldLength + 48;
format1[3] = 'd';
format1[4] = 0x00;
printf("%*d", FL0, N1>data[j]);
j;
while (j >= 0)
{
printf(format1, N1>data[j]);
j;
}
return;
}
void AEIfprintf(FILE * fp, struct AEI *N1){
long j = N1>digits1;
double fieldLength, temp;
char format0[8], format1[8];
fieldLength = (int)log10(base);
temp = modf(fieldLength, &fieldLength);
format0[0] = '%';
format0[1] = fieldLength + 48;
format0[2] = 'd';
format0[3] = 0x00;
format1[0] = '%';
format1[1] = '0';
format1[2] = fieldLength + 48;
format1[3] = 'd';
format1[4] = 0x00;
fprintf(fp,format0, N1>data[j]);
j;
while (j >= 0){
fprintf(fp, format1, N1>data[j]);
j;
}
return;
}
void pack(IntegerArrayType i, struct AEI * N1)
{
long t = 1, i1, j = 0;
while (t == 1){
i1 = i % base;
N1>data[j] = i1;
i = (i  i1) / base;
j++;
if (i == 0)
t = 0;
}
N1>digits=j;
return;
}
void Amult(struct AEI * A, struct AEI * B, struct AEI * C){
/*C = A * B; */
long i, ii,d, result, carry=0, digits=0;
struct AEI *Ans;
Ans=malloc(sizeof(struct AEI));
i=0;
d= (A>digits+B>digits1);
while(i<d){
Ans>data[i]=carry;
carry=0;
ii=0;
while(ii<=i){
if(B>data[ii]!=0){
Ans>data[i]+=A>data[iii]*B>data[ii];
carry+=Ans>data[i]/base;
Ans>data[i]=Ans>data[i]%base;
}
ii++;
}
carry+=Ans>data[i]/base;
Ans>data[i]=Ans>data[i]%base;
i++;
}
if(carry!=0){
d++;
Ans>data[i]=carry;
}
C>digits=d;
i=0;
while(i<d){
C>data[i]=Ans>data[i];
i++;
}
return;
}
====================end paste============================
I tried to indent the code with spaces instead of tabs, but if some
parts end up not properly indented, I hope no one will hold it against
me.
Thanks ahead  
Share this Question
P: n/a

fermineutron wrote:
For a while now i have been "playing" with a little C program to
compute the factorial of large numbers. Currently it takes aboy 1
second per 1000 multiplications, that is 25000P1000 will take about a
second. It will be longer for 50000P1000 as expected, since more digits
will be in the answer. Now, on the Num Analyses forum/Group there is a
post reaporting that that person wrot java code that computed 1000000!
in about a second. That is about 10000 times faste than I would expect
my code to do it. So the two possiblilities are:
1) I am doing something terribly wrong
2) The othr person is lying
At the moment i am inclined to believe that its number 1.
I am posing my code below, I would like to hear your opinions about
why it is slow and how i can improove its speed.
Before anyone reviews the code, have you profiled it?
If not, why not? If you have, where were the bottlenecks?

Ian Collins.  
P: n/a

fermineutron:
Now, on the Num Analyses forum/Group there is a
post reaporting that that person wrot java code that computed 1000000!
in about a second. That is about 10000 times faste than I would expect
my code to do it. So the two possiblilities are:
1) I am doing something terribly wrong
2) The othr person is lying
The burden of proof is on the Java dude.
Probability suggests that he's lying, because only a very small proportion of
proficient programmers waste their time on mickeymouse holdmyhand
languages such as Java. Also, Java is slow.
Either that or the algorithm's something stupid like:
char const *Func(void)
{
return "2349507239573258471257975093275092375093275923875 09";
}
Ask the Java dude if you can see his code. If he refuses, assume that he's a
liar, then egg his house.

Frederick Gotham  
P: n/a

Ian Collins wrote:
Before anyone reviews the code, have you profiled it?
If not, why not? If you have, where were the bottlenecks?
I profilled it, but there were no obvious bottlenecks which i would not
anticipate to be there by design.
here is the profiler output http://igorpetrusky.awardspace.com/Temp/RunStats.html
I was thinking that maybe there is some other algorythm that is better
than mine for the long int arithemetic?
Factorial calculation is just a driver for the multiplication function.  
P: n/a

On 1 Nov 2006 14:28:20 0800, "fermineutron" <fr**********@yahoo.com>
wrote:
>For a while now i have been "playing" with a little C program to compute the factorial of large numbers. Currently it takes aboy 1 second per 1000 multiplications, that is 25000P1000 will take about a second. It will be longer for 50000P1000 as expected, since more digits will be in the answer. Now, on the Num Analyses forum/Group there is a post reaporting that that person wrot java code that computed 1000000! in about a second. That is about 10000 times faste than I would expect my code to do it. So the two possiblilities are: 1) I am doing something terribly wrong 2) The othr person is lying
At the moment i am inclined to believe that its number 1.
I am posing my code below, I would like to hear your opinions about why it is slow and how i can improove its speed.
I know that there are public BIGNUM libraries which are already optimized for such calculations, but I dont want to use them, bcause i want to approach this problem on a lower level. I am mostly interested to find out how to get this code perform faster or what alternative algorythms i should consider. The factorial calculation is just a test program.
I don't know if my comment in Amult addresses the question you voiced
but you need to pay attention to your compiler diagnostics.
> ===================start paste==========================
#include<stdio.h> #include<stdlib.h> #include<math.h>
#define al 1024*20 #define base 1000 typedef long int IntegerArrayType;
struct AEI{
IntegerArrayType data[al];
long int digits;
};
void pack(IntegerArrayType i, struct AEI *N1); void Amult(struct AEI * A, struct AEI * B, struct AEI * C); void AEIprintf(struct AEI * N1); void AEIfprintf(FILE * fp, struct AEI * N1);
int main(void) {
struct AEI *N1, *MO, *Ans; long i = 0, j = 0, ii, NUM, iii; FILE *ff;
N1=malloc(sizeof(struct AEI));
MO=malloc(sizeof(struct AEI));
Ans=malloc(sizeof(struct AEI)); while (i < al){ N1>data[i] = 0; MO>data[i] = 0;
Ans>data[i]=0; i++; }
printf("Enter integer to Factorialize: ");
Please learn to indent consistently. This is like trying to roller
skate on cobblestones.
snip rest of function until
> printf("\nProgress: 100\%");
The correct way to include a '%' in a printf string is %%, not \%.
Didn't your compiler complain about an undefined escape sequence?
> return 0; }
void AEIprintf(struct AEI *N1){
float fieldLength;
double temp; char format1[8];
long j, FL0;
j = N1>digits1; FL0=(long)log10((float)base); fieldLength = (float)log10((float)base); temp = modf(fieldLength, &fieldLength);
This is a constraint violation. The second argument should be a
double*, not a float*. During execution it will invoke undefined
behavior. Didn't your compiler complain about a type mismatch?
>
format1[0] = '%'; format1[1] = '0'; format1[2] = fieldLength + 48;
What is 48. You probably mean '0' and should use that.
> format1[3] = 'd'; format1[4] = 0x00;
While the hex value is correct, '\0' is more in keeping with common C
idiom.
> printf("%*d", FL0, N1>data[j]); j;
while (j >= 0) { printf(format1, N1>data[j]);
j; }
return; }
snip
>void Amult(struct AEI * A, struct AEI * B, struct AEI * C){
/*C = A * B; */ long i, ii,d, result, carry=0, digits=0;
struct AEI *Ans;
Ans=malloc(sizeof(struct AEI));
i=0;
d= (A>digits+B>digits1);
while(i<d){
Ans>data[i]=carry;
carry=0;
ii=0;
while(ii<=i){
if(B>data[ii]!=0){
Ans>data[i]+=A>data[iii]*B>data[ii];
carry+=Ans>data[i]/base;
Ans>data[i]=Ans>data[i]%base;
}
ii++;
}
carry+=Ans>data[i]/base;
Ans>data[i]=Ans>data[i]%base;
i++;
}
if(carry!=0){
d++;
Ans>data[i]=carry;
}
C>digits=d;
i=0;
while(i<d){
C>data[i]=Ans>data[i];
You could have simply said *C = *Ans and let the compiler do this
possibly more efficiently.
Is there a reason you did not do the work in C directly (and avoid the
time spent in malloc and copying *Ans to *C)?
i++;
Why do you go to so much trouble to avoid for loops?
for (i = 0; i < d; i++)
C>...;
} return;
Where do you free Ans? You are causing repeated memory leaks.
>}
====================end paste============================
I tried to indent the code with spaces instead of tabs, but if some parts end up not properly indented, I hope no one will hold it against me.
Thanks ahead
Remove del for email  
P: n/a

Barry Schwarz said:
On 1 Nov 2006 14:28:20 0800, "fermineutron" <fr**********@yahoo.com>
wrote:
<snip>
>> float fieldLength; double temp; char format1[8]; long j, FL0; j = N1>digits1; FL0=(long)log10((float)base); fieldLength = (float)log10((float)base); temp = modf(fieldLength, &fieldLength);
This is a constraint violation. The second argument should be a
double*, not a float*. During execution it will invoke undefined
behavior. Didn't your compiler complain about a type mismatch?
In my experience, fermineutron pays no heed to correctness issues, and is
concerned only with speed. He has a track record of ignoring corrections. I
think you're wasting your time, Barry.

Richard Heathfield
"Usenet is a strange place"  dmr 29/7/1999 http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)  
P: n/a

Hi
The math software in my ti89 calculator with a 12 MHz 68k CPU can
calculate all 613 digits of 299! in 1 second.
It then takes 5 seconds to convert that to a string for display. But
for giggles I once wrote my own display routine that does the integer
to string conversion in 1/3 of a second.
Frederick Gotham wrote:
fermineutron:
Now, on the Num Analyses forum/Group there is a
post reaporting that that person wrot java code that computed 1000000!
in about a second. That is about 10000 times faste than I would expect
my code to do it. So the two possiblilities are:
1) I am doing something terribly wrong
2) The othr person is lying
The burden of proof is on the Java dude.
Probability suggests that he's lying, because only a very small proportion of
proficient programmers waste their time on mickeymouse holdmyhand
languages such as Java. Also, Java is slow.
Either that or the algorithm's something stupid like:
char const *Func(void)
{
return "2349507239573258471257975093275092375093275923875 09";
}
Ask the Java dude if you can see his code. If he refuses, assume that he's a
liar, then egg his house.

Frederick Gotham
 
P: n/a

fermineutron wrote:
Ian Collins wrote:
Before anyone reviews the code, have you profiled it?
If not, why not? If you have, where were the bottlenecks?
Its a factorial calculation. The bottleneck is in the big integer
multiply, which itself should have a bottleneck in the platform's
multiply.
I profilled it, but there were no obvious bottlenecks which i would not
anticipate to be there by design.
here is the profiler output
http://igorpetrusky.awardspace.com/Temp/RunStats.html
I was thinking that maybe there is some other algorythm that is better
than mine for the long int arithemetic?
Probably so. Consider that you are doing nothing more than computing
the straight product of the numbers using no arithmetic short cuts at
all.
So here's what comes off the top of my head: Ask yourself the
following problem. How many factors of 2 are there in 1000000! ?
Certainly every other number is even. But every 4th number has 2
factors of 2, and every 8th number has 3 factors of two in it. So the
answer is:
f(2) = floor(1000000/2) + floor(1000000/4) + floor(1000000/8) + ...
Similarly we can figure out the number of factors of 3s, 5s, 7s, 11s,
and all the primes less than 1000000, as f(3), f(5), etc. Then the
result you are looking for is:
1000000! = pow(2,f(2))*pow(3,f(3))*pow(5,f(5))*...
Now, the question is  what makes us think this will be any faster?
Well, the pow() function can be computed with successive squaring
tricks. Squaring faster than straight multiplying because (a*q^r + b)
^ 2 = a^2*q^(2*r) + b^2 + 2*a*b*(q^r). And the resulting big number
multiplies that you have to perform here can be accelerated using any
number of big number multiply acceleration tricks (Karatsuba,
ToomCook, or DFTs.).
I don't know how much faster, if any, doing things this way would be.
If you find a faster way in the literature, I would be interested in
know it.

Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/  
P: n/a

fermineutron wrote:
Ian Collins wrote:
Before anyone reviews the code, have you profiled it?
If not, why not? If you have, where were the bottlenecks?
Its a factorial calculation. The bottleneck is in the big integer
multiply, which itself should have a bottleneck in the platform's
multiply.
I profilled it, but there were no obvious bottlenecks which i would not
anticipate to be there by design.
here is the profiler output
http://igorpetrusky.awardspace.com/Temp/RunStats.html
I was thinking that maybe there is some other algorythm that is better
than mine for the long int arithemetic?
Probably so. Consider that you are doing nothing more than computing
the straight product of the numbers using no arithmetic short cuts at
all.
So here's what comes off the top of my head: Ask yourself the
following problem. How many factors of 2 are there in 1000000! ?
Certainly every other number is even. But every 4th number has 2
factors of 2, and every 8th number has 3 factors of two in it. So the
answer is:
f(2) = floor(1000000/2) + floor(1000000/4) + floor(1000000/8) + ...
Similarly we can figure out the number of factors of 3s, 5s, 7s, 11s,
and all the primes less than 1000000, as f(3), f(5), etc. Then the
result you are looking for is:
1000000! = pow(2,f(2))*pow(3,f(3))*pow(5,f(5))*...
Now, the question is  what makes us think this will be any faster?
Well, the pow() function can be computed with successive squaring
tricks. Squaring faster than straight multiplying because (a*q^r + b)
^ 2 = a^2*q^(2*r) + b^2 + 2*a*b*(q^r). And the resulting big number
multiplies that you have to perform here can be accelerated using any
number of big number multiply acceleration tricks (Karatsuba,
ToomCook, or DFTs.).
I don't know how much faster, if any, doing things this way would be.
If you find a faster way in the literature, I would be interested in
knowing it.

Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/  
P: n/a

Frederick Gotham wrote:
Probability suggests that he's lying, because only a very small proportion of
proficient programmers waste their time on mickeymouse holdmyhand
languages such as Java. Also, Java is slow.
Can we drop the insults, please?

Chris "unhashedup hashed up hashing" Dollin
"I'm still here and I'm holding the answers"  Karnataka, /Love and Affection/  
P: n/a
 we******@gmail.com writes:
[...]
So here's what comes off the top of my head: Ask yourself the
following problem. How many factors of 2 are there in 1000000! ?
Certainly every other number is even. But every 4th number has 2
factors of 2, and every 8th number has 3 factors of two in it. So the
answer is:
f(2) = floor(1000000/2) + floor(1000000/4) + floor(1000000/8) + ...
Similarly we can figure out the number of factors of 3s, 5s, 7s, 11s,
and all the primes less than 1000000, as f(3), f(5), etc. Then the
result you are looking for is:
1000000! = pow(2,f(2))*pow(3,f(3))*pow(5,f(5))*...
Now, the question is  what makes us think this will be any faster?
Well, the pow() function can be computed with successive squaring
tricks. Squaring faster than straight multiplying because (a*q^r + b)
^ 2 = a^2*q^(2*r) + b^2 + 2*a*b*(q^r). And the resulting big number
multiplies that you have to perform here can be accelerated using any
number of big number multiply acceleration tricks (Karatsuba,
ToomCook, or DFTs.).
Hmm, interesting approach. But does it really speed things up? If
you want to compute, say, 3**1024 with extendedprecision integers,
successive squaring reduces the number of multiplications, but is it
faster than doing the 1023 multiplications? I'm thinking that
multiplying, say, a 499digit number by a 1digit number might be
quicker than multiplying two 250digit numbers.
I haven't really thought this through (feel free to say that that's
obvious).
I don't know how much faster, if any, doing things this way would be.
If you find a faster way in the literature, I would be interested in
know it.

Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.  
P: n/a

Chris Dollin said:
Frederick Gotham wrote:
>Probability suggests that he's lying, because only a very small proportion of proficient programmers waste their time on mickeymouse holdmyhand languages such as Java. Also, Java is slow.
Can we drop the insults, please?
That would at least be a start, but he still needs to make amends for past
insults.

Richard Heathfield
"Usenet is a strange place"  dmr 29/7/1999 http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)  
P: n/a

Keith Thompson said:
<snip>
If
you want to compute, say, 3**1024 with extendedprecision integers,
successive squaring reduces the number of multiplications, but is it
faster than doing the 1023 multiplications? I'm thinking that
multiplying, say, a 499digit number by a 1digit number might be
quicker than multiplying two 250digit numbers.
Sure, but obtaining the 499digit number in the first place is likely to be
slower.
Measure, measure, measure!

Richard Heathfield
"Usenet is a strange place"  dmr 29/7/1999 http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)  
P: n/a

Keith Thompson wrote:
we******@gmail.com writes:
[...]
So here's what comes off the top of my head: Ask yourself the
following problem. How many factors of 2 are there in 1000000! ?
Certainly every other number is even. But every 4th number has 2
factors of 2, and every 8th number has 3 factors of two in it. So the
answer is:
f(2) = floor(1000000/2) + floor(1000000/4) + floor(1000000/8) + ...
Similarly we can figure out the number of factors of 3s, 5s, 7s, 11s,
and all the primes less than 1000000, as f(3), f(5), etc. Then the
result you are looking for is:
1000000! = pow(2,f(2))*pow(3,f(3))*pow(5,f(5))*...
Now, the question is  what makes us think this will be any faster?
Well, the pow() function can be computed with successive squaring
tricks. Squaring faster than straight multiplying because (a*q^r + b)
^ 2 = a^2*q^(2*r) + b^2 + 2*a*b*(q^r). And the resulting big number
multiplies that you have to perform here can be accelerated using any
number of big number multiply acceleration tricks (Karatsuba,
ToomCook, or DFTs.).
Hmm, interesting approach. But does it really speed things up? If
you want to compute, say, 3**1024 with extendedprecision integers,
successive squaring reduces the number of multiplications, but is it
faster than doing the 1023 multiplications? I'm thinking that
multiplying, say, a 499digit number by a 1digit number might be
quicker than multiplying two 250digit numbers.
I haven't really thought this through (feel free to say that that's
obvious).
I'll resist the temptation. Especially, since I only went half way. :)
Anyhow, the issue is not whether or not the squaring trick is saving
you overall performance (it does  this is not controvertial.) The
real controversy with my approach is that its breaking numbers down
into its constituent factors first, then recombining them. I.e., its
faster to multiply 72 by 91 directly then to first break it down into 8
* 9 * 7 * 13. The hope, however, is that the number of primes are
sufficiently less than n (in this case there are about 87K primes under
1 million) to compensate for the cost of computing the power
computation. Notice that the vast majority of the primes will have a
power less than, say, 6, and that in fact the most common power is 1.

Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/  
P: n/a

Frederick Gotham wrote:
fermineutron:
Now, on the Num Analyses forum/Group there is a
post reaporting that that person wrot java code that computed 1000000!
in about a second. That is about 10000 times faste than I would expect
my code to do it. So the two possiblilities are:
1) I am doing something terribly wrong
2) The othr person is lying
The burden of proof is on the Java dude.
Probability suggests that he's lying, because only a very small proportion of
proficient programmers waste their time on mickeymouse holdmyhand
languages such as Java. Also, Java is slow.
Either that or the algorithm's something stupid like:
char const *Func(void)
{
return "2349507239573258471257975093275092375093275923875 09";
}
1 million factorial is a very very very very very very big number. I
assure you that they did not solve the problem in the way you suggest
unless they are rigging against a very specific benchmark. Can you
imagine a library that's 100s of megs larger than necesary just to hold
a table of factorials?
Ask the Java dude if you can see his code. If he refuses, assume that he's a
liar, then egg his house.
Right, because nobody ever writes proprietary code.
In the case of bignum libraries, while I don't know this first hand,
the specification and architecture of the Java language itself suggests
very strongly to me that their bignum performance should be equal to
the best bignum libraries in existence, and faster than any portable
pure C bignum library. Again, I don't know this from direct firsthand
knowledge, but there is very strong possibility that you are completely
wrong on this point. Java may be very slow at many things, but if its
fast at anything, bignum is probably one of the prime candidates.

Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/  
P: n/a

* fermineutron <fr**********@yahoo.comwrote:
For a while now i have been "playing" with a little C program to
compute the factorial of large numbers. Currently it takes aboy 1
second per 1000 multiplications, that is 25000P1000 will take about a
second. It will be longer for 50000P1000 as expected, since more digits
will be in the answer. Now, on the Num Analyses forum/Group there is a
post reaporting that that person wrot java code that computed 1000000!
in about a second. That is about 10000 times faste than I would expect
my code to do it. So the two possiblilities are:
1) I am doing something terribly wrong
2) The othr person is lying
At the moment i am inclined to believe that its number 1.
Yes. I'm pretty sure that "the person" used the PrimeSchönhage algorithm and
the Java port of the apfloat library.
Using the C++ version of apfloat I get these times for 1000000! :
builtin factorial(): 45s
PrimeSchönhage: 15s
All on a Celeron 1.2 GHz. Apfloat's number theoretic transforms probably perform
much faster on a 64 bit platform.
Stefan Krah  
P: n/a
 we******@gmail.com wrote:
fermineutron wrote:
>Ian Collins wrote:
>>Before anyone reviews the code, have you profiled it? If not, why not? If you have, where were the bottlenecks?
Its a factorial calculation. The bottleneck is in the big integer
multiply, which itself should have a bottleneck in the platform's
multiply.
>I profilled it, but there were no obvious bottlenecks which i would not anticipate to be there by design.
here is the profiler output
http://igorpetrusky.awardspace.com/Temp/RunStats.html
I was thinking that maybe there is some other algorythm that is better than mine for the long int arithemetic?
Probably so. Consider that you are doing nothing more than
computing the straight product of the numbers using no arithmetic
short cuts at all.
So here's what comes off the top of my head: Ask yourself the
following problem. How many factors of 2 are there in 1000000! ?
Certainly every other number is even. But every 4th number has 2
factors of 2, and every 8th number has 3 factors of two in it.
So the answer is:
f(2) = floor(1000000/2) + floor(1000000/4) + floor(1000000/8) + ...
Similarly we can figure out the number of factors of 3s, 5s, 7s,
11s, and all the primes less than 1000000, as f(3), f(5), etc.
Then the result you are looking for is:
1000000! = pow(2,f(2))*pow(3,f(3))*pow(5,f(5))*...
Now, the question is  what makes us think this will be any
faster? Well, the pow() function can be computed with successive
squaring tricks. Squaring faster than straight multiplying
because (a*q^r + b) ^ 2 = a^2*q^(2*r) + b^2 + 2*a*b*(q^r). And
the resulting big number multiplies that you have to perform
here can be accelerated using any number of big number multiply
acceleration tricks (Karatsuba, ToomCook, or DFTs.).
I don't know how much faster, if any, doing things this way would
be. If you find a faster way in the literature, I would be
interested in know it.
This doesn't answer the 'faster' question, but your method is
basically the same as my algorithm for computing large factorials
in limited size registers, which I published here over 3 years ago.
/* compute factorials, extended range
on a 32 bit machine this can reach fact(15) without
unusual output formats. With the prime table shown
overflow occurs at 101.
Public domain, by C.B. Falconer. 20030622
*/
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
/* 2 and 5 are handled separately
Placing 2 at the end attempts to preserve such factors
for use with the 5 factor and exponential notation
*/
static unsigned char primes[] = {3,7,11,13,17,19,23,29,31,37,
41,43,47,53,57,59,61,67,71,
/* add further primes here >*/
2,5,0};
static unsigned int primect[sizeof primes]; /* = {0} */
static double fltfact = 1.0;
/*  */
static
unsigned long int fact(unsigned int n, unsigned int *zeroes)
{
unsigned long val;
unsigned int i, j, k;
#define OFLOW ((ULONG_MAX / j) < val)
/* This is a crude mechanism for passing back values */
for (i = 0; i < sizeof primes; i++) primect[i] = 0;
for (i = 1, val = 1UL, *zeroes = 0; i <= n; i++) {
fltfact *= i; /* approximation */
j = i;
/* extract exponent of 10 */
while ((0 == (j % 5)) && (!(val & 1))) {
j /= 5; val /= 2;
(*zeroes)++;
}
/* Now try to avoid any overflows */
k = 0;
while (primes[k] && OFLOW) {
/* remove factors primes[k] */
while (0 == (val % primes[k]) && OFLOW) {
val /= primes[k];
++primect[k];
}
while (0 == (j % primes[k]) && OFLOW) {
j /= primes[k];
++primect[k];
}
k++;
}
/* Did we succeed in the avoidance */
if (OFLOW) {
#if DEBUG
fprintf(stderr, "Overflow at %u, %lue%u * %u\n",
i, val, *zeroes, j);
#endif
val = 0;
break;
}
val *= j;
}
return val;
} /* fact */
/*  */
int main(int argc, char *argv[])
{
unsigned int x, zeroes;
unsigned long f;
if ((2 == argc) && (1 == sscanf(argv[1], "%u", &x))) {
if (!(f = fact(x, &zeroes))) {
fputs("Overflow\n", stderr);
return EXIT_FAILURE;
}
printf("Factorial(%u) == %lu", x, f);
if (zeroes) printf("e%u", zeroes);
for (x = 0; primes[x]; x++) {
if (primect[x]) {
printf(" * pow(%d,%d)", primes[x], primect[x]);
}
}
putchar('\n');
printf("or approximately %.0f.\n", fltfact);
return 0;
}
fputs("Usage: fact n\n", stderr);
return EXIT_FAILURE;
} /* main */

Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>  
P: n/a

CBFalconer wrote:
we******@gmail.com wrote:
fermineutron wrote:
Ian Collins wrote:
Before anyone reviews the code, have you profiled it? If not, why not? If you have, where were the bottlenecks?
Its a factorial calculation. The bottleneck is in the big integer
multiply, which itself should have a bottleneck in the platform's
multiply.
I profilled it, but there were no obvious bottlenecks which i
would not anticipate to be there by design.
here is the profiler output
http://igorpetrusky.awardspace.com/Temp/RunStats.html
I was thinking that maybe there is some other algorythm that is
better than mine for the long int arithemetic?
Probably so. Consider that you are doing nothing more than
computing the straight product of the numbers using no arithmetic
short cuts at all.
So here's what comes off the top of my head: Ask yourself the
following problem. How many factors of 2 are there in 1000000! ?
Certainly every other number is even. But every 4th number has 2
factors of 2, and every 8th number has 3 factors of two in it.
So the answer is:
f(2) = floor(1000000/2) + floor(1000000/4) + floor(1000000/8) + ...
Similarly we can figure out the number of factors of 3s, 5s, 7s,
11s, and all the primes less than 1000000, as f(3), f(5), etc.
Then the result you are looking for is:
1000000! = pow(2,f(2))*pow(3,f(3))*pow(5,f(5))*...
Now, the question is  what makes us think this will be any
faster? Well, the pow() function can be computed with successive
squaring tricks. Squaring faster than straight multiplying
because (a*q^r + b) ^ 2 = a^2*q^(2*r) + b^2 + 2*a*b*(q^r). And
the resulting big number multiplies that you have to perform
here can be accelerated using any number of big number multiply
acceleration tricks (Karatsuba, ToomCook, or DFTs.).
I don't know how much faster, if any, doing things this way would
be. If you find a faster way in the literature, I would be
interested in know it.
This doesn't answer the 'faster' question, but your method is
basically the same as my algorithm for computing large factorials
in limited size registers, which I published here over 3 years ago.
Your solution computes a mathematically equivalent reexpression, but
the similarities end there. The differences start where the
*algorithms* begin. Your algorithm counts each factor individually
(++primect[k]), paying a divide and modulo penalty for each occurring
factor for each number one by one. As you can see in my formulations
above, calculating each f(p) will perform log(p,n) divides and no
modulos. So yeah, your solution definately doesn't answer the 'faster'
question.
/* compute factorials, extended range
on a 32 bit machine this can reach fact(15) without
unusual output formats. With the prime table shown
overflow occurs at 101.
The OP asked for 1000000!. You're off by 4 orders of magnitude. For
factorials that small, why wouldn't you just create a table?
Public domain, by C.B. Falconer. 20030622
*/
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
/* 2 and 5 are handled separately
Placing 2 at the end attempts to preserve such factors
for use with the 5 factor and exponential notation
*/
static unsigned char primes[] = {3,7,11,13,17,19,23,29,31,37,
41,43,47,53,57,59,61,67,71,
/* add further primes here >*/
2,5,0};
Ok, seriously dude. Do I need to say "640K ought to be enough for
anyone" yet again to you? What's your problem with prime number
sequences anyways? This is the second time I've seen you express them
in a miniscule finite table, for problems that require an unbounded
above sequence of them.

Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/  
P: n/a

fermineutron wrote:
For a while now i have been "playing" with a little C program to
compute the factorial of large numbers.
You asked this question before, and I gave you the answer.
You're doing one digit at a time. Doing it with 32 bits at a time will
be at least 400,000,000 times faster. Why,? because with 32 bits you
can store numbers up to about 4 billion, AND you don't need to do a
divide and a mod each time.  
P: n/a

Ancient_Hacker wrote:
You asked this question before, and I gave you the answer.
You're doing one digit at a time. Doing it with 32 bits at a time will
be at least 400,000,000 times faster. Why,? because with 32 bits you
can store numbers up to about 4 billion, AND you don't need to do a
divide and a mod each time.
I am doing 3 digits at the time, see the base constant at the top of a
file. Theoretically i could use 5 digits, or if to make the code only
compatible with C99 hence using 64bit long long int i could increase
base even higher. The restraint on the base is that base^2 must fit in
largest int type avaiable.  
P: n/a

>Barry Schwarz wrote:
Is there a reason you did not do the work in C directly (and avoid the
time spent in malloc and copying *Ans to *C)?
I do not work in C directly to allow the caller to use the same
argument for one of the multipliers as the result. Doing calculations
in C directly will screw up the math.
Where do you free Ans? You are causing repeated memory leaks.
OOPS, it did not copy over. but i did have it in the code.  
P: n/a

fermineutron wrote:
Ancient_Hacker wrote:
You asked this question before, and I gave you the answer.
You're doing one digit at a time. Doing it with 32 bits at a time will
be at least 400,000,000 times faster. Why,? because with 32 bits you
can store numbers up to about 4 billion, AND you don't need to do a
divide and a mod each time.
I am doing 3 digits at the time
Good! Then the speedup will only be by a factor of 400,000 times.
I would approach it this way:
First go to using base 2. Shouldnt be hard, just change all your 10's
to 2's.
Then get the code going as well as before, only using base 2.
Then you can try stepping up the number of digits from 3 up to whatever
will fit in your datatype. Better stop a few bits short the first time
to avoid problems with the sign bit.
Then notice that all your divides and mod's by 2^digits can be replaced
by shifts and masks. Although a good compiler will do this for you.
So by using most of the datatype width, and removing all the divs and
mods, things should get super speedy.  
P: n/a

Ancient_Hacker wrote:
First go to using base 2. Shouldnt be hard, just change all your 10's
to 2's.
Then get the code going as well as before, only using base 2.
Then you can try stepping up the number of digits from 3 up to whatever
will fit in your datatype. Better stop a few bits short the first time
to avoid problems with the sign bit.
Then notice that all your divides and mod's by 2^digits can be replaced
by shifts and masks. Although a good compiler will do this for you.
So by using most of the datatype width, and removing all the divs and
mods, things should get super speedy.
Hmm, i forgot about super speedy shifts, going to rework the code now.
I know there are better agorithms for computing factorial then straight
up multiplication, but it is the bignum arithemetic that i am mostly
curious about so factoriial calculations is just a driver/benchmark.  
P: n/a

On Thu, 2 Nov 2006, Ancient_Hacker wrote:
fermineutron wrote:
>Ancient_Hacker wrote:
>>You asked this question before, and I gave you the answer.
You're doing one digit at a time. Doing it with 32 bits at a time will be at least 400,000,000 times faster. Why,? because with 32 bits you can store numbers up to about 4 billion, AND you don't need to do a divide and a mod each time.
I am doing 3 digits at the time
Good! Then the speedup will only be by a factor of 400,000 times.
I would approach it this way:
First go to using base 2. Shouldnt be hard, just change all your 10's
to 2's.
Then get the code going as well as before, only using base 2.
Then you can try stepping up the number of digits from 3 up to whatever
will fit in your datatype. Better stop a few bits short the first time
to avoid problems with the sign bit.
Then notice that all your divides and mod's by 2^digits can be replaced
by shifts and masks. Although a good compiler will do this for you.
So by using most of the datatype width, and removing all the divs and
mods, things should get super speedy.
Also, there are almost certainly faster ways to multiply than O(n^2).
However, when fermineutron goes to base 2^k, he will need to invest
some time (or Google searching) in a new "AEIfprintf" routine. This
shouldn't be a big deal  his existing routine is totally braindead
and should be shot on sight  but it might take a while to get all
the new bugs out, especially since he's working in an unfamiliar language.
Followups set to comp.programming, since this thread has lost any
connection to the C programming language.
Arthur  
P: n/a

"fermineutron" <fr**********@yahoo.comwrites:
Ancient_Hacker wrote:
[...]
>Then notice that all your divides and mod's by 2^digits can be replaced by shifts and masks. Although a good compiler will do this for you.
Hmm, i forgot about super speedy shifts, going to rework the code now.
[...]
I predict that he will ignore the "Although a good compiler will do
this for you" statement, and just manually replace all his
multiplications and divisions by shift without first checking whether
it's necessary. (I could be wrong, of course.)

Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.  
P: n/a

"fermineutron" <fr**********@yahoo.comwrites:
>>Barry Schwarz wrote: Is there a reason you did not do the work in C directly (and avoid the time spent in malloc and copying *Ans to *C)?
I do not work in C directly to allow the caller to use the same
argument for one of the multipliers as the result. Doing calculations
in C directly will screw up the math.
For those tuning in late, "C" is the name of a variable. I suggest
that, like "a", "C" is a really bad variable name.

Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.  
P: n/a

fermineutron wrote:
For a while now i have been "playing" with a little C program to
compute the factorial of large numbers.
What's the largest factorial you've been able to compute with
your program?  
P: n/a

John Smith wrote:
What's the largest factorial you've been able to compute with
your program?
with old version the larges one i did in real time is 150000
after i swiched to base 1024 i have not tried yet, still stuck on
printing function, i seem to be unable to come up with algoryth to go
from base "base" to base 10 in a long array
Standard aproach such as
NumInBase10=NumInBaseB*(B/10)^N
will not work since N can be several thousand easy.  
P: n/a

Here is the function i have for printing to the screen an array encoded
integer, but it does not work, would someone please fix it. I dont
usually ask people to write code for me completely but i fried my brain
trying to figure this one out today. Dont know whats wrong with me
today, but i just seem to be unable to figure this one out today.
here is the code
=====================Start paste=====================
void AEIprintf(struct AEI *p) {
int fieldLength = (int)log10(base);
long fieldbase=pow(10,fieldLength);
char format[10];
size_t j = p>digits;
long dpf, dpf1;
sprintf(format, "%%0%dd", fieldLength);
dpf = 0;
dpf1 = (p>data[j] + dpf)/fieldbase;
dpf = (p>data[j] + dpf)%fieldbase;
printf("%*d", fieldLength, dpf1);
for (; j 0; j) {
dpf1 = (p>data[j] + dpf)/fieldbase;
dpf = (p>data[j] + dpf)%fieldbase;
printf(format, dpf1);
}
return;
}
=====================End Paste=====================  
P: n/a

can such a printing operation be done at all? I think there may be a
way if to 1st reencode the array into a decimal base and then print,
but honestly speaking i have no idea.  
P: n/a

"fermineutron" <fr**********@yahoo.comwrites:
can such a printing operation be done at all? I think there may be a
way if to 1st reencode the array into a decimal base and then print,
but honestly speaking i have no idea.
You've been here long enough to understand the importance of
providing context in a followup, haven't you?

Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.  
P: n/a

fermineutron wrote:
John Smith wrote:
>What's the largest factorial you've been able to compute with your program?
with old version the larges one i did in real time is 150000
after i swiched to base 1024 i have not tried yet, still stuck on
printing function, i seem to be unable to come up with algoryth to go
from base "base" to base 10 in a long array
Standard aproach such as
NumInBase10=NumInBaseB*(B/10)^N
will not work since N can be several thousand easy.
I suspect this may be beyond you, but take a look at dubldabl.
<http://cbfalconer.home.att.net/download>

Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>  
P: n/a

CBFalconer wrote:
I suspect this may be beyond you, but take a look at dubldabl.
<http://cbfalconer.home.att.net/download>
Correct me if i am wrong but, I dont think that this will work, because
in my case i have array of binary coded numbers but each element in the
array corresponds to a base not equal to 2^bits per field.
That is, my data looks like this
2^70 2^60 2^50 2^40 2^30 2^20 2^10
2^0

 01023  01023  01023  01023  01023  01023  01023  01023


4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes


theoretically the range of each field can be half the bits in the
field. This has to be so to allow for an integer datatype to be able to
hold a maximum product of 2 fields.
so if i were to setup something like:

 ASCII Memory  Binary Memory


and shift from right 2 left using algorythm you pointed me to,
shouldn't the unused 2 left bytes of each element in binary memeory
screw up the ASII memory?
Before you posted the post to which i am replying here, i tried to
print the numbers into char array, as follows:
base = 1024
===================Start Code=======================
void AEIprintf(struct AEI *p) {
int fieldLength = (int)log10(base);
long fieldbase=pow(10,fieldLength+1), N=0, carry=0, tbl=0;
char format[10], *buffer;
long j = p>digits;
long i=0;
buffer=malloc(sizeof(struct AEI)*10);
if(buffer!=NULL){
while(i<j){
N=(carry+p>data[i]+p>data[i+1]*base)%fieldbase;
carry=(carry+p>data[i]+p>data[i+1]*base)/fieldbase;
tbl+=sprintf(buffer+tbl,"%ld",N);
i++;
}
printf("\nprinting buffer %d\n", strlen(buffer));
printf("\n%s\n",buffer);
}else{
printf("\nCant Allocate memory\n");
}
return;
}
===================End Code========================
but because the base has integer value in it 1024 the contribution to
units decemal field of resulting number in base 10 will come from all
elements of array in base 1024 i would have to find (p>data[N])*1024^N
which is impossible to do since N is several thousand may be even 10s
of thousands.
It is also the case that even if i were to requre the field size be
half of the bits of the maximun int type, hence allowing to use all
bits in the field to store data, to not wory about separate sign field,
i casted the array as signed, i know its wastefull, but i have not
thought of the way to handle it oherwise, i guess it can be done adding
a sign field into the structure.
what would you seggest about printing function?
The more I am thinking about the more it seems that i should redo it
using unsigned ints using full field width, but i have not yet gathereg
my bravory to do that.  
P: n/a

On Thu, 2 Nov 2006, CBFalconer wrote (in response to fermineutron):
>
I suspect this may be beyond you, but take a look at dubldabl.
<http://cbfalconer.home.att.net/download>
Just for kicks, I implemented doubledabble from scratch based on
your description and added a description and C implementation to
Wikipedia: http://en.wikipedia.org/wiki/Double_dabble
Any comments on the implementation from the nitpickers?
(Other than "check calloc and realloc for NULL", that is. I
consciously decided to omit the errorchecking in a "reference"
implementation, because it isn't relevant to the algorithm itself
and would only distract readers who wanted to translate the algorithm
to some other language. Remember this is meant to strike a harmonious
mean between "real code" and "teaching code".)
Suggestions of a better test driver are particularly solicited.
Also, CBFalconer, do you know where the name "doubledabble" came
from originally, and how long it's been around? Curiously, some
online references use the name "doubledabble" to refer to an
"algorithm" that's little more than an observation about positional
number systems  see the page history of "Double dabble" for that
ridiculousness.
Thanks,
Arthur  
P: n/a

Arthur J. O'Dwyer wrote:
On Thu, 2 Nov 2006, CBFalconer wrote (in response to fermineutron):
I suspect this may be beyond you, but take a look at dubldabl.
<http://cbfalconer.home.att.net/download>
Just for kicks, I implemented doubledabble from scratch based on
your description and added a description and C implementation to
Wikipedia: http://en.wikipedia.org/wiki/Double_dabble
Any comments on the implementation from the nitpickers?
(Other than "check calloc and realloc for NULL", that is. I
consciously decided to omit the errorchecking in a "reference"
implementation, because it isn't relevant to the algorithm itself
and would only distract readers who wanted to translate the algorithm
to some other language. Remember this is meant to strike a harmonious
mean between "real code" and "teaching code".)
Suggestions of a better test driver are particularly solicited.
Also, CBFalconer, do you know where the name "doubledabble" came
from originally, and how long it's been around? Curiously, some
online references use the name "doubledabble" to refer to an
"algorithm" that's little more than an observation about positional
number systems  see the page history of "Double dabble" for that
ridiculousness.
Thanks,
Arthur
Thanks Arthur,
Your code works great, i was able to use it with no problems. Without
YOUR help i would not have gotten it done. Good news is that the
multiplication algo that i changed earlier works correctly and your
code is what made it possible for me to check that. Thanks again.  
P: n/a

Arthur,
Is there any chance you could add to the wikipedia the C code for
reverse transformation of long integers?
That is, given a null terminated string containing characters 09 how
would i transfer it to array of integers in base 2^16 or 2^32?
Thanks ahead.   This discussion thread is closed Replies have been disabled for this discussion.   Question stats  viewed: 2084
 replies: 35
 date asked: Nov 1 '06
