I have a perl app that is calculating the standard deviation of a 4000
element 16 bit integer array, that has large dynamic content. I.e,
the range spans a significant portion of the 16 bits.
I am trying to increase the performance of this critical loop, and
I've found that I am exceeding the 32 bit registers causing Perl to
switch to an infinite precision math library. I've rewritten the loop
such that I now have only one term that overflows 32 bits.
I am at the point of inlining assembler to speed up this loop.
However, the MMX terminology is confusing. I see there are 64bit MMX
registers that can do multiply and accumulate, but I don't see any
64bit add commands. My loop is something like this:
for (i=0; i<4000; i++) { sum += array[i]; }
Of course my loop is slightly bigger than this, but even this toy
example has terrible performance on real data when sum exceeds 32
bits.
Before I go into the time of inlining C (and assembler), are the MMX
registers cabable of accumulating a 64 bit register with adding a
16(or 32) bit operand?
By the way, I've already tried the GMP library, and it did not help.
I also recompiled everything with the MMX optimizations turned on. Oh
yeah, this is on Linux. 6 3899
Brian K. Michalk wrote: I have a perl app that is calculating the standard deviation of a 4000 element 16 bit integer array, that has large dynamic content. I.e, the range spans a significant portion of the 16 bits.
I am trying to increase the performance of this critical loop, and I've found that I am exceeding the 32 bit registers causing Perl to switch to an infinite precision math library. I've rewritten the loop such that I now have only one term that overflows 32 bits.
Sorry, but that's bogus:
The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
cannot have more than 16+12 = 28 bits.
There must be some other problem.
Terje

 <Te************@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" mi*****@awpi.com (Brian K. Michalk) writes: I have a perl app that is calculating the standard deviation of a 4000 element 16 bit integer array, that has large dynamic content. I.e, the range spans a significant portion of the 16 bits.
I am trying to increase the performance of this critical loop, and I've found that I am exceeding the 32 bit registers causing Perl to switch to an infinite precision math library. I've rewritten the loop such that I now have only one term that overflows 32 bits.
I am at the point of inlining assembler to speed up this loop. However, the MMX terminology is confusing. I see there are 64bit MMX registers that can do multiply and accumulate, but I don't see any 64bit add commands. My loop is something like this: for (i=0; i<4000; i++) { sum += array[i]; } Of course my loop is slightly bigger than this, but even this toy example has terrible performance on real data when sum exceeds 32 bits.
Before I go into the time of inlining C (and assembler), are the MMX registers cabable of accumulating a 64 bit register with adding a 16(or 32) bit operand?
By the way, I've already tried the GMP library, and it did not help. I also recompiled everything with the MMX optimizations turned on. Oh yeah, this is on Linux.
Sounds like you need a better algorithm for calculating the standard
deviation. The Welford update formula updates the mean (which can be 16
bits) rather than the usual formula of sum/n (which needs to be 16+log2(n)
bits)
For example, have a look at this code
From http://www.cs.trinity.edu/~joldham0/...varWelford.pl
#! /usr/bin/perl w ## ## Oldham, Jeffrey D. ## 2000Mar13 ## util ## ## Compute the Mean and Variance Using Welford's Formula
## Compute the mean and variance of a stream of floatingpoint ## numbers. The numbers should appear at the beginning of each line. $mean = 0.0; $variance = 0.0; $count = 0;
# Compute the mean and maximum. while (<ARGV>) { ($number, $_) = split; ++$count; $diff = $number  $mean; $mean += $diff / $count; $variance += $diff * $diff * ($count 1) / $count; } $variance /= $count;
printf("The mean is %g.\n", $mean); printf("The variance is %g.\n", $variance); printf("The standard deviation is %g.\n", sqrt($variance));
Also if you're desperate for speed then, since the standard deviation is
only an estimate, you can double the speed of your code by calculating the
standard deviation of every other number (ie use 2000 of them) and it will
be pretty good estimate. You could probably just use a few hundred samples
and still get a very good approximation.
P.S.
If you do code this algorithm in MMX, please post it here as more people
should understand how to calculate means and SDs with this more accurate
formula.

"Thinks: I can't think of a thinks. End of thinks routine": Blue Bottle
** Aunty Spam says: Remove the trailing x from the To: field to reply **
Terje Mathisen <te************@hda.hydro.com> wrote in message news:<bk**********@osl016lin.hda.hydro.com>... Brian K. Michalk wrote:
I have a perl app that is calculating the standard deviation of a 4000 element 16 bit integer array, that has large dynamic content. I.e, the range spans a significant portion of the 16 bits.
I am trying to increase the performance of this critical loop, and I've found that I am exceeding the 32 bit registers causing Perl to switch to an infinite precision math library. I've rewritten the loop such that I now have only one term that overflows 32 bits.
Sorry, but that's bogus:
The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits, cannot have more than 16+12 = 28 bits.
There must be some other problem.
Terje
Uhh, you are right. Problem in my example. I am indeed working with
addition of 32 bit integers, the A/D data is still 16 bits. The loop
is perhaps a little too complicated here, but it is a trace of 4000
points. I have a 400 point averaging (high pass) window that iterates
over the entire 4000 element trace. My goal is to find an A/D
amplitude(alpha) such that 5% of the points in the averaging window
are above alpha.
Brute force works just fine, except that it's a really bad choice. If
I take the standard deviation of the moving window, and then apply the
ztransform for 5%, out pops the alpha that I need to filter the data.
It's an order(N) filter, and is blazingly fast for low amplitude
signals, because they will stay within 32 bit operations. For large
deviations, the math exceeds 32 bits. Perhaps there is another
"almost as good" method for calculating standard deviation?
I'm using(integer math):
$std = sqrt(
($n*$sumsq)($sum**2))/
($n**2)
)
Which I have reduced in the interest of overflows to:
$std = sqrt(
($sumsq/$n)(($sum/$n)**2)
)
$sum = 0;
$sumsq = 0;
&some_routine_that_primes_filter_here(); # prime sum and sumsq
# now iterate ofer the trace
for ($i=0; $i<3950; $i++) {
$sum += $ary[$i];
$sum = $ary[$i+50];
$sumsq += ($ary[$i]**2);
$sumsq = ($ary[$i+50]**2);
# calculate standard deviation here
# take the ztransform here
# use the result to filter here
}
But, my question still stands on the MMX issue. Will MMX registers do
this? It's more of an assembler question, but if theres an
alternative that will keep the data in 32 bits I'm interested in that
as well.
Brian K. Michalk wrote: Terje Mathisen <te************@hda.hydro.com> wrote in message news:<bk**********@osl016lin.hda.hydro.com>...Sorry, but that's bogus:
The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits, cannot have more than 16+12 = 28 bits.
There must be some other problem.
Uhh, you are right. Problem in my example. I am indeed working with addition of 32 bit integers, the A/D data is still 16 bits. The loop is perhaps a little too complicated here, but it is a trace of 4000 points. I have a 400 point averaging (high pass) window that iterates over the entire 4000 element trace. My goal is to find an A/D amplitude(alpha) such that 5% of the points in the averaging window are above alpha.
Brute force works just fine, except that it's a really bad choice. If I take the standard deviation of the moving window, and then apply the ztransform for 5%, out pops the alpha that I need to filter the data. It's an order(N) filter, and is blazingly fast for low amplitude signals, because they will stay within 32 bit operations. For large deviations, the math exceeds 32 bits. Perhaps there is another "almost as good" method for calculating standard deviation?
I'm using(integer math): $std = sqrt( ($n*$sumsq)($sum**2))/ ($n**2) ) Which I have reduced in the interest of overflows to: $std = sqrt( ($sumsq/$n)(($sum/$n)**2) )
$sum = 0; $sumsq = 0; &some_routine_that_primes_filter_here(); # prime sum and sumsq # now iterate ofer the trace for ($i=0; $i<3950; $i++) { $sum += $ary[$i]; $sum = $ary[$i+50]; $sumsq += ($ary[$i]**2); $sumsq = ($ary[$i+50]**2); # calculate standard deviation here # take the ztransform here # use the result to filter here }
But, my question still stands on the MMX issue. Will MMX registers do this? It's more of an assembler question, but if theres an alternative that will keep the data in 32 bits I'm interested in that as well.
Since you're posting to c.l.a.x86, you're really wanting high speed
here, right?
Your inner loop seems kind of strange: You are adding in $ary[$i] and
subtracting $ary[$i+50], instead of the opposite: Is this intentional?
Assuming the logic is correct, it should easily run in about 1015
cycles/iteration using integer registers on a PIII, and about the same
speed or a little slower with fp on a P4.
With just 4000 table entries, I'd do the square calculation only once,
instead of twice, which makes it easy to overlap multiple iterations to
hide the latency of the multiplications.
The 64bit intermediate values with integer ops would be held in a pair
of registers, while fp just works.
Terje

 <Te************@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
MMX does not support 64bit additions. SSE 2 adds the paddq instruction
which can do 64bit addition with MMX registers, but it is only available on
Pentium 4 and Opteron/Athlon64.
You can do very fast CPUneutral 64bit arithmetic with add/adc and sub/sbb.
If Perl is compiled for 64bits, it should already be capable of this,
though I'm not sure if that means it would use 64bit arithmetic for
everything or just things that need it. Certainly multiply, divide, and
square root are nontrivial for MMX.
If you were to write this routine in assembly, then I would accumulate the
integers using general registers and use the x87 with 80bit precision for a
"perfect" square root. Any C compiler that supports long long should be able
to do the first half, and any C compiler that uses 10byte long doubles can
do the latter half.
Matt
"Brian K. Michalk" <mi*****@awpi.com> wrote in message
news:c3**************************@posting.google.c om... Terje Mathisen <te************@hda.hydro.com> wrote in message
news:<bk**********@osl016lin.hda.hydro.com>... Brian K. Michalk wrote:
I have a perl app that is calculating the standard deviation of a 4000 element 16 bit integer array, that has large dynamic content. I.e, the range spans a significant portion of the 16 bits.
I am trying to increase the performance of this critical loop, and I've found that I am exceeding the 32 bit registers causing Perl to switch to an infinite precision math library. I've rewritten the loop such that I now have only one term that overflows 32 bits.
Sorry, but that's bogus:
The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits, cannot have more than 16+12 = 28 bits.
There must be some other problem.
Terje
Uhh, you are right. Problem in my example. I am indeed working with addition of 32 bit integers, the A/D data is still 16 bits. The loop is perhaps a little too complicated here, but it is a trace of 4000 points. I have a 400 point averaging (high pass) window that iterates over the entire 4000 element trace. My goal is to find an A/D amplitude(alpha) such that 5% of the points in the averaging window are above alpha.
Brute force works just fine, except that it's a really bad choice. If I take the standard deviation of the moving window, and then apply the ztransform for 5%, out pops the alpha that I need to filter the data. It's an order(N) filter, and is blazingly fast for low amplitude signals, because they will stay within 32 bit operations. For large deviations, the math exceeds 32 bits. Perhaps there is another "almost as good" method for calculating standard deviation?
I'm using(integer math): $std = sqrt( ($n*$sumsq)($sum**2))/ ($n**2) ) Which I have reduced in the interest of overflows to: $std = sqrt( ($sumsq/$n)(($sum/$n)**2) )
$sum = 0; $sumsq = 0; &some_routine_that_primes_filter_here(); # prime sum and sumsq # now iterate ofer the trace for ($i=0; $i<3950; $i++) { $sum += $ary[$i]; $sum = $ary[$i+50]; $sumsq += ($ary[$i]**2); $sumsq = ($ary[$i+50]**2); # calculate standard deviation here # take the ztransform here # use the result to filter here }
But, my question still stands on the MMX issue. Will MMX registers do this? It's more of an assembler question, but if theres an alternative that will keep the data in 32 bits I'm interested in that as well.
JRS: In article <bk**********@osl016lin.hda.hydro.com>, seen in
news:comp.lang.asm.x86, Terje Mathisen <te************@hda.hydro.com>
posted at Sat, 20 Sep 2003 10:29:17 : Brian K. Michalk wrote:
I have a perl app that is calculating the standard deviation of a 4000 element 16 bit integer array, that has large dynamic content. I.e, the range spans a significant portion of the 16 bits.
I am trying to increase the performance of this critical loop, and I've found that I am exceeding the 32 bit registers causing Perl to switch to an infinite precision math library. I've rewritten the loop such that I now have only one term that overflows 32 bits.
Sorry, but that's bogus:
The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits, cannot have more than 16+12 = 28 bits.
There must be some other problem.
Standard Deviation is RootMeanSquare deviation, corrected for
statistics. Therefore 2*16 + 12, = 44 bits. The 16 is actually 15,
because the sign does not matter in squaring.
The standard deviation of the data is exactly defined. But if it is to
be used as an estimate of the SD of an infinite population from which
the 4000 were drawn, then it has an uncertainty of about 1 in root 4000,
say 1 in 64. In that case, a small additional systematic/random error
may be tolerable.
The SD will be dominated by the larger entries. Take, therefore, 256
times the SD of the upper bytes (taken as 128..127) of the data,
possibly rounding in the MSB of the lower byte. We now have less than
2*8 + 12, <32, bits.
Needs :
(a) Test it; emulation in a convenient language will do. If plausible,
(b) consult a professional statistician or two.

© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk DOS 3.3, 6.20; Win98. ©
Web <URL:http://www.merlyn.demon.co.uk/>  FAQqish topics, acronyms & links.
PAS EXE TXT ZIP via <URL:http://www.merlyn.demon.co.uk/programs/00index.htm>
My DOS <URL:http://www.merlyn.demon.co.uk/batfiles.htm>  also batprogs.htm. This discussion thread is closed Replies have been disabled for this discussion. Similar topics
14 posts
views
Thread by Norm 
last post: by

8 posts
views
Thread by John 
last post: by

2 posts
views
Thread by benjamin_r 
last post: by

1 post
views
Thread by Allerdyce.John 
last post: by

5 posts
views
Thread by cesco 
last post: by

1 post
views
Thread by yinglcs 
last post: by

7 posts
views
Thread by Henrique A. Menarin 
last post: by
           