P: n/a

Hi guys,
(rather lengthy...)
I'm trying to speed up the time spent on a postfilter for video.
YUV 4:2:0 data, each pixel is 1 byte (0255)
The basic idea is to filter one pixel on each side of a 8pixel border.
The filter used is a variant of (1,1,4,1,1).
In the example below I do a vertical filtering of line n and the
diff for pixel c1 is calculated as
diff(c1) = a1+b1+(c1<<2)+d1+e1 (1)
c2 as
diff(c2) = a2+b2+(c2<<2)+d2+e2
etc.
Pixel 1.2.3.4.

n2 a1a2a3a4
n1 b1b2b3b4
n c1c2c3c4
 pixel border
n+1 d12d3d4d
n+2 e1e2e3e4
The current implementation reads the values of a1,b1,c1,d1,e1 one byte
at a time, do the calculation and write back the filtered value for c1.
I.e something close to the code below:
imdifftmp = *(ImageSrc_pw2);
imdiff2 = *(ImageSrc_pw2+1);
...
imdiff8 = *(ImageSrc_pw2+7);
imdifftmp += *(ImageSrc_pwidth);
imdiff2 += *(ImageSrc_pwidth+1);
...
imdiff8 += *(ImageSrc_pwidth+7);
imdifftmp = (*(ImageSrc_p)) << 2;
imdiff2 = (*(ImageSrc_p+1)) << 2;
...
imdiff8 = (*(ImageSrc_p+7)) << 2;
imdifftmp += *(ImageSrc_p+width);
imdiff2 += *(ImageSrc_p+width+1);
...
imdiff8 += *(ImageSrc_p+width+7);
imdifftmp += *(ImageSrc_p+w2);
imdiff2 += *(ImageSrc_p+w2+1);
...
imdiff8 += *(ImageSrc_p+w2+7);
Not very efficient on a 32bit machine! What I'm trying to achive is
to read a 32bit word containing 4 pixel values, do the calculation
an a whole word and write back a word. After some googeling I found
the book "Hackers Delight" by Henry S. Warren, Jr. He presents such
a method implemented by the two macros below:
//Multibyte Add of 4 1byte integers packed into a word
#define MBA(x, y, s)\
do{\
s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \
s = (((x)^(y))&0x80808080)^s; \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)
//Multibyte Subtract of 4 1byte integers packed into a word
#define MBS(x, y, d)\
do{\
d = ((x)0x80808080)((y)&0x7f7f7f7f); \
d = ~((((x)^(y))0x7f7f7f7f)^d); \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)
He also states that the operation below gives the carry into each
position
(where ¤ in this case denotes bitwise exclusive or (^):
(x¤y)¤x¤y
These macros works great for small values! The problem is how to handle
the carry so that the correct values after the calculations in (1)
can be extracted. My question (finally!) is:
How can I (if it is possible) handle the carry to recreate the correct
signed integer value after the calculations above?
Some sample code below:
void main(void){
long a1 = 0xc7c8c9ca;
long b1 = 0xc8c9cacb;
long c1 = 0xddc8c9ca;
long d1 = 0xcacbcccd;
long e1 = 0xcbcccdce;
MBA(a1,b1,s1); //a+b
MBA(s1,d1,s2); //+d
MBA(s2,e1,s1); //+e
MBS(s1,c1,s2); //c (is it possible to do the (c<<2) part smarter?
MBS(s2,c1,s1); //c
MBS(s1,c1,s2); //c
MBS(s2,c1,s1); //c
//Extract MSB Byte (B0) and add carry stuff...
printf("\nvalue after macros %08lX, value after calc %08lX\n", s1,
0xc7+0xc8(0xdd<<2)+0xca+0xcb);
}
Gives:
carry 9F939794
carry 1F073F3A
carry B7B9BF9C
carry F8101000
carry BF81879C
carry FF313730
carry 3B818384
value after macros B0080808, value after calc FFFFFFB0
 
^ ^


Same value for different methods
Cheers
//Fredrik  
Share this Question
P: n/a
 va*****@linuxmail.org wrote: I'm trying to speed up the time spent on a postfilter for video. YUV 4:2:0 data, each pixel is 1 byte (0255)
In the example below I do a vertical filtering of line n and the diff for pixel c1 is calculated as diff(c1) = a1+b1+(c1<<2)+d1+e1 (1) ... Not very efficient on a 32bit machine! What I'm trying to achive is to read a 32bit word containing 4 pixel values, do the calculation an a whole word and write back a word. After some googeling I found the book "Hackers Delight" by Henry S. Warren, Jr. He presents such a method implemented by the two macros below:
//Multibyte Add of 4 1byte integers packed into a word #define MBA(x, y, s)\ do{\ s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \ s = (((x)^(y))&0x80808080)^s; \ // printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\ }while(0)
//Multibyte Subtract of 4 1byte integers packed into a word #define MBS(x, y, d)\ do{\ d = ((x)0x80808080)((y)&0x7f7f7f7f); \ d = ~((((x)^(y))0x7f7f7f7f)^d); \ // printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\ }while(0)
Each of the 8bit fields is added mod 2^8.
These macros works great for small values! The problem is how to handle the carry so that the correct values after the calculations in (1) can be extracted. My question (finally!) is: How can I (if it is possible) handle the carry to recreate the correct signed integer value after the calculations above?
Some sample code below:
void main(void){ long a1 = 0xc7c8c9ca; long b1 = 0xc8c9cacb; long c1 = 0xddc8c9ca; long d1 = 0xcacbcccd; long e1 = 0xcbcccdce;
MBA(a1,b1,s1); //a+b MBA(s1,d1,s2); //+d MBA(s2,e1,s1); //+e MBS(s1,c1,s2); //c (is it possible to do the (c<<2) part smarter? MBS(s2,c1,s1); //c MBS(s1,c1,s2); //c MBS(s2,c1,s1); //c ... value after macros B0080808, value after calc FFFFFFB0   ^ ^   Same value for different methods
As you note, the 8 lsbs are correct. If you can guarantee that the
difference in pixel value over points a  e is less than 64, you can
simply use the msb as the sign bit. In your example the msb of B0 = 1,
so sign extend the bit.
If you make no assumptions about value range in the group, then the
range of computed value is 4*255 to 4*255. That requires 11 bits to
uniquely represent each value. You could represent each pixel as 11
bits, with the initial 3 msbs = 0. You could thus pack 2 pixels in a
32bit word or 5 pixels in a 64bit word. If you can guarantee a pixel
value difference of 128 or less in each 5 point group, you could get by
with 10 bits/pixel, packing 3 pixels per 32bit word.
If you choose to use two 11 bit pixels in a 32bit word, you might as
well pack 2 16bit values per 32bit word, which gives easier packing
and unpacking.

Thad  
P: n/a

<va*****@linuxmail.org> wrote in message
news:11**********************@e56g2000cwe.googlegr oups.com...
Valinor,
I've made some corrections. Don't let those get to you. There are some
useful noncorrection related comments below. I'm trying to speed up the time spent on a postfilter for video. YUV 4:2:0 data, each pixel is 1 byte (0255)
The basic idea is to filter one pixel on each side of a 8pixel border. The filter used is a variant of (1,1,4,1,1).
In the example below I do a vertical filtering of line n and the diff for pixel c1 is calculated as diff(c1) = a1+b1+(c1<<2)+d1+e1 (1) c2 as diff(c2) = a2+b2+(c2<<2)+d2+e2 etc.
Pixel 1.2.3.4.  n2 a1a2a3a4 n1 b1b2b3b4 n c1c2c3c4  pixel border n+1 d12d3d4d n+2 e1e2e3e4
<snip> //Multibyte Add of 4 1byte integers packed into a word #define MBA(x, y, s)\ do{\ s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \ s = (((x)^(y))&0x80808080)^s; \ // printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
The C++ comments create a multiline comment according to GCC. Rewrite like
so:
/* printf("\ncarry %08lX", ((x)+(y))^(x)^(y)); */ \
}while(0)
//Multibyte Subtract of 4 1byte integers packed into a word #define MBS(x, y, d)\ do{\ d = ((x)0x80808080)((y)&0x7f7f7f7f); \ d = ~((((x)^(y))0x7f7f7f7f)^d); \ // printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
The C++ comments create a multiline comment according to GCC. Rewrite like
so:
/* printf("\ncarry %08lX", ((x)+(y))^(x)^(y)); */ \
}while(0)
He also states that the operation below gives the carry into each position (where ¤ in this case denotes bitwise exclusive or (^): (x¤y)¤x¤y
These macros works great for small values! The problem is how to handle the carry so that the correct values after the calculations in (1) can be extracted. My question (finally!) is: How can I (if it is possible) handle the carry to recreate the correct signed integer value after the calculations above?
The MBS macro _appears_ (you'll need to confirm) to be calculating two's
complement correctly. This means that the values _should_ be correctly
signed when you extract each byte and cast them from an unsigned variable to
a signed one. This is because most compilers use two's complement for
negative integers.
Some sample code below:
void main(void){
#include <stdio.h>
#include <stdlib.h>
int main(void) { /* corrected */
long a1 = 0xc7c8c9ca; long b1 = 0xc8c9cacb; long c1 = 0xddc8c9ca; long d1 = 0xcacbcccd; long e1 = 0xcbcccdce;
long s1,s2; /* missing */
MBA(a1,b1,s1); //a+b MBA(s1,d1,s2); //+d MBA(s2,e1,s1); //+e MBS(s1,c1,s2); //c (is it possible to do the (c<<2) part smarter? MBS(s2,c1,s1); //c MBS(s1,c1,s2); //c MBS(s2,c1,s1); //c
//Extract MSB Byte (B0) and add carry stuff...
printf("\nvalue after macros %08lX, value after calc %08lX\n", s1, 0xc7+0xc8(0xdd<<2)+0xca+0xcb);
return(EXIT_SUCCESS); /* corrected */
}
In (1) above, you _add_ (c1<<2), but here you _subtract_ (c1<<2). Did you
want MBS() or MBA()?
MBS(s1,c1,s2); //c (is it possible to do the (c<<2) part smarter?
Yes, replace the four lines that compute (c1<<2), with (if you wanted MBS,
otherwise change to MBA):
MBS(s1,((c1&0x3f3f3f3f)<<2),s2); //c (is it possible to do the (c<<2) part
smarter?
Gives: carry 9F939794 carry 1F073F3A carry B7B9BF9C carry F8101000 carry BF81879C carry FF313730 carry 3B818384 value after macros B0080808, value after calc FFFFFFB0
Sorry, I didn't check these.
Rod Pemberton   This discussion thread is closed Replies have been disabled for this discussion.   Question stats  viewed: 2484
 replies: 2
 date asked: Apr 28 '06
