473,396 Members | 1,770 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

speed it up

Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)
// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;
// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,

--
-Gernot
int main(int argc, char** argv) {printf
("%silto%c%cf%cgl%ssic%ccom%c", "ma", 58, 'g', 64, "ba", 46, 10);}

________________________________________
Looking for a good game? Do it yourself!
GLBasic - you can do
www.GLBasic.com
Jul 22 '05 #1
19 1761
Gernot Frisch wrote:
Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)
// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;
// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,


Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Anyway, to optimize for the ARM. The ARM processor likes rolled out
for-loops and reduced number of branches (which might be true for
most processors). The ARM processor has special instructions that
can load many registers at once from memory and put many instructions
into memory. My information is that both instructions require
sequential memory locations. Thus we can use the load but not the
put.

Let us concentrate on algorithm B.
I will optimize it in steps.
B1:
/* The "const" modifiers will allow the compiler to better
* optimize the code.
*/
const short xpitch = 320;
const short ypitch = -1;
const unsigned short * src = pSrc;
for (y = 0; y < 320; ++y)
{
ldst = dst;
for (x = 0; x < 60; ++x)
{
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
}
dst += ypitch;
}
In the above modification, the inner loop is unrolled
so that 4 memory transfers are performed for each
branch, rather than one transfer per branch as in
your original code.

B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}
The above loop tells the compiler that 4 registers
are being loaded at once, then written to memory.
Hopefully this will trigger that special instruction.
The "ldst[index]" assignment is telling the compiler
to use a store at location indexed by register instruction.
You can expand or rollout the inner loop more by the
number of registers available. There are a minimum
of three variables used in a function: program counter,
return address, and local variable pointer. So print
the function in assembly language and see how many
registers are left, then expand the inner loop.

If you tell us what the algorithms are doing, perhaps
we can suggest a more optimal method for the processors.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Jul 22 '05 #2
> Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.
Yes, right. I'm transfering my back buffer to the display memory here.
(PDA-device).
B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}


<klonk> (That was my yaws) You made it almost 5x faster! Unbelievable.

Thank you so much,
Gernot
Jul 22 '05 #3

but now it's got errors in execution. I can't find any error (other
than the typo x(i)ptch).
Really strange here...
Try defining /undefining FAST.
Can you find what's different!?
-Gernot

#define FAST
void Do()
{
unsigned short* src, *dst=NULL, *dst2=NULL, *pDst, *pDst2, *pSrc;
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
pDst2 = new unsigned short[320*240];
long n;
for (n=0; n<320*240; n++) pSrc[n]=21+n*n;
memset(pDst, 0, 320*240*2);
memset(pDst2, 0, 320*240*2);

register unsigned short x, y, *ldst;
short xpitch=640;
short ypitch=-2;
short xptch = (xpitch >> 1), yptch = (ypitch >> 1);

dst = pDst2 + 319;
src = pSrc;
// Old method
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

dst = pDst + 319;
src = pSrc;
// new method
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
#ifndef FAST
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
#else
for (x = 0; x < 60; x++)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index=0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xptch;
ldst[index] = s2;
index += xptch;
ldst[index] = s3;
index += xptch;
ldst[index] = s4;
index += xptch;
}
#endif
dst += yptch; // add a line to dst buffer
}

long q=0;
for (n=0; n<320*240;n++)
{
if (pDst[n] != pDst2[n]) q++;
}
printf ("Errors: %d\n", q);
}
Jul 22 '05 #4
Sorry for all that traffic. It was the index-variable. Should belong
out of the loop...
-Gernot
Jul 22 '05 #5
Thomas Matthews wrote:
Gernot Frisch wrote:
Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)
// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;
// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,


Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Anyway, to optimize for the ARM. The ARM processor likes rolled out
for-loops and reduced number of branches (which might be true for
most processors). The ARM processor has special instructions that
can load many registers at once from memory and put many instructions
into memory. My information is that both instructions require
sequential memory locations. Thus we can use the load but not the
put.

Let us concentrate on algorithm B.
I will optimize it in steps.
B1:
/* The "const" modifiers will allow the compiler to better
* optimize the code.
*/
const short xpitch = 320;
const short ypitch = -1;
const unsigned short * src = pSrc;
for (y = 0; y < 320; ++y)
{
ldst = dst;
for (x = 0; x < 60; ++x)
{
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
}
dst += ypitch;
}
In the above modification, the inner loop is unrolled
so that 4 memory transfers are performed for each
branch, rather than one transfer per branch as in
your original code.

B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}
The above loop tells the compiler that 4 registers
are being loaded at once, then written to memory.
Hopefully this will trigger that special instruction.
The "ldst[index]" assignment is telling the compiler
to use a store at location indexed by register instruction.
You can expand or rollout the inner loop more by the
number of registers available. There are a minimum
of three variables used in a function: program counter,
return address, and local variable pointer. So print
the function in assembly language and see how many
registers are left, then expand the inner loop.


I'm a bit surprised to see that you have to explicitly unroll the loop
and have to use the register keyword. I would expect a good optimizing
compiler to perform these optimization by itself, especially if the
target processors doesn't like branches.

I have seen compilers for other CPU architectures doing these kind of
optimizations by themselves. Sometimes a minor hint in the code is
required to trigger the optimization, but also quite often no
modifications to the code are needed at all.

--
Peter van Merkerk
peter.van.merkerk(at)dse.nl
Jul 22 '05 #6
> Yes, right. I'm transfering my back buffer to the display memory here.
(PDA-device).
I don't know anything about PDA hardware, but on PC the display memory used
to be a lot slower than main memory (in times of DOS and VGA cards when
programmer had direct access to card). The speed of copying to video memory
was always bottlenecked by this low v-memory speed, never by CPU. It
completely didn't matter how you coded the copy operation, it just couldn't
get any faster.

Fortunately, on PDA this need not be so.

Best regards,
Marcin
B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}


<klonk> (That was my yaws) You made it almost 5x faster! Unbelievable.

Thank you so much,
Gernot

Jul 22 '05 #7
I don't know anything about PDA hardware, but on PC the display memory used to be a lot slower than main memory (in times of DOS and VGA cards when programmer had direct access to card). The speed of copying to video memory was always bottlenecked by this low v-memory speed, never by CPU. It
completely didn't matter how you coded the copy operation, it just couldn't get any faster.

Fortunately, on PDA this need not be so.


Here it does matter. My old version gave my max 60 fps, now I'm
getting 72fps. I found a lib that does a smooth 95 fps, but it's pure
optimized asm I think. Anyway, I'm close enough to avoid buying it
now.
-Gernot
Jul 22 '05 #8
Peter van Merkerk wrote:
Thomas Matthews wrote:
Gernot Frisch wrote:
Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)
// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;
// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,


Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Anyway, to optimize for the ARM. The ARM processor likes rolled out
for-loops and reduced number of branches (which might be true for
most processors). The ARM processor has special instructions that
can load many registers at once from memory and put many instructions
into memory. My information is that both instructions require
sequential memory locations. Thus we can use the load but not the
put.

Let us concentrate on algorithm B.
I will optimize it in steps.
B1:
/* The "const" modifiers will allow the compiler to better
* optimize the code.
*/
const short xpitch = 320;
const short ypitch = -1;
const unsigned short * src = pSrc;
for (y = 0; y < 320; ++y)
{
ldst = dst;
for (x = 0; x < 60; ++x)
{
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
}
dst += ypitch;
}
In the above modification, the inner loop is unrolled
so that 4 memory transfers are performed for each
branch, rather than one transfer per branch as in
your original code.

B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}
The above loop tells the compiler that 4 registers
are being loaded at once, then written to memory.
Hopefully this will trigger that special instruction.
The "ldst[index]" assignment is telling the compiler
to use a store at location indexed by register instruction.
You can expand or rollout the inner loop more by the
number of registers available. There are a minimum
of three variables used in a function: program counter,
return address, and local variable pointer. So print
the function in assembly language and see how many
registers are left, then expand the inner loop.

I'm a bit surprised to see that you have to explicitly unroll the loop
and have to use the register keyword. I would expect a good optimizing
compiler to perform these optimization by itself, especially if the
target processors doesn't like branches.

I have seen compilers for other CPU architectures doing these kind of
optimizations by themselves. Sometimes a minor hint in the code is
required to trigger the optimization, but also quite often no
modifications to the code are needed at all.


Hey, pretty good for not testing it!
Actually, I was converting from assembly language into C
on the fly. The "register" keyword may not be necessary, but
I wanted to remind the compiler to use registers. This is almost
exactly what the assembly language would look like.

The reason for the speed up is more that all the data is read
at once and the loop is unrolled. I figured that using the
s* variables would trigger an optimization pattern match
within the compiler. Sometimes, a little hint to the compiler
goes a long way.
--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Jul 22 '05 #9
Gernot Frisch wrote:
Sorry for all that traffic. It was the index-variable. Should belong
out of the loop...
-Gernot


Mea culpa! The actual mistake is that I forgot
to update the pointer:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
/****** insert the line below *****/
ldst += 4 * xpitch;
/* Or: */
ldst += xpitch << 2;
/* add ldst, ldst, xpitch lsl 2 */
}

The index variable is fine within the loop.
You may be able to increase the speed by declaring
the s* and index variables outside of the loops.

Have you tried increasing the number of "s" variables
to 6 or 8?
--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Jul 22 '05 #10
Gernot Frisch wrote:
Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Yes, right. I'm transfering my back buffer to the display memory here.
(PDA-device).


Still doesn't look like a straight transfer. Looks like a rotation.
If the transfer is the bottleneck, you may want to change the
layout of the source array to match your PDA (destination) buffer.

You may also get more performance by declaring the arrays the same
bit-width as the PDA. For example, if an integer is 32 bits and
your ARM processor uses 32 bits, declare your arrays as unsigned
ints. This will reduce the number of instructions and cycles as
processors love to use their native integer size and anything
else slows it down.

B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}

<klonk> (That was my yaws) You made it almost 5x faster! Unbelievable.

Thank you so much,
Gernot


I used some optimization techniques, which you should research:
1. Loop unrolling.
2. Load (Input) as much data before processing.

Unfortunately, because the increment in the destination is not
contigous, I could not apply the other two rules:
3. Process as much data as possible
4. Output as much data as possible.
For more fun, obtain a book about the StrongARM processor and
see how it works. Try changing your C code look more like
the assembly code.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Jul 22 '05 #11
> Still doesn't look like a straight transfer. Looks like a rotation.
If the transfer is the bottleneck, you may want to change the
layout of the source array to match your PDA (destination) buffer.
In fact, it is a rotation. Since some devices have rotated displays,
some don't. Anyway.
You may also get more performance by declaring the arrays the same
bit-width as the PDA. For example, if an integer is 32 bits and
your ARM processor uses 32 bits, declare your arrays as unsigned
ints. This will reduce the number of instructions and cycles as
processors love to use their native integer size and anything
else slows it down.
That is exaclty what I tried to do when using my other version, which
copies to a unsigned long array rather than the short array. This
works very fast on my PC, but the ARM version runs twice as long!?
I used some optimization techniques, which you should research:
1. Loop unrolling.
2. Load (Input) as much data before processing.
I'll google for it.
Unfortunately, because the increment in the destination is not
contigous, I could not apply the other two rules:
3. Process as much data as possible
4. Output as much data as possible.
Yes, I know. That's pretty much my problem. I'm not even sure the
step-values are positive, since on my device e.g. pitchy = -2...
For more fun, obtain a book about the StrongARM processor and
see how it works. Try changing your C code look more like
the assembly code.


Yes, thought about that, too. But I'm pretty busy, so I was wondering
if some ASM/C guru would be able to help in the first place - which
you did by speeding it up 5 times!

-Gernot
Jul 22 '05 #12
> /****** insert the line below *****/
ldst += 4 * xpitch;
/* Or: */
ldst += xpitch << 2;
/* add ldst, ldst, xpitch lsl 2 */
}
Unfortunaltely I can't be sure, that after one line the next line
begins. On my device the previous line begins. Pretty stupid, I know.
So now I'm doing:

ldst = dst; // Get current line address
/*
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel rigth on src
ldst += xptch; // add a pixel to the right on dest
}
*/
for (x = 0; x < 60; ++x)
{
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
*ldst = s1;
ldst += xptch;
*ldst = s2;
ldst += xptch;
*ldst = s3;
ldst += xptch;
*ldst = s4;
ldst += xptch;
}
dst += yptch; // add a line to dst buffer

I declared the ldst as "register unsigned short*", so it should be
faster adding the pointer itself instead of an index that will be
added to the original location, no?

short* p1 = p2;
for() *(p1+=x)=...
// instead of
long index=0;
for() p1[index]=...; index+=x;

Have you tried increasing the number of "s" variables
to 6 or 8?


I'll do that next.
Jul 22 '05 #13
> Unfortunaltely I can't be sure, that after one line the next line
begins.


Rubbish - I missread your changes. Sorry. Do you know ASM? Can I post
you my produced ASM code and see if some speedups might be possible?
-Gernot
Jul 22 '05 #14
I guess _that_ listing can't be speed up much anymore now? Only chance
might be to use another approach... Anyway - it's really fast now.
; 11 : int xptch8 =(xptch<<3); // 8 pitches
; 12 : register unsigned short s1, s2, s3, s4, s5, s6, s7, s8;
; 13 : register unsigned int index = 0;
; 14 : for (y = 0; y < 320; y++)

mov r0, #0
mov lr, r0
b |$L42991|
|$L43502|
mov r0, #0
|$L42991|

; 15 : {
; 16 : ldst = dst; // Get current line address

mov r8, r11

; 17 : /*
; 18 : for (x = 0; x < 240; x++ )
; 19 : {
; 20 : *(ldst) = *src++; // one pixel rigth on src
; 21 : ldst += xptch; // add a pixel to the right on dest
; 22 : }
; 23 : */
; 24 : for (x = 0; x < 30; ++x)

mov r9, r0
|$L42994|

; 25 : {
; 26 : s1 = *src++;

ldrh r0, [r10], #2

; 27 : s2 = *src++;

ldrh r1, [r10], #2

; 28 : s3 = *src++;

ldrh r2, [r10], #2

; 29 : s4 = *src++;

ldrh r3, [r10], #2

; 30 : s5 = *src++;

ldrh r4, [r10], #2

; 31 : s6 = *src++;

ldrh r5, [r10], #2

; 32 : s7 = *src++;

ldrh r6, [r10], #2

; 33 : s8 = *src++;

ldrh r7, [r10], #2

; 34 : *ldst=s1; ldst+=xptch;

strh r0, [r8], #-2
add r0, r9, #1

; 35 : *ldst=s2; ldst+=xptch;

strh r1, [r8], #-2
mov r1, r0, lsl #16

; 36 : *ldst=s3; ldst+=xptch;

strh r2, [r8], #-2
mov r2, r1, lsr #16

; 37 : *ldst=s4; ldst+=xptch;

strh r3, [r8], #-2

; 38 : *ldst=s5; ldst+=xptch;

strh r4, [r8], #-2
mov r0, r2, lsl #16

; 39 : *ldst=s6; ldst+=xptch;

strh r5, [r8], #-2
mov r9, r0, lsr #16

; 40 : *ldst=s7; ldst+=xptch;

strh r6, [r8], #-2

; 41 : *ldst=s8; ldst+=xptch;

strh r7, [r8], #-2
cmp r9, #0x1E
bcc |$L42994|
add r0, lr, #1
mov r1, r0, lsl #16

; 42 : }
; 43 : dst += yptch; // add a line to dst buffer

add r11, r11, #0xA, 26
mov r2, r1, lsr #16
mov r0, r2, lsl #16
mov lr, r0, lsr #16
cmp lr, #5, 26
bcc |$L43502|

; 44 : }
; 45 : }
Jul 22 '05 #15
Gernot Frisch wrote:
I guess _that_ listing can't be speed up much anymore now? Only chance
might be to use another approach... Anyway - it's really fast now.
; 11 : int xptch8 =(xptch<<3); // 8 pitches
; 12 : register unsigned short s1, s2, s3, s4, s5, s6, s7, s8;
; 13 : register unsigned int index = 0;
; 14 : for (y = 0; y < 320; y++)

mov r0, #0
mov lr, r0
b |$L42991|
|$L43502|
mov r0, #0
|$L42991|

; 15 : {
; 16 : ldst = dst; // Get current line address

mov r8, r11

; 17 : /*
; 18 : for (x = 0; x < 240; x++ )
; 19 : {
; 20 : *(ldst) = *src++; // one pixel rigth on src
; 21 : ldst += xptch; // add a pixel to the right on dest
; 22 : }
; 23 : */
; 24 : for (x = 0; x < 30; ++x)

mov r9, r0
|$L42994|

; 25 : {
; 26 : s1 = *src++;

ldrh r0, [r10], #2

; 27 : s2 = *src++;

ldrh r1, [r10], #2

; 28 : s3 = *src++;

ldrh r2, [r10], #2

; 29 : s4 = *src++;

ldrh r3, [r10], #2

; 30 : s5 = *src++;

ldrh r4, [r10], #2

; 31 : s6 = *src++;

ldrh r5, [r10], #2

; 32 : s7 = *src++;

ldrh r6, [r10], #2

; 33 : s8 = *src++;

ldrh r7, [r10], #2

[snip]

Although assembly is not topical in this group,
the listing shows that the compiler is not using
the most efficient instruction.

Try converting your arrays from short to integer.
This should speed things up.

Otherwise, the assembly looks rather efficient.
I think the optimization is reaching its limit.
Any more code changes won't realize a drastic
improvement, only subtle ones.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Jul 22 '05 #16
> Try converting your arrays from short to integer.
This should speed things up.


That's what I tried to do with my other code, that runs much faster on
my PC, but not on the ARM processor. The asm looks horribly. I have a
lot of "lsr #16" calls which I can't get rid of.
Here's my code + the produces asm:

; 401 : {
; 402 : dst++;
; 403 : src+=(319*240);

mov r0, #0x95, 22
orr r0, r0, #0x22, 28

; 404 : unsigned long* pDisplay = (unsigned long*)dst;
; 405 : pDisplay-=160; // One line back

mov r1, #0x9F, 30
add r6, r7, r0
orr r1, r1, #2

; 406 : for(x=0; x<240; x++)

mov r0, #0
sub r5, r5, r1
mov r7, r0

; 414 : }
; 415 : }
; 416 : else // Safe & slow

b |$L43362|
|$L44796|
mov r0, #0
|$L43362|

; 407 : {
; 408 : for(y=0; y<160; y++)

mov r4, r0
|$L43365|

; 409 : {
; 410 : *pDisplay++ = (*(src-240)<<16) | *(src); // Process 4
bytes at once

ldr r12, [pc, #0x714]
ldrh r0, [r6, +r12]
mov r1, r0, lsl #16
ldrh r0, [r6]

; 411 : src-=480;

sub r6, r6, #0xF, 26
mov r2, r1, lsr #16
mov r1, r0, lsl #16
add r0, r4, #1
mov r3, r2, lsl #16
orr r2, r3, r1, lsr #16
mov r1, r0, lsl #16
str r2, [r5], #4
mov r2, r1, lsr #16
mov r0, r2, lsl #16
mov r4, r0, lsr #16
cmp r4, #0xA0
bcc |$L43365|

; 412 : }
; 413 : src+=(320*240+1);

add r0, r6, #0x96, 22
add r1, r7, #1
add r6, r0, #2
mov r0, r1, lsl #16
mov r2, r0, lsr #16
mov r1, r2, lsl #16
mov r7, r1, lsr #16
cmp r7, #0xF0
bcc |$L44796|

; 414 : }
; 415 : }

Jul 22 '05 #17
Thomas Matthews wrote:
The reason for the speed up is more that all the data is read
at once and the loop is unrolled. I figured that using the
s* variables would trigger an optimization pattern match
within the compiler. Sometimes, a little hint to the compiler
goes a long way.


I know, for really performance critical stuff it always interesting to
take a look at the assembly output of the compiler to see what it makes
of your code. Sometimes compilers can perform quite amazing
optimizations, at other times they let you down and you have to really
spell it out what code you want it to generate. I'm glad that I learned
assembly programming in the good old days. For this kind of stuff it can
really make a difference if you understand what the compiler is doing.

--
Peter van Merkerk
peter.van.merkerk(at)dse.nl
Jul 22 '05 #18
Gernot Frisch wrote:
Try converting your arrays from short to integer.
This should speed things up.

That's what I tried to do with my other code, that runs much faster on
my PC, but not on the ARM processor. The asm looks horribly. I have a
lot of "lsr #16" calls which I can't get rid of.


Sorry, but there is a misunderstanding here.
Use the "B2" code, but destination and source arrays
are declared as integers:

/************************************************** ***
* Use named constants. Get into this habit.
************************************************** ***/
static const unsigned int MAX_COLUMNS = 320;
static const unsigned int MAX_ROWS = 240;
static const unsigned int ARRAY_SIZE = MAX_COLUMNS * MAX_ROWS;

/************************************************** ***
* Declare arrays as integers even though the data range
* is a short, because int is the processor's native
* size. [1] [2]
************************************************** ***/
unsigned int source[ARRAY_SIZE];
unsigned int destination[ARRAY_SIZE];
unsigned int * ldst;
unsigned int * dst;

/************************************************** ***
* Since the source array will not be changed, all
* pointers to the source array are declared as a
* pointer to constant data. This is important so
* that the compiler can invoke some optimizations.
************************************************** ***/
const unsigned int * src;

dst = destination + MAX_COLUMNS - 1;
/* a.k.a. dst = &destination[MAX_COLUMNS - 1]; */

src = source;

/************************************************** ***
* If a variable's value never changes, prefer to use
* a named constant.
*
* When constants are declared in-line or in a function
* prefer to declare them as static.
*
* Again, proper use of "const" allows a compiler to
* perform better optimization "magic".
************************************************** ***/
static const int xpitch = 2 * MAX_COLUMNS;
static const int ypitch = -2;

const unsigned int TRANSFERS_IN_LOOP = 4;
const unsigned int INNER_ITERATIONS =
MAX_ROWS / TRANSFERS_IN_LOOP;
unsigned int x;
unsigned int y;

/************************************************** ***
* From Scott Meyer's Effective C++ series, prefer
* pre-increment indices in loops.
************************************************** ***/
for (y = 0; y < MAX_COLUMNS; ++y ) /* Use Named constant */
{
ldst = dst; // Get current line address
for (x = 0; x < INNER_ITERATIONS; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
ldst += xpitch << 2;
}
dst += yptch; // add a line to dst buffer
/* See note [3] */
}

Notes:
[1] One reason _not_ to use "int" is when the destination pointer
refers to a specific hardware memory area which has a
different bit-width than "int".
For example, if the display memory is 16-bits wide and an
int is 32-bits wide.

[2] Don't use dynamic arrays unless the size is not known
until run-time or they are too large to be allocated
as an automatic variable. If the arrays are large
prefer to declare them as static inside blocks and
functions.

[3] You could also try unrolling the outer loop also. The performance
gain may not be as much as unrolling the inner loop. The objective
is to reduce the ratio of processing to branching.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Jul 22 '05 #19
My source was an short array.
unsigned int source[ARRAY_SIZE];
const unsigned int * src;
unsigned short s1;
src = source;
s1 = *src++;


The src++ will move the src pointer by 4 bytes now (sizeof int), won't
it? That's a problem now, isn't it?
I would have to use:

unsigned long sx;
sx = *src++
s1 = sx >> 0x10;
s2 = sx & 0xFFFF;

Right?

Thank you for your pacience. You help me very much in understanding
asm and speed issues.

-Gernot

Jul 22 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: Yang Li Ke | last post by:
Hi guys, Is it possible to know the internet speed of the visitors with php? Thanx -- Yang
8
by: Rob Ristroph | last post by:
I have tried out PHP 5 for the first time (with assistance from this group -- thanks!). The people I was working with have a site that uses lots of php objects. They are having problems with...
34
by: Jacek Generowicz | last post by:
I have a program in which I make very good use of a memoizer: def memoize(callable): cache = {} def proxy(*args): try: return cache except KeyError: return cache.setdefault(args,...
28
by: Maboroshi | last post by:
Hi I am fairly new to programming but not as such that I am a total beginner From what I understand C and C++ are faster languages than Python. Is this because of Pythons ability to operate on...
52
by: Neuruss | last post by:
It seems there are quite a few projects aimed to improve Python's speed and, therefore, eliminate its main limitation for mainstream acceptance. I just wonder what do you all think? Will Python...
7
by: YAZ | last post by:
Hello, I have a dll which do some number crunching. Performances (execution speed) are very important in my application. I use VC6 to compile the DLL. A friend of mine told me that in Visual...
6
by: Ham | last post by:
Yeah, Gotto work with my VB.Net graphic application for days, do any possible type of code optimization, check for unhandled errors and finally come up with sth that can't process 2D graphics and...
6
by: Jassim Rahma | last post by:
I want to detect the internet speed using C# to show the user on what speed he's connecting to internet?
11
by: kyosohma | last post by:
Hi, We use a script here at work that runs whenever someone logs into their machine that logs various bits of information to a database. One of those bits is the CPU's model and speed. While...
4
by: nestle | last post by:
I have DSL with a download speed of 32MB/s and an upload speed of 8MB/s(according to my ISP), and I am using a router. My upload speed is always between 8MB/s and 9MB/s(which is above the max upload...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.