Moving bytes in memory

Alessio Sangalli

Hi, I am facing some performance issues in an algorithm to "translate"
some YUV data in another format. I'll make everything very simple:

I have 4 blocks of Y data, one of U data and one of V data. Repeat the
previous structure a few hundreds times to get a whole image.
Every block consists in a 8x8 grid.

Let's consider Y data only:
I have to convert it in a way that I take chunks of 8bytes and put in
different memory locations to build up a "planar" or "raster"
representation of the data. The following code processes one block:

unsigned char* destination;
unsigned char* source;
int xsize, ysize, x, y, i;

[...]

for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
memcpy(destination+(y+i)*xsize+x, yuvcurr, sizeof(unsigned char)*8);
yuvcurr+=8;
}
I noticed that the code above is *much* slower (ARM9, gcc 4.0.0, uClibc)
than the following trick:

unsigned int* destination;
unsigned int* source;
int xsize, ysize, x, y, i;

[...]

for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
dest = (unsigned int*)(final+(y+i)*xsize+x);
*(dest++) = *(source++);
*(dest++) = *(source++);
}
Basically I know I have to copy 8 bytes or two words and I do that
instead of calling memcpy.
The first solution roughly gives me 70fps, while the second one 230.

Any comment on this? Am I missing something, the memcpy implementation
is mislead by something?

bye
Alessio

Jul 18 '08 #1

Subscribe Post Reply

3881

Eric Sosman

Alessio Sangalli wrote:

Hi, I am facing some performance issues in an algorithm to "translate"
some YUV data in another format. I'll make everything very simple:
[...]

It might be simpler if you showed the actual code instead
of an paraphrase teeming with undeclared variables ...

--
Eric Sosman
es*****@ieee-dot-org.invalid

Jul 18 '08 #2

Alessio Sangalli

Eric Sosman wrote:

It might be simpler if you showed the actual code instead
of an paraphrase teeming with undeclared variables ...

every IRC channel/newsgroup/forum has its own rules. I am now using a
pastebin service. What is commented is the "alternate" way to do it. The
"memcpy" version is at least 3 times slower.

http://pastebin.ca/1075068

bye
as

Jul 18 '08 #3

Willem

Alessio Sangalli wrote:
) Basically I know I have to copy 8 bytes or two words and I do that
) instead of calling memcpy.
) The first solution roughly gives me 70fps, while the second one 230.
)
) Any comment on this? Am I missing something, the memcpy implementation
) is mislead by something?

The only way the memcpy would be able to beat copying by hand of 8 bytes,
is if it were inlined by the compiler and then optimized heavily, or if
the compiler 'knows' about memcpy and can optimize it accordingly.
(And even then, it could only beat your code by a small margin, using
128-bit instructions.)
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Jul 18 '08 #4

Peter Nilsson

Alessio Sangalli wrote:

Eric Sosman wrote:

It might be simpler if you showed the actual code instead
of an paraphrase teeming with undeclared variables ...

every IRC channel/newsgroup/forum has its own rules. I am now using a
pastebin service. What is commented is the "alternate" way to do it. The
"memcpy" version is at least 3 times slower.

http://pastebin.ca/1075068

So unroll everything! [Completely untested...]

void foo( unsigned char *d,
unsigned char *s,
size_t xsize,
size_t ysize )
{
unsigned char *s2; /* secondary source (s + 64) */
unsigned char *y, *yend;
unsigned char *x, *xend;

#define copy8(d,s) \
do { \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
} \
while (0)

for (y = d, yend = y + ysize * xsize; y < yend; y += 16 * xsize)
for (x = y, xend = x + xsize; x < xend; )
{
s2 = s + 64;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
s += 64;

s2 = s + 64;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16 * xsize;
s += 64;
}
}

--
Peter

Jul 18 '08 #5

James Dow Allen

On Jul 18, 7:27*am, Alessio Sangalli
<alesanRemoveMePle...@manoweb.comwrote:

Hi, I am facing some performance issues ...
I have 4 blocks of Y data, one of U data and one of V data
... I take chunks of 8bytes and put in
different memory locations
...
* * * memcpy(...)
...
the code above is *much* slower ... than
...
* * * *(dest++) = *(source++);
* * * *(dest++) = *(source++);
...

If you're very concerned with performance, I'm surprised
you bothered to try a memcpy() solution.
Regulars in this ng will babble over and over and over
and over again, albeit correctly, that Premature Optimization
is an evil, but replacing a function call with a
(more readable!) two-line fix in the inner loop known to
dominate performance hardly constitutes Premature
Optimization!

Please! To listen to some c.l.c'ers you'd think
any optimized code should be replaced forthwith
with an unoptimized version! (One regular touts
his hashlib every week or two, even though it
contains beaucoup code with no discernable purpose
except to slow it down!)

What I used to do, sometimes, in loops like yours,
was to examine the generated object code, see if it
could be improved, then figure how to modify the C
code to produce the optimal object code. (For
example, changing an index from signed to unsigned
might do good, IIRC.)

Do you need to convert YUV to/from RGB? If so, that
conversion is likely to be quite time-consuming.
There are some very interesting ways to speed
*that* up.

The World's Fastest Jpeg Compressor(tm) has DCT,
Huffman, Zigzag, color conversion, etc. that are all
quite fast, and ends up spending, believe it or not,
about 40% of (decompression) time in the very inner
loop similar to that you are describing!

Eric Sosman wrote:

It might be simpler if you showed the actual code instead
of an paraphrase teeming with undeclared variables ...

Eric's comments usually have great value and technical
accuracy. Not counting a spelling deviation on dest(ination)
I counted one (1) undeclared variable in the sample
code, hardly justifying the term "teeming", and the code
and question were self-explanatory anyway. Eric: are
you applying for premier membership in the c.l.c
pedant's club?

James Dow Allen

Jul 18 '08 #6

Kalle Olavi Niemitalo

Alessio Sangalli <al******************@manoweb.comwrites:

dest = (unsigned int*)(final+(y+i)*xsize+x);
*(dest++) = *(source++);
*(dest++) = *(source++);

This assumes the data matches the alignment requirements of
unsigned int. In general, memcpy cannot assume that and must
either check the alignment every time or copy byte by byte,
slowing it down.

http://www.arm.com/support/faqdev/4972.html mentions that ARM's
compiler redirects memcpy calls to a different function
(__rt_memcpy_w or __aeabi_memcpy4) if the pointer types in
argument expressions imply word-alignment, or even inlines the
copy if the size is also an aligned small constant. I don't know
if GCC has such a feature.

GCC probably cannot merge your two *(dest++) = *(source++);
statements into one LDM/STM pair because doing so would change
the result if dest + 1 == source. Replacing unsigned int with
e.g. uint64_t, so that only one assignment is needed, might
enable the use of LDM/STM; in principle this could require even
stricter alignment, but I guess ARM has __alignof(uint64_t) ==
__alignof(unsigned int). Alternatively, __restrict might help.

Jul 18 '08 #7

Bart

On Jul 18, 1:27*am, Alessio Sangalli
<alesanRemoveMePle...@manoweb.comwrote:

* * * memcpy(destination+(y+i)*xsize+x, yuvcurr, sizeof(unsigned char)*8);

I noticed that the code above is *much* slower (ARM9, gcc 4.0.0, uClibc)
than the following trick:

* * * dest = (unsigned int*)(final+(y+i)*xsize+x);
* * * *(dest++) = *(source++);
* * * *(dest++) = *(source++);
* * }

Basically I know I have to copy 8 bytes or two words and I do that
instead of calling memcpy.
The first solution roughly gives me 70fps, while the second one 230.

Any comment on this? Am I missing something, the memcpy implementation
is mislead by something?

Copying 8 bytes can be done in typically 4 instructions on a 32-bit
machine.

How many instructions do you think it takes to push the parameters,
call memcpy, and for it to sort out all it's logic, do the work, and
then return? In fact I'm surprised it's only 3 times slower. Maybe
your compiler is inlining, although it's not doing a very good job
then.

--
Bartc

Jul 18 '08 #8

Alessio Sangalli

James Dow Allen wrote:

If you're very concerned with performance, I'm surprised
you bothered to try a memcpy() solution.

Well it's textbook knowledge, everybody knows that memcpy is supposed to
be the fastest way to copy memory ;)

Do you need to convert YUV to/from RGB? If so, that

No, I simply need to unpack the output of the JPEG decoder (hardware)
and put it in a different (raster) format for the scaler (hardware).
Thank you and to everybody that helped me to open up my eyes on memcpy.

bye
as

Jul 18 '08 #9

Flash Gordon

Alessio Sangalli wrote, On 18/07/08 21:07:

James Dow Allen wrote:

>If you're very concerned with performance, I'm surprised
you bothered to try a memcpy() solution.

Well it's textbook knowledge, everybody knows that memcpy is supposed to
be the fastest way to copy memory ;)

<snip>

You have to make sure you have included the correct headers *and*
enabled the optimiser or memcpy might not be inlined and optimised
(including telling it the oldest processor you want to support, if using
gcc on an x86 for example you might not need to support an 80386!).
However it is still not guaranteed to be faster, it depends on the
quality of the implementation.

I would try to make sure that the data will match the strictest
alignment the processor supports. This *might* help memcpy and *will*
allow you to make use of fancier tricks if you need to.

You have measured and found a problem, so trying to improve the
situation is not premature optimisation. However changing the code might
not be the *only* way to improve it.
--
Flash Gordon

Jul 18 '08 #10

Chris Torek

In article <g5**********@online.de>
Alessio Sangalli <al******************@manoweb.comwrote:

>I noticed that [using memcpy] is *much* slower (ARM9, gcc 4.0.0, uClibc)
than [an inline expansion that takes advantage of some knowledge about

the data being copied.]

GCC-on-ARM is, shall we say, "not a very mature product".[%] But
if you are willing to do some compiler-source work, it is likely
that you can make your GCC emit a reasonably optimal code sequence
for this particular code. (You will want to add special handling
for memcpy of known sizes at known-or-implied alignments.) If you
then submit this back to the GCC maintainers, it may eventually
wind up in future distributions. Until then, you will have to
maintain this change yourself.

It is up to you to decide whether "maintaining changes to GCC until
they are picked up by the mainline" is worthwhile. The alternative
is that for this particular case, where you have already profiled
or otherwise timed your code and found a bottleneck that can be
worked around some other way, you write a little bit of "special"
code just for this ARM system, and maintain this special code in
your own system. Which one is more cost-effective for you? (The
answer depends on *you*, i.e., your own situation, so this is not
a rhetorical question.)

[% In the WR Linux group, we have a bit of kernel code that gcc
compiles incorrectly on ARM, but not on other machines. We had to
decide what to do about this until we could get the compiler bug
fixed. I was not involved in the decision, I just saw the issue
fly by in email.]
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html

Jul 20 '08 #11

by: HeroOfSpielburg | last post by:

hi, I know this is a much-travelled subject, but I was wondering what people's thoughts were on the bare minimum (and conversely the grand scheme) for augmenting standard memory references to...

C / C++

Memory leak? Private Bytes/Virtual Bytes

by: Ronald | last post by:

Hi! As I ran in to errors like "Ran out of memory" and "Out of memory" I supposed my website might have memoryleaks. To trace a potential memory leak I isolated the IIS process and monitored the...

ASP / Active Server Pages

Doubt In Bytes Allocated

by: ranjeet.gupta | last post by:

Dear All I am not able o understand the exact number of bytes allocation done by the two fucntion given below, It is said that the fuction Condition_String1 allocates the 240 bytes while...

C / C++

Moving a string to char* and viceversa.

by: Tim Conner | last post by:

Hi, I am an ex-delphi programmer, and I having a real hard time with the following simple code (example ): Which is the equivalent to the following code ? var chars : PChar; sBack, s :...

C# / C Sharp

moving data in a file without using system memory

by: ulyses | last post by:

Let's assume I have following file: 2938929384902491233..... 923949919199191919112.... File contains INTs only. What is more they are huge. For example first row in file may contain integer...

C / C++

where do the extra bytes go while using Malloc ?

by: karthikbalaguru | last post by:

Hi, In the case of heap , to keep track of a single chunk of memory it requires 8 bytes of information. That is, it requires 4 bytes to hold the size, and 4 bytes to hold the pointer to the next...

C / C++

Page File counter and Private Bytes Counter

by: George2 | last post by:

Hello everyone, I am not sure whether I am wrong or the Windows Internals Book 4th version is wrong. Here is what the book says in Chapter 7, Memory Management from Page 444 to Page 445 ...

C / C++

how to find number of bytes freed

by: sh.vipin | last post by:

is there any way to find out number of bytes freed on a particular free() call in C

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Moving bytes in memory

Similar topics