Storage of char in 64 bit machine

aruna.mysore

Hi all,

I have a simple definitioin in a C file something like this.

main()
{
char a;
.......
int k;
}

Since character is 8 bit, how is it stored in the machine in a 64 bit
machine. If it is word aligned, what about the rest of the bytes. What
about the retrievel of the char c, will it be expensive. Is it
expensive w.r.t read or write.

Thanx and Regards,
Aruna

Aug 14 '06

Subscribe Reply

4601

« First
<
5
6
7
8
>

CBFalconer

we******@gmail. com wrote:

CBFalconer wrote:
>Stephen Sprunk wrote:
... snip ...
>>I am curious if the following has a substantially different speed
from your version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(ch ar *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen( c1)==3) && (strlen(c2)==3) );
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}
-------

This version looks, to me, like it's fully portable and should be
substantial ly faster than strcmp(), at least when NDEBUG is defined.

Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

A modern CPU will perform these operations substantially in *PARALLEL*.
In fact the full schedule in an OOO machine like an Athlon (3-way
subscalar, with free offsets) looks something like this:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load (c2+4)
Cycle 3: f1 <- cmp(r0, r1) ||| r4 <- load (c1+8) ||| r5 <- load (c2+8)
Cycle 4: r6 <- f1 ||| f2 <- cmp (r2, r3)
Cycle 5: r7 <- f2 ||| f3 <- cmp (r4, r5)
Cycle 6: r8 <- and(r6, r7) ||| r9 <- f3
Cycle 7: ra <- and(r8, r9) ||| return

Every cycle performs at least two operations with only cycles 4 and 5
being somewhat less efficient due to the way the x86 architecture works
(you have to copy from the flags to the register.) This tells us that
you should instead use the expression:

static int cneq (const char * c1, const char * c2) {
return (c1[0] - c2[0]) | (c1[1] - c2[1]) | (c1[2] - c2[2]);
}

This leads us to:

Cycle 1: r0 <- load(c1) ||| r1 <- load(c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load(c2+4)
Cycle 3: r6 <- sub(r0, r1) ||| r4 <- load(c1+8) ||| r5 <- load(c2+8)
Cycle 4: r7 <- sub(r2, r3)
Cycle 5: r8 <- sub(r4, r5) ||| r9 <- ior(r6, r7)
Cycle 6: ra <- ior(r8, r9) ||| return

which is a cycle faster.

>Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:

int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}

In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison.

And *one branch prediction*. In any event, lets take a look:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c1)
Cycle 2: r2 <- load(c2+4)
Cycle 3: f0 <- cmp(r0, 0) ||| ifnz(f0) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 4/14: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 5/15: noop
Cycle 6/16: f2 <- cmp(r3, r4)
Cycle 7/17: r5 <- f2 ||| return

{ or: }

Cycle 4/14: f1 <- cmp(r1, r2) ||| ifnz(f1) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 5/15: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 6/16: noop
Cycle 7/17: f2 <- cmp(r3, r4)
Cycle 8/18: r5 <- f2 ||| return

etc. So your *best* case costs basically 7 clocks anyways. (Once you
either start predicting wrong, or enter the loop, you are clearly much
worse.) These are highly idealized models of the Athlon machine, so
real results may be different. It might be interesting to see the real
world results.

>[...] The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.

It highly depends on the random distribution. The branch prediction
penalty could *easily* dominate. For example, if there were only two
or three really common strings (USD, Euros, and Yen, say) but no
discernable pattern in those actual cases, then you could easily tend
to get the branch prediction wrong around 30% of the time, which
translates to an average of 3 clocks extra.

>In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.

I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?

No. Clearly you are much more familiar with the internals of
modern CPUs than I am. However, considering the chances of
entering the loop at all, a priori (for random strings) there is
only one chance in 256 (for 8 bit bytes) of entering. In actuality
the chance is going to be much higher, maybe one in 100, yet not to
be sneezed at. In the probable case there are two decisions,
whether to enter the loop, and what to return. By putting these in
the appropriate default path, as an intelligent compiler can
(possibly by generating hints), all delays in the routine can be
avoided most of the time. In fact, by putting those decisions in
the calling sequence all control transfers can be avoided most of
the time! You can even express that in C as a conditional calling
macro and a specialized subroutine.

I tend to think of much simpler processors, i.e. the kind that are
really prevalent in the embedded world, and which make up the vast
majority. These are the ones that have relatively limited
performance, and need careful tending to get maximized results.

This, while interesting, is OT for c.l.c, yet it is the sort of
thing with which every programmer should familiarize themselves.
Otherwise it can be impossible to make some fundamental decisions
on methods and techniques, at least in cases where it matters.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943

Aug 19 '06 #61

websnarf

CBFalconer wrote:

we******@gmail. com wrote:
CBFalconer wrote:
In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.
I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?

No. Clearly you are much more familiar with the internals of
modern CPUs than I am. However, considering the chances of
entering the loop at all, a priori (for random strings) there is
only one chance in 256 (for 8 bit bytes) of entering.

The point of my analysis was to explain that this idea of yours is
wrong. Short-circuiting only works if there is a substantial amount of
work that you are saving by doing so. As the clock counts get smaller,
its a generally better idea to make code straightline and
parallelizable and avoid any control-flow.

[...] In actuality
the chance is going to be much higher, maybe one in 100, yet not to
be sneezed at.

Huh? The O.P. has explained that these are currencies. So presumably
he's tracking transactions at a bank or something like that. Not that
I am an expert on such things, but I would imagine that there are
extreme biases in the input data that he sees. I.e., he will see a lot
more USD, Euro and Yen than he will of the Nicaraguan Cordoba for
example. If its pretty much *all* USD, then the branch predictors
start working for you again. But as soon as you see a mix with a
handful of very significant entries, the branch predictors will just
end up failing. My intuition is that in fact for certain international
banks or european banks (that are mixing the british pound and the
euro, for example) this is, in fact, a worst case scenario.

You generally can't think of probabilities of real world data in
simplistic terms such as expectations about the range of the data type.
Even approximating them this way can so easily lead you astray. This
is the same reason that its so important to really look at worst case
scenarios in generic code.

[...] In the probable case there are two decisions,
whether to enter the loop, and what to return. By putting these in
the appropriate default path, as an intelligent compiler can
(possibly by generating hints), all delays in the routine can be
avoided most of the time. In fact, by putting those decisions in
the calling sequence all control transfers can be avoided most of
the time! You can even express that in C as a conditional calling
macro and a specialized subroutine.

I tend to think of much simpler processors, i.e. the kind that are
really prevalent in the embedded world, and which make up the vast
majority. These are the ones that have relatively limited
performance, and need careful tending to get maximized results.

You mean embedded processors from the past? MIPS, Xscale, and PowerPC
are all super scalar and fully pipelined processors. There are
embodiments of these which are deeply pipelined and support some degree
of OOO. While some are not able to completely re-order the
instructions, good compilers would *try* to do that for them anyways.
Over time standard processors become embedded processors.

[...]
This, while interesting, is OT for c.l.c, yet it is the sort of
thing with which every programmer should familiarize themselves.

Which do you think they should learn? Your incorrect/inapplicable
analysis, or my somewhat detailed analysis? I think people should
first learn to benchmark as our other poster here seems keen on. They
can learn what I know if they really feel up to it, or then can just
stick with benchmarking and learn some rules of thumb along the way.

Otherwise it can be impossible to make some fundamental decisions
on methods and techniques, at least in cases where it matters.

If you want to make *correct* decisions, and somehow benchmarking is
not a good option for you, then you have little choice but to engage in
the kind of analysis that I do. But I too have to eventually put
things to the acid test -- because in real processors there are often
factors that are just not easy to see.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 19 '06 #62

Flash Gordon

Walter Roberson wrote:

In article <1940995.IL8L2S CWzz@misha>,
Mikhail Teterin <us****@aldan.a lgebra.comwrote :
>We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

A couple of people then {perhaps unwisely} said "a good strcmp() can
probably do just as well", and you produced a benchmark that appeared
to show that your code is measurably faster... at least for the systems
you tried.

I just tried the benchmark on my machine and got the same result.
However, when I changed bench.c to call strcmp directly, which is how I
would actually do the string comparison inline the results switched to
strcmp being far faster. When, to be fair, I changed the integer
comparison to being inline I got the *same* results for both. So I don't
thing the comment about good strcmp implementations was unwise, I think
it was justifiable at least in part. Under the right conditions the
strcmp method *does* perform as efficiently because the compiler knows
what strcmp does, inlines it, and then optimises the inlined code in
context.

<OT>
I believe that this is the reason gcc has a number of "library"
functions implemented as built-ins rather than just calling the library.
The built ins *are* inlined before the optimiser is invoked.
</OT>

This was followed by a collective "So what if it executes faster?
The time differences is small enough to be inconsequential for most
systems, and in the meantime you've written unclear code that will
likely break on future systems. Paul Hsieh is the only one who
said that it's worth exploring, and it isn't clear that he is taking
into account relative costs of maintaince and execution time.

There is also the point that even if it save 3 hours each run it may
still not be worth it. My customers still have jobs that are run as
overnight jobs for very good reasons, they don't care how long it takes
as long as it is finished by morning. In other situations a 1ms saving
can be critical. In other words, you have to consider how valuable
execution time is as well as how much you save and what it costs you.

>I think, this is a considerable progress for one thread and will shut up for
a while...

You haven't really answered to the costs vs portability concerns.
You mooted about banks and funds, but I pointed out that if
the execution time of the comparison was really such a key factor
then you wouldn't have used this approach at all.

Sooo... the space we are in right now is "Here's a non-portable
and less-clear hack that's often faster" and you wanting general
approval for your cleverness... or at the very least, you wanting
people to be content to have such hacks be topical here.

The reason I believe most people object to hacks like these is, I
believe, because they make it needlessly hard to understand the code
without any corresponding *significant* benefit. Note that even when
they do improve performance (some tricks people have posted would make
performance *worse* on some systems) the improved performance is often
not enough for the user to notice.

The only progress that I see that you've made is Paul Hseih's
"don't take the naysaying too seriously". Paul is [to me] obviously
fairly knowledgable in some areas, but it -looks- to me as if he
often ends up in arguments, and I get the -impression- that a fair
number of people don't pay careful attention to him (possibly
because of the -way- he says things...)

What gets me is his apparent assumption that just because optimisations
will work and produce better results on the systems he has experience
with means that they are generally worth while. See my response to his
post pointing out that his *assumptions* about performance are wrong
even for a bog-standard notebook running Linux.
--
Flash Gordon
Still sigless on this computer.

Aug 19 '06 #63

Flash Gordon

we******@gmail. com wrote:

Mikhail Teterin wrote:
>Stephen Sprunk wrote:
>>The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).

You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.

Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.

The only
chance the compiler has is to support cross-file inlining,

True, but in reality you would be unlikely to wrap up your strcmp call
inside another function. I, at least, would just call strcmp directly to
compare strings.

constant
propagation, and a kind of global string analysis that just isn't going
to yeild fruitful optimizations in general. And that would be just to
try to pull even with integer comparison. If you used memcmp() instead
of strcmp(), things *might* be different.

Wrong. Calling strcmp directly since that is how one would actually use
it and inlining the other comparison for fairness I get identical
performance for both integer and string comparison.

Note that the whole point of this is to demonstrate that if you write
clear code and give the compiler a fair chance (not hobbling it by
introducing an additional level of function call you would not use in
the real world) the compiler *can* do amazing pieces of optimisation.

Benching this is unnecessary. *Any* inner loop that contains a str*
function is always going to be way slower than any reasonably
reconsidered alternative.

Wrong. See above.

That's a rule of thumb people should just
always be aware of.

Like many rules of thumb from years ago it is wrong today.

In many cases, the std C library is actually
behind by actual orders of complexity. Here its slower because of the
poor not really parallizable str* function design (it ends up
pointlessly scanning for 4 '\0' characters no matter what).

Obviously it does manage to do something far cleverer on some systems if
you give it a chance because it does on my system which is nothing special.

>[...] It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.

You are obviously wrong. See comments above.

These "authoritie s" are overstepping their bounds. They have decided
that this is a newsgroup just about the C standard. Practical C
programming is way beyond both the scope of this newsgroup and those so
called authorities. Their brains have, over the years, become
hardwired in a very specific way -- if you ask the question "how do I

<snip>

and yours is hard wired in to believing you can always outstrip the
compiler by doing dirty tricks even though optimisers have moved on a
long way as illustrated when you actually give the compiler a chance to
optimise by using strcmp directly instead of indirectly.

The issue with your "trick" is that its not portable.

That is only part of the issue. It also makes the code needlessly hard
to read or maintain and is *not* guaranteed to provide the suggested
performance gain.

But that's
typical of anything useful that most people do in the C language.

As has been pointed out before, there are vast amounts of very useful
things that *can* and *are* done in standard portable C.

Otherwise, yes of course it will lead to an enormous performance
improvement (after all you are removing a str* from an inner loop.)

You show your lack of knowledge about modern optimisers when they are
given a chance by writing clear code.

To make it portable, you probably want to try to isolate the *system*
through ifdef's. The reason is that you need to isolate *systems* that
align things on at most 32-bit boundaries. The exceptions will be
characterized by system. Then there's the whole sizeof(int32_t) issue
-- you probably just want to deal with 3 cases of 1, 2, 4 and forget
the other ones. Then, since you know that these strings are exactly 3
characters, you should use *memcmp()*, not strcmp() (its a little
faster) on the fallback cases.

I agree that when you do need to use system specific tricks you isolate
them. However it is entirely possible for strcmp to be as fast as
memcmp, seem my comments about performance else where in this post.
--
Flash Gordon
Still sigless on this computer

Aug 19 '06 #64

Flash Gordon

Mikhail Teterin wrote:

Stephen Sprunk wrote:

>The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.

Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).

FreeBSD6.1/i386 using `gcc version 3.4.4 [FreeBSD] 20050518'
with "-pedantic -Wall -O5 -pipe -march=pentium4 -DNDEBUG -DNODEBUG
-fomit-frame-pointer":

./bench 100000000
str: used 1119486 microseconds
int: used 406449 microseconds

<snip>

Of the above, the Sun's cc/libmil is definitely has the special "knowledge
and treatment of common functions/idioms" such as strcmp(), but even there
using strcmp() was 5 times slower...

Try giving the compiler a fighting chance of optimising the code as a
whole. After all, you would not normally wrap up your strcmp calls in
another layer of function call, would you?

It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.

Oh yes it is if you actually give the optimiser a fighting change. Try
this version of your code (yes people, I know it contains Unix
specifics, I could not be bothered to remove them from Mikhail's code):

/* bench.c */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sysexits.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>

#include "comp.h"

static void
reporttime(labe l, pu1, pu2)
const char *label;
const struct rusage *pu1, *pu2;
{

printf("%s: used %ld microseconds\n" , label,
(pu2->ru_utime.tv_se c -
pu1->ru_utime.tv_se c)*1000000 +
(pu2->ru_utime.tv_us ec -
pu1->ru_utime.tv_us ec));
}

int
main(int argc, char *argv[])
{
xCUR cur1 = { "USD" }, cur2 = { "UHR" };
long i, iterations,t1=0 ,t2=0;
struct rusage ru1, ru2;

switch (argc) {
case 1:
iterations = 10000000;
break;
case 2:
iterations = atol(argv[1]);
if (iterations 0)
break;
/* FALLTHROUGH */
default:
fprintf(stderr, "Usage:\n\t %s [iterations]\n", argv[0]);
exit(EX_USAGE);
}

if (getrusage(RUSA GE_SELF, &ru1)) {
perror("getrusa ge");
exit(EX_OSERR);
}

for (i = 0; i < iterations; i++)
t1 += (strcmp(cur1.ac CUR, cur2.acCUR) == 0);

if (getrusage(RUSA GE_SELF, &ru2)) {
perror("getrusa ge");
exit(EX_OSERR);
}

reporttime("str ", &ru1, &ru2);

if (getrusage(RUSA GE_SELF, &ru1)) {
perror("getrusa ge");
exit(EX_OSERR);
}

for (i = 0; i < iterations; i++)
t2 += (cur1.iCUR == cur2.iCUR);

if (getrusage(RUSA GE_SELF, &ru2)) {
perror("getrusa ge");
exit(EX_OSERR);
}
reporttime("int ", &ru1, &ru2);
printf("Force strcomp to happen by printing %ld %ld\n",t1,t2);
return EX_OK;
}

/* comp.h */
#include <inttypes.h>
#include <limits.h>

typedef union {
char acCUR[4];
#if CHAR_BIT == 8
uint32_t iCUR;
#elif CHAR_BIT == 16
uint64_t iCUR;
#else
#error "I'm not prepared for this platform's value of CHAR_BIT"
#endif
} xCUR;

int comp_str(const char cur1[4], const char cur2[4]);
int comp_int(const xCUR *cur1, const xCUR *cur2);
--
Flash Gordon
Still sigless on this computer

Aug 19 '06 #65

websnarf

Flash Gordon wrote:

we******@gmail. com wrote:
Mikhail Teterin wrote:
Stephen Sprunk wrote:
The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).
You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.

Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.

I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.

The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.

If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 19 '06 #66

Malcolm

"Edmond Dantes" <ed****@le-comte-de-monte-cristo.bizwrote

Walter Roberson wrote:

>In article <1940995.IL8L2S CWzz@misha>,
Mikhail Teterin <us****@aldan.a lgebra.comwrote :
>>>We went from "don't do this, you idiot, the compiler is much better
optimizing , than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

There must be an equivalent of the old aphorism, "penny-wise,
pound-foolish"
that can be applied to optimizing code.

Today, with rare exception, there is hardly the need to optimize assembly
code by hand. I could see it for, say, an embedded system running on
limited resources where it may make sense. But in general? Nope.

If you've got a very slow computer or a very fast computer, often you need
to optimise.
If you are using a 3GHz PC to run a medium-sized company accounts on a
spreadsheet, it is pretty pointless.
--
www.personal.leeds.ac.uk/~bgy1mm
freeware games to download.

Aug 19 '06 #67

Chris Torek

>CBFalconer wrote:

>Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

In article <11************ **********@i3g2 000cwc.googlegr oups.com>
<we******@gmail .comwrote:

>A modern CPU will perform these operations substantially in *PARALLEL*.

Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)

All of this is somewhat moot, since the original code could be
rewritten to use memcmp():

if (memcmp(c1, c2, 4) == 0)

which could then be (but is not, in the versions I tested) optimized
by the compiler into a "load 4 bytes and compare" instruction --
in fact, the very one the OP was getting via his union trick. All
that is required is a little work on the compiler: make memcmp()
a builtin like memcpy(), and then notice small constant sizes (and
aligned addresses, if required by the CPU) and that the result is
tested only for equality to zero. That is, given:

if (memcmp(c1, c2, k) == 0) /* or != 0 */

where k is a small constant, we can easily generate good code inline
without a library call; but if the test is:

if (memcmp(c1, c2, k) 0) /* or < 0, etc. */

the job becomes harder (it is now byte-order dependent, if we try
to use integer comparison instructions). Note that memcmp() is
always allowed to access all the bytes of the region, so on x86-64:

memcmp(c1, c2, 12) == 0 /* or != 0 */

can turn into:

((*(long long *)c1 - *(long long *)c2) |
(((int *)c1)[2] - ((int *)c2)[2])) == 0 /* or != 0 */

which should still be faster than a library call.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Aug 20 '06 #68

Flash Gordon

we******@gmail. com wrote:

Flash Gordon wrote:
>we******@gmail. com wrote:
>>Mikhail Teterin wrote:
Stephen Sprunk wrote:
The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internall y, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).
You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.
Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.

I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.

It was not a silly theory it was the result of profiling. It's is not my
fault if none of the compilers your use are as good as the one I use.

As I said else where. You assume that your experience applies to
everything else. It does not.

The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.

So? I have a compiler which produces different results thus countering
your claim that the integer comparison is *always* faster. One only
needs one counter example to disprove a claim of always, and no number
of example where integer comparison is faster will change that.

If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.

I'm not claiming it is using a "normal" way only that it find *a* way
however devious. Quite possibly it is making use of the fact that it
knows the strings are exactly 4 bytes.

It is a bog-standard Ubunto install on a Fujitsu Siemens Amilo Pro 2040V
notebook.

markg@brenda:~/tmp$ gcc --version
gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm not prepared to post long assembler dumps to this group since this
is a C group. However I do have a web site so I've shoved the dump of
the executable (I did not bother producing a .o file) up at
http://home.flash-gordon.me.uk/asm.txt for now. You are just lucky I did
it on an x86 based machine rather than a PPC based one!

Please note that I don't want to get in to discussing the intricacies of
x86 assembler or what gcc has done. I'm providing this for your interest
and as a demonstration of how a compiler/optimiser can be devious
potentially meaning that you don't get the performance boost you expect
on all systems under all conditions. The fact that you don't see the
same effect just shows how unpredictable these things are.

By the way, when I left the integer trick as a function the strcmp code
was *faster* than the int comparison trick. So to use the int comparison
trick you would have to implement it as a macro rather than a function
in a library to give it a good chance.
--
Flash Gordon
Still sigless on this computer.

Aug 20 '06 #69

Stephen Sprunk

"Chris Torek" <no****@torek.n etwrote in message
news:ec******** *@news4.newsguy .com...

CBFalconer wrote:
>>Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

In article <11************ **********@i3g2 000cwc.googlegr oups.com>
<we******@gmail .comwrote:
>>A modern CPU will perform these operations substantially in
*PARALLEL*.

Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)

True, most novices probably would have used && instead of &; heck,
Mikhail seems fairly bright but he managed to turn the &s I wrote into
&&s. I used & intentionally to prevent short-circuit evaluation, which
would actually be harmful in this scenario.

Most modern CPUs will do speculative loads for the six chars, but doing
the tests in series, with branches in between, will still kill
performance due to lower IPC and horrible branch misprediction penalties
(up to 20 cycles per branch on some CPUs). Not that this is strictly
on-topic anyways, but IMHO discussing how to make portable code as fast
as unportable code should be.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Aug 20 '06 #70

Similar topics