468,463 Members | 2,046 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,463 developers. It's quick & easy.

Storage of char in 64 bit machine

Hi all,

I have a simple definitioin in a C file something like this.

main()
{
char a;
.......
int k;
}

Since character is 8 bit, how is it stored in the machine in a 64 bit
machine. If it is word aligned, what about the rest of the bytes. What
about the retrievel of the char c, will it be expensive. Is it
expensive w.r.t read or write.

Thanx and Regards,
Aruna

Aug 14 '06
74 3803
On Wed, 16 Aug 2006 20:03:57 -0400, in comp.lang.c , Mikhail Teterin
<us****@aldan.algebra.comwrote:
>Keith Thompson wrote:
>I, for one, will ignore any attachments posted to this newsgroup.

I have not spent much time on this group, but, so far, I have not seen
anything that would make me truly saddened by your decision...
As you say, you've not spent much time here if you think Keith's
contributions are worth little.
>I will consider changing this policy only if there's a general consensus
among the regulars that text-only attachments are acceptable.

You are confirming my impression, that the actual merits of text-only
attachments are secondary (or even fully irrelevant) to your decision, with
the annoyance over my violating the unwritten and unofficial "rules of the
club" being the primary...
You're incorrect.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
Aug 17 '06 #51
In article <1940995.IL8L2SCWzz@misha>,
Mikhail Teterin <us****@aldan.algebra.comwrote:
>We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".
No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

A couple of people then {perhaps unwisely} said "a good strcmp() can
probably do just as well", and you produced a benchmark that appeared
to show that your code is measurably faster... at least for the systems
you tried.

This was followed by a collective "So what if it executes faster?
The time differences is small enough to be inconsequential for most
systems, and in the meantime you've written unclear code that will
likely break on future systems. Paul Hsieh is the only one who
said that it's worth exploring, and it isn't clear that he is taking
into account relative costs of maintaince and execution time.
>I think, this is a considerable progress for one thread and will shut up for
a while...
You haven't really answered to the costs vs portability concerns.
You mooted about banks and funds, but I pointed out that if
the execution time of the comparison was really such a key factor
then you wouldn't have used this approach at all.

Sooo... the space we are in right now is "Here's a non-portable
and less-clear hack that's often faster" and you wanting general
approval for your cleverness... or at the very least, you wanting
people to be content to have such hacks be topical here.

The only progress that I see that you've made is Paul Hseih's
"don't take the naysaying too seriously". Paul is [to me] obviously
fairly knowledgable in some areas, but it -looks- to me as if he
often ends up in arguments, and I get the -impression- that a fair
number of people don't pay careful attention to him (possibly
because of the -way- he says things...)
--
"It is important to remember that when it comes to law, computers
never make copies, only human beings make copies. Computers are given
commands, not permission. Only people can be given permission."
-- Brad Templeton
Aug 17 '06 #52
Mikhail Teterin <us****@aldan.algebra.comwrites:
We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

I think, this is a considerable progress for one thread and will shut up for
a while...
Looking back in this thread's history, you weren't the OP, but you did
bring up something that triggered a lot of discussion:

| So, comparing, say, 4-char arrays (like currency codes) can NOT be done in
| the following way?
|
| typedef union {
| char acCUR[4];
| int32_t iCUR;
| } xCUR;
|
| int
| CurEqual(xCUR *c1, xCUR *c2)
| {
| if (c1->iCUR == c2->iCUR)
| printf("Same currency %s\n", c1->acCUR);
| else
| printf("%s and %s are different\n",
| c1->acCUR, c2->acCUR);
| }
|
| Having to call a strcmp() in such cases seems like a bad waste to me, but I
| don't see, how the compiler could possibly optimize such a code without the
| trick above...

In this case, it seems to me that there are solutions better than
either using strcmp() or pretending that a char[4] is an int32_t.

Presumably you want the operation of comparing two currency codes to
be as fast as possible, because that operation is a bottleneck in your
application.

You can probably achieve this by storing and comparing the currency
codes *as integers*. One simple way to do this is to compute the
numeric codes arithmetically from the string values. I think you said
that all the currency codes are two characters; if so, it's as simple
as:

numeric_code = (string_code[0] << CHAR_BIT) + string_code[1];

and the reverse conversion is also quite simple.

Note that the values of the numeric codes could vary with different
character sets, and if you store them in files, they could vary with
system byte ordering.

This avoids the need for strcmp() *and* it doesn't depend on type
punning.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Aug 18 '06 #53
Lew Pitcher wrote:
Assume the code fragment...

{ /* A */
char aa; int ab, ac;
char ad;
{ /* B */
char ae;
}
}

In the nesting level I've called "A", the compiler /may/ optimize the
allocations of aa, ab, ac, and ad so that aa and ad are adjacent in
"memory".
However, because variable ae is declared within a different "level" of
the code, I doubt that most compilers would "optimize" its allocation
to occupy another 8 bits within that 64bit allocation that aa and ad
potentially occupy.
Every compiler I've ever used allocates all local variables,
on all "levels", at the start of the function (by advancing the
stack pointer a set amount). It doesn't declare new stack
frames for code blocks within the same function, or
anything like that.

Aug 18 '06 #54
In article <1940995.IL8L2SCWzz@mishaMikhail Teterin <us****@aldan.algebra.comwrites:
We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".
You quoted a gain of 0.00000000713037 seconds per comparison in EOD
processing at a bank. Even if in that EOD processing 1000 million of
such comparisons are made, your net gain is about 7 seconds. I think
that is peanuts compared to the total time required for EOD processing.

I would think that in the EOD process the comparisons are a small fraction
of the total time involved. Say 10% (which is, eh, quite large in my
opinion). If we reduce that by a factor 5, the total time will be
reduced by only 8%.

But even *if* it is significant, there are better methods to increase
speed. Set up a table with all existing currencies, and use indexing.
When a transaction is entered, look up the index numbers for the
transaction (takes about 0.00003 seconds if there are two currencies
involved), and use those indices for the remainder.

Even when loading a full table of conversion rates between two currencies,
the finding of an index would be peanuts (0.0003 seconds gain with your
method). I would think that reading the rates themselves would take
much more time.

This is called micro-optimisation. Optimising a part of the program
that does not use a significant amount of time compared to the
remainder. You should start at the bottle-necks, and for that you
have to measure what the bottle-necks are. Profile the program.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Aug 18 '06 #55
Walter Roberson wrote:
In article <1940995.IL8L2SCWzz@misha>,
Mikhail Teterin <us****@aldan.algebra.comwrote:
>>We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".
There must be an equivalent of the old aphorism, "penny-wise, pound-foolish"
that can be applied to optimizing code.

Today, with rare exception, there is hardly the need to optimize assembly
code by hand. I could see it for, say, an embedded system running on
limited resources where it may make sense. But in general? Nope.

--
-- Edmond Dantes, CMC
And Now for something Completely Different:
http://mini-cooper.ReallyHotOrNot.com
http://washers.industrialtips.com
http://embroidery.womencraft.com
http://commodity.yourtruemoney.com
http://Linux.Internet69.com
http://Honda.ReallyHotOrNot.com
http://playset.GadgetRUs.com
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Aug 18 '06 #56
"Mikhail Teterin" <us****@aldan.algebra.comwrote in message
news:2432497.Jq3QulhsxS@misha...
Stephen Sprunk wrote:
>The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.

Well, here are some benchmarks comparing the use of strcmp() to
compare short character strings (4 characters).
....
It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as
such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.
Okay, I retract my statement. The compiler does a good job with strcmp(),
but it'll never be as fast as your version because the compiler has no way
to know that the strings will _always_ be exactly of length 3, so it has to
test for '\0' several times. That difference of knowledge will easily
account for the 3x-5x difference in execution speed.

I am curious if the following has a substantially different speed from your
version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));

/* Note: & is safe here since == always returns 0 or 1, and
it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}
-------

This version looks, to me, like it's fully portable and should be
substantially faster than strcmp(), at least when NDEBUG is defined.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Aug 18 '06 #57
Stephen Sprunk wrote:
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
Yes, it is faster than strcmp(), of course. But still slower than the
int-compare. I added two more functions to the test:

int
comp_mem(const char cur1[4], const char cur2[4])
{
return memcmp(cur1, cur2, sizeof cur1) == 0;
}

int
comp_Sprunk(const char cur1[4], const char cur2[4])
{
return cur1[0] == cur2[0] && cur1[1] == cur2[1]
&& cur1[2] == cur2[2];
}
Here are the results:

FreeBSD/i386 (Pentium D @3GHz):
str: used 1091883 microseconds
int: used 404868 microseconds
Sprunk: used 590683 microseconds
memcmp: used 4424082 microseconds

FreeBSD/amd64 (opteron 244):
str: used 1402967 microseconds
int: used 392841 microseconds
Sprunk: used 617995 microseconds
memcmp: used 1571644 microseconds

Solaris8/sparc (@900MHz)
str: used 730000 microseconds
int: used 130000 microseconds
Sprunk: used 160000 microseconds
memcmp: used 560000 microseconds

AIX/powerpc
str: used 3200000 microseconds
int: used 710000 microseconds
Sprunk: used 820000 microseconds
memcmp: used 2990000 microseconds

Linux/i386 (Xeon @3GHz)
str: used 1060000 microseconds
int: used 390000 microseconds
Sprunk: used 530000 microseconds
memcmp: used 3520000 microseconds

What puzzles me, actually, is the poor performance of memcmp() vs. strcmp(),
but that's it -- no other surprises...
This version looks, to me, like it's fully portable and should be
substantially faster than strcmp(), at least when NDEBUG is defined.
Agreed about the strcmp, but portability is something, my version is not,
actually, lacking either. I'm unaware of a general-purpose platform, with
non-8 CHAR_BIT. And if int32_t is missing, well, that's a compile-time
error, not a run-time disaster...

-mi
Aug 18 '06 #58
Stephen Sprunk wrote:
>
.... snip ...
>
I am curious if the following has a substantially different speed
from your version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}
-------

This version looks, to me, like it's fully portable and should be
substantially faster than strcmp(), at least when NDEBUG is defined.
Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:

int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}

In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison. The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.

In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.

You will have to do an unholy number of calls to these routines to
measure the difference with most clock systems resolution. Have
fun. Don't forget that execution speed is data dependant so you
will need to generate bags and bags of random strings. I suggest
the Mersenne Twister to do that, but you will have to compensate
for string generation time.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
Aug 19 '06 #59
CBFalconer wrote:
Stephen Sprunk wrote:
... snip ...
I am curious if the following has a substantially different speed
from your version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}
-------

This version looks, to me, like it's fully portable and should be
substantially faster than strcmp(), at least when NDEBUG is defined.

Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.
A modern CPU will perform these operations substantially in *PARALLEL*.
In fact the full schedule in an OOO machine like an Athlon (3-way
subscalar, with free offsets) looks something like this:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load (c2+4)
Cycle 3: f1 <- cmp(r0, r1) ||| r4 <- load (c1+8) ||| r5 <- load (c2+8)
Cycle 4: r6 <- f1 ||| f2 <- cmp (r2, r3)
Cycle 5: r7 <- f2 ||| f3 <- cmp (r4, r5)
Cycle 6: r8 <- and(r6, r7) ||| r9 <- f3
Cycle 7: ra <- and(r8, r9) ||| return

Every cycle performs at least two operations with only cycles 4 and 5
being somewhat less efficient due to the way the x86 architecture works
(you have to copy from the flags to the register.) This tells us that
you should instead use the expression:

static int cneq (const char * c1, const char * c2) {
return (c1[0] - c2[0]) | (c1[1] - c2[1]) | (c1[2] - c2[2]);
}

This leads us to:

Cycle 1: r0 <- load(c1) ||| r1 <- load(c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load(c2+4)
Cycle 3: r6 <- sub(r0, r1) ||| r4 <- load(c1+8) ||| r5 <- load(c2+8)
Cycle 4: r7 <- sub(r2, r3)
Cycle 5: r8 <- sub(r4, r5) ||| r9 <- ior(r6, r7)
Cycle 6: ra <- ior(r8, r9) ||| return

which is a cycle faster.
Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:

int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}

In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison.
And *one branch prediction*. In any event, lets take a look:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c1)
Cycle 2: r2 <- load(c2+4)
Cycle 3: f0 <- cmp(r0, 0) ||| ifnz(f0) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 4/14: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 5/15: noop
Cycle 6/16: f2 <- cmp(r3, r4)
Cycle 7/17: r5 <- f2 ||| return

{ or: }

Cycle 4/14: f1 <- cmp(r1, r2) ||| ifnz(f1) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 5/15: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 6/16: noop
Cycle 7/17: f2 <- cmp(r3, r4)
Cycle 8/18: r5 <- f2 ||| return

etc. So your *best* case costs basically 7 clocks anyways. (Once you
either start predicting wrong, or enter the loop, you are clearly much
worse.) These are highly idealized models of the Athlon machine, so
real results may be different. It might be interesting to see the real
world results.
[...] The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.
It highly depends on the random distribution. The branch prediction
penalty could *easily* dominate. For example, if there were only two
or three really common strings (USD, Euros, and Yen, say) but no
discernable pattern in those actual cases, then you could easily tend
to get the branch prediction wrong around 30% of the time, which
translates to an average of 3 clocks extra.
In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.
I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 19 '06 #60
we******@gmail.com wrote:
CBFalconer wrote:
>Stephen Sprunk wrote:
... snip ...
>>I am curious if the following has a substantially different speed
from your version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}
-------

This version looks, to me, like it's fully portable and should be
substantially faster than strcmp(), at least when NDEBUG is defined.

Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

A modern CPU will perform these operations substantially in *PARALLEL*.
In fact the full schedule in an OOO machine like an Athlon (3-way
subscalar, with free offsets) looks something like this:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load (c2+4)
Cycle 3: f1 <- cmp(r0, r1) ||| r4 <- load (c1+8) ||| r5 <- load (c2+8)
Cycle 4: r6 <- f1 ||| f2 <- cmp (r2, r3)
Cycle 5: r7 <- f2 ||| f3 <- cmp (r4, r5)
Cycle 6: r8 <- and(r6, r7) ||| r9 <- f3
Cycle 7: ra <- and(r8, r9) ||| return

Every cycle performs at least two operations with only cycles 4 and 5
being somewhat less efficient due to the way the x86 architecture works
(you have to copy from the flags to the register.) This tells us that
you should instead use the expression:

static int cneq (const char * c1, const char * c2) {
return (c1[0] - c2[0]) | (c1[1] - c2[1]) | (c1[2] - c2[2]);
}

This leads us to:

Cycle 1: r0 <- load(c1) ||| r1 <- load(c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load(c2+4)
Cycle 3: r6 <- sub(r0, r1) ||| r4 <- load(c1+8) ||| r5 <- load(c2+8)
Cycle 4: r7 <- sub(r2, r3)
Cycle 5: r8 <- sub(r4, r5) ||| r9 <- ior(r6, r7)
Cycle 6: ra <- ior(r8, r9) ||| return

which is a cycle faster.
>Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:

int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}

In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison.

And *one branch prediction*. In any event, lets take a look:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c1)
Cycle 2: r2 <- load(c2+4)
Cycle 3: f0 <- cmp(r0, 0) ||| ifnz(f0) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 4/14: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 5/15: noop
Cycle 6/16: f2 <- cmp(r3, r4)
Cycle 7/17: r5 <- f2 ||| return

{ or: }

Cycle 4/14: f1 <- cmp(r1, r2) ||| ifnz(f1) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 5/15: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 6/16: noop
Cycle 7/17: f2 <- cmp(r3, r4)
Cycle 8/18: r5 <- f2 ||| return

etc. So your *best* case costs basically 7 clocks anyways. (Once you
either start predicting wrong, or enter the loop, you are clearly much
worse.) These are highly idealized models of the Athlon machine, so
real results may be different. It might be interesting to see the real
world results.
>[...] The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.

It highly depends on the random distribution. The branch prediction
penalty could *easily* dominate. For example, if there were only two
or three really common strings (USD, Euros, and Yen, say) but no
discernable pattern in those actual cases, then you could easily tend
to get the branch prediction wrong around 30% of the time, which
translates to an average of 3 clocks extra.
>In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.

I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?
No. Clearly you are much more familiar with the internals of
modern CPUs than I am. However, considering the chances of
entering the loop at all, a priori (for random strings) there is
only one chance in 256 (for 8 bit bytes) of entering. In actuality
the chance is going to be much higher, maybe one in 100, yet not to
be sneezed at. In the probable case there are two decisions,
whether to enter the loop, and what to return. By putting these in
the appropriate default path, as an intelligent compiler can
(possibly by generating hints), all delays in the routine can be
avoided most of the time. In fact, by putting those decisions in
the calling sequence all control transfers can be avoided most of
the time! You can even express that in C as a conditional calling
macro and a specialized subroutine.

I tend to think of much simpler processors, i.e. the kind that are
really prevalent in the embedded world, and which make up the vast
majority. These are the ones that have relatively limited
performance, and need careful tending to get maximized results.

This, while interesting, is OT for c.l.c, yet it is the sort of
thing with which every programmer should familiarize themselves.
Otherwise it can be impossible to make some fundamental decisions
on methods and techniques, at least in cases where it matters.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
Aug 19 '06 #61
CBFalconer wrote:
we******@gmail.com wrote:
CBFalconer wrote:
In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.
I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?

No. Clearly you are much more familiar with the internals of
modern CPUs than I am. However, considering the chances of
entering the loop at all, a priori (for random strings) there is
only one chance in 256 (for 8 bit bytes) of entering.
The point of my analysis was to explain that this idea of yours is
wrong. Short-circuiting only works if there is a substantial amount of
work that you are saving by doing so. As the clock counts get smaller,
its a generally better idea to make code straightline and
parallelizable and avoid any control-flow.
[...] In actuality
the chance is going to be much higher, maybe one in 100, yet not to
be sneezed at.
Huh? The O.P. has explained that these are currencies. So presumably
he's tracking transactions at a bank or something like that. Not that
I am an expert on such things, but I would imagine that there are
extreme biases in the input data that he sees. I.e., he will see a lot
more USD, Euro and Yen than he will of the Nicaraguan Cordoba for
example. If its pretty much *all* USD, then the branch predictors
start working for you again. But as soon as you see a mix with a
handful of very significant entries, the branch predictors will just
end up failing. My intuition is that in fact for certain international
banks or european banks (that are mixing the british pound and the
euro, for example) this is, in fact, a worst case scenario.

You generally can't think of probabilities of real world data in
simplistic terms such as expectations about the range of the data type.
Even approximating them this way can so easily lead you astray. This
is the same reason that its so important to really look at worst case
scenarios in generic code.
[...] In the probable case there are two decisions,
whether to enter the loop, and what to return. By putting these in
the appropriate default path, as an intelligent compiler can
(possibly by generating hints), all delays in the routine can be
avoided most of the time. In fact, by putting those decisions in
the calling sequence all control transfers can be avoided most of
the time! You can even express that in C as a conditional calling
macro and a specialized subroutine.

I tend to think of much simpler processors, i.e. the kind that are
really prevalent in the embedded world, and which make up the vast
majority. These are the ones that have relatively limited
performance, and need careful tending to get maximized results.
You mean embedded processors from the past? MIPS, Xscale, and PowerPC
are all super scalar and fully pipelined processors. There are
embodiments of these which are deeply pipelined and support some degree
of OOO. While some are not able to completely re-order the
instructions, good compilers would *try* to do that for them anyways.
Over time standard processors become embedded processors.
[...]
This, while interesting, is OT for c.l.c, yet it is the sort of
thing with which every programmer should familiarize themselves.
Which do you think they should learn? Your incorrect/inapplicable
analysis, or my somewhat detailed analysis? I think people should
first learn to benchmark as our other poster here seems keen on. They
can learn what I know if they really feel up to it, or then can just
stick with benchmarking and learn some rules of thumb along the way.
Otherwise it can be impossible to make some fundamental decisions
on methods and techniques, at least in cases where it matters.
If you want to make *correct* decisions, and somehow benchmarking is
not a good option for you, then you have little choice but to engage in
the kind of analysis that I do. But I too have to eventually put
things to the acid test -- because in real processors there are often
factors that are just not easy to see.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 19 '06 #62
Walter Roberson wrote:
In article <1940995.IL8L2SCWzz@misha>,
Mikhail Teterin <us****@aldan.algebra.comwrote:
>We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

A couple of people then {perhaps unwisely} said "a good strcmp() can
probably do just as well", and you produced a benchmark that appeared
to show that your code is measurably faster... at least for the systems
you tried.
I just tried the benchmark on my machine and got the same result.
However, when I changed bench.c to call strcmp directly, which is how I
would actually do the string comparison inline the results switched to
strcmp being far faster. When, to be fair, I changed the integer
comparison to being inline I got the *same* results for both. So I don't
thing the comment about good strcmp implementations was unwise, I think
it was justifiable at least in part. Under the right conditions the
strcmp method *does* perform as efficiently because the compiler knows
what strcmp does, inlines it, and then optimises the inlined code in
context.

<OT>
I believe that this is the reason gcc has a number of "library"
functions implemented as built-ins rather than just calling the library.
The built ins *are* inlined before the optimiser is invoked.
</OT>
This was followed by a collective "So what if it executes faster?
The time differences is small enough to be inconsequential for most
systems, and in the meantime you've written unclear code that will
likely break on future systems. Paul Hsieh is the only one who
said that it's worth exploring, and it isn't clear that he is taking
into account relative costs of maintaince and execution time.
There is also the point that even if it save 3 hours each run it may
still not be worth it. My customers still have jobs that are run as
overnight jobs for very good reasons, they don't care how long it takes
as long as it is finished by morning. In other situations a 1ms saving
can be critical. In other words, you have to consider how valuable
execution time is as well as how much you save and what it costs you.
>I think, this is a considerable progress for one thread and will shut up for
a while...

You haven't really answered to the costs vs portability concerns.
You mooted about banks and funds, but I pointed out that if
the execution time of the comparison was really such a key factor
then you wouldn't have used this approach at all.

Sooo... the space we are in right now is "Here's a non-portable
and less-clear hack that's often faster" and you wanting general
approval for your cleverness... or at the very least, you wanting
people to be content to have such hacks be topical here.
The reason I believe most people object to hacks like these is, I
believe, because they make it needlessly hard to understand the code
without any corresponding *significant* benefit. Note that even when
they do improve performance (some tricks people have posted would make
performance *worse* on some systems) the improved performance is often
not enough for the user to notice.
The only progress that I see that you've made is Paul Hseih's
"don't take the naysaying too seriously". Paul is [to me] obviously
fairly knowledgable in some areas, but it -looks- to me as if he
often ends up in arguments, and I get the -impression- that a fair
number of people don't pay careful attention to him (possibly
because of the -way- he says things...)
What gets me is his apparent assumption that just because optimisations
will work and produce better results on the systems he has experience
with means that they are generally worth while. See my response to his
post pointing out that his *assumptions* about performance are wrong
even for a bog-standard notebook running Linux.
--
Flash Gordon
Still sigless on this computer.
Aug 19 '06 #63
we******@gmail.com wrote:
Mikhail Teterin wrote:
>Stephen Sprunk wrote:
>>The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).

You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.
Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.
The only
chance the compiler has is to support cross-file inlining,
True, but in reality you would be unlikely to wrap up your strcmp call
inside another function. I, at least, would just call strcmp directly to
compare strings.
constant
propagation, and a kind of global string analysis that just isn't going
to yeild fruitful optimizations in general. And that would be just to
try to pull even with integer comparison. If you used memcmp() instead
of strcmp(), things *might* be different.
Wrong. Calling strcmp directly since that is how one would actually use
it and inlining the other comparison for fairness I get identical
performance for both integer and string comparison.

Note that the whole point of this is to demonstrate that if you write
clear code and give the compiler a fair chance (not hobbling it by
introducing an additional level of function call you would not use in
the real world) the compiler *can* do amazing pieces of optimisation.
Benching this is unnecessary. *Any* inner loop that contains a str*
function is always going to be way slower than any reasonably
reconsidered alternative.
Wrong. See above.
That's a rule of thumb people should just
always be aware of.
Like many rules of thumb from years ago it is wrong today.
In many cases, the std C library is actually
behind by actual orders of complexity. Here its slower because of the
poor not really parallizable str* function design (it ends up
pointlessly scanning for 4 '\0' characters no matter what).
Obviously it does manage to do something far cleverer on some systems if
you give it a chance because it does on my system which is nothing special.
>[...] It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.
You are obviously wrong. See comments above.
These "authorities" are overstepping their bounds. They have decided
that this is a newsgroup just about the C standard. Practical C
programming is way beyond both the scope of this newsgroup and those so
called authorities. Their brains have, over the years, become
hardwired in a very specific way -- if you ask the question "how do I
<snip>

and yours is hard wired in to believing you can always outstrip the
compiler by doing dirty tricks even though optimisers have moved on a
long way as illustrated when you actually give the compiler a chance to
optimise by using strcmp directly instead of indirectly.
The issue with your "trick" is that its not portable.
That is only part of the issue. It also makes the code needlessly hard
to read or maintain and is *not* guaranteed to provide the suggested
performance gain.
But that's
typical of anything useful that most people do in the C language.
As has been pointed out before, there are vast amounts of very useful
things that *can* and *are* done in standard portable C.
Otherwise, yes of course it will lead to an enormous performance
improvement (after all you are removing a str* from an inner loop.)
You show your lack of knowledge about modern optimisers when they are
given a chance by writing clear code.
To make it portable, you probably want to try to isolate the *system*
through ifdef's. The reason is that you need to isolate *systems* that
align things on at most 32-bit boundaries. The exceptions will be
characterized by system. Then there's the whole sizeof(int32_t) issue
-- you probably just want to deal with 3 cases of 1, 2, 4 and forget
the other ones. Then, since you know that these strings are exactly 3
characters, you should use *memcmp()*, not strcmp() (its a little
faster) on the fallback cases.
I agree that when you do need to use system specific tricks you isolate
them. However it is entirely possible for strcmp to be as fast as
memcmp, seem my comments about performance else where in this post.
--
Flash Gordon
Still sigless on this computer
Aug 19 '06 #64
Mikhail Teterin wrote:
Stephen Sprunk wrote:
>The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.

Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).

FreeBSD6.1/i386 using `gcc version 3.4.4 [FreeBSD] 20050518'
with "-pedantic -Wall -O5 -pipe -march=pentium4 -DNDEBUG -DNODEBUG
-fomit-frame-pointer":

./bench 100000000
str: used 1119486 microseconds
int: used 406449 microseconds
<snip>
Of the above, the Sun's cc/libmil is definitely has the special "knowledge
and treatment of common functions/idioms" such as strcmp(), but even there
using strcmp() was 5 times slower...
Try giving the compiler a fighting chance of optimising the code as a
whole. After all, you would not normally wrap up your strcmp calls in
another layer of function call, would you?
It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.
Oh yes it is if you actually give the optimiser a fighting change. Try
this version of your code (yes people, I know it contains Unix
specifics, I could not be bothered to remove them from Mikhail's code):

/* bench.c */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sysexits.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>

#include "comp.h"

static void
reporttime(label, pu1, pu2)
const char *label;
const struct rusage *pu1, *pu2;
{

printf("%s: used %ld microseconds\n", label,
(pu2->ru_utime.tv_sec -
pu1->ru_utime.tv_sec)*1000000 +
(pu2->ru_utime.tv_usec -
pu1->ru_utime.tv_usec));
}

int
main(int argc, char *argv[])
{
xCUR cur1 = { "USD" }, cur2 = { "UHR" };
long i, iterations,t1=0,t2=0;
struct rusage ru1, ru2;

switch (argc) {
case 1:
iterations = 10000000;
break;
case 2:
iterations = atol(argv[1]);
if (iterations 0)
break;
/* FALLTHROUGH */
default:
fprintf(stderr, "Usage:\n\t%s [iterations]\n", argv[0]);
exit(EX_USAGE);
}

if (getrusage(RUSAGE_SELF, &ru1)) {
perror("getrusage");
exit(EX_OSERR);
}

for (i = 0; i < iterations; i++)
t1 += (strcmp(cur1.acCUR, cur2.acCUR) == 0);

if (getrusage(RUSAGE_SELF, &ru2)) {
perror("getrusage");
exit(EX_OSERR);
}

reporttime("str", &ru1, &ru2);

if (getrusage(RUSAGE_SELF, &ru1)) {
perror("getrusage");
exit(EX_OSERR);
}

for (i = 0; i < iterations; i++)
t2 += (cur1.iCUR == cur2.iCUR);

if (getrusage(RUSAGE_SELF, &ru2)) {
perror("getrusage");
exit(EX_OSERR);
}
reporttime("int", &ru1, &ru2);
printf("Force strcomp to happen by printing %ld %ld\n",t1,t2);
return EX_OK;
}

/* comp.h */
#include <inttypes.h>
#include <limits.h>

typedef union {
char acCUR[4];
#if CHAR_BIT == 8
uint32_t iCUR;
#elif CHAR_BIT == 16
uint64_t iCUR;
#else
#error "I'm not prepared for this platform's value of CHAR_BIT"
#endif
} xCUR;

int comp_str(const char cur1[4], const char cur2[4]);
int comp_int(const xCUR *cur1, const xCUR *cur2);
--
Flash Gordon
Still sigless on this computer
Aug 19 '06 #65
Flash Gordon wrote:
we******@gmail.com wrote:
Mikhail Teterin wrote:
Stephen Sprunk wrote:
The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).
You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.

Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.
I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.

The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.

If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 19 '06 #66

"Edmond Dantes" <ed****@le-comte-de-monte-cristo.bizwrote
Walter Roberson wrote:
>In article <1940995.IL8L2SCWzz@misha>,
Mikhail Teterin <us****@aldan.algebra.comwrote:
>>>We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

There must be an equivalent of the old aphorism, "penny-wise,
pound-foolish"
that can be applied to optimizing code.

Today, with rare exception, there is hardly the need to optimize assembly
code by hand. I could see it for, say, an embedded system running on
limited resources where it may make sense. But in general? Nope.
If you've got a very slow computer or a very fast computer, often you need
to optimise.
If you are using a 3GHz PC to run a medium-sized company accounts on a
spreadsheet, it is pretty pointless.
--
www.personal.leeds.ac.uk/~bgy1mm
freeware games to download.

Aug 19 '06 #67
>CBFalconer wrote:
>Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.
In article <11**********************@i3g2000cwc.googlegroups. com>
<we******@gmail.comwrote:
>A modern CPU will perform these operations substantially in *PARALLEL*.
Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)

All of this is somewhat moot, since the original code could be
rewritten to use memcmp():

if (memcmp(c1, c2, 4) == 0)

which could then be (but is not, in the versions I tested) optimized
by the compiler into a "load 4 bytes and compare" instruction --
in fact, the very one the OP was getting via his union trick. All
that is required is a little work on the compiler: make memcmp()
a builtin like memcpy(), and then notice small constant sizes (and
aligned addresses, if required by the CPU) and that the result is
tested only for equality to zero. That is, given:

if (memcmp(c1, c2, k) == 0) /* or != 0 */

where k is a small constant, we can easily generate good code inline
without a library call; but if the test is:

if (memcmp(c1, c2, k) 0) /* or < 0, etc. */

the job becomes harder (it is now byte-order dependent, if we try
to use integer comparison instructions). Note that memcmp() is
always allowed to access all the bytes of the region, so on x86-64:

memcmp(c1, c2, 12) == 0 /* or != 0 */

can turn into:

((*(long long *)c1 - *(long long *)c2) |
(((int *)c1)[2] - ((int *)c2)[2])) == 0 /* or != 0 */

which should still be faster than a library call.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Aug 20 '06 #68
we******@gmail.com wrote:
Flash Gordon wrote:
>we******@gmail.com wrote:
>>Mikhail Teterin wrote:
Stephen Sprunk wrote:
The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).
You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.
Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.

I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.
It was not a silly theory it was the result of profiling. It's is not my
fault if none of the compilers your use are as good as the one I use.

As I said else where. You assume that your experience applies to
everything else. It does not.
The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.
So? I have a compiler which produces different results thus countering
your claim that the integer comparison is *always* faster. One only
needs one counter example to disprove a claim of always, and no number
of example where integer comparison is faster will change that.
If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.
I'm not claiming it is using a "normal" way only that it find *a* way
however devious. Quite possibly it is making use of the fact that it
knows the strings are exactly 4 bytes.

It is a bog-standard Ubunto install on a Fujitsu Siemens Amilo Pro 2040V
notebook.

markg@brenda:~/tmp$ gcc --version
gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm not prepared to post long assembler dumps to this group since this
is a C group. However I do have a web site so I've shoved the dump of
the executable (I did not bother producing a .o file) up at
http://home.flash-gordon.me.uk/asm.txt for now. You are just lucky I did
it on an x86 based machine rather than a PPC based one!

Please note that I don't want to get in to discussing the intricacies of
x86 assembler or what gcc has done. I'm providing this for your interest
and as a demonstration of how a compiler/optimiser can be devious
potentially meaning that you don't get the performance boost you expect
on all systems under all conditions. The fact that you don't see the
same effect just shows how unpredictable these things are.

By the way, when I left the integer trick as a function the strcmp code
was *faster* than the int comparison trick. So to use the int comparison
trick you would have to implement it as a macro rather than a function
in a library to give it a good chance.
--
Flash Gordon
Still sigless on this computer.
Aug 20 '06 #69
"Chris Torek" <no****@torek.netwrote in message
news:ec*********@news4.newsguy.com...
CBFalconer wrote:
>>Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

In article <11**********************@i3g2000cwc.googlegroups. com>
<we******@gmail.comwrote:
>>A modern CPU will perform these operations substantially in
*PARALLEL*.

Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)
True, most novices probably would have used && instead of &; heck,
Mikhail seems fairly bright but he managed to turn the &s I wrote into
&&s. I used & intentionally to prevent short-circuit evaluation, which
would actually be harmful in this scenario.

Most modern CPUs will do speculative loads for the six chars, but doing
the tests in series, with branches in between, will still kill
performance due to lower IPC and horrible branch misprediction penalties
(up to 20 cycles per branch on some CPUs). Not that this is strictly
on-topic anyways, but IMHO discussing how to make portable code as fast
as unportable code should be.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Aug 20 '06 #70
Flash Gordon wrote:
we******@gmail.com wrote:
Flash Gordon wrote:
we******@gmail.com wrote:
Mikhail Teterin wrote:
Stephen Sprunk wrote:
The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).
You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.
Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.
I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.

It was not a silly theory it was the result of profiling. It's is not my
fault if none of the compilers your use are as good as the one I use.
I'm not sure you understand how unlikely that is.
As I said else where. You assume that your experience applies to
everything else. It does not.
I'm waiting for a counter example -- this doesn't count as one, BTW.
The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.

So? I have a compiler which produces different results thus countering
your claim that the integer comparison is *always* faster. One only
needs one counter example to disprove a claim of always, and no number
of example where integer comparison is faster will change that.
I only mean to say that "extraordinary claims demand extraordinary
evidence". I can only ask this if I establish that your claim is
extraordinary, which I think I did.
If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.

I'm not claiming it is using a "normal" way only that it find *a* way
however devious. Quite possibly it is making use of the fact that it
knows the strings are exactly 4 bytes.
That itself would still be considered "normal". In my thinking on
this, I have incorporated a lot of consideration for what the compiler
can and cannot do. By not "normal" I am talking about tricks like some
compilers used to "detect SpecCPU" (a famous benchmark) and replace
code with hand tuned transformations that are not ordinarily within the
capabilities of any compiler.
It is a bog-standard Ubunto install on a Fujitsu Siemens Amilo Pro 2040V
notebook.

markg@brenda:~/tmp$ gcc --version
gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm not prepared to post long assembler dumps to this group since this
is a C group. However I do have a web site so I've shoved the dump of
the executable (I did not bother producing a .o file) up at
http://home.flash-gordon.me.uk/asm.txt for now.
You don't get it do you? From my previous post, I was just shy of
calling you a liar. With this post, I now fully confirm it.

That assembly code has no significant loops in it of any kind (it does
call gets for some reason). But you aren't just missing a .o file,
that main doesn't call out anywhere. So this *is* all the code, but it
does *NOT* run the benchmark in question.
[...] You are just lucky I did
it on an x86 based machine rather than a PPC based one!
It would slow me down a bit, but it would serve the exact same purpose
-- to prove that you are lying.
Please note that I don't want to get in to discussing the intricacies of
x86 assembler or what gcc has done.
Deep analysis is not necessary. Look for the fragment of code under
<main>, look for any j?? instructions and notice that there aren't any
(so there are no loops in main). Then look at all the lines with call
hex-number <symboliclabel>. Each symboliclabel points into standard
library functions. (A quick check of these labels earlier in the code
shows they are all direct jumps into the standard library, so no tricky
shenanigans are happening here.)
[...] I'm providing this for your interest
and as a demonstration of how a compiler/optimiser can be devious
potentially meaning that you don't get the performance boost you expect
on all systems under all conditions.
This is "deviousness" of the original source construction and claims
being made, and nothing more. I.e., the assembly does not correspond
to the source you claim it does, Mr. Fraud Gordon.
[...] The fact that you don't see the
same effect just shows how unpredictable these things are.
You just don't have any idea. If forced to reveal the assembly, you
cannot trick me. You simply don't possess the skill. You *believed* I
would not check this assembly, or perhaps thought I was all talk and no
show. You don't even realize that I actually compiled your source, ran
all the benches *AND* disassembled each one to confirm my understanding
of what was going on on *5* compilers. If you understood this, perhaps
you would have no tried to perpetrate such a blatant attempt at a con.

Of course, nobody else has chimed in (all you have to do is compile and
run his source to see he's obviously wrong/lying). Perhaps this
explains your level of expectation. I don't know why you thought you
would be able to get away with this, but I'm the wrong person to try
pull a trick like this on.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 21 '06 #71
we******@gmail.com wrote:
Flash Gordon wrote:
>we******@gmail.com wrote:
>>Flash Gordon wrote:
we******@gmail.com wrote:
Mikhail Teterin wrote:
>Stephen Sprunk wrote:
>>The implementation is likely to have a very, very clever strcmp() that
>>will perform at least as well as your code (possibly doing the same thing
>>internally, if it's known to be safe) and likely even better if the
>>compiler is reasonably modern due special knowledge and treatment of
>>common functions/idioms.
>Well, here are some benchmarks comparing the use of strcmp() to compare
>short character strings (4 characters).
You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.
Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.
I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.
It was not a silly theory it was the result of profiling. It's is not my
fault if none of the compilers your use are as good as the one I use.

I'm not sure you understand how unlikely that is.
>As I said else where. You assume that your experience applies to
everything else. It does not.

I'm waiting for a counter example -- this doesn't count as one, BTW.
I made a simple, honest mistake. I forgot that I had built another
sample of code from here in the same directory. Since I normally don't
bother specifying an output name it writes it as a.out, thus I put up
the wrong sample.
>>The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.
So? I have a compiler which produces different results thus countering
your claim that the integer comparison is *always* faster. One only
needs one counter example to disprove a claim of always, and no number
of example where integer comparison is faster will change that.

I only mean to say that "extraordinary claims demand extraordinary
evidence". I can only ask this if I establish that your claim is
extraordinary, which I think I did.
I put up the wrong dump. That is now rectified.
>>If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.
I'm not claiming it is using a "normal" way only that it find *a* way
however devious. Quite possibly it is making use of the fact that it
knows the strings are exactly 4 bytes.

That itself would still be considered "normal". In my thinking on
this, I have incorporated a lot of consideration for what the compiler
can and cannot do. By not "normal" I am talking about tricks like some
compilers used to "detect SpecCPU" (a famous benchmark) and replace
code with hand tuned transformations that are not ordinarily within the
capabilities of any compiler.
Well, whatever gcc is doing on my system it is within the bounds of an
"ordinary" compiler since I do not have a custom copy of gcc.
>It is a bog-standard Ubunto install on a Fujitsu Siemens Amilo Pro 2040V
notebook.

markg@brenda:~/tmp$ gcc --version
gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm not prepared to post long assembler dumps to this group since this
is a C group. However I do have a web site so I've shoved the dump of
the executable (I did not bother producing a .o file) up at
http://home.flash-gordon.me.uk/asm.txt for now.

You don't get it do you? From my previous post, I was just shy of
calling you a liar. With this post, I now fully confirm it.
No, it shows that like anyone else I can make an honest mistake. Do you
never make mistakes then?
That assembly code has no significant loops in it of any kind (it does
call gets for some reason). But you aren't just missing a .o file,
that main doesn't call out anywhere. So this *is* all the code, but it
does *NOT* run the benchmark in question.
Agreed. I did the wrong code by mistake.
>[...] You are just lucky I did
it on an x86 based machine rather than a PPC based one!

It would slow me down a bit, but it would serve the exact same purpose
-- to prove that you are lying.
No. It was a simple mistake.

<snip>
>[...] The fact that you don't see the
same effect just shows how unpredictable these things are.

You just don't have any idea. If forced to reveal the assembly, you
cannot trick me. You simply don't possess the skill. You *believed* I
would not check this assembly, or perhaps thought I was all talk and no
show. You don't even realize that I actually compiled your source, ran
all the benches *AND* disassembled each one to confirm my understanding
of what was going on on *5* compilers. If you understood this, perhaps
you would have no tried to perpetrate such a blatant attempt at a con.
I have no intention of trying to trick you. I already know that you are
far better at x86 assembler than I am ever likely to be since I stopped
writing assembler a few years back and even then it was not x86.
Of course, nobody else has chimed in (all you have to do is compile and
run his source to see he's obviously wrong/lying). Perhaps this
explains your level of expectation. I don't know why you thought you
would be able to get away with this, but I'm the wrong person to try
pull a trick like this on.
See above. I made a mistake and posted the wrong executable.

This time I've put the correct executable up as
http://home.flash-gordon.me.uk/a.out and the objdump output at
http://home.flash-gordon.me.uk/asm.txt

If you want you can look at the PPC equivalent generated on my Gentoo
box. On that I am certain it is cheating heavily since I get a reported
execution time of 0. Those are at
http://home.flash-gordon.me.uk/ppcasm.txt and
http://home.flash-gordon.me.uk/p.out

That machine (which is the web server) I could even give you a shell
account on and you could build the code yourself. You would just have to
give me a public key and I would create an account for you if you don't
believe what I'm telling you now that I've posted links to the correct code.

Sorry for the mistake, but it really was an honest one. Whatever I may
think of some of the things you post I do respect your assembler
knowledge and know that it is greater than mine.
--
Flash Gordon
Still sigless on this computer.
Aug 21 '06 #72
Flash Gordon wrote:
we******@gmail.com wrote:
Flash Gordon wrote:
As I said else where. You assume that your experience applies to
everything else. It does not.
I'm waiting for a counter example -- this doesn't count as one, BTW.

I made a simple, honest mistake. I forgot that I had built another
sample of code from here in the same directory. Since I normally don't
bother specifying an output name it writes it as a.out, thus I put up
the wrong sample.
Ok fine. But this only shows the problem with your methodology.
Well, whatever gcc is doing on my system it is within the bounds of an
"ordinary" compiler since I do not have a custom copy of gcc.
Indeed, its exploiting your bad benchmarking methodology.
That assembly code has no significant loops in it of any kind (it does
call gets for some reason). But you aren't just missing a .o file,
that main doesn't call out anywhere. So this *is* all the code, but it
does *NOT* run the benchmark in question.

Agreed. I did the wrong code by mistake.
You didn't check it either. There was a call to gets in there, and
that didn't seem to ring an alarm bell of any kind for you.
[...] If you want you can look at the PPC equivalent generated on my Gentoo
box. On that I am certain it is cheating heavily since I get a reported
execution time of 0.
And this doesn't make you stop and think even for one second?

Here are the relevant loops in the x86 build:

mov 0xffffffec(%ebp),%eax
cmp %eax,0xfffffff0(%ebp)
sete %al
movzbl %al,%eax
xor %edx,%edx

80485cf: add %eax,%ebx
add $0x1,%edx
cmp %edx,%esi
jne 80485cf
lea 0xffffffec(%ebp),%eax
lea 0xfffffff0(%ebp),%edx
mov %eax,0x4(%esp)
mov %edx,(%esp)
call 80483ac <strcmp@plt>
test %eax,%eax
sete %al
movzbl %al,%eax
xor %edx,%edx

8048694: add %eax,%edi
add $0x1,%edx
cmp %edx,%esi
jne 8048694

These are equivalent to:

int tmp = cur1.iCUR == cur2.iCur;
for (i = 0; i < iterations; i++)
t2 += (long) tmp;

and

int tmp = strcmp(cur1.acCUR, cur2.acCUR) == 0;
for (i = 0; i < iterations; i++)
t1 += (long) tmp;

respectively. Now lets cogitate on these code fragments for a second.
Ok, so gcc 4.x has decided that higher level hoisting is a feature
worth implementing. No doubt this is why Mikhail Teterin made the
function calls external in the first place (so that gcc couldn't pull
that kind of trick.)

In any event, we can see clearly that this benchmark doesn't actually
compare the int== speed versus a call to strlen(). (But you can
eyeball a 5-instructions versus 9-instructions plus a call to an
external function in the assembly above. If you managed to put those
into an inner loop you would get results more in lne with what everyone
else has been getting.) So all you've proven is that if you change
what you are measuring, you can make the measurements whatever you
like.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 23 '06 #73

Stephen Sprunk wrote:
"Chris Torek" <no****@torek.netwrote in message
news:ec*********@news4.newsguy.com...
>CBFalconer wrote:
Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.
In article <11**********************@i3g2000cwc.googlegroups. com>
<we******@gmail.comwrote:
>A modern CPU will perform these operations substantially in
*PARALLEL*.
Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)

True, most novices probably would have used && instead of &; heck,
Mikhail seems fairly bright but he managed to turn the &s I wrote into
&&s. I used & intentionally to prevent short-circuit evaluation, which
would actually be harmful in this scenario.

Most modern CPUs will do speculative loads for the six chars, but doing
the tests in series, with branches in between, will still kill
performance due to lower IPC and horrible branch misprediction penalties
(up to 20 cycles per branch on some CPUs). Not that this is strictly
on-topic anyways, but IMHO discussing how to make portable code as fast
as unportable code should be.
Undoubtedly true for some processors, but not for all.
My benchmarks showed the && version running slightly
faster than the & version in all cases (by 10%-20%
or so, depending on the case). FWIW.

Aug 24 '06 #74
On Fri, 18 Aug 2006 00:24:09 GMT, Keith Thompson <ks***@mib.org>
wrote:
Mikhail Teterin <us****@aldan.algebra.comwrites:
| So, comparing, say, 4-char arrays (like currency codes) can NOT be done in
| the following way?
|
| typedef union {
| char acCUR[4];
| int32_t iCUR;
| } xCUR;
<snip using iCUR>
In this case, it seems to me that there are solutions better than
either using strcmp() or pretending that a char[4] is an int32_t.
You can probably achieve this by storing and comparing the currency
codes *as integers*. One simple way to do this is to compute the
numeric codes arithmetically from the string values. I think you said
that all the currency codes are two characters; if so, it's as simple
He said they are 3 characters plus null, hence the int32=4*char test
would -- if not for that danged UB -- always give the same equality
result as strcmp, i.e. there will never be any trailing possibly
garbage byte(s) ignored by strcmp but included in int32.

Unsurprisingly, as the applicable official standard for interchange,
ISO 4217, defines codes of three characters = 2 characters country
code (mostly ISO 3166) + one letter allowing for multiple (i.e.
replacement) currencies, and IME is often used internally as well.
More relevant to the discussion here, ISO 4217 also provides numeric
codes which will fit easily in a (minimal standard) 16-bit short,
which are equally official but IME not as widely used.
as:

numeric_code = (string_code[0] << CHAR_BIT) + string_code[1];
<snip>
This avoids the need for strcmp() *and* it doesn't depend on type
punning.
But I suspect it isn't very likely to be more efficient than strcmp(),
which was the stated goal. (Setting aside the usual discussions,
already had, about whether/when/how to optimize.)

- David.Thompson1 at worldnet.att.net
Aug 28 '06 #75

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by David Garamond | last post: by
5 posts views Thread by aneesh | last post: by
7 posts views Thread by Jim Showalter | last post: by
13 posts views Thread by S.Tobias | last post: by
reply views Thread by Namratha Shah \(Nasha\) | last post: by
7 posts views Thread by lithiumcat | last post: by
reply views Thread by NPC403 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.