>> Chris Torek wrote:
> ... the same C Standard says:
> ... What constitutes an access to an object that has
> volatile-qualified type is implementation-defined.
> which leaves the implementor a truck-sized loophole: he can simply
> define away all but one of the actual memory references, leaving
> only one of them as an "access".
Eric Sosman <Er*********@sun.com> wrote in message
news:<41**************@sun.com>... Hard to avoid such gaps, I think.
[example snipped, but it includes a large structure assignment] Most implementations, I think, would be forced to define the
single C-level "access" to `s1' in terms of multiple hardware-
level "accesses."
Indeed.
In article <news:2d************************@posting.google.co m>
j0mbolar <j0******@engineer.com> asked:this begs the question, what in the world is multiple hardware-level
accesses? and how does this affect something volatile?
The *intent* of the C Standard is clear: the hardware has some
set(s) of instruction(s) that perform hardware-level access, and
there is some mapping from "hardware access" to "C code". That
mapping is allowed to be optimized as much as possible *except*
in the presence of "volatile" qualifiers, where the mapping should
be as direct as possible.
Suppose we have a conventional load/store architecture, for instance,
in which there are only "two kinds" of "hardware access": the "load"
and the "store". In assembly these are achieved via "ld" and "st"
instructions. Only one "bus width" is supported (the 32-bit-word),
so that:
ld r1,(r2)
st r1,(r3)
"means": "do a 32-bit bus access to the address given by r2, putting
the value retrieved into r1; then do a 32-bit bus access to the
address given by r3, storing the value now in r1".
Hardware *devices* may then respond in particular (and peculiar)
ways to these two hardware-level bus transactions.
Since the C compiler for this particular machine has 32-bit "int"s,
we can do the same in C with:
int r1;
volatile int *r2, *r3;
r1 = *r2;
*r3 = r1;
and "expect" the C compiler to generate the "obvious" code (although
the register numbers might change in the process). The C Standard
gives us (C programmers) "volatile" to do it, but does not promise
us that the compiler will accede to our wishes; it is up to us to
obtain a C compiler that actually does so.
What happens, though, if we have a 16-bit or 8-bit hardware device
and have to connect it to this machine? The *machine* is PHYSICALLY
INCAPABLE of doing anything other than a 32-bit-wide access. How
can we take an AMD "Lance" Ethernet device, with its two 16-bit
registers, and make it work with this (MIPS-R2000-like) CPU?
The answer in this case was to put the 16-bit registers on 32-bit
boundaries:
struct lance_registers {
uint16_t pad1;
uint16_t rap; /* Register Address Port */
uint16_t pad2;
uint16_t rdp; /* Register Data Port */
}; /* (I might have the address and data ports backwards) */
This, however, is *not* how it is done on a conventional 80x86-like
CPU, which *does* have multiple different bus-size-transactions.
Here the compiler should use 16-bit bus accesses for 16-bit integers,
and 8-bit bus accesses for 8-bit integers, and the two "pad"s go
away in the structure.
Moreover, the 80x86 has what are called "read-modify-write" bus
cycles, as did the PDP-11 and VAX. Some PDP-11 Unibus hardware
devices *required* certain operations to use these r/m/w cycles
to obtain predictable results. To get such a bus cycle, an assembler
programmer might use the "bis" or "bic" instructions on the VAX:
bisw2 r1,(r6)
This instruction reads from the (presumably Unibus) location given
by r6, sets the bits given by r1, and writes the result back, all
within a single bus operation using the "r/m/w" cycle. The C programmer
familiar with all this would write the code as:
*r6 |= r1;
and "expect" to get the same bisw2 instruction (provided r6 has
type "volatile unsigned short *" or similar). Writing:
*r6 = *r6 | r1;
would instead produce an assembler sequence like:
movzwl (r6),r0 # or perhaps just movw
bisl2 r1,r0 # in which case this would be a bisw2
movw r0,(r6)
Again, while "volatile" is *necessary* to tell the compiler "please
do not attempt to optimize this", it is not *sufficient* -- the
compilre must actually generate different code for the "|=" operation.
A similer compiler on a load/store architecture *cannot* generate
a single instruction for this, though, because there IS NO SUCH
SINGLE INSTRUCTION (and there are no r/m/w bus cycles).
The answers to j0mbolar's questions, then, are: "access" is really
defined by the hardware, and as C programmers, we have to know not
only what the hardware does, but also whether we can convince our
C compilers to generate the necessary code. When C's types and
operations "map nicely" onto the hardware, we can expect, and should
really demand, that our C compilers do the "obvious thing".
What about the cases where C's types and operations do not fit well
with the hardware-level operations? Consider the V8 SPARC's "ldstub"
(load/store unsigned byte) instruction, or V9's compare and swap;
the 80x86 compare-and-exchange instructions; and the MIPS and PowerPC
style "load linked / store conditional" pairs. The ldstub
instruction is defined as an atomic bus cycle that:
- reads a byte from memory
- stores 0xff into memory
and gives you the original byte in the register. If two devices
or processors attempt this at the "same time", and the byte is
originally not 0xff, one of them will "see" the original byte and
the other will see the 0xff. The compare and swap (aka compare
and exchange) instructions, which are more powerful, take two
registers and a memory location and atomically:
- compare the first register with the memory value
- if they are equal, change the memory value to the second register,
but if they are not equal, leave the memory value alone
- leave the result of the comparison or the original memory value
(or both) in one of the registers and/or in some condition codes
The ll/sc sequence, which is perhaps the most powerful of all,
loads a value from memory into a register, and then later stores
a new value (as given by a register) into that memory location but
only if no one else has changed it yet. (This is done through the
cache protocols -- the CPU cache uses MESI or MOESI to cooperate
with other devices, and is alerted if the value gets changed between
the two separate instructions. While CAS can be used to implement
atomic adds and mutexes, LL+SC can be used to implement atomic
queues.)
The closest one can come to writing CAS in C, for instance, is:
tmp = *mem;
if (tmp == r1)
*mem = r2;
r1 = tmp;
but all this happens in a single bus cycle. There is no C operator
that compresses this down to one operation. The LL/SC sequence
actually takes multiple bus cycles and cannot be expressed at all
in C.
Today, the usual tack for handling the "cannot be written in C at
all" instructions is to use assembly code -- either a C-callable
subroutine, or inline expansion.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it
http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.