union {unsigned char u[10]; ...}

Yevgen Muntyan

Hey,

Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];

I tried to find it in the standard, but I only found that
value of u.u here is unspecified. Standard implies that
once u.u above is legal, then u.u[0] will be bits of first byte
u.a and so on, so here we can treat u.u in the same way as
if we did

int a = 8;
unsigned char u[sizeof a];
memcpy(u, &a, sizeof a);

But, I can't find the place which says u.u in the first
example is indeed legal and u.u value is the same as
the value of the union (which then is bytes from u.a
value, and so on).

Regards,
Yevgen

Mar 13 '07 #1

Subscribe Post Reply

3212

Ben Pfaff

Yevgen Muntyan <mu****************@tamu.eduwrites:

Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];

See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
--
"Some programming practices beg for errors;
this one is like calling an 800 number
and having errors delivered to your door."
--Steve McConnell

Mar 13 '07 #2

Yevgen Muntyan

Ben Pfaff wrote:

Yevgen Muntyan <mu****************@tamu.eduwrites:

>Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];

See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

But character type is not a union. Moreover, it actually says

- an aggregate or union type that includes one of the aforementioned
types among its members (including, recursively, a member of a
subaggregate or contained union), or
— a character type.

i.e. character type is not in the list of types mentioned in
the above paragraph, about unions. Do I miss something here?
("aforementioned" means "listed above", right?)

Yevgen

Mar 13 '07 #3

Yevgen Muntyan

Yevgen Muntyan wrote:

Ben Pfaff wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:

>>Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];

See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

But character type is not a union. Moreover, it actually says

- an aggregate or union type that includes one of the aforementioned
types among its members (including, recursively, a member of a
subaggregate or contained union), or
— a character type.

i.e. character type is not in the list of types mentioned in
the above paragraph, about unions. Do I miss something here?
("aforementioned" means "listed above", right?)

Moreover, this paragraph is actually irrelevant. It says
you can do something like

int func (int *i);
union U {int a; double b;};
U u;
u.a = 2;
func ((int*)&u);

but it doesn't let you do

U u;
u.b = 2;
func ((int*)&u);

Same thing for character type, even if it was in the list:

you can have character array in the union, and *if* you
set this member value to representation of some double,
then you can pass the union around as it was double. But
it's not clear at all if you can set double member, and
then use the union as if you set character array member.

Yevgen

Mar 13 '07 #4

Ben Pfaff

Yevgen Muntyan <mu****************@tamu.eduwrites:

Ben Pfaff wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:

>>Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];

See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

But character type is not a union.

You're accessing an object of type "int" through an object of
character type. The fact that the "int" is inside a union is
immaterial.

Moreover, it actually says

- an aggregate or union type that includes one of the aforementioned
types among its members (including, recursively, a member of a
subaggregate or contained union), or

This gives permission for a different class of accesses, one that
in this case we're not interested in.
--
char a[]="\n .CJacehknorstu";int putchar(int);int main(void){unsigned long b[]
={0x67dffdff,0x9aa9aa6a,0xa77ffda9,0x7da6aa6a,0xa6 7f6aaa,0xaa9aa9f6,0x11f6},*p
=b,i=24;for(;p+=!*p;*p/=4)switch(0[p]&3)case 0:{return 0;for(p--;i--;i--)case+
2:{i++;if(i)break;else default:continue;if(0)case 1:putchar(a[i&15]);break;}}}

Mar 13 '07 #5

Keith Thompson

Yevgen Muntyan <mu****************@tamu.eduwrites:

Ben Pfaff wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:

>>Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":
An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

But character type is not a union.

[snip]

u.a is of type int. u.u[0] is of type char, a character type. The
code above accesses the stored value of the object u.a using an lvalue
expression, u.u[0], which is of character type, which satisfies 6.5.

Intuitively, the rules for unions imply that u.u[0] actually does
access the first byte of u.a. Rigorously proving this from the
wording of the standard may be trickier, and it's a larger task than
I'm willing to undertake at the moment.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Mar 13 '07 #6

Yevgen Muntyan

Keith Thompson wrote:

Yevgen Muntyan <mu****************@tamu.eduwrites:
>Ben Pfaff wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:

Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":
An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
But character type is not a union.
[snip]

u.a is of type int. u.u[0] is of type char, a character type. The
code above accesses the stored value of the object u.a using an lvalue
expression, u.u[0], which is of character type, which satisfies 6.5.

I am not convinced. Consider this:

int a;
int b;
int *p = &b;
*(p - 1);

It accesses value of a using an lvalue of type int. The problem is
of course that *(p-1) is illegal. Same thing with that union: why
is u.u access allowed, and why is value of u.u is the same as if you
actually set it, using u.u[0] = 8?

Intuitively, the rules for unions imply that u.u[0] actually does
access the first byte of u.a. Rigorously proving this from the
wording of the standard may be trickier, and it's a larger task than
I'm willing to undertake at the moment.

No, it's easy once you know that bytes of u.u value are the same as
bytes of value of u.

Yevgen

Mar 14 '07 #7

Yevgen Muntyan

Ben Pfaff wrote:

Yevgen Muntyan <mu****************@tamu.eduwrites:

>Ben Pfaff wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:

Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
But character type is not a union.

You're accessing an object of type "int" through an object of
character type.

Only when I have this object of character type. The very question
is why "u.u" yields an object of character type after u.a = 1;
(i.e. why it's not UB or constraint violation or something), and
if u.u is indeed allowed then why bytes in u.u value will be the
same as in u.a (first bytes, of course, ignoring sizes and padding
and whatnot).

I guess my question is actually this:

union U {int a; float b;};
u.a = something;

Is 'u.b' allowed here given that the bit representation of
u.a is a bit representation of a float object, and is u.b
value the same as if we did

int a = something;
float b;
memcpy (&b, &a, 4);

assuming 4 bytes int and float.

Best regards,
Yevgen

Mar 14 '07 #8

Ben Pfaff

Yevgen Muntyan <mu****************@tamu.eduwrites:

Consider this:

int a;
int b;
int *p = &b;
*(p - 1);

It accesses value of a using an lvalue of type int.

No, it doesn't. It yields undefined behavior. And real
compilers are likely to put "a" into a register here, defeating
this idea in practice.
--
"I've been on the wagon now for more than a decade. Not a single goto
in all that time. I just don't need them any more. I don't even use
break or continue now, except on social occasions of course. And I
don't get carried away." --Richard Heathfield

Mar 14 '07 #9

Ben Pfaff

Yevgen Muntyan <mu****************@tamu.eduwrites:

Ben Pfaff wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:

>>Ben Pfaff wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:

Why is it legal to do
>
union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
But character type is not a union.

You're accessing an object of type "int" through an object of
character type.

Only when I have this object of character type. The very question
is why "u.u" yields an object of character type after u.a = 1;
(i.e. why it's not UB or constraint violation or something), and
if u.u is indeed allowed then why bytes in u.u value will be the
same as in u.a (first bytes, of course, ignoring sizes and padding
and whatnot).

I don't really understand this question. The standard has
wording that says that u.a and u.u are at the same address, and
it has wording that says that any object may be accessed through
an lvalue of character type[*]. Put the two together, and it's
allowed.
[*] It's best to use an unsigned character type: signed character
types can have trap representations.

I guess my question is actually this:

union U {int a; float b;};
u.a = something;

Is 'u.b' allowed here given that the bit representation of
u.a is a bit representation of a float object,

No. There's a special dispensation in C99 6.5 (which we've
discussed) which allows accessing any object as an array of
characters. There's no such dispensation for aliasing an int and
a float.

and is u.b value the same as if we did

int a = something;
float b;
memcpy (&b, &a, 4);

assuming 4 bytes int and float.

No, that's a different situation: memcpy accesses objects as
arrays of characters. Thus, you can use it to do this sort of
thing and then access "b" as a float, given some additional
provisos (e.g. the bits in "a" are not a trap representation when
interpreted as float, "float" is 4 bytes long, "int" is at least
4 bytes long, ...)
--
int main(void){char p[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuv wxyz.\
\n",*q="kl BIcNBFr.NKEzjwCIxNJC";int i=sizeof p/2;char *strchr();int putchar(\
);while(*q){i+=strchr(p,*q++)-p;if(i>=(int)sizeof p)i-=sizeof p-1;putchar(p[i]\
);}return 0;}

Mar 14 '07 #10

Keith Thompson

Yevgen Muntyan <mu****************@tamu.eduwrites:

Keith Thompson wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:
>>Ben Pfaff wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:

Why is it legal to do
>
union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":
An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
But character type is not a union.
[snip]

u.a is of type int. u.u[0] is of type char, a character type. The
code above accesses the stored value of the object u.a using an lvalue
expression, u.u[0], which is of character type, which satisfies 6.5.

I am not convinced. Consider this:

int a;
int b;
int *p = &b;
*(p - 1);

It accesses value of a using an lvalue of type int. The problem is
of course that *(p-1) is illegal.

Another problem is that it's not necessarily accessing the value of a.

However, the standard does explicitly allow objects to be adjacent (it
has to do so to make pointer equality work consistently). In the
absence of any knowledge of how a and b are allocated in memory,
evaluating *(p - 1) invokes undefined behavior. If you happen to know
that they're adjacent, then *(p - 1) does access the value of a, and
it's legitimate (though quite silly). For example:

#include <stdio.h>
int main(void)
{
int a = 42;
int b;
int *p = &b;

if (&a + 1 == &b) {
printf("Accessing a strangely: %d\n", *(p - 1));
}
else {
printf("Accessing a normally: %d\n", a);
}
return 0;
}

This program's output will be either:
Accessing a strangely: 42
or
Accessing a normally: 42
(On one implementation I tried, I got the "normally" message; swapping
the declarations of a and b got me the "strangely" message.)

If I change the two messages so they're identical, I think the program
is actually strictly conforming; the path it follows depends on
implementation-specific behavior, but the output doesn't.

I'm not quite sure what this has to do with the question about unions,
though.

Same thing with that union: why
is u.u access allowed, and why is value of u.u is the same as if you
actually set it, using u.u[0] = 8?

I'm afraid I don't understand what you're getting at here. u.u[0]
accesses the first byte of u.a; why would it not do so?

>Intuitively, the rules for unions imply that u.u[0] actually does
access the first byte of u.a. Rigorously proving this from the
wording of the standard may be trickier, and it's a larger task than
I'm willing to undertake at the moment.

No, it's easy once you know that bytes of u.u value are the same as
bytes of value of u.

So what's the problem?

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Mar 14 '07 #11

Yevgen Muntyan

Ben Pfaff wrote:

Yevgen Muntyan <mu****************@tamu.eduwrites:

>Ben Pfaff wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:

Ben Pfaff wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
>
>Why is it legal to do
>>
>union U {unsigned char u[8]; int a;};
>union U u;
>u.a = 1;
>u.u[0];
See C99 section 6.5 "Expressions":
>
An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
But character type is not a union.
You're accessing an object of type "int" through an object of
character type.
Only when I have this object of character type. The very question
is why "u.u" yields an object of character type after u.a = 1;
(i.e. why it's not UB or constraint violation or something), and
if u.u is indeed allowed then why bytes in u.u value will be the
same as in u.a (first bytes, of course, ignoring sizes and padding
and whatnot).

I don't really understand this question. The standard has
wording that says that u.a and u.u are at the same address,

Right here, where does the standard say u.u is an object
after you assigned u.a? I am not trying to look stupid or
pedantic, I really want to understand why you can access
member of a union other than what was assigned.
Take

union U {int a; double b;}; U u;

u.a has the same address as u.b, but you can't access
u.b after you assigned u.a (or you can but its value
is unspecified).

and
it has wording that says that any object may be accessed through
an lvalue of character type[*].

Again, this lvalue must be well-defined. In my example
with stupid pointer arithmetic we do have an lvalue of
int type, but we can't use it just because we can't
use it, not because it has wrong type.

So, once u.u is good (has the same first bytes as u, etc.)
you can use u.u to access u.a. Why is u.u "good"?

No. There's a special dispensation in C99 6.5 (which we've
discussed) which allows accessing any object as an array of
characters.

No, it's not 6.5, it's 6.2.6 that describes what happens if
you use use character arrays (in particular it says you can
use unsigned char arrays safely).
The piece of 6.5 you quoted only restricts types you can use
to access object value, it does not allow yet the very access
using those types. The obvious example is accessing unsigned
int object using int type - it may overflow.

There's no such dispensation for aliasing an int and
a float.

But double u.b isn't aliasing int u.a here. It's an expression which is
either well-defined or not. Standard says value of u.b is unspecified
here; and the paragraph about aliasing does not sound to me
as the text which explicitly makes character array members of
unions special.

Best regards,
Yevgen

Mar 14 '07 #12

Old Wolf

On Mar 14, 12:01 pm, Ben Pfaff <b...@cs.stanford.eduwrote:

Yevgen Muntyan <muntyan.removet...@tamu.eduwrites:
Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];

See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

I wonder if you can clarify something for me. The above clause does
not say that accessing a stored value via an lvalue expression of
character type is always legal. It merely says that accessing it
via other types is not legal.

Now, in N869, 6.5.2.3#5 says quite clearly that u.u can only be
accessed if it were the last member to be set, so the above code
would be UB. However the first sentence of 6.5.2.3#5 was removed
in N1124, making the above code legal again.

What did the actual C99 text say, and what did C90 have to say on the
matter?

Mar 14 '07 #13

Yevgen Muntyan

Keith Thompson wrote:

Yevgen Muntyan <mu****************@tamu.eduwrites:
>Keith Thompson wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:
Ben Pfaff wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
>
>Why is it legal to do
>>
>union U {unsigned char u[8]; int a;};
>union U u;
>u.a = 1;
>u.u[0];
See C99 section 6.5 "Expressions":
An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
But character type is not a union.
[snip]

u.a is of type int. u.u[0] is of type char, a character type. The
code above accesses the stored value of the object u.a using an lvalue
expression, u.u[0], which is of character type, which satisfies 6.5.
I am not convinced. Consider this:

int a;
int b;
int *p = &b;
*(p - 1);

It accesses value of a using an lvalue of type int. The problem is
of course that *(p-1) is illegal.

Another problem is that it's not necessarily accessing the value of a.

Well, UB here is totally enough for me, regardless of what exactly
implementation will do :)

However, the standard does explicitly allow objects to be adjacent (it
has to do so to make pointer equality work consistently). In the
absence of any knowledge of how a and b are allocated in memory,
evaluating *(p - 1) invokes undefined behavior. If you happen to know
that they're adjacent, then *(p - 1) does access the value of a, and
it's legitimate (though quite silly).

It's legitimate? Pointer arithmetic is allowed only on arrays (not sure
what correct term is, it's not those int a[2]; arrays), isn't it? I
mean, it's UB even if a and b happen to be adjacent (which itself
isn't a standard term, since standard doesn't know what it means for
objects which are not members of some aggregate, in which case we
can talk about sequences of bytes).

....

I'm not quite sure what this has to do with the question about unions,
though.

It was an example of situation where types are fine as to 6.5p7
but the expression was illegal nevertheless.

> Same thing with that union: why
is u.u access allowed, and why is value of u.u is the same as if you
actually set it, using u.u[0] = 8?

I'm afraid I don't understand what you're getting at here. u.u[0]
accesses the first byte of u.a; why would it not do so?

Because it's similar to saying

union U {int a; double b;};
U u;
int a = 1;
u.a = a;
memcpy (someplace, &u.b, 1);

is allowed and copies first byte of a. But we can't use u.b
here, or can we?

Yevgen

Mar 14 '07 #14

Yevgen Muntyan

Old Wolf wrote:

On Mar 14, 12:01 pm, Ben Pfaff <b...@cs.stanford.eduwrote:
>Yevgen Muntyan <muntyan.removet...@tamu.eduwrites:
>>Why is it legal to do
union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

I wonder if you can clarify something for me. The above clause does
not say that accessing a stored value via an lvalue expression of
character type is always legal. It merely says that accessing it
via other types is not legal.

Now, in N869, 6.5.2.3#5 says quite clearly that u.u can only be
accessed if it were the last member to be set, so the above code
would be UB. However the first sentence of 6.5.2.3#5 was removed
in N1124, making the above code legal again.

I could only find 6.2.6.1p6 in N1124 about this business. If there are
no other restrictions, it means we can freely access any member
of union, given that we are careful with trap representations
[1]. Maybe it's in fact right, and C99 relaxed this requirement on
unions (perhaps because no implementation was stupid enough to enforce
it)? Then my question has an obvious answer.

Anyway, are you saying it's UB in C90?

[1] In particular it means that

union U {unsigned char a; unsigned char b[2];};
U u;
unsigned char foo;
u.a = 1;
foo = u.b[2];

is not UB, while

unsigned char foo;
unsigned char b;
foo = b;

is. It's kind of strange.

Thanks,
Yevgen

Mar 14 '07 #15

Yevgen Muntyan

Yevgen Muntyan wrote:
....

foo = u.b[2];

It should be u.b[1] of course.

Mar 14 '07 #16

Yevgen Muntyan

Yevgen Muntyan wrote:

Yevgen Muntyan wrote:
>Ben Pfaff wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:

Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];

See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

But character type is not a union. Moreover, it actually says

- an aggregate or union type that includes one of the aforementioned
types among its members (including, recursively, a member of a
subaggregate or contained union), or
— a character type.

i.e. character type is not in the list of types mentioned in
the above paragraph, about unions. Do I miss something here?
("aforementioned" means "listed above", right?)

Moreover, this paragraph is actually irrelevant.

Rationale in fact confirms this, it looks like the aliasing rules
are indeed aliasing rules, they are not "what you can put into a
union"; the union mechanics is described where unions are described.
The unions mentioned there are this:

union U {int a;};
int *b;
union U *u;

here b can be used to access u.a and vice versa. It's *not* saying
that you can do this:

union U {int a; unsigned b;};
union U u;
unsigned c;
u.a = 1;
c = u.b;

If you can do it (which isn't clear), you can do it because of how
unions work, and how integer types work, not because of aliasing rules.

Yevgen

Mar 14 '07 #17

Keith Thompson

Yevgen Muntyan <mu****************@tamu.eduwrites:

Keith Thompson wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:
>>Keith Thompson wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
Ben Pfaff wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:
>>
>>Why is it legal to do
>>>
>>union U {unsigned char u[8]; int a;};
>>union U u;
>>u.a = 1;
>>u.u[0];
>See C99 section 6.5 "Expressions":
> An object shall have its stored value accessed only by an
> lvalue expression that has one of the following types:73)
>[...]
> - a character type.
But character type is not a union.
[snip]

u.a is of type int. u.u[0] is of type char, a character type. The
code above accesses the stored value of the object u.a using an lvalue
expression, u.u[0], which is of character type, which satisfies 6.5.
I am not convinced. Consider this:

int a;
int b;
int *p = &b;
*(p - 1);

It accesses value of a using an lvalue of type int. The problem is
of course that *(p-1) is illegal.
Another problem is that it's not necessarily accessing the value of
a.

Well, UB here is totally enough for me, regardless of what exactly
implementation will do :)

>However, the standard does explicitly allow objects to be adjacent (it
has to do so to make pointer equality work consistently). In the
absence of any knowledge of how a and b are allocated in memory,
evaluating *(p - 1) invokes undefined behavior. If you happen to know
that they're adjacent, then *(p - 1) does access the value of a, and
it's legitimate (though quite silly).

It's legitimate? Pointer arithmetic is allowed only on arrays (not sure
what correct term is, it's not those int a[2]; arrays), isn't it? I
mean, it's UB even if a and b happen to be adjacent (which itself
isn't a standard term, since standard doesn't know what it means for
objects which are not members of some aggregate, in which case we
can talk about sequences of bytes).

For purposes of pointer arithmetic, any object can be treated as if it
were a single-element array. See, for example, C99 6.5.8p4:

For the purposes of these operators, a pointer to an object that
is not an element of an array behaves the same as a pointer to the
first element of an array of length one with the type of the
object as its element type.

As for adjacency, see C99 6.5.9p6:

Two pointers compare equal if and only if both are null pointers,
both are pointers to the same object (including a pointer to an
object and a subobject at its beginning) or function, both are
pointers to one past the last element of the same array object, or
one is a pointer to one past the end of one array object and the
other is a pointer to the start of a different array object that
happens to immediately follow the first array object in the
address space.

with a footnote:

Two objects may be adjacent in memory because they are adjacent
elements of a larger array or adjacent members of a structure with
no padding between them, or because the implementation chose to
place them so, even though they are unrelated. If prior invalid
pointer operations (such as accesses outside array bounds)
produced undefined behavior, subsequent comparisons also produce
undefined behavior.

Without this special-case permission, the standard would have had to
say that, given
int a;
int b;
&a + 1 *may not* be equal to &b (or vice versa), which would require
the implementation to insert at least one byte of padding between
objects that might otherwise be adjacent.

Allowing objects to be adjacent is necessary for equality to be
defined consistently. It's for the same of implementers' convenience;
no program should take advantage of this permission.

And it's really a very minor and obscure point; you just happened to
hit on it in your example.

...
>I'm not quite sure what this has to do with the question about unions,
though.

It was an example of situation where types are fine as to 6.5p7
but the expression was illegal nevertheless.

Be careful with the word "illegal". I think what you mean is that it
invokes undefined behavior.

>> Same thing with that union: why
is u.u access allowed, and why is value of u.u is the same as if you
actually set it, using u.u[0] = 8?
I'm afraid I don't understand what you're getting at here. u.u[0]
accesses the first byte of u.a; why would it not do so?

Because it's similar to saying

union U {int a; double b;};
U u;
int a = 1;
u.a = a;
memcpy (someplace, &u.b, 1);

is allowed and copies first byte of a. But we can't use u.b
here, or can we?

We can't use *the value of* u.b because C99 6.7.2.1p14 says:

The value of at most one of the members can be stored in a union
object at any time.

But 6.5p7 gives special permission to access an object by an lvalue
expression of character types. As the footnote there says:

The intent of this list is to specify those circumstances in which
an object may or may not be aliased.

One way to alias an object is to make it a member of a union. (Other
ways involve various pointer tricks.)

Now I'm not sure whether you can actually prove, from the wording of
the standard, that it's permitted to store a value in one member of a
union, then access a different member, as long as the other member has
character or array-of-character type. (By "permitted", I mean not
invoking undefined behavior.) But it's a reasonably common idiom, and
I'm about 95% convinced that it's *intended* to be allowed. It's
difficult to imagine an implementation that meets the requirements of
the standard but disallows this particular kind of aliasing.

I think the question of whether the wording of the standard actually
supports this conclusion is getting into comp.std.c territory.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Mar 14 '07 #18

Old Wolf

Yevgen Muntyan wrote:

I could only find 6.2.6.1p6 in N1124 about this business. If there are
no other restrictions, it means we can freely access any member
of union, given that we are careful with trap representations

No, the aliasing rule as quoted by Ben Pfaff still applies;
you can't access a long long with a pointer to int , regardless
of whether they are in a union or not.

Anyway, are you saying it's UB in C90?

No, please re-read my post.

unsigned char foo;
unsigned char b;
foo = b;

is [UB]. It's kind of strange.

There's been some debate over whether using indeterminately
valued unsigned chars is undefined or merely unspecified. I
can't remember what the conclusion was.

Mar 14 '07 #19

Yevgen Muntyan

Keith Thompson wrote:

Yevgen Muntyan <mu****************@tamu.eduwrites:
>Keith Thompson wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:
Keith Thompson wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
>Ben Pfaff wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:
>>>
>>>Why is it legal to do
>>>>
>>>union U {unsigned char u[8]; int a;};
>>>union U u;
>>>u.a = 1;
>>>u.u[0];
>>See C99 section 6.5 "Expressions":
>> An object shall have its stored value accessed only by an
>> lvalue expression that has one of the following types:73)
>>[...]
>> - a character type.
>But character type is not a union.
[snip]
>
u.a is of type int. u.u[0] is of type char, a character type. The
code above accesses the stored value of the object u.a using an lvalue
expression, u.u[0], which is of character type, which satisfies 6.5.
I am not convinced. Consider this:

int a;
int b;
int *p = &b;
*(p - 1);

It accesses value of a using an lvalue of type int. The problem is
of course that *(p-1) is illegal.
Another problem is that it's not necessarily accessing the value of
a.
Well, UB here is totally enough for me, regardless of what exactly
implementation will do :)

>>However, the standard does explicitly allow objects to be adjacent (it
has to do so to make pointer equality work consistently). In the
absence of any knowledge of how a and b are allocated in memory,
evaluating *(p - 1) invokes undefined behavior. If you happen to know
that they're adjacent, then *(p - 1) does access the value of a, and
it's legitimate (though quite silly).
It's legitimate? Pointer arithmetic is allowed only on arrays (not sure
what correct term is, it's not those int a[2]; arrays), isn't it? I
mean, it's UB even if a and b happen to be adjacent (which itself
isn't a standard term, since standard doesn't know what it means for
objects which are not members of some aggregate, in which case we
can talk about sequences of bytes).

For purposes of pointer arithmetic, any object can be treated as if it
were a single-element array. See, for example, C99 6.5.8p4:

For the purposes of these operators, a pointer to an object that
is not an element of an array behaves the same as a pointer to the
first element of an array of length one with the type of the
object as its element type.

As for adjacency, see C99 6.5.9p6:

Two pointers compare equal if and only if both are null pointers,
both are pointers to the same object (including a pointer to an
object and a subobject at its beginning) or function, both are
pointers to one past the last element of the same array object, or
one is a pointer to one past the end of one array object and the
other is a pointer to the start of a different array object that
happens to immediately follow the first array object in the
address space.

with a footnote:

Two objects may be adjacent in memory because they are adjacent
elements of a larger array or adjacent members of a structure with
no padding between them, or because the implementation chose to
place them so, even though they are unrelated. If prior invalid
pointer operations (such as accesses outside array bounds)
produced undefined behavior, subsequent comparisons also produce
undefined behavior.

Without this special-case permission, the standard would have had to
say that, given
int a;
int b;
&a + 1 *may not* be equal to &b (or vice versa), which would require
the implementation to insert at least one byte of padding between
objects that might otherwise be adjacent.

Sorry, I don't understand if it's yes or no. I said the following:

1) int a; int *p = &a + 1; is UB.
2) In "int a; int b;" if we say "a and b are adjacent" then it has
not meaning as far as the standard is concerned. We could talk
about implementation-specific memory layout, about what addresses
actually mean, but it's not standard.

Allowing objects to be adjacent is necessary for equality to be
defined consistently.

Standard doesn't allow nor disallow independent (as in not parts
of some aggregate) objects to be adjacent, it simply doesn't
say nor care about what it means.

It's for the same of implementers' convenience;
no program should take advantage of this permission.

And it's really a very minor and obscure point; you just happened to
hit on it in your example.

Nope, I hit an easy example of UB, I needed UB to demonstrate
that quoted paragraph from 6.5 wasn't enough for that union thing.
As for this example, an implementation could easily make
&a == &b + 1; (is it what's called direction in which stack grows?).

>...
>>I'm not quite sure what this has to do with the question about unions,
though.
It was an example of situation where types are fine as to 6.5p7
but the expression was illegal nevertheless.

Be careful with the word "illegal". I think what you mean is that it
invokes undefined behavior.

No, strictly speaking I mean non-strictly-conforming code.

>>> Same thing with that union: why
is u.u access allowed, and why is value of u.u is the same as if you
actually set it, using u.u[0] = 8?
I'm afraid I don't understand what you're getting at here. u.u[0]
accesses the first byte of u.a; why would it not do so?
Because it's similar to saying

union U {int a; double b;};
U u;
int a = 1;
u.a = a;
memcpy (someplace, &u.b, 1);

is allowed and copies first byte of a. But we can't use u.b
here, or can we?

We can't use *the value of* u.b because C99 6.7.2.1p14 says:

The value of at most one of the members can be stored in a union
object at any time.

But 6.5p7 gives special permission to access an object by an lvalue
expression of character types.

It doesn't give special permission to access union member other
than that was previously set, at least it's absolutely not obvious
if it does.

As the footnote there says:

The intent of this list is to specify those circumstances in which
an object may or may not be aliased.

Exactly, has nothing to do with this particular thing: whether
you can freely access different union members.

One way to alias an object is to make it a member of a union. (Other
ways involve various pointer tricks.)

Now I'm not sure whether you can actually prove, from the wording of
the standard, that it's permitted to store a value in one member of a
union, then access a different member, as long as the other member has
character or array-of-character type. (By "permitted", I mean not
invoking undefined behavior.) But it's a reasonably common idiom,

It's also common idiom to do this:

union U {void **ptr; int **iptr;};
void func (void **ptr);
....
union U u;
u.iptr = &ip; /* ip is some int pointer */
func (u.ptr);

to avoid gcc warnings about strict aliasing when
you do just func((void*)&ip);. Similar thing is used to pass
character data around (when function takes unsigned char **
to store "any data" at given address). Or, struct hack -
common idiom, nobody knows if it's legal.

and
I'm about 95% convinced that it's *intended* to be allowed. It's
difficult to imagine an implementation that meets the requirements of
the standard but disallows this particular kind of aliasing.

I'd think that it's simple: either you can access union members freely
(i.e. standard permits it), or not. In latter case an implementation
could explode when you do it, similar to famous implementations
which check array bounds (none does, and struct hack works).

I think the question of whether the wording of the standard actually
supports this conclusion is getting into comp.std.c territory.

I believe rationale explains the intent of that 6.5 wording, and
the intent certainly wasn't to allow accessing character array
union members. If it's allowed, then it must be elsewhere.

Best regards,
Yevgen

Mar 14 '07 #20

Yevgen Muntyan

Old Wolf wrote:

Yevgen Muntyan wrote:
>I could only find 6.2.6.1p6 in N1124 about this business. If there are
no other restrictions, it means we can freely access any member
of union, given that we are careful with trap representations

No, the aliasing rule as quoted by Ben Pfaff still applies;
you can't access a long long with a pointer to int , regardless
of whether they are in a union or not.

Hm, do you mean something like this:

union U {int a; long long b;};

void func (int *p)
{
union U *up = p;
long long *l = &p->b;
...
}

U u;
u.b = 8;
func (&u.a);

It would access long long using int pointer. But then it would
only mean that given code is UB, not that you can't do

u.a = 8; u.b;

It's similar to the following:

void func (int *ip)
{
long long *lp = (long long*) ip;
*lp = 8;
}

long long a;
func ((int*) &a);

It's legal to convert long long pointer to int pointer and back
again (assuming everything is good with alignment), but this
code invokes UB according to 6.5?

Perhaps I am missing something again, I am totally confused
by this stuff.

>Anyway, are you saying it's UB in C90?

No, please re-read my post.

Sorry, I thought N869 was some C89 standard draft.

Yevgen

Mar 14 '07 #21

Jack Klein

On Wed, 14 Mar 2007 01:55:14 GMT, Yevgen Muntyan
<mu****************@tamu.eduwrote in comp.lang.c:

Keith Thompson wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
Keith Thompson wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
Keith Thompson wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
Ben Pfaff wrote:
>Yevgen Muntyan <mu****************@tamu.eduwrites:
>>
>>Why is it legal to do
>>>
>>union U {unsigned char u[8]; int a;};
>>union U u;
>>u.a = 1;
>>u.u[0];
>See C99 section 6.5 "Expressions":
> An object shall have its stored value accessed only by an
> lvalue expression that has one of the following types:73)
>[...]
> - a character type.
But character type is not a union.
[snip]

u.a is of type int. u.u[0] is of type char, a character type. The
code above accesses the stored value of the object u.a using an lvalue
expression, u.u[0], which is of character type, which satisfies 6.5.
I am not convinced. Consider this:

int a;
int b;
int *p = &b;
*(p - 1);

It accesses value of a using an lvalue of type int. The problem is
of course that *(p-1) is illegal.
Another problem is that it's not necessarily accessing the value of
a.
Well, UB here is totally enough for me, regardless of what exactly
implementation will do :)

However, the standard does explicitly allow objects to be adjacent (it
has to do so to make pointer equality work consistently). In the
absence of any knowledge of how a and b are allocated in memory,
evaluating *(p - 1) invokes undefined behavior. If you happen to know
that they're adjacent, then *(p - 1) does access the value of a, and
it's legitimate (though quite silly).
It's legitimate? Pointer arithmetic is allowed only on arrays (not sure
what correct term is, it's not those int a[2]; arrays), isn't it? I
mean, it's UB even if a and b happen to be adjacent (which itself
isn't a standard term, since standard doesn't know what it means for
objects which are not members of some aggregate, in which case we
can talk about sequences of bytes).
For purposes of pointer arithmetic, any object can be treated as if it
were a single-element array. See, for example, C99 6.5.8p4:

For the purposes of these operators, a pointer to an object that
is not an element of an array behaves the same as a pointer to the
first element of an array of length one with the type of the
object as its element type.

As for adjacency, see C99 6.5.9p6:

Two pointers compare equal if and only if both are null pointers,
both are pointers to the same object (including a pointer to an
object and a subobject at its beginning) or function, both are
pointers to one past the last element of the same array object, or
one is a pointer to one past the end of one array object and the
other is a pointer to the start of a different array object that
happens to immediately follow the first array object in the
address space.

with a footnote:

Two objects may be adjacent in memory because they are adjacent
elements of a larger array or adjacent members of a structure with
no padding between them, or because the implementation chose to
place them so, even though they are unrelated. If prior invalid
pointer operations (such as accesses outside array bounds)
produced undefined behavior, subsequent comparisons also produce
undefined behavior.

Without this special-case permission, the standard would have had to
say that, given
int a;
int b;
&a + 1 *may not* be equal to &b (or vice versa), which would require
the implementation to insert at least one byte of padding between
objects that might otherwise be adjacent.

Sorry, I don't understand if it's yes or no. I said the following:

1) int a; int *p = &a + 1; is UB.

No, it's not. Reread C99 6.5.8p4 that Keith quoted above. You may
form a pointer to one past a single element, even if it is not part of
an array. &a + 1 is a valid pointer value to create and hold, just
not to deference.

2) In "int a; int b;" if we say "a and b are adjacent" then it has
not meaning as far as the standard is concerned. We could talk
about implementation-specific memory layout, about what addresses
actually mean, but it's not standard.

Yes it does have meaning as far as the standard is concerned, because
it mentions the possibility, but not the requirement, for objects to
be adjacent.

Allowing objects to be adjacent is necessary for equality to be
defined consistently.

Standard doesn't allow nor disallow independent (as in not parts
of some aggregate) objects to be adjacent, it simply doesn't
say nor care about what it means.

Yes it does. That is why the wording was added to the standard.
Earlier versions of the standard did not have the clause "or one is a
pointer to one past the end of one array object and the other is a
pointer to the start of a different array object that happens to
immediately follow the first array object in the address space." as
part 6.5.9p6.

It was inadvertent, but that meant that if you wrote a function like
this:

void func(void)
{
int a, b;

if (((&a + 1) == &b) || ((&b + 1) == &a))
puts("this implementation violates the standard!");
}

....many, in fact most if not all, implementations would output the
message that they violated the standards. For an implementation to
comply with the earlier wording, it must put at least one byte (and
most likely sizeof(int) bytes) of wasted space between 'a' and 'b' in
memory, or in fact between the end of any object and the start of any
other object.

It's for the same of implementers' convenience;
no program should take advantage of this permission.

And it's really a very minor and obscure point; you just happened to
hit on it in your example.

Nope, I hit an easy example of UB, I needed UB to demonstrate
that quoted paragraph from 6.5 wasn't enough for that union thing.
As for this example, an implementation could easily make
&a == &b + 1; (is it what's called direction in which stack grows?).

...
I'm not quite sure what this has to do with the question about unions,
though.
It was an example of situation where types are fine as to 6.5p7
but the expression was illegal nevertheless.
Be careful with the word "illegal". I think what you mean is that it
invokes undefined behavior.

No, strictly speaking I mean non-strictly-conforming code.

>> Same thing with that union: why
is u.u access allowed, and why is value of u.u is the same as if you
actually set it, using u.u[0] = 8?
I'm afraid I don't understand what you're getting at here. u.u[0]
accesses the first byte of u.a; why would it not do so?
Because it's similar to saying

union U {int a; double b;};
U u;
int a = 1;
u.a = a;
memcpy (someplace, &u.b, 1);

is allowed and copies first byte of a. But we can't use u.b
here, or can we?
We can't use *the value of* u.b because C99 6.7.2.1p14 says:

The value of at most one of the members can be stored in a union
object at any time.

But 6.5p7 gives special permission to access an object by an lvalue
expression of character types.

It doesn't give special permission to access union member other
than that was previously set, at least it's absolutely not obvious
if it does.

It does not need to give special permission for union members. An
instance of a union is an object. Each of its members is an object.
6.5p7 does not say an object other than a union or a union member. It
applies to ANY object.

There is no possible confusion unless you try to maintain that members
of a union are not objects.

As the footnote there says:

The intent of this list is to specify those circumstances in which
an object may or may not be aliased.

Exactly, has nothing to do with this particular thing: whether
you can freely access different union members.

One way to alias an object is to make it a member of a union. (Other
ways involve various pointer tricks.)

Now I'm not sure whether you can actually prove, from the wording of
the standard, that it's permitted to store a value in one member of a
union, then access a different member, as long as the other member has
character or array-of-character type. (By "permitted", I mean not
invoking undefined behavior.) But it's a reasonably common idiom,

It's also common idiom to do this:

union U {void **ptr; int **iptr;};
void func (void **ptr);
...
union U u;
u.iptr = &ip; /* ip is some int pointer */
func (u.ptr);

to avoid gcc warnings about strict aliasing when
you do just func((void*)&ip);. Similar thing is used to pass
character data around (when function takes unsigned char **
to store "any data" at given address). Or, struct hack -
common idiom, nobody knows if it's legal.

and
I'm about 95% convinced that it's *intended* to be allowed. It's
difficult to imagine an implementation that meets the requirements of
the standard but disallows this particular kind of aliasing.

I'd think that it's simple: either you can access union members freely
(i.e. standard permits it), or not. In latter case an implementation
could explode when you do it, similar to famous implementations
which check array bounds (none does, and struct hack works).

I think the question of whether the wording of the standard actually
supports this conclusion is getting into comp.std.c territory.

I believe rationale explains the intent of that 6.5 wording, and
the intent certainly wasn't to allow accessing character array
union members. If it's allowed, then it must be elsewhere.

Please explain your conclusion that such an access is undefined.

Let's start here:

---snippet one---
union { int i } un = { INT_MAX };

unsigned char *ucp = (unsigned char *)&un.i;

unsigned char uc = ucp[0];
---end---

The above access to the first byte of the is valid under 6.5p7, since
un.i is an object.

Now let's change the code:

---snippet two---
union { int i; unsigned char u[sizeof(int)]; } = { INT_MAX };

unsigned char *ucp = (unsigned char *)&un.i;
unsigned char uc = ucp[0];

ucp = un.u;
uc = ucp[0];

---end---

Has undefined behavior on either the last line, or the last two lines.

6.5p7 guarantees that snippet 1 is valid. In snippet 2, un.i is still
an object, so the first access via converted pointer is still valid.
For some reason you think that the standard is ambiguous about the
second access, via the unconverted address of un.u, which results in
the same address being accessed as the same type of lvalue.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html

Mar 14 '07 #22

Jack Klein

On 13 Mar 2007 17:39:09 -0700, "Old Wolf" <ol*****@inspire.net.nz>
wrote in comp.lang.c:

On Mar 14, 12:01 pm, Ben Pfaff <b...@cs.stanford.eduwrote:
Yevgen Muntyan <muntyan.removet...@tamu.eduwrites:
Why is it legal to do

union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.

I wonder if you can clarify something for me. The above clause does
not say that accessing a stored value via an lvalue expression of
character type is always legal. It merely says that accessing it
via other types is not legal.

Now, in N869, 6.5.2.3#5 says quite clearly that u.u can only be
accessed if it were the last member to be set, so the above code
would be UB. However the first sentence of 6.5.2.3#5 was removed
in N1124, making the above code legal again.

This wording was removed before the ratification of C99. The standard
wording is the same as N1124.

What did the actual C99 text say, and what did C90 have to say on the
matter?

Wording was changed in the standard specifically to make it clear that
a program can access any memory that it has a right to as an array of
unsigned char, without causing UB. In addition to the change in
6.5.2.3p5 that you mentioned, there is at least one other instance:

n896 3.18
"undefined behavior
behavior, upon use of a non portable or erroneous program construct,
of erroneous data, or of indeterminately valued objects, for which
this International Standard imposes no requirements"

C99 3.43:
"undefined behavior
behavior, upon use of a non portable or erroneous program construct or
of erroneous data, for which this International Standard imposes no
requirements"

Note the removal of the clause "or of indeterminately valued objects",
because it was pointed out that this would brand accessing raw
allocated memory or the byte representation of an uninitialized object
as an array of unsigned char undefined. When in fact it is not.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html

Mar 14 '07 #23

Yevgen Muntyan

Jack Klein wrote:

On Wed, 14 Mar 2007 01:55:14 GMT, Yevgen Muntyan
<mu****************@tamu.eduwrote in comp.lang.c:

>Keith Thompson wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:
Keith Thompson wrote:
Yevgen Muntyan <mu****************@tamu.eduwrites:
>Keith Thompson wrote:
>>Yevgen Muntyan <mu****************@tamu.eduwrites:
>>>Ben Pfaff wrote:
>>>>Yevgen Muntyan <mu****************@tamu.eduwrites:
>>>>>
>>>>>Why is it legal to do
>>>>>>
>>>>>union U {unsigned char u[8]; int a;};
>>>>>union U u;
>>>>>u.a = 1;
>>>>>u.u[0];
>>>>See C99 section 6.5 "Expressions":
>>>> An object shall have its stored value accessed only by an
>>>> lvalue expression that has one of the following types:73)
>>>>[...]
>>>> - a character type.
>>>But character type is not a union.
>>[snip]
>>>
>>u.a is of type int. u.u[0] is of type char, a character type. The
>>code above accesses the stored value of the object u.a using an lvalue
>>expression, u.u[0], which is of character type, which satisfies 6.5.
>I am not convinced. Consider this:
>>
>int a;
>int b;
>int *p = &b;
>*(p - 1);
>>
>It accesses value of a using an lvalue of type int. The problem is
>of course that *(p-1) is illegal.
Another problem is that it's not necessarily accessing the value of
a.
Well, UB here is totally enough for me, regardless of what exactly
implementation will do :)

However, the standard does explicitly allow objects to be adjacent (it
has to do so to make pointer equality work consistently). In the
absence of any knowledge of how a and b are allocated in memory,
evaluating *(p - 1) invokes undefined behavior. If you happen to know
that they're adjacent, then *(p - 1) does access the value of a, and
it's legitimate (though quite silly).
It's legitimate? Pointer arithmetic is allowed only on arrays (not sure
what correct term is, it's not those int a[2]; arrays), isn't it? I
mean, it's UB even if a and b happen to be adjacent (which itself
isn't a standard term, since standard doesn't know what it means for
objects which are not members of some aggregate, in which case we
can talk about sequences of bytes).
For purposes of pointer arithmetic, any object can be treated as if it
were a single-element array. See, for example, C99 6.5.8p4:

....

>>>
Without this special-case permission, the standard would have had to
say that, given
int a;
int b;
&a + 1 *may not* be equal to &b (or vice versa), which would require
the implementation to insert at least one byte of padding between
objects that might otherwise be adjacent.
Sorry, I don't understand if it's yes or no. I said the following:

1) int a; int *p = &a + 1; is UB.

No, it's not. Reread C99 6.5.8p4 that Keith quoted above. You may
form a pointer to one past a single element, even if it is not part of
an array. &a + 1 is a valid pointer value to create and hold, just
not to deference.

Yes, I totally missed it. You and Keith are absolutely right.
Anyway, while I was wrong that UB starts right in &a + 1, the example
is still good to demonstrate that right type is not enough:

int a, b = 2;
if (&a + 1 == &b)
{
printf ("%d\n", *(&a + 1));
*(&a + 1) = 8;
}

it is UB, right? I can't believe any implementation can do anything
but changing b here, but it's what standard says, or not?

....

>>Allowing objects to be adjacent is necessary for equality to be
defined consistently.
Standard doesn't allow nor disallow independent (as in not parts
of some aggregate) objects to be adjacent, it simply doesn't
say nor care about what it means.

Yes it does. That is why the wording was added to the standard.
Earlier versions of the standard did not have the clause "or one is a
pointer to one past the end of one array object and the other is a
pointer to the start of a different array object that happens to
immediately follow the first array object in the address space." as
part 6.5.9p6.

It was inadvertent, but that meant that if you wrote a function like
this:

void func(void)
{
int a, b;

if (((&a + 1) == &b) || ((&b + 1) == &a))
puts("this implementation violates the standard!");
}

...many, in fact most if not all, implementations would output the
message that they violated the standards. For an implementation to
comply with the earlier wording, it must put at least one byte (and
most likely sizeof(int) bytes) of wasted space between 'a' and 'b' in
memory, or in fact between the end of any object and the start of any
other object.

Now I got Keith's words about one byte padding.

.....

>>We can't use *the value of* u.b because C99 6.7.2.1p14 says:

The value of at most one of the members can be stored in a union
object at any time.

But 6.5p7 gives special permission to access an object by an lvalue
expression of character types.
It doesn't give special permission to access union member other
than that was previously set, at least it's absolutely not obvious
if it does.

It does not need to give special permission for union members. An
instance of a union is an object. Each of its members is an object.
6.5p7 does not say an object other than a union or a union member. It
applies to ANY object.

There is no possible confusion unless you try to maintain that members
of a union are not objects.

Well, it looks to me like unitialized variables: they are objects, but
trying to use them like normal objects (e.g. using the value) may not be
allowed. Or pointers after you call free() on them, total mistery.

No, it's not obvious to me why we can do

union U {int a; double b;};
U u;
u.a = 8;
u.b *= 2;

Either we can do this (avoiding things like trap representation,
i.e. making sure bit representation is valid for both types) and
we can use union {unsigned char ar[MANY]; Anything a;};
or we can't do this, and then same thing applies to unsigned char
case. From what I read, we actually can do it (i.e. I haven't
seen a prohibition), but it doesn't sound "right" to me. Maybe
it's just that I don't trust the standard and always expect traps.
In any case, it's not as simple as "u.b is an object therefore
we can access it".
Why I think we can do it is simple: it's not explicitly prohibited
(like in case of access outside array boundaries), and there is
only one way to make it work (this is proven using the fact that
value of union object consists of the same bytes as value of
its members, modulo padding bytes). But it's wrong, it must
be wrong!

....

>>I think the question of whether the wording of the standard actually
supports this conclusion is getting into comp.std.c territory.
I believe rationale explains the intent of that 6.5 wording, and
the intent certainly wasn't to allow accessing character array
union members. If it's allowed, then it must be elsewhere.

Please explain your conclusion that such an access is undefined.

Not necessarily undefined. Maybe unspecified. But my conclusion is
not it's undefined (illegal), my conclusion is "I can't see why it's
allowed". Maybe it's because of C++, and C is more relaxed?

....

---snippet two---
union { int i; unsigned char u[sizeof(int)]; } = { INT_MAX };

unsigned char *ucp = (unsigned char *)&un.i;
unsigned char uc = ucp[0];

ucp = un.u;
uc = ucp[0];

---end---

Has undefined behavior on either the last line, or the last two lines.

It has or has not? And is it different from the original one,
where you'd do 'uc = un.u[0];'?

6.5p7 guarantees that snippet 1 is valid. In snippet 2, un.i is still
an object, so the first access via converted pointer is still valid.
For some reason you think that the standard is ambiguous about the
second access, via the unconverted address of un.u, which results in
the same address being accessed as the same type of lvalue.

Well, addresses are the same, yes. But in the example with int a, b;
we also have the same addresses; still we can't dereference one of two
pointers, even though they compare equal. Indeterminate values: address
is no problem at all, accessing value is UB.

Yevgen

Mar 14 '07 #24

Old Wolf

On Mar 14, 3:23 pm, Yevgen Muntyan <muntyan.removet...@tamu.edu>
wrote:

Old Wolf wrote:

No, the aliasing rule as quoted by Ben Pfaff still applies;
you can't access a long long with a pointer to int , regardless
of whether they are in a union or not.

Hm, do you mean something like this:

union U {int a; long long b;};

void func (int *p)
{
union U *up = p;
long long *l = &p->b;
}

U u;
u.b = 8;
func (&u.a);

It would access long long using int pointer. But then it would
only mean that given code is UB, not that you can't do

u.a = 8; u.b;

Sorry, I was unclear with my wording. The text quoted by Ben Pfaff
is more specific; u.b is UB because you are accessing the content
of u.a by using an lvalue whose type is not compatible with the
type of u.a . (The meaning of 'compatible' is precisely defined
by the standard text).

It's similar to the following:

void func (int *ip)
{
long long *lp = (long long*) ip;
*lp = 8;
}

long long a;
func ((int*) &a);

It's legal to convert long long pointer to int pointer and back
again (assuming everything is good with alignment), but this
code invokes UB according to 6.5?

I think the code is OK, w.r.t. 6.5. The typecast is not the
problem. The problem would only occur if you tried to access
the contents of 'a' using an lvalue of type int. In this code,
although you construct a pointer of type int, you never
dereference it, so no problem.

Note, this code does have other problems, namely the assumption
that the long long pointer can be converted to the int pointer
and back, without loss of information. Of course the standard
does not guarantee this (I don't recall if a diagnostic is
required for the above code, or whether it is ID or just UB).

Perhaps I am missing something again, I am totally confused
by this stuff.

Me too :)

I don't really see the reason for the type-punning rule; as long
as alignment is OK and the representation is not trap.

Mar 14 '07 #25

Yevgen Muntyan

Jack Klein wrote:

On 13 Mar 2007 17:39:09 -0700, "Old Wolf" <ol*****@inspire.net.nz>
wrote in comp.lang.c:

>On Mar 14, 12:01 pm, Ben Pfaff <b...@cs.stanford.eduwrote:
>>Yevgen Muntyan <muntyan.removet...@tamu.eduwrites:
Why is it legal to do
union U {unsigned char u[8]; int a;};
union U u;
u.a = 1;
u.u[0];
See C99 section 6.5 "Expressions":

An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:73)
[...]
- a character type.
I wonder if you can clarify something for me. The above clause does
not say that accessing a stored value via an lvalue expression of
character type is always legal. It merely says that accessing it
via other types is not legal.

Now, in N869, 6.5.2.3#5 says quite clearly that u.u can only be
accessed if it were the last member to be set, so the above code
would be UB. However the first sentence of 6.5.2.3#5 was removed
in N1124, making the above code legal again.

This wording was removed before the ratification of C99. The standard
wording is the same as N1124.

>What did the actual C99 text say, and what did C90 have to say on the
matter?

Wording was changed in the standard specifically to make it clear that
a program can access any memory that it has a right to as an array of
unsigned char, without causing UB.

The removed sentence said it's implementation-defined, not UB, which
makes quite a difference. It was clearly said you can access any union
member without causing UB. The intent could be to make it UB, or
make it unspecified, or make it well-defined, or anything else (unless
you know more than that text and current text).

Yevgen

Mar 14 '07 #26

Yevgen Muntyan

Old Wolf wrote:

On Mar 14, 3:23 pm, Yevgen Muntyan <muntyan.removet...@tamu.edu>
wrote:
>Old Wolf wrote:

>>No, the aliasing rule as quoted by Ben Pfaff still applies;
you can't access a long long with a pointer to int , regardless
of whether they are in a union or not.
Hm, do you mean something like this:

union U {int a; long long b;};

void func (int *p)
{
union U *up = p;
long long *l = &p->b;
}

U u;
u.b = 8;
func (&u.a);

It would access long long using int pointer. But then it would
only mean that given code is UB, not that you can't do

u.a = 8; u.b;

Sorry, I was unclear with my wording. The text quoted by Ben Pfaff
is more specific; u.b is UB because you are accessing the content
of u.a by using an lvalue whose type is not compatible with the
type of u.a . (The meaning of 'compatible' is precisely defined
by the standard text).

I see now (or I hope so).
So, we can do u.char_array[0] because it's not prohibited in this
paragraph, not prohibited elsewhere, and there is no behavior for it
other than the obvious one which wouldn't contradict to description
of union objects.

>It's similar to the following:

void func (int *ip)
{
long long *lp = (long long*) ip;
*lp = 8;
}

long long a;
func ((int*) &a);

It's legal to convert long long pointer to int pointer and back
again (assuming everything is good with alignment), but this
code invokes UB according to 6.5?

I think the code is OK, w.r.t. 6.5. The typecast is not the
problem. The problem would only occur if you tried to access
the contents of 'a' using an lvalue of type int. In this code,
although you construct a pointer of type int, you never
dereference it, so no problem.

Note, this code does have other problems, namely the assumption
that the long long pointer can be converted to the int pointer
and back, without loss of information. Of course the standard
does not guarantee this (I don't recall if a diagnostic is
required for the above code, or whether it is ID or just UB).

6.3.2.3p7:
A pointer to an object or incomplete type may be converted to a pointer
to a different object or incomplete type. If the resulting pointer is
not correctly aligned for the pointed-to type, the behavior is
undefined. Otherwise, when converted back again, the result shall
compare equal to the original pointer.

So converting forth and back is fine, "assuming everything is good with
alignment".

>Perhaps I am missing something again, I am totally confused
by this stuff.

Me too :)

I don't really see the reason for the type-punning rule; as long
as alignment is OK and the representation is not trap.

Perhaps it allows some great optimizations? At least it was the reason
why gcc-4 broke lot of code, it seems :)

Yevgen

Mar 14 '07 #27

Yevgen Muntyan

Old Wolf wrote:

On Mar 14, 3:23 pm, Yevgen Muntyan <muntyan.removet...@tamu.edu>
wrote:
>Old Wolf wrote:

>>No, the aliasing rule as quoted by Ben Pfaff still applies;
you can't access a long long with a pointer to int , regardless
of whether they are in a union or not.
Hm, do you mean something like this:

union U {int a; long long b;};

void func (int *p)
{
union U *up = p;
long long *l = &p->b;
}

U u;
u.b = 8;
func (&u.a);

It would access long long using int pointer. But then it would
only mean that given code is UB, not that you can't do

u.a = 8; u.b;

Sorry, I was unclear with my wording. The text quoted by Ben Pfaff
is more specific; u.b is UB because you are accessing the content
of u.a by using an lvalue whose type is not compatible with the
type of u.a . (The meaning of 'compatible' is precisely defined
by the standard text).

Actually, why is it access to content of u.a? If u.b is always
allowed, then u.b accesses value of u.a after "u.a = 1;" in the
same way as after "u.b = 1;".
If "do not try to get value of a union member if you did not
set it explicitly" is not a law, then

union U {long long a; int b;};
u.a = 0; u.b *= 2;

is valid (given we avoid trap representation and padding bytes
problems), similar to

memset (&u, 0, sizeof u); u.b *= 2;

(assuming there are no padding bits in long long). But there is no
such rule in the standard, is there? It was implementation-defined in
N869, and disappeared in N1124. Something is wrong here.

Yevgen

Mar 14 '07 #28

Chris Torek

In article <ln************@nuthaus.mib.org>
Keith Thompson <ks***@mib.orgwrote:

>... If you happen to know that they're adjacent, then *(p - 1) does
access the value of a, and it's legitimate (though quite silly).
For example:

#include <stdio.h>
int main(void)
{
int a = 42;
int b;
int *p = &b;

if (&a + 1 == &b) {
printf("Accessing a strangely: %d\n", *(p - 1));
}
else {
printf("Accessing a normally: %d\n", a);
}
return 0;
}

I am ... "uncomfortable" with calling this code well-defined,
if only because the act of computing (p - 1) (regardless of
the unary "*" operator) is itself undefined.

In other words, suppose &a+1 == &b (and note that "&a + 1" is
well-defined, as is the equality operator), but suppose also that
the implementation has some sort of Mysterious And Complicated
pointer checking at runtime, by which it finds that p points to
&b, which in turn is a "standalone" object, i.e., not a member of
an array or structure or whatever. (This step probably has to look
in a "side table" left behind by the compiler. The format of the
table itself would be rather complicated too: in the general case,
it has to relate data addresses to offsets within activation records
["stack frames"] and thence to information about declarations in
the original source code. This kind of checking would generally
be easier in an interpreter, where pointer values probably carry
actual variable names as part of their encoding.[%]) Given all this
information, the runtime code can decide that "p - 1" is invalid.
Thus, when you compute p - 1, it may trap at runtime.

[% For malloc() objects, which have no names, the interpreter simply
constructs a "dynamic name" at malloc() time. If the interpreter
constructs new names even when an internal malloc-able address is
recycled, this can also catch the use of a free()d pointer, even
if the memory has been handed out again later. That is:

T1 *p;
T2 *q;
p = malloc(...); /* name: "<malloc:0000>" address: 0x1234 */
...
free(p); /* 0x1234 back into pool */
...
q = malloc(...); /* name: "<malloc:0001>" address: 0x1234 */
...
some_operation(*p) /* address 0x1234 is valid but has name <malloc:0001>,
while the name associated with p is <malloc:0000>,
so this is referring to a free'd pointer. */

A good C interpreter that did this would be useful debugging tool. :-) ]
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Mar 14 '07 #29

Keith Thompson

Yevgen Muntyan <mu****************@tamu.eduwrites:
[...]

Sorry, I don't understand if it's yes or no. I said the following:

1) int a; int *p = &a + 1; is UB.

Sort of.

2) In "int a; int b;" if we say "a and b are adjacent" then it has
not meaning as far as the standard is concerned. We could talk
about implementation-specific memory layout, about what addresses
actually mean, but it's not standard.

As I wrote upthread, the standard specifically addresses this. It's
possible for a and b to be adjacent (i.e., either &a + 1 == &b, or
&b + 1 == &a), and if they happen to be adjacent, then it's possible
for a program to detect it. It's not meaningless at all.

>Allowing objects to be adjacent is necessary for equality to be
defined consistently.

Standard doesn't allow nor disallow independent (as in not parts
of some aggregate) objects to be adjacent, it simply doesn't
say nor care about what it means.

That's incorrect. The standard doesn't suggest that any two
independent objects should or shouldn't be adjacent, but it
specifically allows for the posibility. Am I misunderstanding your
point here?

[...]

>>>I'm not quite sure what this has to do with the question about unions,
though.
It was an example of situation where types are fine as to 6.5p7
but the expression was illegal nevertheless.
Be careful with the word "illegal". I think what you mean is that it
invokes undefined behavior.

No, strictly speaking I mean non-strictly-conforming code.

I don't *think* that's what you mean. There's a big difference
between non-strictly-conforming and "illegal", for any reasonable
definition of the term. For example, this:

printf("INT_MAX = %d\n", INT_MAX);

is non-strictly-conforming, but it's perfectly legal.

[...]

>As the footnote there says:
The intent of this list is to specify those circumstances in
which
an object may or may not be aliased.

Exactly, has nothing to do with this particular thing: whether
you can freely access different union members.

The standard doesn't have a definition of the term "aliasing". In my
opinion, the union example is a form of aliasing.

[...]

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Mar 14 '07 #30

Yevgen Muntyan

Keith Thompson wrote:

Yevgen Muntyan <mu****************@tamu.eduwrites:
[...]

>>Allowing objects to be adjacent is necessary for equality to be
defined consistently.
Standard doesn't allow nor disallow independent (as in not parts
of some aggregate) objects to be adjacent, it simply doesn't
say nor care about what it means.

That's incorrect. The standard doesn't suggest that any two
independent objects should or shouldn't be adjacent, but it
specifically allows for the posibility. Am I misunderstanding your
point here?

No, I misunderstood you and standard and everything here :)

[...]

>>>>I'm not quite sure what this has to do with the question about unions,
though.
It was an example of situation where types are fine as to 6.5p7
but the expression was illegal nevertheless.
Be careful with the word "illegal". I think what you mean is that it
invokes undefined behavior.
No, strictly speaking I mean non-strictly-conforming code.

I don't *think* that's what you mean. There's a big difference
between non-strictly-conforming and "illegal", for any reasonable
definition of the term.

Um, I am mixing lot of stuff here. In the original issue case
(union {unsigned char u[10];};) I do care only about strict conformance.
Say, if it's implementation-defined what happens when you access member
of a union which wasn't previously set, then it's perfectly legal but
totally defeats the purpose of union-with-char-array trick.
And when I answered your question, I answered completely unrelated
question. So, answer to your question is something like: UB or something
equally bad. Not just UB, but also constraint violation, syntax error,
documented crash in runtime, or whatever bad can happen, not sure.
The example showed UB indeed (incorrectly but we easily correct it by
adding dereferencing the pointer).

[...]

>>As the footnote there says:
The intent of this list is to specify those circumstances in
which
an object may or may not be aliased.
Exactly, has nothing to do with this particular thing: whether
you can freely access different union members.

The standard doesn't have a definition of the term "aliasing". In my
opinion, the union example is a form of aliasing.

I don't know. I do know that union member access is less clear than
before :)

Yevgen

Mar 14 '07 #31

union {unsigned char u[10]; ...}

Similar topics