By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,691 Members | 1,123 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,691 IT Pros & Developers. It's quick & easy.

Efficient string tricks: legal and defined?

P: n/a
I'm trying to write a space efficient string (nul terminated array
of char), storing the characters directly unless the number of
characters is too large to be so stored, and storing a pointer to
other storage otherwise.

Rather than lose space to padding by just using a sentinal bool to
indicate which member of the union has been written to, I want to
overload the last character of the directly stored string as the
sentinal.

I think what I'm doing is defined and legal, but I'd like you to
pick it apart in case I've missed something.

First, is it in fact undefined to write/read any possible padding
bits in a struct? If (as I suspect) it's not legal, am I in fact
avoiding doing so?

Basically, I have a union of char* and char[ n ], which is the
first member of a struct which has a char[ n ] as the second
member. In each case, n is sizeof( char* ).

Any string which, with nul terminator, is <= 2 * sizeof( char* ),
is stored directly. Since the nul terminator is the last character,
in this case the last character is 0 (or is forced to be).

Any string which, with nul terminator, is longer than 2 * sizeof(
char ), is copied to malloc'd memory, and the union's first member
is pointed to it. Then last character of the struct is set to 1.

Since the last character of the struct is NOT a part of the union,
it's legal to access it regardless of what member of the union has
been written to, to discover know which member of the union was
written to and can legally be read.

In case there is padding at the end of the struct, we calculate the
size we can directly write by finding the offset of the last
element of its last member, and not with the more obvious sizeof(
struct pstring ):
offsetof( struct pstring, m[ sizeof( char* ) ] )

(As m[ sizeof( char* ) ] is one past the end of the array, this
should be legal.)

Complete code can be found below.

Thanks,
Tom

#include <stdlib.h>
#include <stddef.h>
#include <string.h>

union pOrA {
char* p ;
char a[ sizeof( char* ) ] ;
} ;

struct pstring {
/* pstring? A VERY UNINFORMATIVE NAME */
union pOrA poa ;
/* 1: does the standard guarantee there is no padding here? */
char m[ sizeof( char* ) ] ;
/* 2: does the standard guarantee there is no padding here? */
} ;

void write_pstring( struct pstring* p, const char* src, size_t slen
) {
char b ;
char* d ;

/* this fails if pstring is padded at the end!
if( b = ( slen >= sizeof( pstring ) ) ) { /*

/* this may be ok: */
if( b =
( slen >= offsetof( struct pstring, m[ sizeof( char* ) ] ) )
) {
d = p->poa.p = malloc( slen + 1 ) ;
} else {
d = (char*)p ;
}
strncpy( d, src, slen ) ;
d[ slen ] = 0 ;
p->m[ sizeof( char* ) - 1 ] = b ;
}

const char* read_pstring( struct pstring* p ) {
if( p->m[ sizeof( char* ) - 1 ] ) {
return p->poa.p ;
} else {
return &p->poa.a[ 0 ] ;
}
}

int main() {
struct pstring ps[ 16 ] ;
const char* src = "abcdefghijklmnopqrstuvwxyz" ;
int srclen = strlen( src ) ;
int i ;

for( i = 0 ;
i <= srclen && i < sizeof( ps ) / sizeof( ps[ 0 ] ) ;
++i ) {
write_pstring( &ps[ i ], src, i ) ;
printf( "ps[ %i ]: |%s|\n", i, read_pstring( &ps[ i ] ) ) ;
}
}
Nov 13 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Thomas Paul Diffenbach <usenet@..my-last-name...org> wrote:
# I'm trying to write a space efficient string (nul terminated array
# of char), storing the characters directly unless the number of
# characters is too large to be so stored, and storing a pointer to
# other storage otherwise.
#
# Rather than lose space to padding by just using a sentinal bool to
# indicate which member of the union has been written to, I want to
# overload the last character of the directly stored string as the
# sentinal.

One of two issues I can think of is that the last char of a string does not
necessarily correspond to the least significant byte of an address:
on some machines it can correspond to the msb. If you want to be able
to run on those machines, you need to deal with that.

If you pack the string into an integer, then on any rational
architecture the least signficant byte of an integer and an
address will correspond. You can pack the string with something
like

typedef unsigned ... PackedString;
/*... is a large enough int such that sizeof(PackedString)>=sizeof(char*).*/
PackedString ps; int n; unsigned char *sp = (unsigned char*)originalString;
for (n=0,ps=0; n<sizeof(char*) && *sp; n++,sp++) {
ps = (ps<<CHAR_BIT) | *sp;
}

The second issue is whether you have enough spare bits to distinguish
the two values. If you restrict the packed string length to sizeof(char*)-1
instead sizeof(char*), you have CHAR_BIT spare bits in the PackedString that
you can encode a type.

If you restrict yourself to ASCII, with codes 0 through 127, then you only need
7 bits to char; on most machines CHAR_BIT is 8. If you shift (ps<<7) instead on
such machines, you can pack sizeof(char*) characters and still have spare bits.
On most machines today, sizeof(char*)=4, CHAR_BIT=8, and so you would have 4 spare
bits.

I don't think there's a good general way to find spare bits in an arbitrary
char* address, but on most machines today, malloc returns addresses aligned
on an even number of bytes, or even an multiple of 4. That means that on most
modern machines, if you only malloc the string, the bottom one or two bits of
the address will always be zero. That gives you one or two bits for type
encoding.

So on most modern machines, if you restrict your packed string width to
sizeof(char*)-1 or the character width to less than CHAR_BIT, and you
restrict your string allocation to a malloc that returns even address,
then you have at least one bit available that you can use as a type discriminator.
(The above packing leaves the spare bits at the most significant end instead
of the least signficant end like addresses. All you need is another << shift
to move the packed string up and leave a hole at the lsb.)

Is this legal? All things are permitted that do not cause a cpu fault.

Is this portable? It is widely portable, but not universally portable.
You should explicitly state your assumptions in the documentation so that
people at least realise there is a porting issue. You can also encode tests
to verify your assumption. If you're assume 7<CHAR_BIT, you can code that
test in the program with something like
#if 7>=CHAR_BIT
#error Expected CHAR_BIT>7.
#endif

You can test the assumption about malloc by testing the returned address.
p = malloc(n);
if (!p) ...
if (((int)p) & 1) {
fputs("Expected malloc to return even addresses.\n",stderr);
abort();
}

(Using least signficant bits to encode types is an old and established
programming trick.)

--
Derk Gwen http://derkgwen.250free.com/html/index.html
One of the drawbacks of being a martyr is that you have to die.
Nov 13 '05 #2

P: n/a
Thomas Paul Diffenbach wrote:

I'm trying to write a space efficient string (nul terminated array
of char), storing the characters directly unless the number of
characters is too large to be so stored, and storing a pointer to
other storage otherwise.

Rather than lose space to padding by just using a sentinal bool to
indicate which member of the union has been written to, I want to
overload the last character of the directly stored string as the
sentinal.

I think what I'm doing is defined and legal, but I'd like you to
pick it apart in case I've missed something.
Warning: I've only glanced at it, and haven't had time
to study it carefully. But ...
First, is it in fact undefined to write/read any possible padding
bits in a struct? If (as I suspect) it's not legal, am I in fact
avoiding doing so?
As far as I can tell, no harm can come to you from writing
the padding bytes. However, trouble *can* arise when you read
them back later on, because padding bytes are not guaranteed
to retain their stored values. Make a copy of the struct, and
it's not guaranteed that the padding bytes get copied. Store
something in one of the struct elements, and the padding bytes
may get clobbered as a side-effect. The long and short is that
you can't rely on padding bytes' values to guide your program.
[...]
struct pstring {
/* pstring? A VERY UNINFORMATIVE NAME */
union pOrA poa ;
/* 1: does the standard guarantee there is no padding here? */
No; there may be padding here.
char m[ sizeof( char* ) ] ;
/* 2: does the standard guarantee there is no padding here? */
No; there may be padding here.
} ;
[remainder snipped]


--
Er*********@sun.com
Nov 13 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.