How is strlen implemented?

"roy" <ro*****@hotmail.com> writes:

Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

It's helpful to provide some context when you post a followup. I
happen to have read the previous articles just before I read this one,
but I could as easily have seen your followup first.

If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.

As for your question, strlen()'s argument isn't a char array, it's a
pointer to a char. Normally the pointer should point to the first
element of a "string" (i.e., a sequence of characters marked by a '\0'
terminator). strlen() has doesn't know how many characters are
actually in the array. By calling strlen(), you're promising that
there's a '\0' terminator somewhere within the array; if you break
that promise, there's no telling what will happen.

A typical implementation of strlen() will simply traverse the elements
of what it assumes to be your array until it finds a '\0' character.
If it doesn't find a '\0' character within the array, it has no way of
knowing it should stop searching, so it will just continue until it
finds a '\0'. As soon as it passes the end of the array, it invokes
undefined behavior. It might happen to find a '\0' character (which
is what happened in your case). Or it might run past the memory owned
by your program and trigger a segmentation fault or something similar.
Or, as far as the C standard is concerned, it might make demons fly
out your nose.

So don't do that.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #8

Joe Wright

Jason wrote:

roy wrote:
Hi,

I was wondering how strlen is implemented.
What if the input string doesn't have a null terminator, namely the
'\0'?
Thanks a lot
Roy

strlen will read from the char* until it finds a '\0' char. If your
string does not use the '\0' as a terminator, then you should avoid
most of the <string.h> functions.

-Jason

More precisely, if your char array does not have a 0 terminator, it is
not a string.
--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Nov 14 '05 #9

Richard Tobin

In article <11**********************@l41g2000cwc.googlegroups .com>,
roy <ro*****@hotmail.com> wrote:

Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.

Bear in mind that a char array usually *does* have a null terminator.

If it doesn't, it's quite likely to be followed in by memory by a zero
byte, which is the representation of nul on almost all systems, so it
will often work by luck.

Debugging systems often have an option to initialize variables to
non-zero values, precisely to stop this kind of "luck" from obscuring
real errors. Some readers will remember the many bugs that were
revealed when dynamic linking was added to SunOS, causing
uninitialized variables in main() to no longer be zero.

-- Richard

Nov 14 '05 #10

There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

/* Gregory Pietsch */

Nov 14 '05 #11

Joe Estock

Gregory Pietsch wrote:

There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

/* Gregory Pietsch */

Interesting seeing \0 so widely in use. On most systems, NULL is defined
as \0, however there are a few special cases where it is not. Shouldn't
we be using NULL instead of \0?

Joe Estock

Nov 14 '05 #12

Joe Wright

Joe Estock wrote:

Gregory Pietsch wrote:
There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

/* Gregory Pietsch */

Interesting seeing \0 so widely in use. On most systems, NULL is defined
as \0, however there are a few special cases where it is not. Shouldn't
we be using NULL instead of \0?

Joe Estock

No Joe, NULL is the 'null pointer constant' while '\0' is a constant
character (with int type) and value zero. This is often called the null
character or the NUL character. Never NULL character.

--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Nov 14 '05 #13

Minti

Chris Torek wrote:

In article <11*********************@g14g2000cwa.googlegroups. com>
roy <ro*****@hotmail.com> wrote:
I was wondering how strlen is implemented.
What if the input string doesn't have a null terminator, namely the
'\0'?

Q: What if a tree growing in a forest is made of plastic?
A: Then it is not a tree, or at least, it is not growing.

If something someone else is calling a "string" does not have the
'\0' terminator, it is not a string, or at least, not a C string.
In C, the word "string" means "data structure consisting of zero
or more characters, followed by a '\0' terminator". No terminator,
no string.

Since strlen() requires a string, it may assume it gets one.

There are functions that work on "non-stringy arrays"; in particular,
the mem* functions -- memcpy(), memmove(), memcmp(), memset(),
memchr() -- but they take more than one argument. If you have an
array that always contains exactly 40 characters, and it is possible
that none of them is '\0' but you want to find out whether there
is a '\0' in those 40 characters, you can use memchr():

char *p = memchr(my_array, '\0', 40);

memchr() stops when it finds the first '\0' or has used up the
count, whichever occurs first. (It then returns a pointer to the
found character, or NULL if the count ran out.) The strlen()
function has an effect much like memchr() with an "infinite" count,
except that because the count is "infinite", it "always" finds the
'\0':

size_t much_like_strlen(const char *p) {
const char *q = memchr(p, '\0', INFINITY);
return q - p;
}

except of course C does not really have a way to express "infinity"
here. (You can approximate it with (size_t)-1, though.)

Pardon me Chris, but I really don't get the drift of what you are
trying to convey. These strings are also "stringy", I don't see how
these are "non-stringy".

IOW you are assuming that these "non-stringy" arrays are also supposed
to end with a null character. "Stringy" I say.

--
Imanpreet Singh Arora

Nov 14 '05 #14

Chris Torek

>Chris Torek wrote:

There are functions that work on "non-stringy arrays"; in particular,
the mem* functions ... If you have an array that always contains
exactly 40 characters, and it is possible that none of them is '\0'
but you want to find out whether there is a '\0' in those 40
characters, you can use memchr() ...

In article <11**********************@o13g2000cwo.googlegroups .com>,
Minti <im*******@gmail.com> wrote:Pardon me Chris, but I really don't get the drift of what you are
trying to convey. These strings are also "stringy", I don't see how
these are "non-stringy".
If there is no '\0' byte in all 40 characters, it is not a string.
If there is a '\0' byte somewhere within those 40 characters, it
*is* a string -- and any characters after the first such '\0' are
not part of the string (but remain part of the array).
IOW you are assuming that these "non-stringy" arrays are also supposed
to end with a null character. "Stringy" I say.

In other words, I am saying that these arrays do not contain strings
if and only if they do not contain a '\0'. Note that strncpy()
sometimes makes such arrays (which is one reason some people invented
strlcpy()).

If I may draw an analogy: in mathematics, a statement is false if
there is even a single counterexample. Hence "x * (1/x) = 1" is
a false statement mathematically, because it does not hold for x=0.
(But note that if we limit it, "x * (1/x) = 1 provided x \ne 0",
the statement becomes true for x \elem real, while it remains false
for x \elem integer, and so on.) (Note that details like "x is a
real number" also matter in computing, where float and double do
not really give us "real numbers", but rather approximations.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Nov 14 '05 #15

"Gregory Pietsch" <GK**@flash.net> writes:

There has to be a null terminator somewhere.

To clarify: This doesn't mean that there's a guarantee that there will
be a null terminator somewhere. It means that if there isn't a null
terminator anyway, you must not call strlen(). The burden is on the
caller.

(I briefly read your statement the other way.)

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #16

Mark McIntyre

On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<ro*****@hotmail.com> wrote:

Thanks. Maybe my question should be "what if the input is a char array
without a null terminator".
your question was already answered. However, a quote from hte ISO
Standard may help:

7.21.6.3 The strlen function

3. The strlen function returns the number of characters that precede
the terminating null character.

Clearly if there's no terminating null, this function can't return
anything meaningful. It may in fact not return at all, and its not
uncommon for it to return absurd numbers such as 5678905 or -456

But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
How can it do that? Its /required/ to search for the terminating null.
Your compiler is either not standard compilant, or its exhibiting
random behaviour.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky
--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Nov 14 '05 #17

Mark McIntyre <ma**********@spamcop.net> writes:

On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<ro*****@hotmail.com> wrote:

[...]

But from my experimental results, it seems
that strlen can still return the number of characters of a char array.

How can it do that? Its /required/ to search for the terminating null.
Your compiler is either not standard compilant, or its exhibiting
random behaviour.

strlen() is almost certainly finding a zero byte immediately after his
array. I'd expect that to be a very common manifestation of the
undefined behavior in this case.

I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #18

roy wrote:

Hi,

I was wondering how strlen is implemented.
What if the input string doesn't have a null terminator, namely the
'\0'?
Thanks a lot
Roy

I found some C functions coded in assembler for the 8086 way back when.

;
; -------------------------------------------------------
; int strlen(s)
; char *s;
; Purpose: Returns the length of the string, not
; including the NULL character
; -------------------------------------------------------
;
ifndef pca
include macro2.asm
include libdef.asm
endif
;
idt strlen
def strlen
strlen: qenter bx,di
mov di,parm1[bx]
; cmp di,zero
; jz null
mov ax,ds
mov es,ax
mov cx,-1
xor al,al
cld
repnz scasb
not cx
dec cx
mov ax,cx
exitf
;null xor ax,ax
; exitf
modend strlen

I guess it's C equivelent is:

unsigned
strlen( char *string )
{
unsigned rv = -1;

while( *string ) rv--, *string++;

rv = (-rv) - 1;
return rv;
}

of course I'd just write it like this:

size_t
strlen( char *string )
{
size_t rv = 0;
while ( *string++ ) rv++;
return rv;
}

Nov 14 '05 #19

Keith Thompson wrote:

Mark McIntyre <ma**********@spamcop.net> writes:
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<ro*****@hotmail.com> wrote:

[...]
But from my experimental results, it seems
that strlen can still return the number of characters of a char array.

How can it do that? Its /required/ to search for the terminating null.
Your compiler is either not standard compilant, or its exhibiting
random behaviour.

strlen() is almost certainly finding a zero byte immediately after his
array. I'd expect that to be a very common manifestation of the
undefined behavior in this case.

I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Nov 14 '05 #20

Stan Milam <st*****@swbell.net> writes:

Keith Thompson wrote:
Mark McIntyre <ma**********@spamcop.net> writes:
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<ro*****@hotmail.com> wrote:

[...]
But from my experimental results, it seems
that strlen can still return the number of characters of a char array. [...]I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

An implementation of strlen() that was able to detect the case where
the argument points to the first element of an array that doesn't
contain any '\0' characters would most likely add significant overhead
to *all* operations. The obvious way to implement it is to make all
pointers "fat", so each pointer includes both the base address and
bounds information; strlen() would then have to check the bounds.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #21

James McIninch

<posted & mailed>

By definition, a character array without a null terminator is not a string.

Calling strlen on somthing that isn't a string will cause undefined behavior
(an error).

roy wrote:

Hi,

I was wondering how strlen is implemented.
What if the input string doesn't have a null terminator, namely the
'\0'?
Thanks a lot
Roy

--
Remove '.nospam' from e-mail address to reply by e-mail

Nov 14 '05 #22

Mark McIntyre

On Sun, 24 Apr 2005 00:32:23 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.org> wrote:

Mark McIntyre <ma**********@spamcop.net> writes:
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<ro*****@hotmail.com> wrote:

[...]
But from my experimental results, it seems
that strlen can still return the number of characters of a char array.

How can it do that? Its /required/ to search for the terminating null.
Your compiler is either not standard compilant, or its exhibiting
random behaviour.

strlen() is almost certainly finding a zero byte immediately after his
array. I'd expect that to be a very common manifestation of the
undefined behavior in this case.

that comes under my definition of 'random' - its by chance finding a
null just shortly after the string, possibly due to some debugging
mode 'helpfulness'.

Of course, if the string were zero length, then....
:-)

--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

Nov 14 '05 #23

Emmanuel Delahaye

roy wrote on 23/04/05 :

Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

If the string is malformed (missing terminating 0), the behaviour is
undefined. Any thing could happen.

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

..sig under repair

Nov 14 '05 #24

Emmanuel Delahaye

Stan Milam wrote on 24/04/05 :

So, you are saying this is a poorly implemented compiler?

What would be a better implementation ? If the limit is not here,
anything happens. Blame the coder, not the compiler.

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"Clearly your code does not meet the original spec."
"You are sentenced to 30 lashes with a wet noodle."
-- Jerry Coffin in a.l.c.c++

Nov 14 '05 #25

Emmanuel Delahaye

Joe Estock wrote on 23/04/05 :

Interesting seeing \0 so widely in use. On most systems, NULL is defined as
\0, however there are a few special cases where it is not. Shouldn't we be
using NULL instead of \0?

No, because here, we are talking about the null character that is 0 or
'\0' (but I'm too lazy to type the latter).

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"C is a sharp tool"

Nov 14 '05 #26

Stan Milam wrote:

Keith Thompson wrote:

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Okay guys, that was a joke.

Nov 14 '05 #27

I checked my libraries, and the following may be faster than the above:

#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE
#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#ifdef _ALIGN
#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))
#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE
#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;
/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */
s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);
}

/* Gregory Pietsch */

Nov 14 '05 #28

NULL is usually reserved for the null pointer. Here, we're checking for
the null character, '\0'.

Gregory Pietsch

Nov 14 '05 #29

Flash Gordon

Gregory Pietsch wrote:

I checked my libraries,
Do you mean your personal libraries or your implementations. Remember
that the implementation is allowed to do things you are not allowed to do.
and the following may be faster than the above:
What above? Please quote enough of the message you are replying to for
us to see what you are talking about. There is an option that gets
Google to do the right thing and if you search the group I'm sure you
will find the instructions. It's in someone's sig, but I can't remember who.
#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE
An implementation could declare that or not for any reason it wants.
#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#ifdef _ALIGN
Again, a compiler could declare that or not as it saw fit.
#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))
There is no guarantee that this will tell you if it is aligned. Some
people around here have worked on word addressed systems where the byte
within the word was flagged in the *high* bits of the address.
#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
Misleading name, I initially read that as a screwy attempt to detect a
NULL pointer. DETECTNULCHAR would be better.
#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE
Isn't that macro you are defining in the implementation name space?
Anything could happen.
#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;
The above could read bytes off the end of a properly nul terminated
string. For example,
size_t len = strlen("a");
/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */
s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);
No need to cast the result of the subtraction. The compiler already
knows is is returning a size_t so will do the conversion anyway.
}

/* Gregory Pietsch */

--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

Nov 14 '05 #30

Lawrence Kirby

On Sun, 24 Apr 2005 05:02:10 +0000, Keith Thompson wrote:

Stan Milam <st*****@swbell.net> writes:
Keith Thompson wrote:
Mark McIntyre <ma**********@spamcop.net> writes:
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<ro*****@hotmail.com> wrote:
[...]

>But from my experimental results, it seems
>that strlen can still return the number of characters of a char array. [...]>I am just not sure whether I am just lucky or sth else happened inside
>strlen.

lucky
No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

Lawrence

Nov 14 '05 #31

Flash Gordon wrote:

Gregory Pietsch wrote:
I checked my libraries,
Do you mean your personal libraries or your implementations. Remember

that the implementation is allowed to do things you are not allowed to do.

It was my implementation, based on unravelling the "while(*s)s++" loop.

> and the following may be faster than the above:
What above? Please quote enough of the message you are replying to

for us to see what you are talking about. There is an option that gets
Google to do the right thing and if you search the group I'm sure you will find the instructions. It's in someone's sig, but I can't remember who.
#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE
An implementation could declare that or not for any reason it wants.

If _OPTIMIZED_FOR_SIZE is declared, the implementation tries to unravel
the "while(*s)s++" loop somewhat.

#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */ #ifdef _ALIGN
Again, a compiler could declare that or not as it saw fit.

There's no way to portably detect whether a pointer-to-char is aligned
on a long boundary, is there?

#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))
There is no guarantee that this will tell you if it is aligned. Some
people around here have worked on word addressed systems where the

byte within the word was flagged in the *high* bits of the address.
I bet that makes for some funky internal pointer arithmetic!

#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
Misleading name, I initially read that as a screwy attempt to detect

a NULL pointer. DETECTNULCHAR would be better.
#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE
Isn't that macro you are defining in the implementation name space?
Anything could happen.

I tried two types of optimizations, one for time (try to unravel the
loop) and one for size. If I don't get a kind of system where casting
a pointer-to-char to a pointer-to-unsigned-long doesn't make much
sense, #defining _OPTIMIZED_FOR_SIZE allows me to leave out code that
wouldn't work in that situation.

#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;

The above could read bytes off the end of a properly nul terminated
string. For example,
size_t len = strlen("a");

I'm testing for having a null character somewhere among the characters
that make up the area that aligned_addr points to. If I don't get a
sane environment (as indicated by the _OPTIMIZED_FOR_SIZE macro), this
code isn't even compiled in.

Here's the general idea: suppose, for example, sizeof(unsigned long) is
4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
don't care if *aligned_addr is big-end-aligned or little-end-aligned.
Oh, well, is there a better way to unravel "while(*s)s++"?

/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */ s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);
No need to cast the result of the subtraction. The compiler already
knows is is returning a size_t so will do the conversion anyway.

The cast is only for my eyes. ;-)

}

/* Gregory Pietsch */

--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

Gregory Pietsch

Nov 14 '05 #32

Lawrence Kirby <lk****@netactive.co.uk> writes:

On Sun, 24 Apr 2005 05:02:10 +0000, Keith Thompson wrote:

[...]

First, strlen() is part of the runtime library, not part of the
compiler.

It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

You're right. I should have said that strlen() is *typically
implemented as* part of the runtime library, not part of the compiler.
(I don't know how many compilers generate inline code, and therefore
how accurate "typically" is.)

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #33

Christian Bau

In article <11**********************@l41g2000cwc.googlegroups .com>,
"roy" <ro*****@hotmail.com> wrote:

Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

You are not lucky, you are unlucky.

If you were lucky, your program would crash as soon as try this, and
then you would know there is a bug that needs fixing. If you are
unlucky, you get a result that doesn't show the bug.

Nov 14 '05 #34

Tim Prince

"Keith Thompson" <ks***@mib.org> wrote in message
news:ln************@nuthaus.mib.org...

Lawrence Kirby <lk****@netactive.co.uk> writes:
On Sun, 24 Apr 2005 05:02:10 +0000, Keith Thompson wrote:

[...]
First, strlen() is part of the runtime library, not part of the
compiler.

It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

You're right. I should have said that strlen() is *typically
implemented as* part of the runtime library, not part of the compiler.
(I don't know how many compilers generate inline code, and therefore
how accurate "typically" is.)

Several common compilers, both commercial and free software, have both
in-line and library implementations, as provided for in standard C (both C89
and C99). In normal usage, not allowing for both possibilities would open
up the possibility of Undefined Behavior.

Nov 14 '05 #35

Chris Torek

In article <11**********************@z14g2000cwz.googlegroups .com>,
Gregory Pietsch <GK**@flash.net> wrote:

There's no way to portably detect whether a pointer-to-char is aligned
on a long boundary, is there?
No (at least, not if by "portable" you mean what we usually do in
comp.lang.c :-) ... there are versions that are "portable" to those
systems that define an alignment function or macro, such as all
the BSD variants).

[code using things like]

#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)

I tried two types of optimizations, one for time (try to unravel the
loop) and one for size. ... Here's the general idea: suppose, for example, sizeof(unsigned long) is
4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
don't care if *aligned_addr is big-end-aligned or little-end-aligned.
Oh, well, is there a better way to unravel "while(*s)s++"?

Maybe, maybe not. It is quite CPU-dependent.

For whatever it is worth (perhaps not much at this point), I tried
the above trick in SPARC assembly code when I was writing the 4.4BSD
C library routines for the SPARC. (I wrote many of the "portable"
routines as well; we set things up so that when you built for VAX,
Tahoe, or SPARC, you got either the machine-specific version or the
generic, depending on whether we had written a machine-specific
version.)

The result was that the fancy version using "four byte at a time"
scans (on aligned pointers) was significantly *slower* than the
dumb, simple, one-byte-at-a-time version, even for relatively long
strings. I was a bit surprised; and the results might be different
on a more modern CPU (this was back in 1991 or so).

(I wrote the whole thing in assembly -- well, in C at first, compiled
to assembly, then hand-edited -- so I know it was not the compiler
doing anything tricky, either.)

It turns out that in most C programs, most strings are very short.
The "Dhrystone" tests that many people used to use to compare C
library implementations use strings that are significantly longer
than average, and overemphasize the time behavior of strlen(),
strcpy(), and strcmp() on relatively long strings. Even for these
longer strings, the "optimized" strlen() was still slower.

Of course, this "most C strings are short" rule of thumb may come
about because most C libraries are optimized for short strings
because most strings are short because most C libraries are optimized
for short strings, etc. :-) In other words, if you have a lot of
long strings, and you do program optimization, you will avoid
calling strlen() on them so much.

Even if one breaks this initial chicken-and-egg loop (by calling
strlen() repeatedly on long strings), and then optimizes the heck
out of strlen(), one can probably still speed up one's programs by
fixing the repeated calls to strlen(). There is another rule of
thumb that applies beyond just C programming, or even computers:

The shortest, fastest, cheapest, and most reliable parts of
any system are the ones that are not there.

(This is another way of putting the "KISS" principle. Of course,
marketing usually gets in the way of this idea. :-) )
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Nov 14 '05 #36

Peter Ammon

Keith Thompson wrote:

Stan Milam <st*****@swbell.net> writes:
Keith Thompson wrote:
Mark McIntyre <ma**********@spamcop.net> writes:

On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<ro*****@hotmail.com> wrote:

[...]
>But from my experimental results, it seems
>that strlen can still return the number of characters of a char array.
[...]
I am just not sure whether I am just lucky or sth else happened inside
>strlen.

lucky

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

An implementation of strlen() that was able to detect the case where
the argument points to the first element of an array that doesn't
contain any '\0' characters would most likely add significant overhead
to *all* operations. The obvious way to implement it is to make all
pointers "fat", so each pointer includes both the base address and
bounds information; strlen() would then have to check the bounds.

A simpler way would be to insert a padding byte containing zero after
every char array.

-Peter

--
Pull out a splinter to reply.

Nov 14 '05 #37

CBFalconer

Chris Torek wrote:

.... snip ...
Even if one breaks this initial chicken-and-egg loop (by calling
strlen() repeatedly on long strings), and then optimizes the heck
out of strlen(), one can probably still speed up one's programs by
fixing the repeated calls to strlen(). There is another rule of
thumb that applies beyond just C programming, or even computers:

The shortest, fastest, cheapest, and most reliable parts of
any system are the ones that are not there.

(This is another way of putting the "KISS" principle. Of course,
marketing usually gets in the way of this idea. :-) )

My suggestion is to try to return the length from most routines
that uncover it, in place of an insipid pointer to the original
string. strlcpy and strlcat follow this practice. So do printf
and sprintf.

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 14 '05 #38

pete

Stan Milam wrote:

Stan Milam wrote:
Keith Thompson wrote:

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Okay guys, that was a joke.

No, it wasn't.
Your posts in the "C FAQ 3.1" thread show that you don't see
the beauty of the concept of undefined behavior.

If you're going to write bad code,
then the C standard committee doesn't care about
what happens as a consequence.

This philosophy was in C originally,
and is maintained in the current C99 standard.

It's not that R was in too much of a hurry specifying C,
so that he didn't have enough time
to also specify what garbage code should do,
but rather it's the case that compiler writers
are in too much of a hurry writing compilers
to want to care about how to translate garbage code.

--
pete

Nov 14 '05 #39

pete

Gregory Pietsch wrote:

There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.
N869
6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference
of the subscripts of the two array elements. The size of
the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.

--
pete

Nov 14 '05 #40

CBFalconer

pete wrote:

Gregory Pietsch wrote:

There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.

N869
6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference of
the subscripts of the two array elements. The size of the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.

Huh? size_t and ptrdiff_t are both integral types, the first being
unsigned, and the second signed. The code above ensures that the
prtdiff_t value is not negative. I fail to see anything undefined
if we ignore the fact that strlen can only be defined in the
implementation.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #41

pete

CBFalconer wrote:

pete wrote:
Gregory Pietsch wrote:

There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.

N869
6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference of
the subscripts of the two array elements. The size of the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.

Huh?

A string longer than PTRDIFF_MAX breaks the code.

It's supposed to be an example of a standard library function
written in portable C code, right?

--
pete

Nov 14 '05 #42

Joe Wright

pete wrote:

CBFalconer wrote:
pete wrote:
Gregory Pietsch wrote:

There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.

N869
6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference of
the subscripts of the two array elements. The size of the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.

Huh?

A string longer than PTRDIFF_MAX breaks the code.

It's supposed to be an example of a standard library function
written in portable C code, right?

Assuming ptrdiff_t is long and 32 bits on a 32-bit machine, a string of
2,147,483,648 bytes will probably break lots of things before you ever
get to run strlen() on it.

Show us a case where (p - s) can be out-of-bounds.

--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Nov 14 '05 #43

pete

Joe Wright wrote:

pete wrote:
CBFalconer wrote:
pete wrote:

Gregory Pietsch wrote:

>There has to be a null terminator somewhere.
>
>Here's a short implementation:
>
>#include <string.h>
>size_t (strlen)(char *s)
>{
> char *p = s;
>
> while (*p != '\0')
> p++;
> return (size_t)(p - s);
>}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.

N869
6.5.6 Additive operators
[#9] If the result is not representable in an object of that
type, the behavior is undefined.
Show us a case where (p - s) can be out-of-bounds.

What do you think that part of the standard means?

--
pete

Nov 14 '05 #44

CBFalconer

pete wrote:

CBFalconer wrote:
pete wrote:
Gregory Pietsch wrote:

There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.

N869
6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference of
the subscripts of the two array elements. The size of the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.

Huh?

A string longer than PTRDIFF_MAX breaks the code.

It's supposed to be an example of a standard library function
written in portable C code, right?

But the string exists, thus the ptrdiff_t value exists by
_definition_. The exit values of p and s are valid, point to the
same entity, so the difference exists.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #45

Richard Tobin

In article <42***************@yahoo.com>,
CBFalconer <cb********@worldnet.att.net> wrote:

6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference of
the subscripts of the two array elements. The size of the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.
But the string exists, thus the ptrdiff_t value exists by
_definition_.

I thought the point of quoting the above paragraph was to show that
there can be cases where the difference between two pointers in the
same array *doesn't* exist, in that attempting to calculate it may
produce undefined behaviour. If no array can exist that's bigger than
the biggest ptrdiff_t value, what's the point of the last sentence of
the paragraph?

-- Richard

Nov 14 '05 #46