469,623 Members | 2,007 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,623 developers. It's quick & easy.

strcmp but with '\n' as the terrminator

Hi there,
I am reading a file into a char array, and I want to find if a string exists
in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there a
similar function that will do this, or will I have to write my own?
Thanks
Allan
Nov 13 '05 #1
53 6919
Allan Bruce wrote:
Hi there,
I am reading a file into a char array, and I want to find if a string
exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there a
similar function that will do this, or will I have to write my own?


Assuming that for some strange reason you haven't yet switched over to my
"stretchy string" routines (which abuse the word "string", since they work
on non-null-terminated data), the easiest way to do what you want, if the
"string" is writeable, is to find the \n, change it to \0, do the strstr,
and then change it back again. If you're doing this a lot, though, you
should beware, as it's not a very efficient solution; in which case, you'd
want to write your own, I guess (unless someone has a better idea).

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #2

"Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote in message
news:bf**********@news.freedom2surf.net...
Hi there,
I am reading a file into a char array, and I want to find if a string exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there a
similar function that will do this, or will I have to write my own?


If you are looking for a substring in a string you can use
strstr()

Syntax:

#include <string.h>
char *strstr(const char *s1, const char *s2);

Description:

Scans a string for the occurrence of a given substring.

strstr scans s1 for the first occurrence of the substring s2.

Return Value

strstr returns a pointer to the element in s1, where s2 begins (points to s2
in s1). If s2 does not occur in s1, strstr returns null.

HTH
cw
Nov 13 '05 #3
Richard Heathfield wrote:
Allan Bruce wrote:

Hi there,
I am reading a file into a char array, and I want to find if a string
exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there a
similar function that will do this, or will I have to write my own?

Assuming that for some strange reason you haven't yet switched over to my
"stretchy string" routines (which abuse the word "string", since they work
on non-null-terminated data), the easiest way to do what you want, if the
"string" is writeable, is to find the \n, change it to \0, do the strstr,
and then change it back again. If you're doing this a lot, though, you
should beware, as it's not a very efficient solution; in which case, you'd
want to write your own, I guess (unless someone has a better idea).


Maybe he can ignore the \n and just use strstr() instead? He won't get
exact matches for the whole line, but he will "find if a string exists
in a given line". ;-)
--
boa

libclc home: http://libclc.sourceforge.net

Nov 13 '05 #4
In 'comp.lang.c', "Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote:
I am reading a file into a char array, and I want to find if a string
exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there
a similar function that will do this, or will I have to write my own?


The usual trick is to remove the '\n' from the read line:

#include <string.h>

<...>

{
char *p = strchr (line, '\n'); /* search ... */

if (p)
{
*p = 0; /* ... and kill. */
}
}

--
-ed- em**********@noos.fr [remove YOURBRA before answering me]
The C-language FAQ: http://www.eskimo.com/~scs/C-faq/top.html
<blank line>
FAQ de f.c.l.c : http://www.isty-info.uvsq.fr/~rumeau/fclc/
Nov 13 '05 #5

"Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote in message
news:bf**********@news.freedom2surf.net...

"code_wrong" <ta*@tac.ouch.co.uk> wrote in message
news:bf**********@newsg2.svr.pol.co.uk...

"Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote in message
news:bf**********@news.freedom2surf.net...
Hi there,
I am reading a file into a char array, and I want to find if a string exists
in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is
there a similar function that will do this, or will I have to write my own?
If you are looking for a substring in a string you can use
strstr()

Syntax:

#include <string.h>
char *strstr(const char *s1, const char *s2);

Description:

Scans a string for the occurrence of a given substring.

strstr scans s1 for the first occurrence of the substring s2.

Return Value

strstr returns a pointer to the element in s1, where s2 begins (points

to s2
in s1). If s2 does not occur in s1, strstr returns null.

HTH
cw


I have used strstr mainly, not strcmp as my post indicates! (doh)
But the problem is still that strstr requires a null terminator and not
'\n'.
Any ideas?


Which function are you using to read your file?
If you read your file with fgets() then you will get null terminated strings
to play with.
Of course there is still the newline character to take into account, but
that will not matter if you use strstr() to check for substrings.

cw
Nov 13 '05 #6
Bj[o]rn Augestad wrote:
Richard Heathfield wrote: <all snipped>
Allan Bruce wrote:

Hi there,
I am reading a file into a char array, and I want to find if a string
exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^Is there a
similar function that will do this, or will I have to write my own?

<snip>
Maybe he can ignore the \n and just use strstr() instead? He won't get
exact matches for the whole line, but he will "find if a string exists
in a given line". ;-)


Well, he did say quite clearly that there was no '\0' at the end of the
data. Or did I misunderstand him?

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #7
Emmanuel Delahaye wrote:
In 'comp.lang.c', "Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote:
I am reading a file into a char array, and I want to find if a string
exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there
a similar function that will do this, or will I have to write my own?


The usual trick is to remove the '\n' from the read line:

#include <string.h>

<...>

{
char *p = strchr (line, '\n'); /* search ... */


Undefined behaviour if line has no terminating null character, as the OP has
pointed out twice now.

<snip>

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #8
Richard Heathfield wrote:
Bj[o]rn Augestad wrote:

Richard Heathfield wrote: <all snipped>
Allan Bruce wrote:

Hi there,
I am reading a file into a char array, and I want to find if a string
exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Is there a
similar function that will do this, or will I have to write my own?

<snip>
Maybe he can ignore the \n and just use strstr() instead? He won't get
exact matches for the whole line, but he will "find if a string exists
in a given line". ;-)

Well, he did say quite clearly that there was no '\0' at the end of the
data. Or did I misunderstand him?


I don't know. :-)

I was just assuming(I know, I know...) that the OP was reading a file
line by line using fgets() and then tried to match some string with the
line read, but ran into problems because of the trailing \n.

Only time and some source code will tell. ;-)

--
boa

libclc home: http://libclc.sourceforge.net

Nov 13 '05 #9

"Bjørn Augestad" <bo*@metasystems.no.spam.to.me> wrote in message
news:RW********************@juliett.dax.net...
Richard Heathfield wrote:
Bj[o]rn Augestad wrote:

Richard Heathfield wrote: <all snipped>

Allan Bruce wrote:

>Hi there,
>I am reading a file into a char array, and I want to find if a string
>exists in a given line.
>I cant use strcmp since the line ends with '\n' and not '\0'.


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>Is there a
>similar function that will do this, or will I have to write my own?

<snip>
Maybe he can ignore the \n and just use strstr() instead? He won't get
exact matches for the whole line, but he will "find if a string exists
in a given line". ;-)

Well, he did say quite clearly that there was no '\0' at the end of the
data. Or did I misunderstand him?


I don't know. :-)

I was just assuming(I know, I know...) that the OP was reading a file
line by line using fgets() and then tried to match some string with the
line read, but ran into problems because of the trailing \n.

Only time and some source code will tell. ;-)

--
boa

libclc home: http://libclc.sourceforge.net


I am using this to read the file:
// find how big the file is
fseek(fptr, 0, SEEK_END);
size = ftell(fptr);

//allocate memory for string
if ( (contents = new char[size]) == NULL)
return 0;

Basically reading it in one big chunk, since I am doing some things to the
code that take a long time so I wanted to keep the file open for as little
time as possible.
I use strstr() to find some matches and also use strcmp() to see if some are
true for example, a line may be:
# Material: Porsche_Body
Now this will be stored with a '\n' at the end but no '\0'.
In this example I wish to search for "Material:" using strstr() but if it
doesnt exist then strstr() is causing undefined behaviour. If strstr() is
successful, then I want to see if the material name matches what I already
have loaded using strcmp() but since the '\0' isnt there - problems. I
count how many chars until the '\n' and then use strncmp I suppose, but that
doesnt get around the strstr() and I want to know for future how to use
strcmp with '\n' terminator.
From the gist of it, I should program my own function, or better still
macro.
Am I correct?
Thanks
Allan
Nov 13 '05 #10
Allan Bruce wrote:

<snip>
Now this will be stored with a '\n' at the end but no '\0'.
In this example I wish to search for "Material:" using strstr() but if it
doesnt exist then strstr() is causing undefined behaviour. If strstr() is
successful, then I want to see if the material name matches what I already
have loaded using strcmp() but since the '\0' isnt there - problems. I
count how many chars until the '\n' and then use strncmp I suppose, but
that doesnt get around the strstr() and I want to know for future how to
use strcmp with '\n' terminator.
From the gist of it, I should program my own function, or better still
macro.
Am I correct?


The simplest solution is to ensure that the string is null-terminated, by
allocating one byte more than you need for the data, and writing a '\0'
character into that byte.

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #11

"Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote in message
news:bf**********@news.freedom2surf.net...
Hi there,
I am reading a file into a char array, and I want to find if a string exists in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there a
similar function that will do this, or will I have to write my own?
Thanks
Allan


You can use use strncmp for that, but only if you know how big the total
buffer is so you know where to stop.
Nov 13 '05 #12

"Serve Laurijssen" <ik@thuis.nl> wrote in message
news:bf**********@news4.tilbu1.nb.home.nl...

"Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote in message
news:bf**********@news.freedom2surf.net...
Hi there,
I am reading a file into a char array, and I want to find if a string

exists
in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is there a similar function that will do this, or will I have to write my own?
Thanks
Allan


You can use use strncmp for that, but only if you know how big the total
buffer is so you know where to stop.


ah what the hell, here's some sample code.
Had a little bit too much whiskey, so trying to find all bugs is left as an
exercise for you :)

#include <stdlib.h>
#include <stdio.h>
int main(void)

{

int i;

char buf[10] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j' };

char *findstr = "ijk";

for (i = 0; i <= (sizeof buf / sizeof *buf)-strlen(findstr); i++)

{

if (strncmp(buf+i, findstr, 2) == 0)

printf("%.3s\n", buf+i);

}

puts("done");

return 0;

}
Nov 13 '05 #13

"Serve Laurijssen" <ik@thuis.nl> wrote in message
news:bf**********@news4.tilbu1.nb.home.nl...

"Serve Laurijssen" <ik@thuis.nl> wrote in message
news:bf**********@news4.tilbu1.nb.home.nl...

"Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote in message
news:bf**********@news.freedom2surf.net...
Hi there,
I am reading a file into a char array, and I want to find if a string exists
in a given line.
I cant use strcmp since the line ends with '\n' and not '\0'. Is
there a similar function that will do this, or will I have to write my own?
Thanks
Allan
You can use use strncmp for that, but only if you know how big the total
buffer is so you know where to stop.


ah what the hell, here's some sample code.
Had a little bit too much whiskey, so trying to find all bugs is left as

an exercise for you :)

#include <stdlib.h>
#include <stdio.h>
int main(void)

{

int i;

char buf[10] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j' };

char *findstr = "ijk";

for (i = 0; i <= (sizeof buf / sizeof *buf)-strlen(findstr); i++)

{

if (strncmp(buf+i, findstr, 2) == 0)

printf("%.3s\n", buf+i);

}

puts("done");

return 0;

}


Thanks, will have a look.
I have to say one thing though, I hope you are drinking Scotch since it is
the best, but I may be biassed, but if you are its spelt whisky. Only the
Irish could spell it differently! (Sorry to my Irish mates...)
Allan
Nov 13 '05 #14
In 'comp.lang.c', "Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote:
I have to say one thing though, I hope you are drinking Scotch since it is
the best, but I may be biassed, but if you are its spelt whisky. Only the
best what? Do you mean pure malt? Forget those blended piece of shit.
Irish could spell it differently! (Sorry to my Irish mates...)


What the hell? Do you meant whiskey? The Whiskey!

--
-ed- em**********@noos.fr [remove YOURBRA before answering me]
The C-language FAQ: http://www.eskimo.com/~scs/C-faq/top.html
<blank line>
FAQ de f.c.l.c : http://www.isty-info.uvsq.fr/~rumeau/fclc/
Nov 13 '05 #15

"Emmanuel Delahaye" <em**********@noos.fr> wrote in message
news:Xn************************@130.133.1.4...
In 'comp.lang.c', "Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote:
I have to say one thing though, I hope you are drinking Scotch since it is the best, but I may be biassed, but if you are its spelt whisky. Only
the
best what? Do you mean pure malt? Forget those blended piece of shit.
Irish could spell it differently! (Sorry to my Irish mates...)


What the hell? Do you meant whiskey? The Whiskey!

--
-ed- em**********@noos.fr [remove YOURBRA before answering me]
The C-language FAQ: http://www.eskimo.com/~scs/C-faq/top.html
<blank line>
FAQ de f.c.l.c : http://www.isty-info.uvsq.fr/~rumeau/fclc/


I do mean single malt, I am an Islay man, so I prefer Ardbeg, Laphroaig and
Lagavulin, and they are all spelt whisky! not whiskey! I am Scottish and
take pride in my whisky, not the Irish variety which generally tastes like
cats pee, sorry Bushmills but it does.
Allan
Nov 13 '05 #16
In article <bf**********@news.freedom2surf.net>
Allan Bruce <al*****@TAKEAWAYf2s.com> writes:
I [have an entire, probably large] file [in] a char array, and I want
to find if a string exists in a given line.


Elsethread, you note you mean "strstr", not "strcmp".

I suspect (per insertion above and given some guesses as to what
you are doing) that you want to check a relatively large number
of lines as well, not just one specific line.

If the file is large and the number of lines is large, you may find
it worthwhile to implement a Boyer-Moore search. By writing your
own, you can choose any sort of termination conditions you like,
including stopping at newlines.

Boyer-Moore is, in the ideal case, O(N/M) where N is the length of
the search space -- in this case, some presuambly large number of
lines -- and M is the length of the string to be found. It has
some overhead setup that makes it a bad idea unless N is noticeably
larger than M. Since strstr() does not get much information, and
C strings are generally short, strstr() is unlikely to use Boyer-Moore
-- but you know more about what you are searching, so you can.
--
In-Real-Life: Chris Torek, Wind River Systems (BSD engineering)
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://67.40.109.61/torek/index.html (for the moment)
Reading email is like searching for food in the garbage, thanks to spammers.
Nov 13 '05 #17
On Sat, 19 Jul 2003 12:08:42 -0400, Allan Bruce wrote:
I am using this to read the file:
// find how big the file is
fseek(fptr, 0, SEEK_END);
size = ftell(fptr);

//allocate memory for string
if ( (contents = new char[size]) == NULL)
return 0;

Basically reading it in one big chunk, since I am doing some things to
the code that take a long time so I wanted to keep the file open for as
little time as possible.
I use strstr() to find some matches and also use strcmp() to see if some
are true for example, a line may be:

# Material: Porsche_Body

Now this will be stored with a '\n' at the end but no '\0'. In this
example I wish to search for "Material:" using strstr() but if it doesnt


I find the best way to parse stuff like this is to use a little "State
Machine" which in this context is just a loop with a switch like:

int ch, state = 0;

while ((ch = fgetc(in)) != EOF) {
switch (state) {
case 0:
if (ch == '#') {
state = 1;
}
break;
case 1:
if (isspace(ch)) {
break;
}
state = 2;
case 2:
if (ch == ':') {
ident[i++] = '\0'; /* Material */
state = 3;
break;
}
ident[i++] = ch;
break;
case 3:
if (isspace(ch)) {
break;
}
state = 4;
case VALUE:
if (ch == '\r' || ch == '\n') {
value[v++] = '\0'; /* Porsche_Body */
state = 0;
break;
}
value[v++] = ch;
}
}

The advantage in doing this is that it tends to scale a lot better. As
your file format changes or if you encounter content that you previously
thought you would never need to parse you can refactor easily without
the code growing exponentially more complex. For example if you suddenly
started getting files that had identifiers (e.g. Material) without values
you could add a test for '\r' and '\n' in case 3 to make value[0] =
'\0' and reset the state to 0.

This is all very crudely described of course. You would need to adjust
the technique to your needs.

Mike
Nov 13 '05 #18
Allan Bruce <al*****@takeawayf2s.com> scribbled the following:
"Emmanuel Delahaye" <em**********@noos.fr> wrote in message
news:Xn************************@130.133.1.4...
In 'comp.lang.c', "Allan Bruce" <al*****@TAKEAWAYf2s.com> wrote:
> I have to say one thing though, I hope you are drinking Scotch since it is > the best, but I may be biassed, but if you are its spelt whisky. Only
the

best what? Do you mean pure malt? Forget those blended piece of shit.
> Irish could spell it differently! (Sorry to my Irish mates...)


What the hell? Do you meant whiskey? The Whiskey!

I do mean single malt, I am an Islay man, so I prefer Ardbeg, Laphroaig and
Lagavulin, and they are all spelt whisky! not whiskey! I am Scottish and
take pride in my whisky, not the Irish variety which generally tastes like
cats pee, sorry Bushmills but it does.


The malt is spoiled by the too high alcohol content. It ends up tasting
like alcohol. If you like malt, drink beer.

--
/-- Joona Palaste (pa*****@cc.helsinki.fi) ---------------------------\
| Kingpriest of "The Flying Lemon Tree" G++ FR FW+ M- #108 D+ ADA N+++|
| http://www.helsinki.fi/~palaste W++ B OP+ |
\----------------------------------------- Finland rules! ------------/
"Immanuel Kant but Genghis Khan."
- The Official Graffitist's Handbook
Nov 13 '05 #19
Emmanuel Delahaye wrote:

<snip>
Forget those blended piece of shit.


Was this entirely necessary? This /is/ a technical newsgroup, after all.

I'd hate to have to plonk you.

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #20
Paul Hsieh wrote:

<snip>
Well, I think the first thing is to realize that the C library is just
pure
digital diarrhea, especially for strings.
Please do not present your opinions, however dearly held, as if they are
facts.
The implicit requirement to
scan for the end of the string implicit in most of the string library
belies is propensity for being slow, a haven for buffer overflows, and
generally just the wrong set of primitives for string manipulation.


And yet a goodly number of C programmers manage perfectly well with
null-terminated strings in their fast, well-written code. Don't assume your
own experience is universal, and don't blame the library for /your/ buffer
overflows. If you don't like C, and you clearly don't, then why not just
use something else instead?

<snip>

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #21

"Paul Hsieh" <qe*@pobox.com> wrote in message

Well, I think the first thing is to realize that the C library is just pure digital diarrhea, especially for strings. The implicit requirement to scan for the end of the string implicit in most of the string library belies is
propensity for being slow, a haven for buffer overflows, and generally just > the wrong set of primitives for string manipulation. The functions are easy to implement, which is often important. Usually
performance in string manipulation isn't too important, since a string is
usually either input or output, and IO overhead is so large that a bit of
processing inefficiency isn't noticeable.
Finally, for better or for worse C has built in support for NUL-terminated
strings.
If your whole purpose is just to read the string off the disk as fast as
possible, then parse later, then I would recommend trying out my string
library.

I'm sure your string library is well-written, is an improvement over the C
string library, deserves to be accepted as an ANSI standard, etc.
However the problem is that it hasn't yet gained wide acceptance, so anyone
trying to understand how the code works first has to read and understand
your library documentation.
I won't say "don't use it", it might well be an advantage, particularly for
a large string-intensive project. However the issue isn't a clear-cut as you
seem to suggest.
Nov 13 '05 #22
Paul Hsieh <qe*@pobox.com> wrote:
[...]
If your whole purpose is just to read the string off the disk as fast as
possible, then parse later, then I would recommend trying out my string
library (http://bstring.sourceforge.net) as it should lead to a much
simpler solution:

#include "bstrlib.h"
#include "bstraux.h"

bstring b = bread ((bNread) fread, fptr); /* Read the whole file */


bNRead is defined thus in bstrlib.h:

typedef size_t (* bNread) (void *buff, size_t elsize, size_t nelem,
void *parm);

....and bread in part:

bstring bread (bNread readPtr, void * parm) {
/* ... */
l = readPtr ((void *) (buff->data + i), 1, n - i, parm);

When you make the bread() call in your example, bread has undefined
behaviour here, because it calls the fread function through an
incorrectly-typed function pointer. bread is passing a void *, but
fread is prototyped as accepting a FILE * as that formal parameter, and
the incorrectly-typed function pointer means there is no conversion.

To do this correctly, you must define an intermediate function:

#include <stdio.h>
#include "bstrlib.h"
#include "bstraux.h"

size_t bfread (void *buff, size_t elsize, size_t nelem, void *parm)
{
return fread(buff, elsize, nelem, parm);
}

....and now the call:

bstring b = bread (bfread, fptr); /* Read the whole file */

is OK (the conversion from FILE * to void * happens at the call to
bread, coerced by the bread prototype, and the conversion from void *
back to FILE * occurs at the call to fread, coerced by the fread
prototype).

Perhaps you should include this wrapper function for fread in your
library, since reading from FILE * is likely to be quite common?

I think you should also reflect on the fact that even the creator of
bstring can't seem to post simple examples using it that don't have
errors.

- Kevin.

Nov 13 '05 #23
In article <ne********************@tomato.pcug.org.au>, kevin@-nospam-
pcug.org.au says...
Paul Hsieh <qe*@pobox.com> wrote:
[...]
If your whole purpose is just to read the string off the disk as fast as
possible, then parse later, then I would recommend trying out my string
library (http://bstring.sourceforge.net) as it should lead to a much
simpler solution:

#include "bstrlib.h"
#include "bstraux.h"

bstring b = bread ((bNread) fread, fptr); /* Read the whole file */
bNRead is defined thus in bstrlib.h: [...]


Yes, I am aware of this issue; I have made a note of this in my documentation.
I am blatantly recommending that people break the ANSI rules for this. BTW,
can you name me at least one platform where this actually ends up being an
issue? I would like to make a note of it in my documentation, but I don't seem
to have ever encountered a platform which implemented pointers for one type
different from pointers of another type. In fact I am suspicious as to whether
such a platform exists.
[...] Perhaps you should include this wrapper function for fread in your
library, since reading from FILE * is likely to be quite common?
A possibility, but then it would force your program to link with file
manipulation functions (or at least fgetc and fread.)

I know this is difficult to understand, but bstring is a *STRING LIBRARY*. It
is *NOT* a file library, and makes absolutely no impositions on the
implementation of file streams whatsoever, while still being able to use them.

I.e., if someone decided that C's file functions are worthless in the same way
that I decided that C's string library functions are worthless, and used the
same philosophy that I did with bstring, then that library should be able to
work together with my library with no issue (even without having awareness of
the existence of bstrlib). I.e., neither of us would have to go through hoops
to interoperate, since we would have both exposed C-library sematic compatible
mechanisms.

Same thing is true of regexp's or other parsing libraries -- bstring will work
well with them.
I think you should also reflect on the fact that even the creator of
bstring can't seem to post simple examples using it that don't have
errors.


Apparently that's par for the course for source code posted here. Obviously,
the original should download the documentation and read it before using the
bstring library where this function prototype coercion ANSI problem is
explained.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sourceforge.net/
Nov 13 '05 #24
On Sun, 20 Jul 2003 07:50:23 +0000, Richard Heathfield wrote:
And yet a goodly number of C programmers manage perfectly well with
null-terminated strings in their fast, well-written code. Don't assume your


Please give examples, because...

http://www.and.org/vstr/security.html#reason

....doesn't agree with you at all, and it does provide examples where very
competent people didn't manage it.

--
James Antill -- ja***@and.org
Need an efficent and powerful string library for C?
http://www.and.org/vstr/

Nov 13 '05 #25
In article <bf**********@newsg4.svr.pol.co.uk>, ma*****@55bank.freeserve.co.uk
says...
"Paul Hsieh" <qe*@pobox.com> wrote in message
Well, I think the first thing is to realize that the C library is just
pure digital diarrhea, especially for strings. The implicit requirement to
scan for the end of the string implicit in most of the string library
belies is propensity for being slow, a haven for buffer overflows, and
generally just the wrong set of primitives for string manipulation.
The functions are easy to implement, which is often important.


And very difficult to implement for speed. I've optimized strlen () by itself
several times based on various ideas I've had or which have been given to me.
I can beat the strlen performance of nearly every C library written by a huge
margin, and still its an embarrasement as being necessarily O(n), when there is
no sensible reason not to be O(1).
[...] Usually
performance in string manipulation isn't too important, since a string is
usually either input or output, and IO overhead is so large that a bit of
processing inefficiency isn't noticeable.
You obviously haven't been introduced to the world of XML, HTML, or ASN1. You
probably have never considered how to implement a fast and space efficient
spell checker, text editor, or database either. Of course you could just
concede that C is the wrong tool for those jobs ...
Finally, for better or for worse C has built in support for NUL-terminated
strings.
If your whole purpose is just to read the string off the disk as fast as
possible, then parse later, then I would recommend trying out my string
library.
I'm sure your string library is well-written, is an improvement over the C
string library, deserves to be accepted as an ANSI standard, etc.


Well, I don't quite see things that way. The C ANSI committee are the group of
people who did *NOT* deprecate gets() and added C++ namespace conflicting
complex numbers which is suitable for numerical computationalist, and worthless
to number theorists (i.e., no complex integers) when they had the opportunity
in the C99 Spec. Whether or not my library is endorsed or considered by them
.... I mean these people are totally irrational, what motivation would or should
I have to submit my library to the ANSI C committee?

And you don't have to *speculate* as to whether or not its well written or not;
the source is fairly small, you can look at it yourself.
However the problem is that it hasn't yet gained wide acceptance, so anyone
trying to understand how the code works first has to read and understand
your library documentation.
The library is only 7 months old, there is a lot of competition from other
string libraries out there and apparently I'm not much of an advertiser. This
seems like a poor rationale for deciding whether or not an extension should be
added to the C standard -- and it clearly was not used for adding floating
point complex numbers.
I won't say "don't use it", it might well be an advantage, particularly for
a large string-intensive project. However the issue isn't a clear-cut as you
seem to suggest.


You cannot buffer overflow with my library unless you are trying really really
hard to do so. With the C library its fairly difficult *NOT* to buffer
overflow. I.e., I would claim that using my library is suitable for *ANY*
amount of string manipulation, if for no other reason than to mitigate the
buffer overflow problem that goes hand in hand with the C library's string
functions.

Squashing this bug alone would probably save Microsoft alone millions in
development costs.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sourceforge.net/
Nov 13 '05 #26
James Antill wrote:
On Sun, 20 Jul 2003 07:50:23 +0000, Richard Heathfield wrote:
And yet a goodly number of C programmers manage perfectly well with
null-terminated strings in their fast, well-written code. Don't assume
your


Please give examples, because...

http://www.and.org/vstr/security.html#reason

...doesn't agree with you at all, and it does provide examples where very
competent people didn't manage it.


On the contrary, the page doesn't disagree with me at all. It advocates good
practice with respect to buffer management, and I certainly agree with
that. It also points out the limitations of fixed size buffers, and I agree
there too.

As for null-terminated strings, why, the page doesn't even mention the term.

I think you've misunderstood the intent of the author of that page. Why
don't you ask him what he really meant to say? ;-)

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #27

"Paul Hsieh" <qe*@pobox.com> wrote in message

You obviously haven't been introduced to the world of XML, HTML, or
ASN1. You probably have never considered how to implement a fast and
space efficient spell checker, text editor, or database either. Of course you could just concede that C is the wrong tool for those jobs ...
I'm a games programmer so I don't generally do text-intensive apps.
However I use Internet Explorer. With a dial-up connection such I have at
home it often takes about two seconds for a page to load. With such an
overhead, no amount of efficiency in the string library is going to make any
noticeable difference.
With a spell checker, the problem would seem to be searching the dictionary.
I don't see how avoiding NUL-terminated strings is going to make a vast
improvement.
I have written a text editor - it was an assignment. I stored the text as a
linked list of lines, and performance was fine. If you try to store
everything as one ASCIIZ string then you are admittedly asking for trouble.
I have been warned "never to bullshit your way into a database job" so I'll
withhold comment on this.
... I mean these people are totally irrational, what motivation would or
should I have to submit my library to the ANSI C committee?
Because then everyone would use your library, you would be famous, and you
could write a book "Notes on Using the Extended String Library" and make a
nice amount of money. As happened with the Standard Template Library.
the buffer overflow problem that goes hand in hand with the C library's
string functions.

Squashing this bug alone would probably save Microsoft alone millions in
development costs.

If you program in C you have to get used to the fact that arrays don't have
bounds checking.
It is also tempting to write code that uses fixed "big enough" buffers. I
will often do this for an internal tool. Since the program is meant for
internal use only, no-one is going to try to find weaknesses in it to
exploit.
In practise I haven't found string buffer overflows to be much of a problem.
Nov 13 '05 #28
Paul Hsieh <qe*@pobox.com> wrote:
In article <ne********************@tomato.pcug.org.au>, kevin@-nospam-
pcug.org.au says...
Paul Hsieh <qe*@pobox.com> wrote:
[...]
> If your whole purpose is just to read the string off the disk as fast as
> possible, then parse later, then I would recommend trying out my string
> library (http://bstring.sourceforge.net) as it should lead to a much
> simpler solution:
>
> #include "bstrlib.h"
> #include "bstraux.h"
>
> bstring b = bread ((bNread) fread, fptr); /* Read the whole file */


bNRead is defined thus in bstrlib.h: [...]


Yes, I am aware of this issue; I have made a note of this in my documentation.
I am blatantly recommending that people break the ANSI rules for this. BTW,
can you name me at least one platform where this actually ends up being an
issue? I would like to make a note of it in my documentation, but I don't
seem to have ever encountered a platform which implemented pointers for one
type different from pointers of another type. In fact I am suspicious as
to whether such a platform exists.


The Data General Eclipse had different representations for word and byte
pointers - converting from one to the other required a shift and mask,
so accessing one as the other without conversion would lead to
unpredictable results.

Since you ask about any kind of pointers, not just pointers to object
types, there are several platforms where function pointers are larger
(in some cases, *much* larger) than object pointers - this is why
conversions between function pointer and object pointer types aren't
defined.

Anyway, once upon a time all the world was a VAX - I don't plan on
repeating the mistakes of the past. Why should people write this
erroneous code, when there is a simple, correct alternative?
[...] Perhaps you should include this wrapper function for fread in your
library, since reading from FILE * is likely to be quite common?


A possibility, but then it would force your program to link with file
manipulation functions (or at least fgetc and fread.)


OK - you'd like it to be portable to non-hosted implementations. Fair
enough. Since the requisite function is a one-liner anyway, you can
just include it in the documentation for people to copy and paste if
they need it.

- Kevin.

Nov 13 '05 #29
In article <ne********************@tomato.pcug.org.au>
Kevin Easton <kevin@-nospam-pcug.org.au> writes, in part:
The Data General Eclipse had different representations for word and byte
pointers - converting from one to the other required a shift and mask ...
Just a shift, actually -- the bits are carefully arranged with ring
and segment numbers offset by one bit, so that a one-bit shift
serves to convert one into the other. The "word" is a 16-bit word,
so a byte pointer has one extra low-order bit that must be introduced
or discarded as necessary. The top bit of a word pointer is a
special "indirect" bit that is not used in C at all (so it can be
discarded without loss of information).
Anyway, once upon a time all the world was a VAX - I don't plan on
repeating the mistakes of the past.


Indeed, we see this happening today with the introduction of 64-bit
architectures. All of the "ILP32 vs LP64" items that were posted
a short while ago are wonderful examples of assuming "all the
world's an i386 or other 32-bit, byte-oriented processor". The C
language proper does not assume this, and if you (the generic "you")
also avoid assuming it, your code will work on both ILP32 *and*
LP64 machines, with no source-level changes required.

(As usual, those who do not learn from history are doomed to repeat
it. :-) )
--
In-Real-Life: Chris Torek, Wind River Systems (BSD engineering)
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://67.40.109.61/torek/index.html (for the moment)
Reading email is like searching for food in the garbage, thanks to spammers.
Nov 13 '05 #30
James Antill wrote:
On Sun, 20 Jul 2003 20:39:08 +0000, Richard Heathfield wrote:
James Antill wrote:
On Sun, 20 Jul 2003 07:50:23 +0000, Richard Heathfield wrote:

And yet a goodly number of C programmers manage perfectly well with
null-terminated strings in their fast, well-written code. Don't assume
your

Please give examples, because...

http://www.and.org/vstr/security.html#reason

...doesn't agree with you at all, and it does provide examples where
very competent people didn't manage it.
On the contrary, the page doesn't disagree with me at all. It advocates
good practice with respect to buffer management, and I certainly agree
with that. It also points out the limitations of fixed size buffers, and
I agree there too.


Hmm, maybe I misunderstood then. It seemed like you were saying that
using the plain string.h functions is often a good solution to string
related problems in C.


No, they are a fairly decent basis for normal programming situations. We've
all encountered situations that they don't match up to, and we've all
written code to work around those situations. But if I've got a buffer
that's yay big, and a null-terminated string that's no bigger than yay big
minus one, and I need to copy the string into the buffer, strcpy works for
me every time.
As for null-terminated strings, why, the page doesn't even mention the
term.
It does at...

http://www.and.org/vstr/security.html#io

...but the words null-terminated aren't used.


Well, that would explain why I couldn't find them. :-)

If the null-terminated string model is not appropriate for your data, then
obviously you have to use something else (and in fact my own "string"
library completely ignores null characters, treating '\0' as just another
value).
I think you've misunderstood the intent of the author of that page. Why
don't you ask him what he really meant to say? ;-)


The problem is that he never seems to argue with what I say ;)


Perhaps he's killfiled you? <g,d&r>

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #31
In article <MP************************@news.sf.sbcglobal.net> ,
Paul Hsieh <qe*@pobox.com> wrote:
do******@address.co.uk.invalid says...
Paul Hsieh wrote:
> The implicit requirement to
> scan for the end of the string implicit in most of the string library
> belies is propensity for being slow, a haven for buffer overflows, and
> generally just the wrong set of primitives for string manipulation.
And yet a goodly number of C programmers manage perfectly well with
null-terminated strings in their fast, well-written code.


By what standards can you say any of that? Buffer overflows are the #1
occurring bug, and the vast majority of them occurr in the C string library.


Accidents are the #1 cause of death for people under the age of 35 in
the United States[1][2], and the vast majority of them are motor vehicle
accidents[3]. Does that mean we should stop using motor vehicles?
As to being fast -- that's impossible unless the functions are absolutely
trivial. The C-library basically imposes an additional minimum O(n) on all
non-trivial string manipulations.
Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time? I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.

C still exposes the best core
speed for someone willing to work around the compiler and pretty much the only
useful language with inline assembly language. So I am stuck with it.


Really? If you go right to assembly, you stop having to work around
the compiler (since there's no longer a compiler to work around),
and claiming that it doesn't give you inline assembly would quickly
(and rightly) be dismissed as quibbling over details.
dave

[1] http://www.cdc.gov/nchs/fastats/pdf/nvsr49_11tb1.pdf
This is a breakdown of causes of death by age, race, and sex.
The first set of detailed breakdowns is all races, both sexes, by age.
Accidents are highest up to age 34 and second-highest for 35-44.

[2] The US being the first country that Google returned results for

[3] http://www.cdc.gov/nchs/fastats/pdf/nvsr49_11.pdf
Perhaps not "vast majority", but enough to get the top ranking in
the subdivision of accidental deaths.
--
Dave Vandervies dj******@csclub.uwaterloo.ca
The q in qsort stands for quirky. You really don't know how it will do its
job.
--Daniel Fox in comp.lang.c
Nov 13 '05 #32
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:
In article <MP************************@news.sf.sbcglobal.net> ,
Paul Hsieh <qe*@pobox.com> wrote:

[...]
As to being fast -- that's impossible unless the functions are absolutely
trivial. The C-library basically imposes an additional minimum O(n) on all
non-trivial string manipulations.


Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time? I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.


Concatenation of a string of length m with a string of length n is
O(n+m) using strcat, but O(m) if you use a string type that has its
length explicitly stored, rather than indicated by a sentinel.

In most other cases you will end up doing a copy of the string that you
need to find the length of, so the overall time complexity doesn't
change by avoiding the scan to find string length - but the constant
factors can often be reduced quite significantly (consider an operation
like search-and-replace).

- Kevin.

Nov 13 '05 #33
Mark McIntyre <ma**********@spamcop.net> wrote:
On Tue, 22 Jul 2003 00:13:30 GMT, in comp.lang.c , Kevin Easton
<kevin@-nospam-pcug.org.au> wrote:
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:
In article <MP************************@news.sf.sbcglobal.net> ,
Paul Hsieh <qe*@pobox.com> wrote:

[...]
As to being fast -- that's impossible unless the functions are absolutely
trivial. The C-library basically imposes an additional minimum O(n) on all
non-trivial string manipulations.

Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time? I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.


Concatenation of a string of length m with a string of length n is
O(n+m) using strcat, but O(m) if you use a string type that has its
length explicitly stored, rather than indicated by a sentinel.


Sure, but the difference between O(m+n) and O(m) is negligible for any
realistic n,m associated with strings.


Consider repeated concatenation of strings onto a destination - if we
concatenate 20 strings, each character in the original buffer is
inspected at least 20 times, each character of the second string at
least 19 times, ...

- Kevin.

Nov 13 '05 #34
Richard Heathfield <in*****@address.co.uk.invalid> wrote:
Kevin Easton wrote:
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:
In article <MP************************@news.sf.sbcglobal.net> ,
Paul Hsieh <qe*@pobox.com> wrote:

[...]
As to being fast -- that's impossible unless the functions are absolutely
trivial. The C-library basically imposes an additional minimum O(n) on
all non-trivial string manipulations.

Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time? I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.


Concatenation of a string of length m with a string of length n is
O(n+m) using strcat, but O(m) if you use a string type that has its
length explicitly stored, rather than indicated by a sentinel.


Using no additional overhead [1], remember how many bytes you've copied
using strcat, and offset the dest pointer by that many on the next copy.

[1] in comparison to the "store the length" method.


strcat doesn't tell you how many bytes you've copied. If you know the
length of the destination string to start with, you'd just use memcpy
anyway (which is what the stored-length method comes down to).
In most other cases you will end up doing a copy of the string that you
need to find the length of, so the overall time complexity doesn't
change by avoiding the scan to find string length - but the constant
factors can often be reduced quite significantly (consider an operation
like search-and-replace).


This can all be managed perfectly satisfactorily using C strings and
temporary variables.


Satisfactorily, yes. But it's undeniably more efficient to keep the
lengths of the pertinent strings around and use directed copies rather
than scan-for-sentinel copies, which means you end up not using a fair
number of the standard C string functions.

I'm not saying the difference is incredibly noticeable in all or even
many cases, but sometimes it is.

- Kevin.

Nov 13 '05 #35
On Tue, 22 Jul 2003 15:55:58 +0000, Dan Pop wrote:
In <ne********************@tomato.pcug.org.au> Kevin Easton <kevin@-nospam-pcug.org.au> writes:
Consider repeated concatenation of strings onto a destination - if we
concatenate 20 strings, each character in the original buffer is
inspected at least 20 times, each character of the second string at
least 19 times, ...


This can be trivially avoided by using sprintf instead of strcat :-)


Errm, what?

Say you have...

List *scan = NULL;
char buf[4096]; /* we "know" this is long enough */

buf[0] = 0;
scan = beg;
while (scan)
{
strcat(buf, scan->data);
scan = scan->next;
}

....how does sprintf() help? Ok, so you can do something like...

ptr = buf;
while (scan)
{
ptr += sprintf(ptr, "%s", scan->data); /* assume sprintf() has an ISO
* return value*/
scan = scan->next;
}

....but then you might as well just do...

ptr = buf;
while (scan)
{
size_t len = strlen(scan->data);

memcpy(ptr, scan->data, len);
ptr += len;

scan = scan->next;
}

....and after you do that more than once you realize that you want...

char *my_stpcpy(char *dst, const char *src)
{
size_t len = strlen(src);

memcpy(dst, src, len);
dst += len;

return (dst);
}

ptr = buf;
while (scan)
{
ptr = stpcpy(ptr, scan->data);
scan = scan->next;
}

....at which point you've just _reinvented the wheel_ for about the
millionth time, creating your own clumsy string API. All because the c
library string APIs are deficient ... which is pretty much what was
argued.

--
James Antill -- ja***@and.org
Need an efficent and powerful string library for C?
http://www.and.org/vstr/

Nov 13 '05 #36
On Tue, 22 Jul 2003 07:38:33 +0000, Richard Heathfield wrote:
Kevin Easton wrote:
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:
In article <MP************************@news.sf.sbcglobal.net> ,
Paul Hsieh <qe*@pobox.com> wrote: [...]
As to being fast -- that's impossible unless the functions are absolutely
trivial. The C-library basically imposes an additional minimum O(n) on
all non-trivial string manipulations.

Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time? I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.


Concatenation of a string of length m with a string of length n is
O(n+m) using strcat, but O(m) if you use a string type that has its
length explicitly stored, rather than indicated by a sentinel.


Using no additional overhead [1], remember how many bytes you've copied
using strcat, and offset the dest pointer by that many on the next copy.


So you create a local stpcpy(), strconcat() and etc. or varients thereof
that take a pointer to a length and the begining of the c style string.
Then you just have to deal with all the problems of using a string API
that limits the length of data:

But this says nothing about how good or bad the C-library string API is.

memcpy: Useful, but requires the programer to keep track of metadata for
dst.
memmove: Same as memcpy.
strcpy: Most commonly used for buffer overflows, as with all the str*
functions to create data the two inputs cannot be the same.
strncpy: Most broken interface ever
strcat: O(n)
strncat: O(n) Plus dst must be a valid NIL terminated c style string
memcmp: Useful, but requires the programer to keep track of metadata for
both arguments and properly merge them (you can "fix" having to
merge the metadata by using strncpy() but I wouldn't recommend
this).
strcmp: Useful, assuming you have valid c style strings.
strcoll: Same as strcmp
strncmp: Same as memcmp
strxfrm: Can be used as a non-broken strncpy() if you don't mind confusing
everyone (and you don't use LC_COLLATE).
memchr: Same as memcpy
strchr, strcspn, strpbrk, strrchr, strspn, strstr, strlen: Same as strcmp
strtok: Often used badly, destroys it's input ... sometimes even horribly
abused as a side band parameter to functions.
memset: Same as memcpy

So there are some useful functions for dealing with C style strings that
exist, but as I've said the only sane way to create those strings is to
abuse strxfrm() or write your own using memcpy()/memmove().
And then after you've created those functions so you can move data to
limited sized buffers without going insane, you still have all the
problems of having limited size buffers...

http://www.and.org/vstr/security.html#alloc
[1] in comparison to the "store the length" method.


_4 bytes of metadata_
and if you want to dynamically allocate the string this is probably less
than 25% of a zero length (1 byte long[2]) string.
And if you aren't dynamically allocating the string, you are almost
certainly going to have the fixed size buffer greater than 16 bytes long,
so you again have less than 25% overhead.

But yeh it's not impossible to do that, you might only need to create
one or two extra functions and it's possible you won't have any security
problems because of it. I might even put Richard on the list of people
that can do all of that, however that's a very short list.

[2] Malloc implementations I've seen require at least 16 bytes of overhead
per object, so you get 16 + 4 + 1 vs. 16 + 1

--
James Antill -- ja***@and.org
Need an efficent and powerful string library for C?
http://www.and.org/vstr/

Nov 13 '05 #37
James Antill wrote:
On Tue, 22 Jul 2003 07:38:33 +0000, Richard Heathfield wrote:
<snip>
So you create a local stpcpy(), strconcat() and etc. or varients thereof
strconcat is out, since it invades implementation namespace.
that take a pointer to a length and the begining of the c style string.
Why bother? Just remember the length in an auto variable. Much of the time,
this is sufficient.
Then you just have to deal with all the problems of using a string API
that limits the length of data:
Have I misunderstood you? I'm not aware of any imposed limit on string
length in the C string model.

But this says nothing about how good or bad the C-library string API is.

memcpy: Useful, but requires the programer to keep track of metadata for
dst.
Right.
memmove: Same as memcpy.
Right, in this regard at least!
strcpy: Most commonly used for buffer overflows,
That's a little unfair on strcpy. If the programmer is careful (as all
programmers should be), strcpy is perfectly safe.
as with all the str*
functions to create data the two inputs cannot be the same.
Why would you want to copy a string onto itself?

<some very good points snipped>
So there are some useful functions for dealing with C style strings that
exist, but as I've said the only sane way to create those strings is to
abuse strxfrm() or write your own using memcpy()/memmove().
strcpy still works fine for me.
And then after you've created those functions so you can move data to
limited sized buffers without going insane, you still have all the
problems of having limited size buffers...
So don't use limited size buffers.

<snip>
But yeh it's not impossible to do that, you might only need to create
one or two extra functions and it's possible you won't have any security
problems because of it. I might even put Richard on the list of people
that can do all of that, however that's a very short list.


If the list is indeed so short, the programming industry needs to be very
very worried. It's not difficult to get this right.
--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #38
Mark McIntyre <ma**********@spamcop.net> wrote:
On Tue, 22 Jul 2003 01:08:26 GMT, in comp.lang.c , Kevin Easton
<kevin@-nospam-pcug.org.au> wrote:
Mark McIntyre <ma**********@spamcop.net> wrote:
On Tue, 22 Jul 2003 00:13:30 GMT, in comp.lang.c , Kevin Easton
<kevin@-nospam-pcug.org.au> wrote: Sure, but the difference between O(m+n) and O(m) is negligible for any
realistic n,m associated with strings.


Consider repeated concatenation of strings onto a destination - if we
concatenate 20 strings, each character in the original buffer is
inspected at least 20 times, each character of the second string at
least 19 times, ...


Well, firstly I still contend that this is relatively speaking
insignificant except in critical sections of code (eg tight loops, but
erm what are you doing manipulating strings in tight loops? :-) ), and
secondly I contend that in such sections, strcat is a poor choice
anyway, memcpy is probably more appropriate.


For some programs, string manipulation is the meat of their job. Anyway
- you've hit the nail on the head - using memcpy _is_ probably more
appropriate, and using it in the best way involves keeping the length of
strings around (or a pointer to the end, which amounts to the same
thing). The question is why builtin C strings use a sentinel method
rather than a length/end-pointer method to indicate their extent[%] - are
there any downsides to the latter?

- Kevin.

[%] Obviously the horse has not only well and truly bolted, but gone on
to live and long and happy life roaming the countryside and long since
died peacefully. So the question is merely of academic interest at this
point.
Nov 13 '05 #39
James Antill wrote:
strcpy: Most commonly used for buffer overflows,
That's a little unfair on strcpy. If the programmer is careful (as all
programmers should be), strcpy is perfectly safe.


That's a little optimistic, there are very few cases where you couldn't
just as easily use memcpy() ... that aren't errors.


I disagree (although of course that might just mean that I have less
experience of fighting malware than you do). I find strcpy to have
expressive power, which is why I prefer it to memcpy when strings are
involved.
as with all the str*
functions to create data the two inputs cannot be the same.
Why would you want to copy a string onto itself?


I've seen code like...

strcpy(s1, s1 + 1);


Um, yes, I've seen code like that too. My LART had memmove written on it (on
the bit just surrounding the sticky-out nail), in large letters. Once the
blood had stopped flowing out quite so freely.
<snip>
As for the rest of the industry, they seem to be desperatly trying to
change language, once every 5 years ... which seem like buying a Ford
because the car stereo in your Mercedes doesn't play tapes, but they're
having fun I guess :).


Crazy world. 'Twas ever thus.

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #40
In <pa****************************@and.org> "James Antill" <ja***********@and.org> writes:
On Tue, 22 Jul 2003 15:55:58 +0000, Dan Pop wrote:
In <ne********************@tomato.pcug.org.au> Kevin Easton <kevin@-nospam-pcug.org.au> writes:
Consider repeated concatenation of strings onto a destination - if we
concatenate 20 strings, each character in the original buffer is
inspected at least 20 times, each character of the second string at
least 19 times, ...
This can be trivially avoided by using sprintf instead of strcat :-)


Errm, what?

Say you have...

List *scan = NULL;
char buf[4096]; /* we "know" this is long enough */

buf[0] = 0;
scan = beg;
while (scan)
{
strcat(buf, scan->data);
scan = scan->next;
}

...how does sprintf() help? Ok, so you can do something like...

ptr = buf;
while (scan)
{
ptr += sprintf(ptr, "%s", scan->data); /* assume sprintf() has an ISO
* return value*/


We normally assume that standard library functions return what the
standard says they do. Without this assumption, the standard library
becomes (next to) useless.
scan = scan->next;
}

...but then you might as well just do...

ptr = buf;
while (scan)
{
size_t len = strlen(scan->data);

memcpy(ptr, scan->data, len);
ptr += len;

scan = scan->next;
}
Except that it requires more code and is, therefore, less readable and
that it requires one more statement, after the loop, to properly terminate
the string.
...and after you do that more than once you realize that you want...

char *my_stpcpy(char *dst, const char *src)
You can simply name it stpcpy(), especially since this is the name you
use below :-)
{
size_t len = strlen(src);

memcpy(dst, src, len);
dst += len;

return (dst);
}

ptr = buf;
while (scan)
{
ptr = stpcpy(ptr, scan->data);
scan = scan->next;
}

...at which point you've just _reinvented the wheel_ for about the
millionth time, creating your own clumsy string API.
Which is pointless, considering that the sprintf-based solution achieves
the same thing, with the same source code complexity, while staying with
the standard API.
All because the c
library string APIs are deficient ... which is pretty much what was
argued.


The only defficiency I can see is that strcpy and strcat (and friends)
have a (mostly) useless return value. For the rare cases when this is
a problem, sprintf provides a solution without needing to reinvent
anything and without having to take the overhead of repetitive strcat()
calls (sprintf has its own overhead, but it is constant per call).

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #41
dj******@csclub.uwaterloo.ca (Dave Vandervies) wrote in message news:<bf**********@tabloid.uwaterloo.ca>...
In article <MP************************@news.sf.sbcglobal.net> ,
Paul Hsieh <qe*@pobox.com> wrote:
do******@address.co.uk.invalid says...
Paul Hsieh wrote: The implicit requirement to
> scan for the end of the string implicit in most of the string library
> belies is propensity for being slow, a haven for buffer overflows, and
> generally just the wrong set of primitives for string manipulation.

And yet a goodly number of C programmers manage perfectly well with
null-terminated strings in their fast, well-written code.
By what standards can you say any of that? Buffer overflows are the #1
occurring bug, and the vast majority of them occurr in the C string
library.


Accidents are the #1 cause of death for people under the age of 35 in
the United States[1][2], and the vast majority of them are motor vehicle
accidents[3]. Does that mean we should stop using motor vehicles?


But there is only so much you can do about people who have accidents.
Furthermore things *ARE* done to minimize them. That's why cars have
bumpers, crumple zones, air bags and seat belts. That's why microwave
ovens can't operate without the door being closed. That's why razors
have bizarrely shaped enclosures around the blade. The infrastructure
evolves around the need to minimize accidents even if you could argue
that the accident was really the fault of the victim.

Compare this with the C language. In order to make it work and be
adopted, in 1989, compromises were made and lots of questionable
practices were rubber stamped. Ok fine -- for 1989 it was good
decision because it allowed the language to be rapidly and widely
implemented adopted. But in the 20+ year lifetime of this language,
we now know this language has serious problems. Nearly every hack,
most general program failures and every buffer overflow->stack hijack
attack can be traced back to the C standard.

Ok -- so what is to be done about this sad state of affairs? Simple,
do *something* whenever there is a standards revision. 1999 was the C
committee's perfect opportunity to do something, *ANYTHING* to try to
mitigate these problems. Even the single solitary act of deprecating
gets() would have at least been a signal that they were thinking about
these issues.

But no, they added in complex numbers that worsens C++ compatibility,
and numerous other irrelenvancies to codify "standard practice" for no
good reason. Not surprisingly, C99 has gotten no serious support from
any major vendor -- the closest thing is gcc, and they are still
working on it.
As to being fast -- that's impossible unless the functions are absolutely
trivial. The C-library basically imposes an additional minimum O(n) on all
non-trivial string manipulations.


Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time?


My claim is that there is an *addition* O(n) paid. For those in
theoretical Comp. Sci., this may mean nothing to you if the operation
is O(n) anyways (especially if we ignore the fact the many operations
have an "m" as well as "n"), but Buffer Overflows, paging, and cache
thrashing probably don't mean anything to you either. In which case
real world performance won't mean anything to you either.
[...] I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.


I don't claim there is no additional overhead. But all the overhead
is O(1).
C still exposes the best core speed for someone willing to work around
the compiler and pretty much the only useful language with inline assembly
language. So I am stuck with it.


Really? If you go right to assembly, you stop having to work around
the compiler (since there's no longer a compiler to work around), [...]


Look, I don't care whether or not you understand why C (+ assembly
sometimes) is the only real option for writing maintainable and high
performance software.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sourceforge.net/
Nov 13 '05 #42
Richard Heathfield <in*****@address.co.uk.invalid> wrote in message news:<3f******@news2.power.net.uk>...
Kevin Easton wrote:
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:
In article <MP************************@news.sf.sbcglobal.net> ,
Paul Hsieh <qe*@pobox.com> wrote: [...]As to being fast -- that's impossible unless the functions are absolutely
trivial. The C-library basically imposes an additional minimum O(n) on
all non-trivial string manipulations.

Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time? I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.


Concatenation of a string of length m with a string of length n is
O(n+m) using strcat, but O(m) if you use a string type that has its
length explicitly stored, rather than indicated by a sentinel.


Using no additional overhead [1], remember


Remember?!?!? Remember where? In one of your processor's 6 precious
registers? You also have to *remember* how much memory you have
allocated and make sure you don't spill over as well, BTW. Oh yes,
and if you are communicating with a library are you going to pass
these remembered quantities around along with the string data? Or
will you let it work it all out with strlen by itself? Of course its
kind of hard to deduce the actual amount of memory from this
information so you either have to figure it all out from the caller
(thus duplicating some of the logic of the library) or you have to
pass it as a parameter (buring an additional register or stack.)

Or you could screw it and just buffer overflow like everyone else
does.
In most other cases you will end up doing a copy of the string that you
need to find the length of, so the overall time complexity doesn't
change by avoiding the scan to find string length - but the constant
factors can often be reduced quite significantly (consider an operation
like search-and-replace).


This can all be managed perfectly satisfactorily using C strings and
temporary variables.


Which my library (and others) is living proof of, of course. Of
course trying to do it all by hand youself without a centralize
library ... well you read about the weekly buffer overflow attacks
that get reported to www.securityfocus.com or Risks Digest or
www.news.com to see what happens when you try to do that.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sourceforge.net/
Nov 13 '05 #43
"James Antill" <ja***********@and.org> wrote:
But this says nothing about how good or bad the C-library string API is.

memcpy: Useful, but requires the programer to keep track of metadata for
dst.
memmove: Same as memcpy.
strcpy: Most commonly used for buffer overflows, as with all the str*
functions to create data the two inputs cannot be the same.
strncpy: Most broken interface ever
strcat: O(n)
strncat: O(n) Plus dst must be a valid NIL terminated c style string
memcmp: Useful, but requires the programer to keep track of metadata for
both arguments and properly merge them (you can "fix" having to
merge the metadata by using strncpy() but I wouldn't recommend
this).
strcmp: Useful, assuming you have valid c style strings.
strcoll: Same as strcmp
strncmp: Same as memcmp
strxfrm: Can be used as a non-broken strncpy() if you don't mind confusing
everyone (and you don't use LC_COLLATE).
memchr: Same as memcpy
strchr, strcspn, strpbrk, strrchr, strspn, strstr, strlen: Same as strcmp
strtok: Often used badly, destroys it's input ... sometimes even horribly
abused as a side band parameter to functions.
memset: Same as memcpy
Oooh! Nice list. I wonder where you got the idea for doing this from
.... ;)
[2] Malloc implementations I've seen require at least 16 bytes of overhead
per object, so you get 16 + 4 + 1 vs. 16 + 1


Yeah, and more importantly, people trying to mitigate buffer overflows
by allocating for the worst case will, of course, waste *far more* in
overhead on average.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sourceforge.net/
Nov 13 '05 #44
Paul Hsieh wrote:
Richard Heathfield <in*****@address.co.uk.invalid> wrote in message
news:<3f******@news2.power.net.uk>...
Kevin Easton wrote:
> Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:
>> In article <MP************************@news.sf.sbcglobal.net> ,
>> Paul Hsieh <qe*@pobox.com> wrote: [...]
>>>As to being fast -- that's impossible unless the functions are
>>>absolutely
>>>trivial. The C-library basically imposes an additional minimum O(n)
>>>on all non-trivial string manipulations.
>>
>> Can you give an example of a nontrivial string manipulation that
>> doesn't
>> already have O(n) time? I strongly suspect that anything you can come
>> up with could be done with the C string library with no additional
>> overhead.
>
> Concatenation of a string of length m with a string of length n is
> O(n+m) using strcat, but O(m) if you use a string type that has its
> length explicitly stored, rather than indicated by a sentinel.


Using no additional overhead [1], remember


Remember?!?!? Remember where?


In a size_t object.
In one of your processor's 6 precious
registers?
<shrug> The number of registers my processors have is not something that
concerns me when I'm writing portable code. For all I know, the program
might be running on Peter Seebach.
You also have to *remember* how much memory you have
allocated and make sure you don't spill over as well, BTW.
Thanks for reminding me. It had quite slipped my mind.
Oh yes,
and if you are communicating with a library are you going to pass
these remembered quantities around along with the string data?
That would be wise, don't you agree?
Or
will you let it work it all out with strlen by itself?
That depends on the library, of course.
Of course its
kind of hard to deduce the actual amount of memory from this
information so you either have to figure it all out from the caller
(thus duplicating some of the logic of the library) or you have to
pass it as a parameter (buring an additional register or stack.)
Yes. This is called "programming".
Or you could screw it and just buffer overflow like everyone else
does.


Can't be bothered.
This can all be managed perfectly satisfactorily using C strings and
temporary variables.


Which my library (and others) is living proof of, of course. Of
course trying to do it all by hand youself without a centralize
library ... well you read about the weekly buffer overflow attacks
that get reported to www.securityfocus.com or Risks Digest or
www.news.com to see what happens when you try to do that.


I've never seen any of my production programs reported there yet.

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #45
In <ne********************@tomato.pcug.org.au> Kevin Easton <kevin@-nospam-pcug.org.au> writes:
thing). The question is why builtin C strings use a sentinel method
rather than a length/end-pointer method to indicate their extent[%] - are
there any downsides to the latter?


On the PDP11, strcpy is simpler and faster with null-terminated strings;
here's the complete implementation, assuming the arguments are passed in
registers (DST is the register receiving the first argument, SRC is the
register receiving the second argument and R0 contains the return value):

STRCPY: MOV DST, R0
LOOP: MOVB (SRC)+, (DST)+
BNE LOOP
RET

But the real reason must be searched elsewhere. Languages using counted
strings provide a higher level API for string manipulation, i.e. they
take care of allocation issues in a transparent fashion and the character
count specifies not only the string length but also the size of the
space allocated to the string. If you copy a string, space for the
destination string will be automatically allocated, if you shrink a
string, the additional bytes will be automatically reclaimed by the
run time system. OTOH, such languages don't have pointers that can
point in the middle of a string and be effectively used as substrings.

The last sentence above also hints the advantage of C strings:
flexibility with minimum overhead:

char *path = "/foo/bar/baz.c";
char *file = strrchr(path, '/');
if (file == NULL) file = path;
else file++;

With counted strings, the above is impossible: a new string has to
be created to hold the file name.

C strings are well suited to a language like C, the only glitch is the
return value of strcmp and strcat: a pointer to the null character in the
destination string would be a lot more useful when concatenating together
many short strings.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #46
Dan Pop <Da*****@cern.ch> wrote:
[...]
But the real reason must be searched elsewhere. Languages using counted
strings provide a higher level API for string manipulation, i.e. they
take care of allocation issues in a transparent fashion and the character
count specifies not only the string length but also the size of the
space allocated to the string. If you copy a string, space for the
destination string will be automatically allocated, if you shrink a
string, the additional bytes will be automatically reclaimed by the
run time system. OTOH, such languages don't have pointers that can
point in the middle of a string and be effectively used as substrings.

The last sentence above also hints the advantage of C strings:
flexibility with minimum overhead:

char *path = "/foo/bar/baz.c";
char *file = strrchr(path, '/');
if (file == NULL) file = path;
else file++;

With counted strings, the above is impossible: a new string has to
be created to hold the file name.
I was thinking about something more like a struct-that-isn't (similar to
_Complex, in some ways?) - where

_String path = "/foo/bar/baz.c";

creates path with a pointer to the start of the string literal and a
length of 14 - when you do:

_String file = strrchr(path, '/');

strrchr would return a _String with an internal pointer to the last / of
the string literal, and a length of 6 (so both _String objects reference
the same memory - more like augmented pointers than fully encapsulated
strings).
C strings are well suited to a language like C, the only glitch is the
return value of strcmp and strcat: a pointer to the null character in the
destination string would be a lot more useful when concatenating together
many short strings.


It would - it would also have been nice to have the limit-pointer
versions like strlcat().

- Kevin.

Nov 13 '05 #47
In <ne********************@tomato.pcug.org.au> Kevin Easton <kevin@-nospam-pcug.org.au> writes:
Dan Pop <Da*****@cern.ch> wrote:
[...]
But the real reason must be searched elsewhere. Languages using counted
strings provide a higher level API for string manipulation, i.e. they
take care of allocation issues in a transparent fashion and the character
count specifies not only the string length but also the size of the
space allocated to the string. If you copy a string, space for the
destination string will be automatically allocated, if you shrink a
string, the additional bytes will be automatically reclaimed by the
run time system. OTOH, such languages don't have pointers that can
point in the middle of a string and be effectively used as substrings.

The last sentence above also hints the advantage of C strings:
flexibility with minimum overhead:

char *path = "/foo/bar/baz.c";
char *file = strrchr(path, '/');
if (file == NULL) file = path;
else file++;

With counted strings, the above is impossible: a new string has to
be created to hold the file name.


I was thinking about something more like a struct-that-isn't (similar to
_Complex, in some ways?) - where

_String path = "/foo/bar/baz.c";

creates path with a pointer to the start of the string literal and a
length of 14 - when you do:

_String file = strrchr(path, '/');

strrchr would return a _String with an internal pointer to the last / of
the string literal, and a length of 6 (so both _String objects reference
the same memory - more like augmented pointers than fully encapsulated
strings).


If you think about it deeper, you'll realise that it would take too much
complexity hidden behind a single language feature. You have to support
all the pointer operations on the _String type, but also provide special
operations for manipulating the pointer component and the length component
separately (e.g. you need to point your _String to some allocated memory
block or to truncate your _String). The semantics of == are also
"interesting". The more I think about it, the more I see the
complexities of C++ creeping into C ;-)

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #48
On Wed, 23 Jul 2003 07:47:08 +0000, Richard Heathfield wrote:
James Antill wrote:
strcpy: Most commonly used for buffer overflows,

That's a little unfair on strcpy. If the programmer is careful (as all
programmers should be), strcpy is perfectly safe.
That's a little optimistic, there are very few cases where you couldn't
just as easily use memcpy() ... that aren't errors.


I disagree (although of course that might just mean that I have less
experience of fighting malware than you do). I find strcpy to have


I meant that often you need to find the length and sanity check it
anyway, so you almost always have all the inputs you need for a call to
memcpy()
expressive power, which is why I prefer it to memcpy when strings are
involved.


This is nice, like using NULL instead of 0, the problem comes when you
have a length metadata variable that is implicitly part of the call (Ie.
things change if/when you alter it) ... but doesn't appear in the
arguments.
as with all the str*
functions to create data the two inputs cannot be the same.

Why would you want to copy a string onto itself?


I've seen code like...

strcpy(s1, s1 + 1);


Um, yes, I've seen code like that too. My LART had memmove written on it (on
the bit just surrounding the sticky-out nail), in large letters. Once the
blood had stopped flowing out quite so freely.


*breaks into song* ... "If I had a LART, I'd LART all over this world."

Of course there's six string functions to add data (including s(n)printf())
and only one memmove().

--
James Antill -- ja***@and.org
Need an efficent and powerful string library for C?
http://www.and.org/vstr/

Nov 13 '05 #49
In <pa****************************@and.org> "James Antill" <ja***********@and.org> writes:
On Wed, 23 Jul 2003 10:49:39 +0000, Dan Pop wrote:
All because the c
library string APIs are deficient ... which is pretty much what was
argued.
The only defficiency I can see is that strcpy and strcat (and friends)
have a (mostly) useless return value. For the rare cases when this is


That's the only defficiency?
Maybe you meant that's the only defficiency in the example. Arbitrary
sized source, source with NIL characters, substituting data, removing
parts of the data or dynamically working out what size the destination
needs to be to hold all the data. These are all handled poorly or not at
all.


You're badly missing the point of C strings. They are not supposed to
provide a general solution to *any* text manipulation problem. If you
need Perl, you know where to find it.
a problem, sprintf provides a solution without needing to reinvent
anything and without having to take the overhead of repetitive strcat()
calls (sprintf has its own overhead, but it is constant per call).


1. A lot of people don't normally see sprintf()/snprintf() used like this,
and so it's much easier for them to understand something that looks like
strcpy()/strncpy() with the correct semantics.


Arguments based on people's incompetence are bogus. Especially in a case
like this, where it is trivial to figure out what happens, even if you
aren't familiar with the technique.
2. People who sometimes use sprintf()/snprintf() in this way screw it up
enough that I would recommend something easier to use.
See above. People can easily misuse each and every feature of the
language and it's library.
3. The constant overhead for sprintf() is non-trivial, so you might as
well use the stpcpy() solution anyway ...
Only if, after profiling, you have determined that this is the performance
bottleneck of your application. Only fools microoptimise before
determining whether it is necessary or not. Unlike sprintf(), stpcpy()
is not a standard library function. Therefore, its usage reduces the
code readability, which is not acceptable without a *good* reason.
or think ahead and use something
better where other people have already written/tested the extra functions
for you.


Same comment as above: using extra functions reduces the code readability.
So, there must be a compelling reason for using them.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #50

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.