st********@gmail.com said:
<snip>
Pretty much what I'm trying to do is create a tokenizer or a (strtok)
and yes I would like to build an array of
pointers that points to each word.
Okay. ("To me, the term "double pointer" means the same as "pointer to
double", just as "char pointer" means "pointer to char", so your use of it
to mean "pointer to pointer", whilst far from unusual, does seem rather
strange to me. Just a thought.)
Please advise.
Divide and conquer.
Let's start by counting the number of words in the input string - and we
will define 'word' as 'a sequence, as long as possible, of one or more
contiguous characters containing no whitespace characters'. Whilst we
*could* count as we go along, it's easier if we know in advance how many
pointers we will need. So:
#include <stddef.h>
#include <ctype.h>
#define NOT_IN_WORD 0
#define IN_WORD 1
size_t wc(const char *s)
{
size_t count = 0;
int state = NOT_IN_WORD;
int space = 0;
while(*s != '\0')
{
space = isspace((unsigned char)*s);
if(state == NOT_IN_WORD && !space)
{
++count;
state = IN_WORD;
}
else if(state == IN_WORD && space)
{
state = NOT_IN_WORD;
}
++s;
}
return count;
}
Now let's test that. I added a simple driver, which looks like this:
#include <stdio.h>
int main(int argc, char **argv)
{
while(argc 0)
{
size_t count = wc(argv[--argc]);
printf("%d\n", count);
}
return 0;
}
Well, perhaps not the greatest test program in the world, but it allowed me
to drive the code and convince myself that I'd got it right. Okay, now we
have a way - wc() - to count the number of words in a string. Fine - now
we'd like to point at them. Remember, divide and conquer - so we'd like a
function that can build a sequence of these pointers, rather than have all
the intestines of this idea clogging up main.
Because we want to *tokenise* the string, in this simple solution we will
allow our function to modify the string itself. Note that this isn't
always what you want - it's "lazy" tokenisation, where we simply point
into the string at various points. That's fine as long as the string
persists (and as long as we don't mind hacking at it!).
Here, then, is a function to do that. Observe its similarities and
differences with regard to wc() - which it itself calls, by the way:
#include <stdlib.h>
#include <assert.h>
char **wl_build(char *s)
{
int space = 0;
int state = 0;
size_t thisword = 0;
size_t wordcount = wc(s);
char **wl = malloc((wordcount + 1) * sizeof *wl);
if(wl != NULL)
{
while(*s != '\0')
{
space = isspace((unsigned char)*s);
if(state == NOT_IN_WORD && !space)
{
wl[thisword++] = s;
state = IN_WORD;
}
else if(state == IN_WORD && space)
{
*s = '\0'; /* terminate the token */
state = NOT_IN_WORD;
}
++s;
}
wl[thisword] = NULL; /* list is NULL-terminated */
}
assert(thisword == wordcount);
return wl;
}
The alternative to doing this whole state machine thing twice is to count
and reallocate as we go. Possible, and microscopically faster, but perhaps
more effort than we'd like to go to, and the code would be messier, harder
to follow, and harder to maintain.
Once we have a way of creating such a list, we ought to have a way to
destroy it:
void wl_destroy(char ***wl)
{
if(wl != NULL)
{
free(*wl);
*wl = NULL;
}
}
And of course we will want a demonstration of how to use it. One easy way
to do that is to write a print function for it:
#include <stdio.h>
void wl_print(char **wl)
{
while(*wl != NULL)
{
printf("%s\n", *wl++);
}
}
Putting it all together, we throw away our old main(), and write a new one:
int main(int argc, char **argv)
{
while(argc 0)
{
char **wl = wl_build(argv[--argc]);
wl_print(wl);
wl_destroy(&wl);
}
return 0;
}
Now, that's all very well if we're content to do lazy evaluation. But what
if we're not? Let's say we need for these tokens to persist beyond the
lifetime of the string. Oh, and hey, let's say the string is sacrosanct,
too - we can look, but we'd better not touch.
Very few modifications are required, as it happens, to what we already
have, but all the mods are important. Perhaps most significant is the
addition of a token-duplicating function, since it is this that allows us
to create token copies that will persist after their originals have either
ceased to exist or at least changed in some way.
The creation of token copies allows us to leave the original string
unmodified, but it slightly complicates the destruction of the word list.
#include <stddef.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#define NOT_IN_WORD 0
#define IN_WORD 1
size_t wc(const char *s)
{
size_t count = 0;
int state = NOT_IN_WORD;
int space = 0;
while(*s != '\0')
{
space = isspace((unsigned char)*s);
if(state == NOT_IN_WORD && !space)
{
++count;
state = IN_WORD;
}
else if(state == IN_WORD && space)
{
state = NOT_IN_WORD;
}
++s;
}
return count;
}
char *token_duplicate(const char *p, size_t len)
{
char *new = malloc(len + 1);
if(new != NULL)
{
memcpy(new, p, len);
new[len] = '\0';
}
return new;
}
#include <assert.h>
char **wl_build(const char *s)
{
int space = 0;
int state = 0;
size_t thisword = 0;
size_t wordcount = wc(s);
const char *tokenstart = NULL;
size_t tokenlen = 0;
char **wl = malloc((wordcount + 1) * sizeof *wl);
if(wl != NULL)
{
while(*s != '\0')
{
space = isspace((unsigned char)*s);
if(state == NOT_IN_WORD && !space)
{
tokenstart = s;
tokenlen = 0;
state = IN_WORD;
}
else if(state == IN_WORD)
{
++tokenlen;
if(space)
{
wl[thisword++] = token_duplicate(tokenstart, tokenlen);
state = NOT_IN_WORD;
}
else
{
}
}
++s;
}
/* the last token may be null-terminated rather than space-terminated
*/
if(state == IN_WORD)
{
wl[thisword++] = token_duplicate(tokenstart, ++tokenlen);
}
wl[thisword] = NULL; /* list is NULL-terminated */
}
assert(thisword == wordcount);
return wl;
}
void wl_destroy(char ***wl)
{
if(wl != NULL)
{
if(*wl != NULL)
{
char **w = *wl;
while(*w != NULL)
{
free(*w++);
}
}
free(*wl);
*wl = NULL;
}
}
void wl_print(char **wl)
{
while(*wl != NULL)
{
printf("%s\n", *wl++);
}
}
int main(int argc, char **argv)
{
while(argc 0)
{
char **wl = wl_build(argv[--argc]);
wl_print(wl);
printf("Integrity of arg: %s\n", argv[argc]);
wl_destroy(&wl);
}
return 0;
}
--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999