473,287 Members | 1,418 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,287 software developers and data experts.

string tokenizing

I looked on google for an answer, but I didn't find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:

#include <algorithm>
#include <cctype>
#include <climits>
#include <deque>
#include <iostream>
#include <iterator>
#include <string>

using namespace std;

int
main()
{
string delim;
int c;

/* fill delim */
for(c=0; c < CHAR_MAX; c++){ // I tried #include <limits>, but
failed...
if((isspace(c) || ispunct(c))
&& !(c == '_' || c == '#')
delim += c;
}

string buf;
string::size_type op, np;
deque<string> tok;

while(std::getline(cin, buf) && !cin.fail()){
op = 0;
while((np=buf.find_first_of(delim, op)) != buf.npos){
tok.push_back(string(&buf[op], np-op));
if((op=buf.find_first_not_of(delim, np)) == buf.npos)
break;
}
tok.push_back(string(&buf[op]));

cout << buf << endl;
copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
"\n"));
cout << endl;
tok.clear();
}
return 0;
}

The inner loop basically finds tokens delimited by any character in
delim where multiple delimiters may appear between tokens (algorithm
follows some advice found on clc++). However, the method seems a little
clumsy, especially with respect to temporary objects. (Also, it does not
seem to work correctly. For example, the last token gets corrupted in
the second outer loop iteration.)

Also, it would be very nice to have a function like

int tokenize(const string& s, container<string>& c);

which returns the number of tokens, inserted into the container.
However, how do you write this so c is any container model? I'm not sure
you can since they don't share a base class. Is there any better way?

Certainly, this is easy to do with a mix of C and C++:

for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
tok.push_back(t);

where buf and delim are essentially char*'s. However, this seems
unsatisfactory as well.

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #1
28 8018
"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
I looked on google for an answer, but I didn't find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:
Remarks below.

#include <algorithm>
#include <cctype>
#include <climits>
#include <deque>
#include <iostream>
#include <iterator>
#include <string>

using namespace std;

int
main()
{
string delim;
int c;

/* fill delim */
for(c=0; c < CHAR_MAX; c++)

Is there a particular reason you're excluding the
value 'CHAR_MAX' from the loop? (using < instead of <=)
{ // I tried #include <limits>, but
failed...
What happened?

#include <limits>

std::numeric_limits<char>::max();

should work.

More below.
if((isspace(c) || ispunct(c))
&& !(c == '_' || c == '#')
delim += c;
}

string buf;
string::size_type op, np;
deque<string> tok;

while(std::getline(cin, buf) && !cin.fail()){
op = 0;
while((np=buf.find_first_of(delim, op)) != buf.npos){
tok.push_back(string(&buf[op], np-op));
if((op=buf.find_first_not_of(delim, np)) == buf.npos)
break;
}
tok.push_back(string(&buf[op]));

cout << buf << endl;
copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
"\n"));
cout << endl;
tok.clear();
}
return 0;
}

The inner loop basically finds tokens delimited by any character in
delim where multiple delimiters may appear between tokens (algorithm
follows some advice found on clc++). However, the method seems a little
clumsy, especially with respect to temporary objects. (Also, it does not
seem to work correctly. For example, the last token gets corrupted in
the second outer loop iteration.)

Also, it would be very nice to have a function like

int tokenize(const string& s, container<string>& c);

which returns the number of tokens, inserted into the container.
However, how do you write this so c is any container model? I'm not sure
you can since they don't share a base class. Is there any better way?
I find your code interesting, so I'll probably play around
with it for a bit, and let you know if I have any ideas.

But here's some food for thought: one way to 'generalize'
container access is with iterators, as do the functions in
<algorithm>.

template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}

For inserting new elements, you can use an iterator
adapter, e.g. std::insert_iterator. You could even
use an output stream as a 'container' using
ostream_iterator.

Certainly, this is easy to do with a mix of C and C++:

for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
tok.push_back(t);
This contradicts your parameter type of const reference to string,
since 'strtok()' modifies its argument.
where buf and delim are essentially char*'s. However, this seems
unsatisfactory as well.


Yes, 'strtok()' can be problematic, if only for the reason that
it modifies its argument, necessitating creation of a copy if
you want to keep the argument const.

HTH,
-Mike
Jul 19 '05 #2
"Mike Wahler" <mk******@mkwahler.net> wrote in message
news:Fk*****************@newsread4.news.pas.earthl ink.net...
template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}


Oops, I meant to make those iterator parameters const refs
as well

const T::iterator& beg, const T::iterator& end

-Mike
Jul 19 '05 #3
Mike Wahler wrote:

[snip]
{ // I tried #include <limits>, but
failed...
What happened?


foo.cc:9: limits: No such file or directory

For some reason, my compiler can't find the file. Otherwise, I agree
with you...
#include <limits>

std::numeric_limits<char>::max();

should work.
[snip] But here's some food for thought: one way to 'generalize'
container access is with iterators, as do the functions in
<algorithm>. template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}


This is a nice idea! I knew you should be able to do this, but I
couldn't see how. Here is the refactored code:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
{
string word;
string::size_type sp, ep; // start/end position

sp = 0;
do{
sp = buf.find_first_not_of(delim, sp);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
word = buf.substr(sp, ep-sp);
*ii++ = lc(word);
sp = buf.find_first_not_of(delim, ep+1);
}
}while(sp != buf.npos);

if(sp != buf.npos){
word = buf.substr(sp, buf.length()-sp);
*ii++ = lc(word);
}
}

called as

tokenize(buf, delim, insert_iter<deque<string> >(tokens,
tokens.begin()));

The orignal spec returned the number of tokens parsed. Now I have to
settle for checking

if(tokens.size() > 0){ ... }

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #4
David Rubin wrote:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
{
string::size_type sp, ep; // start/end position

ep = -1;

do{
sp = buf.find_first_not_of(delim, ep+1);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
*ii++ = buf.substr(sp, ep-sp);
}
}while(sp != buf.npos);
}

That's better. The 'ep+1' is a small optimization. I'm not sure it makes
any difference, and really, starting with ep=0 makes the code a bit
clearer.

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #5

"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
Mike Wahler wrote:

[snip]
{ // I tried #include <limits>, but
failed...
What happened?


foo.cc:9: limits: No such file or directory

For some reason, my compiler can't find the file.


Configuration problem, installation problem, or perhaps
simply not provided by your implemenation. Which one
are you using? More below.
Otherwise, I agree
with you...
#include <limits>

std::numeric_limits<char>::max();

should work.
[snip] But here's some food for thought: one way to 'generalize'
container access is with iterators, as do the functions in
<algorithm>.

template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}


This is a nice idea!

Yes it is. But not my idea. I "stole" it from
the standard library, the design of which is imo
rich with Good Ideas.

If you don't have the Josuttis book, get it.
www.josuttis.com/libbook

I knew you should be able to do this, but I
couldn't see how. Here is the refactored code:
[snip code] (I didn't look at it very closely, so if
obvious errors, I didn't see them)
called as

tokenize(buf, delim, insert_iter<deque<string> >(tokens,
tokens.begin()));

The orignal spec returned the number of tokens parsed. Now I have to
settle for checking

if(tokens.size() > 0){ ... }


or

if(!tokens.empty())

which *might* improve performance, and imo is
more expressive.

-Mike
Jul 19 '05 #6
"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
David Rubin wrote:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
You might squeeze out some more performance by making
that last parameter a const reference.

{
string::size_type sp, ep; // start/end position

ep = -1;

do{
sp = buf.find_first_not_of(delim, ep+1);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
*ii++ = buf.substr(sp, ep-sp);
}
}while(sp != buf.npos);
}

That's better. The 'ep+1' is a small optimization. I'm not sure it makes
any difference, and really, starting with ep=0 makes the code a bit
clearer.


"I love it when a plan comes together."
-George Peppard, as "Hannibal" in "The A Team"

:-)

-Mike
Jul 19 '05 #7
Mike Wahler wrote:

tokenize(const string& buf, const string& delim, InsertIter ii)

You might squeeze out some more performance by making
that last parameter a const reference.


Sorry to bother you. I was wondering how making it a const reference
would help performance?
Jul 19 '05 #8
"SomeDumbGuy" <ab***@127.0.0.1> wrote in message
news:bb******************@nwrddc02.gnilink.net...
Mike Wahler wrote:

tokenize(const string& buf, const string& delim, InsertIter ii)

You might squeeze out some more performance by making
that last parameter a const reference.


Sorry to bother you. I was wondering how making it a const reference
would help performance?


Note the word "might". Depending upon the implementation,
an iterator might be a simple pointer, but it also could
be an elaborate large structure, in which case passing
by reference could be faster than passing a copy by
value, and would still probably be just as fast as
pass by value if the iterator is represented with a
pointer type.

Since OP's quest was to 'generalize' for any container
type (via a template), we cannot know what the actual
representation of the iterator will be.

Also the same iterator type might be implemented in different
ways among library implemenations, some large and complex,
some not. A reference can prevent possible performance
degradation, without having to be concerned whether it's
actually an issue.

The 'const' part of my suggestion doesn't really have anything
to do with performance. I say 'const' reference, since the
function does not need to modify the iterator's value. This
is the recommendation for any parameter of non-built-in type
which a function does not modify, especially when the parameter
type is a template argument, since you can't know how large and
complex it might be.

The "traditional wisdom" concerning parameter types is
essentially:

For non-built-in types or "unknown" types (specified by
a template argument), pass by const reference by default.

If the function needs to modify the caller's argument,
regardless of type, pass by nonconst reference.

If the parameter is always a built-in type and the function
need not modify the caller's argument, pass by value or
const reference.

Or something like that. :-)

Does that help?

-Mike
Jul 19 '05 #9
In article <3F***************@nomail.com>, bo***********@nomail.com
says...
I looked on google for an answer, but I didn't find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:


This may look a bit odd at first, but it works quite nicely. Its basic
idea is to hijack the tokenizer built into the standard iostream
classes, and put it to our purposes. The main thing necessary to do
that is to create a facet that classifies our delimiters as "space" and
everything else as not-"space". Once we have a stream using our facet,
we can simply read tokens from the stream and use them as we please.

Here's the code:

#include <iostream>
#include <deque>
#include <sstream>
#include <iterator>

#include "facet"

template <class T>
class delims : public ctype_table<T>
{
public:
delims(size_t refs = 0)
: ctype_table<T>(ctype_table<T>::empty)
{
for (int i=0; i<table_size; i++)
if((isspace(i) || ispunct(i)) && !(i == '_' || i == '#'))
table()[widen(i)] = mask(space);
}
};

int main() {
std::string buf;
std::locale d(std::locale::classic(), new delims<char>);

while ( std::getline(std::cin, buf)) {
// deque to hold tokens.
std::deque<std::string> tok;

// create istringstream from the input string and have it use our facet.
std::istringstream is(buf);
is.imbue(d);

std::istream_iterator<std::string> in(is), end;

// copy tokens from our stream into a deque
std::copy(in, end,
std::back_inserter<std::deque<std::string> >(tok));

// show tokens, one per line.
std::copy(tok.begin(), tok.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}

return 0;
}

Note that if we knew we were going to use std::cin for this, we could
just imbue std::cin with the locale we created, and read directly from
it into the deque:

// same facet as above.

int main() {
std::locale d(std::locale::classic(), new delims<char>);
std::cin.imbue(d);

std::istream_iterator<std::string> in(std::cin), end;
std::ostream_iterator<std::string> out(std::cout, "\n");

std::copy(in, end, out);
return 0;
}

Of course, even if you're planning to put the tokens into some
container, you can still imbue the input stream with the locale instead
of reading from stream to string, then imbuing a stringstream with the
locale.

The facet header I'm using contains some code I posted a while back --
it looks like this:

#include <locale>
#include <algorithm>

template<class T>
class table {
typedef typename std::ctype<T>::mask tmask;

tmask *t;
public:
table() : t(new std::ctype<T>::mask[std::ctype<T>::table_size]) {}
~table() { delete [] t; }
tmask *the_table() { return t; }
};

template<class T>
class ctype_table : table<T>, public std::ctype<T> {
protected:
typedef typename std::ctype<T>::mask tmask;

enum inits { empty, classic };

ctype_table(size_t refs = 0, inits init=classic)
: std::ctype<T>(the_table(), false, refs)
{
if (classic == init)
std::copy(classic_table(),
classic_table()+table_size,
the_table());
else
std::fill_n(the_table(), table_size, mask());
}
public:
tmask *table() {
return the_table();
}
};

This handles most of the dirty work of creating a facet, so about all
you're left with is specifying what characters you want treated as
delimiters, and then telling the stream to use a locale that includes
the facet.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Jul 19 '05 #10
Mike Wahler wrote:
You might squeeze out some more performance by making
that last parameter a const reference.
Sorry to bother you. I was wondering how making it a const reference
would help performance?

Note the word "might". Depending upon the implementation,
an iterator might be a simple pointer, but it also could
be an elaborate large structure, in which case passing
by reference could be faster than passing a copy by
value, and would still probably be just as fast as
pass by value if the iterator is represented with a
pointer type.


The "traditional wisdom" concerning parameter types is
essentially:

For non-built-in types or "unknown" types (specified by
a template argument), pass by const reference by default.

If the function needs to modify the caller's argument,
regardless of type, pass by nonconst reference.

If the parameter is always a built-in type and the function
need not modify the caller's argument, pass by value or
const reference.

Or something like that. :-)

Does that help?


Yes. :)

Jul 19 '05 #11
Jerry Coffin wrote:
This may look a bit odd at first, but it works quite nicely. Its basic
idea is to hijack the tokenizer built into the standard iostream
classes, and put it to our purposes. The main thing necessary to do
that is to create a facet that classifies our delimiters as "space" and
everything else as not-"space". Once we have a stream using our facet,
we can simply read tokens from the stream and use them as we please.
Interesting idea. It seems rather complex compared to the code I posted
though. One nice thing about your solution is that you encapsulate
delims in a class. I suppose you could extend the class a bit with
various methods to extend the delimiters (equivalent to isspace,
ispunct, etc), although this is reasonably accomplished via subclassing.
What are the other benefits of this approach compared to mine?

Also, I have a few questions...

[snip] int main() {
std::locale d(std::locale::classic(), new delims<char>);
Can you explain the role of locales in a little more detail? Can't you
just skip this part in most cases?
std::cin.imbue(d); std::istream_iterator<std::string> in(std::cin), end;
std::ostream_iterator<std::string> out(std::cout, "\n"); std::copy(in, end, out);
How does end function here? I've only seen copy specified with iterators
associated with a container. In this case, in and end seem to have no
association with each other.
return 0;
}


/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #12
Mike Wahler wrote:

"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
David Rubin wrote:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
You might squeeze out some more performance by making
that last parameter a const reference.


Actually, I tried defining this function as

template <typename InsertIter>
void tokenize(const string& buf, const string& delim, InsertIter& ii);

(and const InsertIter&) with little success. I got compiler errors both
times. For exapmle, with the non-const refernce, I got:

; g++ tokenize.cc
tokenize.cc: In function `int main()':
tokenize.cc:31: error: could not convert `
insert_iterator<std::deque<std::basic_string<char,
std::char_traits<char>,
std::allocator<char> >, std::allocator<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > > >((&tokens),
std::deque<_Tp, _Alloc>::begin() [with _Tp = std::string, _Alloc =
std::allocator<std::string>]())' to `
std::insert_iterator<std::deque<std::string,
std::allocator<std::string> >&' tokenize.cc:12: error: in passing argument 3 of `void tokenize(const
std::string&, const std::string&, InsertIter&) [with InsertIter =
std::insert_iterator<std::deque<std::string,
std::allocator<std::string> >]'


when invoked (12) as

tokenize(buf, delim, insert_iterator<deque<string> >(tokens,
tokens.begin()));

(g++ 3.3.1, Solaris 2.6).

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #13
"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
[...]
(note: I cannot use boost).
[...]


Why can't you use Boost? Legal department?

Dave

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.521 / Virus Database: 319 - Release Date: 9/23/2003
Jul 19 '05 #14

"David Rubin" <bo***********@nomail.com> wrote in message
news:3F**************@nomail.com...
Mike Wahler wrote:

"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
David Rubin wrote:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)


You might squeeze out some more performance by making
that last parameter a const reference.


Actually, I tried defining this function as

template <typename InsertIter>
void tokenize(const string& buf, const string& delim, InsertIter& ii);

(and const InsertIter&) with little success. I got compiler errors both
times. For exapmle, with the non-const refernce, I got:

; g++ tokenize.cc
tokenize.cc: In function `int main()':
tokenize.cc:31: error: could not convert `
insert_iterator<std::deque<std::basic_string<char,
std::char_traits<char>,
std::allocator<char> >, std::allocator<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > > >((&tokens),
std::deque<_Tp, _Alloc>::begin() [with _Tp = std::string, _Alloc =
std::allocator<std::string>]())' to `
std::insert_iterator<std::deque<std::string,
std::allocator<std::string> >
>&'

tokenize.cc:12: error: in passing argument 3 of `void tokenize(const
std::string&, const std::string&, InsertIter&) [with InsertIter =
std::insert_iterator<std::deque<std::string,
std::allocator<std::string> >
>]'


when invoked (12) as

tokenize(buf, delim, insert_iterator<deque<string> >(tokens,
tokens.begin()));


If it's really an issue for you, I'll take a look and
see if I can locate the problem.

-Mike
Jul 19 '05 #15
Mike Wahler wrote:
[...] I got compiler errors both
times. For exapmle, with the non-const refernce, I got:
[snip] If it's really an issue for you, I'll take a look and
see if I can locate the problem.


I'd appreciate it. I just don't understand what the compiler is doing.
Thanks.

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #16

"David Rubin" <bo***********@nomail.com> wrote in message
news:3F**************@nomail.com...
Mike Wahler wrote:
[...] I got compiler errors both
times. For exapmle, with the non-const refernce, I got:

[snip]
If it's really an issue for you, I'll take a look and
see if I can locate the problem.


I'd appreciate it. I just don't understand what the compiler is doing.


I didn't get exactly the same diagnostics, but I did get
complaints about 'insert_iterator::operator++()', because
it modifies the iterator, so we cannot make it a const
reference, but it works for me as a nonconst reference:

#include <algorithm>
#include <deque>
#include <iterator>
#include <string>
template <typename InsertIter>
void tokenize(const std::string& buf,
const std::string& delim,
InsertIter& ii)
{
std::string::size_type sp(0); /* start position */
std::string::size_type ep(-1); /* end position */

do
{
sp = buf.find_first_not_of(delim, ep + 1);
ep = buf.find_first_of(delim, sp);

if(sp != ep)
{
if(ep == buf.npos)
ep = buf.length();

*ii++ = buf.substr(sp, ep-sp);
}

} while(sp != buf.npos);
}

int main()
{
std::string buf("We* are/parsing [a---string");
std::string delim(" */[-");
std::deque<std::string> tokens;

tokenize(buf, delim, std::inserter(tokens, tokens.begin()));

std::copy(tokens.begin(), tokens.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));

return 0;

}
Output:

We
are
parsing
a
string
HTH,
-Mike

Jul 19 '05 #17
"Mike Wahler" <mk******@mkwahler.net> wrote in message
news:CG*****************@newsread4.news.pas.earthl ink.net...

std::copy(tokens.begin(), tokens.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));


I forgot the obvious #include <iostream> :-)

(I suppose my implementation brought the decls in via
#include <iterator>, so didn'tcomplain about no 'cout')

-Mike
Jul 19 '05 #18
In article <3F***************@nomail.com>, bo***********@nomail.com
says...

[ ... ]
Interesting idea. It seems rather complex compared to the code I posted
though.
Yes and no -- the framework for creating a facet is a bit more complex
than I'd like, but creating a facet is fairly easy and using it is
pretty nearly dead simple.
One nice thing about your solution is that you encapsulate
delims in a class. I suppose you could extend the class a bit with
various methods to extend the delimiters (equivalent to isspace,
ispunct, etc), although this is reasonably accomplished via subclassing.
What are the other benefits of this approach compared to mine?
Ease of use -- it supports normal stream extractors. Since the example
at hand was taking the tokens from a stream and putting them into a
deque, I used iterators and std::copy to copy from one to the other. It
could have been done with normal stream extraction though:

std::string token;

while (stream >> token)
std::cout << token << std::endl;

would have worked just fine for the job at hand as well. For that
matter, to read them and put them into the deque, we could have done
something like:

std::string token;
std::deque tok;

while ( stream >> token)
tok.push_back(token);
To make a long story short, once you've imbued a stream with the facet
that defines the delimiters between tokens, you don't have to learn
anything new at all -- you're just extracting strings from a stream.
Also, I have a few questions...

[snip]
int main() {
std::locale d(std::locale::classic(), new delims<char>);


Can you explain the role of locales in a little more detail? Can't you
just skip this part in most cases?


A locale is basically a collection of facets (one facet each of a number
of different types). A stream, however, only knows about a complete
locale, though, not individual facets. Therefore, we create an otherwise
normal locale, but have it use our ctype facet. The rest of it probably
doesn't matter for the job at hand, but (at least TTBOMK) there's no
function that tells a stream to use on facet at a time.
std::cin.imbue(d);

std::istream_iterator<std::string> in(std::cin), end;
std::ostream_iterator<std::string> out(std::cout, "\n");

std::copy(in, end, out);


How does end function here? I've only seen copy specified with iterators
associated with a container. In this case, in and end seem to have no
association with each other.


An istream iterator that's not associated with a stream is basically the
equivalent of EOF -- another istream_iterator will become equal to it
when the end of the stream is encountered.

That's strictly related to using istream_iterator's though -- it's
completely independent of changing the facet to get the parsing we want.
Just for example, if we had a file of numbers (integers for the moment)
and we wanted to read those numbers into a vector, we could do it
similarly:

std::vector<int> v;

std::istream_iterator<int> in(std::cin), end;

std::copy(in, end, std::back_inserter<std::vector<int> >(v));

Similarly, if we have a file of floating point values that we wanted to
read into a list, we could do:

std::list<double> ld;

std::istream_iterator<double> in(std::cin), end;

std::copy(in, end, std::back_inserter<std::list<double> >(ld));

and so on.

To summarize: an istream_iterator allows us to treat the contents of a
stream like a collection, so we can apply a standard algorithm to its
contents (it's basically an input_iterator, so we can't, for example,
apply an algorithm that requires a random access iterator though).

--
Later,
Jerry.

The universe is a figment of its own imagination.
Jul 19 '05 #19
Mike Wahler wrote:

[snip]
template <typename InsertIter>
void tokenize(const std::string& buf,
const std::string& delim,
InsertIter& ii)
[snip] int main()
{
std::string buf("We* are/parsing [a---string");
std::string delim(" */[-");
std::deque<std::string> tokens;

tokenize(buf, delim, std::inserter(tokens, tokens.begin()));


This only works with my compiler if I do

std::insert_iterator<std::deque<std::string> > ii(tokens,
tokens.begin());
tokenize(buf, delim, ii);

Otherwise, I get the same errors as before. I guess this means my
compiler is broken?

; g++ --version
g++ (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #20

"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
Mike Wahler wrote:

[snip]
template <typename InsertIter>
void tokenize(const std::string& buf,
const std::string& delim,
InsertIter& ii)
[snip]
int main()
{
std::string buf("We* are/parsing [a---string");
std::string delim(" */[-");
std::deque<std::string> tokens;

tokenize(buf, delim, std::inserter(tokens, tokens.begin()));


This only works with my compiler if I do

std::insert_iterator<std::deque<std::string> > ii(tokens,
tokens.begin());


But you *did* get it to work with the reference parameter
for the iterator, right?

Interesting about having to 'spell it out' like that.
Both ways worked for me, (I also tried 'std::back_inserter' and
std::back_insert_iterator, both of which also worked for me as well).

tokenize(buf, delim, ii);

Otherwise, I get the same errors as before. I guess this means my
compiler is broken?


Seems so to me. Have you checked for a newer version
of g++?

-Mike
Jul 19 '05 #21
"Mike Wahler" <mk******@mkwahler.net> writes:
"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...
Mike Wahler wrote:

[snip]
template <typename InsertIter>
void tokenize(const std::string& buf,
const std::string& delim,
InsertIter& ii)


[snip]
int main()
{
std::string buf("We* are/parsing [a---string");
std::string delim(" */[-");
std::deque<std::string> tokens;

tokenize(buf, delim, std::inserter(tokens, tokens.begin()));


This only works with my compiler if I do

std::insert_iterator<std::deque<std::string> > ii(tokens,
tokens.begin());


But you *did* get it to work with the reference parameter
for the iterator, right?

Interesting about having to 'spell it out' like that.
Both ways worked for me, (I also tried 'std::back_inserter' and
std::back_insert_iterator, both of which also worked for me as well).

tokenize(buf, delim, ii);

Otherwise, I get the same errors as before. I guess this means my
compiler is broken?


Seems so to me. Have you checked for a newer version
of g++?


FYI: both g++ 3.3.1 and the Intel compiler V7.0 reject the original version.
I guess the problem is that you are forming a non-const reference to a
temporary, which isn't allowed by the standard.
Changing the signature of tokenize to

void tokenize(const std::string& buf,
const std::string& delim,
const InsertIter& ii)

and copying ii inside tokenize to a local helper variable for the loop works.

HTH & kind regards
frank

--
Frank Schmitt
4SC AG phone: +49 89 700763-0
e-mail: frankNO DOT SPAMschmitt AT 4sc DOT com
Jul 19 '05 #22
"Frank Schmitt" <in*****@seesignature.info> wrote in message
news:4c************@scxw21.4sc...

FYI: both g++ 3.3.1 and the Intel compiler V7.0 reject the original version. I guess the problem is that you are forming a non-const reference to a
temporary, which isn't allowed by the standard.
Thanks. For some reason, that issue always seems to trip
me up. Perhaps I need to write this down a hundred times
on the chalkboard. :-)
Changing the signature of tokenize to

void tokenize(const std::string& buf,
const std::string& delim,
const InsertIter& ii)

and copying ii inside tokenize to a local helper variable for the loop

works.

Thanks,
-Mike
Jul 19 '05 #23
Frank Schmitt wrote:

[snip]
FYI: both g++ 3.3.1 and the Intel compiler V7.0 reject the original version.
I guess the problem is that you are forming a non-const reference to a
temporary, which isn't allowed by the standard.
This is what I suspected, but I was thrown off because the diagnostic
was so abstruse. I was expecting something more along the lines of
"cannot for a non-const reference to a temporary." Anyway, it's
interesting that MSVC++6.0 compiles the code without warning. Using MS
as a point of reference always begs the question of which is correct :-)
Changing the signature of tokenize to

void tokenize(const std::string& buf,
const std::string& delim,
const InsertIter& ii)

and copying ii inside tokenize to a local helper variable for the loop works.


What is the point of copying ii inside tokenize if you can just remove
the reference argument altogether and use pass-by-value? Isn't that the
same?

Much thanks,

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #24
David Rubin <bo***********@nomail.com> writes:
Frank Schmitt wrote:
Changing the signature of tokenize to

void tokenize(const std::string& buf,
const std::string& delim,
const InsertIter& ii)

and copying ii inside tokenize to a local helper variable
for the loop works.


What is the point of copying ii inside tokenize if you can just remove
the reference argument altogether and use pass-by-value? Isn't that the
same?


Yes, of course - I guess I just have gotten so used to const references instead
of pass-by-value that it has become a reflex for me not to consider
pass-by-value :-)

kind regards
frank

--
Frank Schmitt
4SC AG phone: +49 89 700763-0
e-mail: frankNO DOT SPAMschmitt AT 4sc DOT com
Jul 19 '05 #25
Jerry Coffin wrote:
The facet header I'm using contains some code I posted a while back --
it looks like this: #include <locale>
#include <algorithm> template<class T>
class table {
typedef typename std::ctype<T>::mask tmask; tmask *t;
public:
table() : t(new std::ctype<T>::mask[std::ctype<T>::table_size]) {}
~table() { delete [] t; }
tmask *the_table() { return t; }
}; template<class T>
class ctype_table : table<T>, public std::ctype<T> {
protected:
typedef typename std::ctype<T>::mask tmask; enum inits { empty, classic }; ctype_table(size_t refs = 0, inits init=classic)
: std::ctype<T>(the_table(), false, refs)
{
if (classic == init)
std::copy(classic_table(),
classic_table()+table_size,
the_table());
else
std::fill_n(the_table(), table_size, mask());
}
public:
tmask *table() {
return the_table();
}
};


I decided after a while that I liked your approach (using istringstr and
locale) better than my tokenize function. I was able to replace the
above code with

#include <algorithm>
#include <locale>

class ctype_table : public std::ctype<char> {
private:
mask tab[table_size];

protected:
enum Init {empty, classic};

ctype_table(Init type=classic) : std::ctype<char>(tab)
{
if (type == classic)
std::copy(classic_table(), classic_table()+table_size, tab);
else
std::fill_n(tab, table_size, space);
}

public:
mask *table() { return tab; }
};

You want to derive from std::ctype<char> rather than std::ctype<T> since
only the char specialization contains the functions and constants you
are using. Also, by deriving from std::ctype<T> [T=char], you can use
type mask and constant space freely (I think 'mask()' is a typo in your
code). Additionally, you don't really need the refs argument (at least
for my application). Lastly, I found on my platform that creating a
static table (tab) results in a smaller executable than allocating tab
off the heap (I assume this was the motivation for privately inheriting
from table).

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #26
David Rubin <bo***********@nomail.com> wrote in message news:<3F***************@nomail.com>...

[ ... ]
I decided after a while that I liked your approach (using istringstr and
locale) better than my tokenize function. I was able to replace the
above code with
[ code elided ]
You want to derive from std::ctype<char> rather than std::ctype<T> since
only the char specialization contains the functions and constants you
are using.
I'm on my laptop right now, so I don't have the standard handy to
check with, but I don't remember using anything that shouldn't work
with wchar_t, etc., as well.
Also, by deriving from std::ctype<T> [T=char], you can use
type mask and constant space freely (I think 'mask()' is a typo in your
code).
The use of mask() was intentional, and you'll almost certainly get all
sorts of strange errors if you try to substitute just "mask" where I
used mask(). Where I used mask(), it was to create a
default-initialized mask object with which to initialize the objects
in the array. Using mask instead, would result only in compiler
errors because you're specifying a type where it wants an object.
Additionally, you don't really need the refs argument (at least
for my application).
For this application, that's probably right. That part of the code
was written with an eye to generality, not specifically for this
application.
Lastly, I found on my platform that creating a
static table (tab) results in a smaller executable than allocating tab
off the heap (I assume this was the motivation for privately inheriting
from table).


The result can be smaller code, or a _lot_ smaller code -- like none
at all. The header is not required to initialize table_size, and with
an implementation that doesn't initialized it _in the header_, your
code won't compile.

The private inheritance was because table only exists to ensure that
the initialization gets done in the right order. There's no reason to
support casting back to table or anything like that.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Jul 19 '05 #27
Jerry Coffin wrote:

David Rubin <bo***********@nomail.com> wrote in message news:<3F***************@nomail.com>...

[ ... ]
I decided after a while that I liked your approach (using istringstr and
locale) better than my tokenize function. I was able to replace the
above code with
[ code elided ]
You want to derive from std::ctype<char> rather than std::ctype<T> since
only the char specialization contains the functions and constants you
are using.


I'm on my laptop right now, so I don't have the standard handy to
check with, but I don't remember using anything that shouldn't work
with wchar_t, etc., as well.


For example, table_size, classic_table(), and the constructor taking a
const mask* argument are only defined in std::ctype<char> AFAIK.
Also, by deriving from std::ctype<T> [T=char], you can use
type mask and constant space freely (I think 'mask()' is a typo in your
code).


The use of mask() was intentional, and you'll almost certainly get all
sorts of strange errors if you try to substitute just "mask" where I
used mask(). Where I used mask(), it was to create a
default-initialized mask object with which to initialize the objects
in the array. Using mask instead, would result only in compiler
errors because you're specifying a type where it wants an object.


I was suggesting that you use 'space' rather than 'mask', which, of
course, will give you a compile error. My understanding is that mask()
will create a mask temporary initialized to zero (since it's an integer
type). My *guess* (although there is no guarantee) is that mask() is
equivalent to space in most implementations. Even if it's not, an
'empty' table would then be full of spaces rather than some
implementation-defined value.
Additionally, you don't really need the refs argument (at least
for my application).


For this application, that's probably right. That part of the code
was written with an eye to generality, not specifically for this
application.


Agreed, but then you don't include a 'delete-when-done' argument, and
you reversed refs and init when you call the std::ctype<T> constructor
in your implementation of delim.
Lastly, I found on my platform that creating a
static table (tab) results in a smaller executable than allocating tab
off the heap (I assume this was the motivation for privately inheriting
from table).


The result can be smaller code, or a _lot_ smaller code -- like none
at all. The header is not required to initialize table_size, and with
an implementation that doesn't initialized it _in the header_, your
code won't compile.


This is a subtle point. I don't have the standard in front of me, but
isn't this covered by C++PL3ed, 12.2.2:

Class objects are constructed from the bottom up: first the base,
then the members, and then the derived class itself.

This suggests to me that table_size is initialized (at least) by the
base class constructor, and is therefore available when the tab member
is "constructed"...
The private inheritance was because table only exists to ensure that
the initialization gets done in the right order. There's no reason to
support casting back to table or anything like that.


....Otherwise, I agree.

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown
Jul 19 '05 #28
In article <3F***************@nomail.com>, bo***********@nomail.com
says...

[ ... ]
I'm on my laptop right now, so I don't have the standard handy to
check with, but I don't remember using anything that shouldn't work
with wchar_t, etc., as well.
For example, table_size, classic_table(), and the constructor taking a
const mask* argument are only defined in std::ctype<char> AFAIK.


Doing some looking, you're right. I may need to re-think the code a
bit.

[ ... ]
I was suggesting that you use 'space' rather than 'mask', which, of
course, will give you a compile error. My understanding is that mask()
will create a mask temporary initialized to zero (since it's an integer
type). My *guess* (although there is no guarantee) is that mask() is
equivalent to space in most implementations.
Your guess is wrong, AFAIK. mask() creates a value that basically says
the character doesn't fit _any_ classification. I.e. it's not a space
or a digit or alphabetic, or control, or anything else. mask is
required to be a bitmask type, and if no bits are set, it doesn't
classify the character as anything at all.
Even if it's not, an
'empty' table would then be full of spaces rather than some
implementation-defined value.
....which would utterly _ruin_ its usefulness. The whole idea is to
produce a table that ONLY classifies a character as a space (for
example) if you say it should be a space. Setting it to fill the table
with a value that said everything was a space would produce utterly
useless results -- when you extract from an istream, it will skip across
anything its locale says it a space character, so doing this would
produce a ctype that always skipped across all input.

[ ... ]
This is a subtle point. I don't have the standard in front of me, but
isn't this covered by C++PL3ed, 12.2.2:

Class objects are constructed from the bottom up: first the base,
then the members, and then the derived class itself.

This suggests to me that table_size is initialized (at least) by the
base class constructor, and is therefore available when the tab member
is "constructed"...


Theoretically that might cover it. Practically speaking, a number of
compilers fail when/if you try to use table_size as the size of an
array. Since I don't care to ignore those compilers, my alternative is
to write code that works with them.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Jul 19 '05 #29

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Justin L. Kennedy | last post by:
I am looking for a function that takes in a string and splits it using a list of other strings (delimiters) and can return the delimiters as well as the extra parts of the string. I was trying the...
4
by: beliavsky | last post by:
The code for text in open("file.txt","r"): print text.replace("foo","bar") replaces 'foo' with 'bar' in a file, but how do I avoid changing text inside single or double quotes? For making...
11
by: wreckingcru | last post by:
I'm trying to tokenize a C++ string with the following code: #include <iostream> #include <stdio> #include <string> using namespace std; int main(int argc, char* argv)
2
by: Ark | last post by:
Hello, I have a text file that has each line that is comma seperated, and I need to get the values into an array. I have worked with Java and have used StringTokenizer, but have been looking for...
2
by: ES Kim | last post by:
Hi everyone, Sometimes I need to tokenize a string and convert it into some value. For example, "123.45" with predifined width 2/4 and type int/double should be converted into an int 2 and a...
16
by: Amit Gupta | last post by:
Hi - I get a seg-fault when I compile and run this simple program. (seg-fault in first call to strtok). Any clues? My gcc is "gcc version 4.1.1 20070105 (Red Hat 4.1.1-51)" #include...
11
by: Lothar Behrens | last post by:
Hi, I have selected strtok to be used in my string replacement function. But I lost the last token, if there is one. This string would be replaced select "name", "vorname", "userid",...
10
by: Hank Stalica | last post by:
I'm having this weird problem where my code does the following conversion from string to float: 27000000.0 -27000000.00 2973999.99 -29740000.00 2989999.13 -2989999.25 The number on the left...
4
by: dhirenved | last post by:
hi, i have to use string tokenizing to isolate commands and the arguments that i have to execute using the execvp command. here is what i'm using char* token; char buf; printf("enter command");...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.