On May 9, 1:18 pm, Jacek Dziedzic
<jacek.dziedzic.n.o.s.p....@gmail.comwrote:
Quote:
I need a routine like:
std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if 's' contains too few words, return ""
// 'words' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t\t".
}
Quote:
I am currenlty using something like:
Quote:
std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
Just a detail, but good() may be false after reading the last
word correctly. What you want here is:
if ( ! ss ) {
return "" ;
}
(If you do this, you don't have to add the extra space at the
end of the initializer of ss.)
Even better would be to move the condition up into the loop
condition. Something like:
std::istringstream ss( source ) ;
std::string s ;
while ( n 0 && ss >s ) {
-- n ;
}
return ss ? s : std::string() ;
Quote:
which is fine, except it performs poorly. Before I'm flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...
Quote:
I'm thinking of a combination of find_first_not_of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?
For starters, you say you're processing a text file. Do you
offen call nth_word on the same string, with different values of
n? If so, you're doing a lot of duplicate work; if extract
every word, you've basically changed an O(n) algorithm into an
O(n^2). If performance is an issue, that's the first thing I'd
consider.
Other than that, std::istringstream does a lot of copying. I've
occasionally used something like:
template< std::ctype_base::mask mask >
class CTypeIs : public std::unary_function< char, bool >
{
public:
typedef std::ctype< char >
CType ;
CTypeIs( std::locale const& l = std::locale() )
: myCType( &std::use_facet< CType >( l ) )
{
}
bool operator()( char ch ) const
{
return myCType->is( mask, ch ) ;
}
private:
CType const* myCType ;
} ;
void
split(
std::vector< std::string >&
dest,
std::string const& source )
{
static CTypeIs< std::ctype_base::space const
isBlank ;
dest.clear() ;
std::string::const_iterator
end = source.end() ;
std::string::const_iterator
current =
std::find_if( source.begin(), end,
std::not1( isBlank ) ) ;
while ( current != end ) {
std::string::const_iterator
start = current ;
current = std::find_if( current, end, isBlank ) ;
dest.push_back( std::string( start, current ) ) ;
current = std::find_if( current, end,
std::not1( isBlank ) ) ;
}
}
to break up a line into fields.
CTypeIs is, of course, in my usual tools library, so I don't
have to write it every time. Note too that it does nothing to
guarantee the lifetime of the facet it is using---this is up to
the caller, but in practice is almost never a problem. Also,
the actual object is a local variable, and so won't be
constructed until the function is actually called. Thus giving
the user time to set the global locale.
Depending on what you are doing with the words, building the
vector may or may not be necessary as well. In the end, if you
can work directly with the two iterators which define the word,
you can avoid any copy what so ever.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34