473,573 Members | 2,922 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

_efficiently_ obtain n-th word from a std::string


Hi!

I need a routine like:

std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if 's' contains too few words, return ""
// 'words' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t \t".
}

I am currenlty using something like:

std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
}
return s;
}

which is fine, except it performs poorly. Before I'm flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...

I'm thinking of a combination of find_first_not_ of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?

TIA,
- J.
May 9 '07 #1
11 2883
Jacek Dziedzic wrote:
>
Hi!

I need a routine like:

std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if 's' contains too few words, return ""
// 'words' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t \t".
}

I am currenlty using something like:

std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
}
return s;
}

which is fine, except it performs poorly. Before I'm flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...

I'm thinking of a combination of find_first_not_ of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?
Nah. If efficiency is critical, I have found the istream interface a
bad solution.

Nothing like a state machine.

// warning - brain dump - not compiled ever - not tested - use at your
own peril

#include <cctype>

std::string nth_word(const std::string& source, unsigned int n)
{

enum state_type { inspace, inword };

std::string::co nst_iterator l_iter = source.begin();
const std::string::co nst_iterator l_end = source.end();

if ( l_end == l_iter )
{
return "";
}
std::string::co nst_iterator l_beg_word = l_iter;

state_type state = ( std::isspace( * l_iter ) ) ? inspace : inword;

int word_count = state == inspace ? 0 : 1;

++ l_iter;

while ( l_end != l_iter )
{

switch ( state )
{
case inspace :

if ( std::isspace( * l_iter ) ) break;
state = inword;
l_beg_word = l_iter;
++ word_count;
break;

case inword :

if ( ! std::isspace( * l_iter ) ) break;
state = inspace;

if ( n == word_count )
{
return std::string( l_beg_word, l_iter );
}
break;
}

++ l_iter;
}

switch ( state )
{
case inspace :
return "";
case inword :
if ( n == word_count )
{
return std::string( l_beg_word, l_iter );
}
return "";
}

return "";
}
May 9 '07 #2
Gianni Mariani wrote:
Jacek Dziedzic wrote:
....
> which is fine, except it performs poorly. Before I'm flamed
....

oh - I forgot to mention, if you extend the state machine a little, you
can apply the state machine to an entire file. Then you can mmap the
entire file and never copy the file into and out of buffers.
May 9 '07 #3
Hi

Jacek Dziedzic wrote:
I need a routine like:

std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if 's' contains too few words, return ""
// 'words' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t \t".
}

I am currenlty using something like:

std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
}
return s;
}

which is fine, except it performs poorly. Before I'm flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...
I'm afraid that your efforts will not help much.
If your n-th word starts at position i in your text file, you will have
to inspect all preceding positions to know that (well, not entirely
true, in some circumstances you can skip a character, but that doesn't
really matter). If your text contains billions of characters, that might
take very long.

The single overhead that could be avoided is copying strings that you
don't need at all. It would of course be enough to iterate over the
string and count the number of words until you have found the n-th word.
If you go architecture dependent, mmx or other streaming instructions
might be able to help you (I really don't know... mmx has comparison
instructions for 8 packed bytes, but I wonder if this will provide any
speed-up, as you would have to check results for each byte)
I'm thinking of a combination of find_first_not_ of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?
I would not recommend any of those...

Use a string::iterato r. Inspect the individual characters. Have a
current state (word, non-word), update the state according to what you
read. Count toggles. Construct a string from the n-th word.

Markus
May 9 '07 #4
Markus Moll wrote:
Hi

Jacek Dziedzic wrote:
> I need a routine like:

std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if 's' contains too few words, return ""
// 'words' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t \t".
}

I am currenlty using something like:

std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
}
return s;
}

which is fine, except it performs poorly. Before I'm flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...

I'm afraid that your efforts will not help much.
If your n-th word starts at position i in your text file, you will have
to inspect all preceding positions to know that (well, not entirely
true, in some circumstances you can skip a character, but that doesn't
really matter). If your text contains billions of characters, that might
take very long.
Fortunately I'm only splitting individual lines into words.
I don't need to search through the file or anything like that,
only something like

for all lines in the file {
depending on word 0 in the line, extract some other word from the line
convert this word to a double, operate on it, output result
}
The single overhead that could be avoided is copying strings that you
don't need at all. It would of course be enough to iterate over the
string and count the number of words until you have found the n-th word.
If you go architecture dependent, mmx or other streaming instructions
might be able to help you (I really don't know... mmx has comparison
instructions for 8 packed bytes, but I wonder if this will provide any
speed-up, as you would have to check results for each byte)
I'd like that to be portable.
> I'm thinking of a combination of find_first_not_ of and
find_first_o f, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?

I would not recommend any of those...

Use a string::iterato r. Inspect the individual characters. Have a
current state (word, non-word), update the state according to what you
read. Count toggles. Construct a string from the n-th word.
So, Gianni's approach. I will try that, thanks.

- J.
May 9 '07 #5
Gianni Mariani wrote:
Gianni Mariani wrote:
>Jacek Dziedzic wrote:
...
>> which is fine, except it performs poorly. Before I'm flamed
...

oh - I forgot to mention, if you extend the state machine a little, you
can apply the state machine to an entire file. Then you can mmap the
entire file and never copy the file into and out of buffers.
thanks, I'll try that,
- J.
May 9 '07 #6
Jacek Dziedzic wrote:
Markus Moll wrote:
.....
>I'm afraid that your efforts will not help much.
If your n-th word starts at position i in your text file, you will
have to inspect all preceding positions to know that (well, not
entirely true, in some circumstances you can skip a character, but
that doesn't really matter). If your text contains billions of
characters, that might take very long.

Fortunately I'm only splitting individual lines into words.
I don't need to search through the file or anything like that,
only something like

for all lines in the file {
depending on word 0 in the line, extract some other word from the line
convert this word to a double, operate on it, output result
}
sounds like a job for a database.

select v4 where v0 = "THIS ONE";

Index on v0 and you can find all your values very quickly.

Try dumping your gigabytes of data into Postgresql or maybe sqllite.

I've used postgresql with a few hunded million rows and a few
"appropriat e" indexes and the answers pop out rather quickly.
May 9 '07 #7
Gianni Mariani wrote:
Gianni Mariani wrote:
>Jacek Dziedzic wrote:
...
>> which is fine, except it performs poorly. Before I'm flamed
...

oh - I forgot to mention, if you extend the state machine a little, you
can apply the state machine to an entire file. Then you can mmap the
entire file and never copy the file into and out of buffers.
And i the performances are really the problem, this would be much
better, even if the program structure can suffer a little bit. The
string is totally useless, and each new string is a "new", end each
"new" is waste of time.

Try using an istream_iterato r to directly iterate through your input
file in order to implement your state machine.

Regards,

Zeppe
May 9 '07 #8
Gianni Mariani wrote:
Jacek Dziedzic wrote:
>Markus Moll wrote:
....
>>I'm afraid that your efforts will not help much.
If your n-th word starts at position i in your text file, you will
have to inspect all preceding positions to know that (well, not
entirely true, in some circumstances you can skip a character, but
that doesn't really matter). If your text contains billions of
characters, that might take very long.

Fortunately I'm only splitting individual lines into words.
I don't need to search through the file or anything like that,
only something like

for all lines in the file {
depending on word 0 in the line, extract some other word from the line
convert this word to a double, operate on it, output result
}

sounds like a job for a database.

select v4 where v0 = "THIS ONE";

Index on v0 and you can find all your values very quickly.

Try dumping your gigabytes of data into Postgresql or maybe sqllite.

I've used postgresql with a few hunded million rows and a few
"appropriat e" indexes and the answers pop out rather quickly.
Thanks for the advice, but that would be way overkill.
What I do is perform postprocessing on raw results of
computer simulations. I usually have to process several
tens of files, each approx 1GB in size. These files are
only processed _once_ (if I don't mess up) so storing them
inside a database would not really help. Anyway, people
around me would give me weird looks if I wanted to put
all this GBs of numbers inside a database, I am a part
of a small minority of non-Fortran users here :), co C++
is weird enough :/.

thanks,
- J.
May 9 '07 #9
On May 9, 1:18 pm, Jacek Dziedzic
<jacek.dziedzic .n.o.s.p....@gm ail.comwrote:
I need a routine like:
std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if 's' contains too few words, return ""
// 'words' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t \t".
}
I am currenlty using something like:
std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
Just a detail, but good() may be false after reading the last
word correctly. What you want here is:

if ( ! ss ) {
return "" ;
}

(If you do this, you don't have to add the extra space at the
end of the initializer of ss.)

Even better would be to move the condition up into the loop
condition. Something like:

std::istringstr eam ss( source ) ;
std::string s ;
while ( n 0 && ss >s ) {
-- n ;
}
return ss ? s : std::string() ;
}
return s;
}
which is fine, except it performs poorly. Before I'm flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...
I'm thinking of a combination of find_first_not_ of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?
For starters, you say you're processing a text file. Do you
offen call nth_word on the same string, with different values of
n? If so, you're doing a lot of duplicate work; if extract
every word, you've basically changed an O(n) algorithm into an
O(n^2). If performance is an issue, that's the first thing I'd
consider.

Other than that, std::istringstr eam does a lot of copying. I've
occasionally used something like:

template< std::ctype_base ::mask mask >
class CTypeIs : public std::unary_func tion< char, bool >
{
public:
typedef std::ctype< char >
CType ;
CTypeIs( std::locale const& l = std::locale() )
: myCType( &std::use_facet < CType >( l ) )
{
}

bool operator()( char ch ) const
{
return myCType->is( mask, ch ) ;
}

private:
CType const* myCType ;
} ;

void
split(
std::vector< std::string >&
dest,
std::string const& source )
{
static CTypeIs< std::ctype_base ::space const
isBlank ;
dest.clear() ;
std::string::co nst_iterator
end = source.end() ;
std::string::co nst_iterator
current =
std::find_if( source.begin(), end,
std::not1( isBlank ) ) ;
while ( current != end ) {
std::string::co nst_iterator
start = current ;
current = std::find_if( current, end, isBlank ) ;
dest.push_back( std::string( start, current ) ) ;
current = std::find_if( current, end,
std::not1( isBlank ) ) ;
}
}

to break up a line into fields.

CTypeIs is, of course, in my usual tools library, so I don't
have to write it every time. Note too that it does nothing to
guarantee the lifetime of the facet it is using---this is up to
the caller, but in practice is almost never a problem. Also,
the actual object is a local variable, and so won't be
constructed until the function is actually called. Thus giving
the user time to set the global locale.

Depending on what you are doing with the words, building the
vector may or may not be necessary as well. In the end, if you
can work directly with the two iterators which define the word,
you can avoid any copy what so ever.

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 10 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
23059
by: HB2 | last post by:
Is it possible to obtain the IP address of my desktop using a Visual Basic command?
6
22113
by: G. | last post by:
This is an obvious bug in the String.Replace function: //load a XML string into a document XmlDocument doc = new XmlDocument(); doc.LoadXml("<test id='' />"); //Obtain the string again...Converts my '' to ""...strange, but ok. String strXML = doc.OuterXml; //trying to convert back the single quotes to double quotes
4
2658
by: Ed Willis | last post by:
We have several offices that have a DSL or Cable modem where the IP address is dynamic and changes often. I created a VB app that obtains the IP address but it obtains the IP address within the firewall and we need the public IP address of the machine outside the firewall so we can use Remote Desktop connection for problem resolution etc. ...
4
1801
by: Daniel R. Rossnagel | last post by:
I can obtain Where clauses, order by, group by, of a consultation SQL, by means of some class similar to codemodel
0
6685
by: godsmustbcrazy | last post by:
Here is my problem. Brand new SQL Server 2005 Installation 1. Create a database "Test" 2. Create a database user "Domain\user" and set user mapping to "Test" with datareader, datawriter permissions 3. Look at Test ->Properties->Permissions activate "Domain\user" and click effective permissions and I get this error message
1
4437
by: VictorG | last post by:
Hello, The below C# code works fine in obtaining the windows user's account SID when the user is local to the machine. It throws a "Not Found" exception when trying top obtain the SID for a user who is on a domain, but logged in locally. Specifically, for a corporate domain - the user logs into the local desktop and has a local profile -...
1
1292
by: =?Utf-8?B?TWFydGluVA==?= | last post by:
I have a problem to obtain the children in IIS when I create a folder directly in c:\inetput\wwwroot\*. If a create with Visual Studio it's OK. This is the code I use to obtain the object IIS. Dim objAdmnIIS As System.DirectoryServices.DirectoryEntry Dim colctRepert As String() Dim objRpertEnf As System.DirectoryServices.DirectoryEntry Dim...
8
1326
by: ThunderMusic | last post by:
Hi, We need to serialize (binary) objects containing generic collections. The thing is, when we try to get the objects back (deserialize) with a different instance of the application, we receive an exception stating the constructor of the generic class does not exist. So Is there a way to obtain the resulting classes of the generics we use so...
3
1730
by: JurgenvonOerthel | last post by:
I want to replace one element in a list<stringby a list<stringand I need to obtain an iterator to the first element in the inserted list. In code: void replace_element_by_list(list<string&my_list, list<string>::iterator &iter_to_remove, const list<string> &list_to_insert) {
0
1362
by: Andrea | last post by:
Hi, I've a problem reading a field containing a crypted string. I read the field via JDBC but when I decrypt the string I obtain a sequence of strange characters (if I output them I obtain a sequence of ? characters). This problem occurs only with a database DB2 8.2 that has the following encoding: Database territory = US Database code...
0
7694
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8202
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7787
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8065
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6419
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
3732
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3733
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2213
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1303
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.