473,386 Members | 1,721 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

efficient string tokenizer

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?
Jul 22 '05 #1
10 8109
Alex wrote:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


The method used currently constructs two vectors of presumably 0 size,
scans the memory block, and for each iteration a vector has an element
inserted at its end (which could result in either vector growing
multiple times in the course of this function). Also, for every
iteration of the loop, there is a call to strlen(), and for every time
the delimiter is encountered memory is dynamically allocated for 'pChar'
(that I assume is delete later). Then there is a possibility that there
is a copy routine that immediately follows the function, as I would
suspect that tokenList has to be copied into its destination variable.

First, I would remove the call to strlen() from the comparison, and use
a variable that stores the value returned by it. Technically, you don't
even need the call to strlen() since you are scanning the string
anyways. You could also reduce the number of times the vector has to
grow by reserve()ing some memory, and you would also gain some speed
from eliminating the dynamic allocation of memory. Maybe you could pass
a reference or pointer to the destination vector<char*> to the function
to prevent invoking a copy routine.

I hope I was of help, a lot of my C++ knowledge seems to hide itself in
the depths of my memory when I am not currently working with it.

Good luck,
Ernest
Jul 22 '05 #2
"Alex" <ab****@ncsu.edu> wrote in message
news:a7**************************@posting.google.c om...
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


I'd remove the memset() call since it's setting all of the memory to zero
when you really only need the last element of the array set to zero. Also,
I'd move the call to strlen() so it isn't called at every iteration of the
loop. If you wanted to go over the top, you might write code like this:

inline char * toarray( const vector<char>& v ) {
char * ret = new char[ v.size()+1 ];
if ( !v.empty() ) {
memcpy( ret, &v[0], v.size() );
}
ret[ v.size() ] = '\0';
return ret;
}

vector<char *> tokenize ( char * in, char delim ) {
vector<char *> tokens;
vector<char> currToken;

size_t len = strlen( in );
// some reasonable, arbitrary guesses for sizes
tokens.reserve( len );
currToken.reserve( 8 );

for( size_t i = 0; i < len; ++i ) {
if ( in[i] == delim ) {
tokens.push_back( toarray(currToken) );
currToken.clear();
}
else {
currToken.push_back( in[i] );
}
}

tokens.push_back( toarray(currToken) );
return tokens;
}

But I probably wouldn't bother with all of that. Also, I think it would be
simpler if the code used std::string or std::vector<char> instead of c-style
strings, because c-style strings can be a headache and code that uses
std::vector<char*> could accidentally leak memory when an exception occurs.

--
David Hilsee
Jul 22 '05 #3
On 31 Jul 2004 21:36:51 -0700, Alex <ab****@ncsu.edu> wrote:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Quite a few.

1) Don't call strlen every time around the loop (in fact don't call strlen
at all).

2) Don't use a temporary vector (theString).

3) Don't use memset, when you are simply overwriting that same characters
later on.

4) Don't return vectors by value, pass a reference instead.

5) Considering using vector<string> instead of vector<char*>, its easier
to handle and likely to be more efficient as well if you have an
implementation that uses short string optimisation.

Overall this should be a lot more efficicent

void tokenize(const char* str, char delim, vector<string>& tokenList)
{
tokenList.clear();
int start = -1;
for (int i = 0; str[i]; ++i)
{
if (str[i] == delim)
{
tokenList.push_back(string(str + start + 1, str + i));
start = i;
}
}
}

This is untested code.

john
Jul 22 '05 #4
On 31 Jul 2004 21:36:51 -0700, ab****@ncsu.edu (Alex) wrote:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Here is yet another alternative to go alongside all the other
suggestions:
// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>
std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()
int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;

my_tokens = tokenize(THE_STRING, SEPARATOR);

for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens[i] << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()

Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.
rossum

--

The ultimate truth is that there is no Ultimate Truth
Jul 22 '05 #5
On Mon, 02 Aug 2004 02:36:37 +0100, rossum <ro******@coldmail.com> wrote:
On 31 Jul 2004 21:36:51 -0700, ab****@ncsu.edu (Alex) wrote:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Here is yet another alternative to go alongside all the other
suggestions:
// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>
std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()
int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;
my_tokens = tokenize(THE_STRING, SEPARATOR);
for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens[i] << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()

Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.


It also has different behaviour from the original code.

tokens = tokenise("abc|123", '|');

is only one token by the original code and it is two tokens in your code.

I wouldn't be surpised if the original specification was mistaken however.

john
Jul 22 '05 #6
On 31 Jul 2004 21:36:51 -0700, ab****@ncsu.edu (Alex) wrote:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
That line above is likely to be the main performance bottleneck -
allocating new memory for each token. There are approaches that avoid
this.
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
The use of theString is another bottleneck - likely it causes further
memory allocation each time around.
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


inlining wouldn't help in any case - you'd save a few instructions vs
the thousands that the method might have to execute (memory allocation
is generally expensive). Here's a high performance variant:

#include <vector>
#include <cstring>

//NB this modifies the passed string!
std::vector<char*> tokenize(char* str, char delim)
{
std::vector<char*> tokenList;
tokenList.reserve(10);
typedef std::string::size_type size_t;
size_t const length = std::strlen(str);
size_t start = 0;
for(size_t i=0; i!=length; ++i)
{
if(str[i] == delim)
{
str[i] = '\0';
tokenList.push_back(str + start);
start = i + 1;
}
}
tokenList.push_back(str + start);
return tokenList;
}

#include <iostream>
#include <algorithm>
#include <iterator>

int main()
{
char const s[] = "1|2|3|10|| ";
char* p = new char[sizeof s];
std::strcpy(p, s);
std::vector<char*> tokens = tokenize(p, '|');
std::copy(
tokens.begin(),
tokens.end(),
std::ostream_iterator<char const*>(std::cout, "*")
);
std::cout << '\n';
}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.

Finally, it might be worth changing the signature to:

std::vector<char*> tokenize(char*& str, char delim);

since this will disable the biggest danger of accidentally passing a
string literal directly to the function.

Tom
Jul 22 '05 #7
On Mon, 02 Aug 2004 06:58:35 +0100, "John Harrison"
<jo*************@hotmail.com> wrote:
On Mon, 02 Aug 2004 02:36:37 +0100, rossum <ro******@coldmail.com> wrote:
On 31 Jul 2004 21:36:51 -0700, ab****@ncsu.edu (Alex) wrote:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Here is yet another alternative to go alongside all the other
suggestions:
// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>
std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()
int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;
my_tokens = tokenize(THE_STRING, SEPARATOR);
for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens[i] << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()

Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.


It also has different behaviour from the original code.

tokens = tokenise("abc|123", '|');

is only one token by the original code and it is two tokens in your code.

I wouldn't be surpised if the original specification was mistaken however.

john


Well spotted John, I hadn't noticed that in my brief testing.

Thanks,

rossum

--

The ultimate truth is that there is no Ultimate Truth
Jul 22 '05 #8
On Mon, 02 Aug 2004 13:12:21 +0100, tom_usenet
<to********@hotmail.com> wrote:

[.....]
}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.


Could you elaborate on 'named return value optimization'?

Mark
--
[ C++ FAQ: http://www.parashift.com/c++-faq-lite/ ]
Jul 22 '05 #9
Thanks to everyone who responded. All your responses proved to be
very helpful. The new tokenizer I am using is a hybrid of all
suggestions and now roughly takes 0.76 ms, whereas the original took
3.69ms. The final method I am using is the following:

void tokenizer(const char* str, char delim, vector<char*>& tokenList)
{
tokenList.clear();
int start = 0;
for (int i = 0; str[i]; i++)
{
if (str[i] == delim)
{
char* token = new char[i-start+1];
memcpy(token, &str[start], i-start);
token[i-start] = '\0';
tokenList.push_back(token);
start = i;
}

}
}
Jul 22 '05 #10
On Mon, 02 Aug 2004 22:24:13 -0400, Mark <ma**********@hotmail.com>
wrote:
On Mon, 02 Aug 2004 13:12:21 +0100, tom_usenet
<to********@hotmail.com> wrote:

[.....]
}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.


Could you elaborate on 'named return value optimization'?


Read all about it
http://groups.google.co.uk/groups?q=...ptimization%22

(or if you have the standard, read 12.8/15, or read the draft version:
the last paragraph on this page
http://www.csci.csusb.edu/dick/c++std/cd2/special.html)

Tom
Jul 22 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Christopher Benson-Manica | last post by:
The function in question follows: vector<string>& tokenize( const string& s, vector<string>& v, char delimiter=',' ) { int delim_idx, begin_idx=0, len=s.length(); for(...
4
by: Java Guy | last post by:
This must be a classical topic -- C++ stgring tokenizer. I just switched from C to C++ ( in Unix ). It turns out that there is no existing C++ string tokenizer. Searching on the Web, I found...
9
by: AMT2K5 | last post by:
Hello, how would I go about breaking up a string that is returned by a function. After I do that, I will strcpy that data to a class data member . I have the following functions void...
27
by: Marcus Kwok | last post by:
I am getting warnings when comparing a (regular) int to the value returned from std::vector.size() in code similar to the following: int i = //stuff if (i >= vec.size()) The compiler gives...
2
by: Thomas Paul Diffenbach | last post by:
I'm trying to write a space efficient string (nul terminated array of char), storing the characters directly unless the number of characters is too large to be so stored, and storing a pointer to...
1
by: John Fly | last post by:
I've used standard functions like find_first_of and find_first_not_of for some time. I was about to implement a quick string tokenizer similar to the example below when I ran across an old thread...
1
by: Gunawan | last post by:
Hi All, Is there any string tokenizer in .net 2.0 just like in Java? Regards, Gun
2
by: yesh81 | last post by:
Hi , can anybody write a java program to validate IP Addresses using string tokenizer.
1
by: JamesHoward | last post by:
I have searched the board and noticed that there isn't really any sort of good implementation of a string tokenizer that will tokenize based on a custom set of tokens and return both the tokens and...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.