read/parse flat file / performance / boost::tokenizer

Knackeback

task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third
key += keyArr[i]; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap
key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool boost::char_separator<char, std::char_traits<char> >::operator()<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)

11.99 0.85 0.35 24903838 0.00 0.00 boost::char_separator<char, std::char_traits<char> >::is_dropped(char) const

7.09 1.06 0.21 28508346 0.00 0.00 bool __gnu_cxx::operator!=<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)

problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of
line number and line content) of a file identified by a key into the container.
Every idea which improves the style and performace of this snippet is welcome !

Thomas

--
NO ePatents: http://swpat.ffii.org/index.de.html

Jul 22 '05 #1

Subscribe Reply

2940

John Harrison

"Knackeback" <kn********@randspringer.de> wrote in message
news:m3************@redrat.quark.de...

task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third key += keyArr[i]; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool boost::char_separator<char, std::char_traits<char>::operator()<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >,
std::basic_string<char, std::char_traits<char>, std::allocator<char> >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > >, std::basic_string<char,
std::char_traits<char>, std::allocator<char> >&)
11.99 0.85 0.35 24903838 0.00 0.00 boost::char_separator<char, std::char_traits<char> >::is_dropped(char) const
7.09 1.06 0.21 28508346 0.00 0.00 bool __gnu_cxx::operator!=<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > const&)
problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of line number and line content) of a file identified by a key into the container. Every idea which improves the style and performace of this snippet is welcome !
Thomas

All your performance bottlenecks seem to be from within the boost tokenizer
library. The obvious answer then is to replace that code with your own
custom code. The tokenizer library is a generic tokenizer, you have a
specific requirements to solve, so you should be able to beat the
performance of boost by taking advantage of the specific knowledge you have
about your application.

john

Jul 22 '05 #2

Knackeback

Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think
the boost::tokenizer is too expensive.
BTW, I compiled my program with g++ and icc (Intels C++ compiler for Linux).
The icc compiled code was five times faster and the compile warnings from icc
are very fine. Good work !

Thomas
--
NO ePatents: http://swpat.ffii.org/index.de.html

Jul 22 '05 #3

John Harrison

"Knackeback" <kn********@randspringer.de> wrote in message
news:m3************@redrat.quark.de...

Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think the boost::tokenizer is too expensive.

That's exactly what I mean. For instance boost will probably create a string
for each token, but you throw some of those tokens away. Your custom code
will only create a string for the tokens you actually need.

Also looking at your original code it seems that after extracting a token,
you add the delimiter back in to the key you are building up. That would be
another improvement, for your purposes a token can include the trailing
delimiter.

john

Jul 22 '05 #4

Knackeback

Thanks for your hint. That handcrafted solution was now three times faster than
the boost tokenizer !
--
NO ePatents: http://swpat.ffii.org/index.de.html

Jul 22 '05 #5

John Harrison

"Knackeback" <kn********@randspringer.de> wrote in message
news:m3************@redrat.quark.de...

Thanks for your hint. That handcrafted solution was now three times faster than the boost tokenizer !

Don't take that as an argument against boost tokenizer. It still does its
job, and presumably does it efficiently (I haven't looked at the code).

What I liked about your post was that you did things the right way round.
First you got a working solution using general purpose tools available to
you, then you decided that it wasn't fast enough so you looked to replace
general purpose code with hand crafted code. That's the way it should be
done.

And of course many times, the hand crafted code isn't necessary at all.

john

Jul 22 '05 #6

Similar topics

Tokenizer Difficulties

by: Max B-K | last post by:

I've delved into the usage of the PHP Tokenizer that directly interfaces with the Zend engine. So far, I have found it incredibly useful when it comes to editing a PHP file. What I am trying...