473,473 Members | 1,906 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

read/parse flat file / performance / boost::tokenizer

task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third
key += keyArr[i]; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap
key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool boost::char_separator<char, std::char_traits<char> >::operator()<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)

11.99 0.85 0.35 24903838 0.00 0.00 boost::char_separator<char, std::char_traits<char> >::is_dropped(char) const

7.09 1.06 0.21 28508346 0.00 0.00 bool __gnu_cxx::operator!=<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)

problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of
line number and line content) of a file identified by a key into the container.
Every idea which improves the style and performace of this snippet is welcome !

Thomas

--
NO ePatents: http://swpat.ffii.org/index.de.html

Jul 22 '05 #1
5 2940

"Knackeback" <kn********@randspringer.de> wrote in message
news:m3************@redrat.quark.de...
task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third key += keyArr[i]; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool boost::char_separator<char, std::char_traits<char>::operator()<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >,
std::basic_string<char, std::char_traits<char>, std::allocator<char> >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > >, std::basic_string<char,
std::char_traits<char>, std::allocator<char> >&)
11.99 0.85 0.35 24903838 0.00 0.00 boost::char_separator<char, std::char_traits<char> >::is_dropped(char) const
7.09 1.06 0.21 28508346 0.00 0.00 bool __gnu_cxx::operator!=<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > const&)
problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of line number and line content) of a file identified by a key into the container. Every idea which improves the style and performace of this snippet is welcome !
Thomas


All your performance bottlenecks seem to be from within the boost tokenizer
library. The obvious answer then is to replace that code with your own
custom code. The tokenizer library is a generic tokenizer, you have a
specific requirements to solve, so you should be able to beat the
performance of boost by taking advantage of the specific knowledge you have
about your application.

john
Jul 22 '05 #2

Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think
the boost::tokenizer is too expensive.
BTW, I compiled my program with g++ and icc (Intels C++ compiler for Linux).
The icc compiled code was five times faster and the compile warnings from icc
are very fine. Good work !

Thomas
--
NO ePatents: http://swpat.ffii.org/index.de.html

Jul 22 '05 #3

"Knackeback" <kn********@randspringer.de> wrote in message
news:m3************@redrat.quark.de...

Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think the boost::tokenizer is too expensive.


That's exactly what I mean. For instance boost will probably create a string
for each token, but you throw some of those tokens away. Your custom code
will only create a string for the tokens you actually need.

Also looking at your original code it seems that after extracting a token,
you add the delimiter back in to the key you are building up. That would be
another improvement, for your purposes a token can include the trailing
delimiter.

john
Jul 22 '05 #4

Thanks for your hint. That handcrafted solution was now three times faster than
the boost tokenizer !
--
NO ePatents: http://swpat.ffii.org/index.de.html

Jul 22 '05 #5

"Knackeback" <kn********@randspringer.de> wrote in message
news:m3************@redrat.quark.de...

Thanks for your hint. That handcrafted solution was now three times faster than the boost tokenizer !


Don't take that as an argument against boost tokenizer. It still does its
job, and presumably does it efficiently (I haven't looked at the code).

What I liked about your post was that you did things the right way round.
First you got a working solution using general purpose tools available to
you, then you decided that it wasn't fast enough so you looked to replace
general purpose code with hand crafted code. That's the way it should be
done.

And of course many times, the hand crafted code isn't necessary at all.

john
Jul 22 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Max B-K | last post by:
I've delved into the usage of the PHP Tokenizer that directly interfaces with the Zend engine. So far, I have found it incredibly useful when it comes to editing a PHP file. What I am trying...
6
by: Maarten van Reeuwijk | last post by:
Hi group, I need to parse various text files in python. I was wondering if there was a general purpose tokenizer available. I know about split(), but this (otherwise very handy method does not...
4
by: Java Guy | last post by:
This must be a classical topic -- C++ stgring tokenizer. I just switched from C to C++ ( in Unix ). It turns out that there is no existing C++ string tokenizer. Searching on the Web, I found...
3
by: tvn007 | last post by:
I wrote the code below to read data from file into structure using C. However, I would like to convert it to C++. Could someone please give me some hints. I am not that famaliar with C++ Thanks...
10
by: Lorenzo J. Lucchini | last post by:
Do you see any counter-indication to a token extractor in the following form? typedef int token; token ExtractToken(const char** String, char* Symbol); If it finds a valid token at the start...
21
by: William Stacey [MVP] | last post by:
Anyone know of some library that will parse files like following: options { directory "/etc"; allow-query { any; }; // This is the default recursion no; listen-on { 192.168.0.225;...
18
by: Robbie Hatley | last post by:
A couple of days ago I dedecided to force myself to really learn exactly what "strtok" does, and how to use it. I figured I'd just look it up in some book and that would be that. I figured...
5
by: Denis Petronenko | last post by:
Hello, how can i read into strings from ifstream? file contains values in following format: value11; val ue12; value 13; valu e21;value22; value23; etc. i need to read like file >string,...
1
by: Karl Kobata | last post by:
Hi Fredrik, This is exactly what I need. Thank you. I would like to do one additional function. I am not using the tokenizer to parse python code. It happens to work very well for my...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.