471,579 Members | 1,470 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,579 software developers and data experts.

Performance File Parsing

Hi,
I have to parse a plain, ascii text file (on local HD). Since the file
might be many millions lines long I want to improve the efficiency of
my parsing process. The resulting data structure shall look like this
the following:

class A {
...
int value;
}

class B {
...
std::vector<Aas; // NOTE: can't be a pointer
}

// All objects of Typ B in bs have objects of Typ A with the same
value;
class C {
...
std::vector<B*bs;
}

In the text-file each (not empty) line of represents one object A. We
know the total number of As in advance, but don't know how many As will
be part of each B. Each empty line represents the creation of a new B
to with the next As should be attached. The objects of Typ B are
grouped to collections in object of Typ C using some value in its A
objects. Means all A that are part of be have the same value. If an two
Bs have As with different values, they belong to different Cs.

I know in advance (its written in the first line of text) how many
instances of A are inside the file. But unfortunatly I have no
information about the structure of the tree (the objects A and B) in
advance.

Currently there are three methodes in my mind to create the tree. In
all cases the C structure (usually just a few hundred) will be created
afterwards by grouping the Bs (a few thousands)

1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.

2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.

3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.

Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?

Thanks in advance,
Thomas Kowalski

Aug 18 '06 #1
1 2077
In article <11*********************@p79g2000cwp.googlegroups. com>, th-
ko@gmx.de says...

[ ... parsing a text file ]
1) Go through the textfile twice. Count of many objects A belong to an
B and use this information to initialise the vectors size of each B.
Bad idea. Basic rule: processing is a lot faster than I/O, so you have
to save a LOT of processing to justify even a little extra I/O. Reading
a multi-million line file twice to save a few dynamic allocations is
_almost_ certain to be a loss.
2) Create on big vector to store each A (since we know the number in
advance) . Then count the lines processed until an empty line appears
and set B's vector size accordingly.
This methode might be the fastest I suppose but use up a lot of memory.
Quite true -- whether it works out well is likely to depend heavily upon
whether you have _free_ memory available. Assuming a virtual memory
system, if most memory is in use, you'll (usually) be writing out other
data to make room for the new data, which means more I/O (about which,
see above).
3) Estimate the vectors size in advance and then set B's capacity
accordingly and use push back for each A. Reduce the capacity to the
correct vectors size at each empty line.
This would be the most obvious and easiest approach. If you have working
code already, it's probably a lot like this though...
Which methode would you prefere? Do you have ideas how to improve the
speed the parsing process up? Might there be chance to speed the whole
thing up using asynchronious IO ? I don't really know something about
this, but usually it just make sense in network IO right?
Asynch I/O can make a lot of sense for access to a local HD as well.
Though a local drive will (normally) give you faster access than a
network, it's still usually a lot slower than processing.

I suppose I should also add that IF you have some insanely fast RAID
array (say, data striped across a dozen disks or so) it's barely
possible to have I/O that actually can keep up with the processing, so
your priorities would change -- but at least IME, this is fairly rare.

As with most questions about optimization, the real answer is to profile
your code, and see where it's spending its time, then work at optimizing
the parts that are slow. Just for example, if only 1% of the time is
spent doing memory allocation, and 90+ % is spent reading from the disk,
then using a two-pass algorithm to optimize memory allocation will
almost certainly slow things down.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Aug 18 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

17 posts views Thread by lkrubner | last post: by
7 posts views Thread by Dennis Roberts | last post: by
2 posts views Thread by Mark | last post: by
14 posts views Thread by Bill | last post: by
15 posts views Thread by Sion Arrowsmith | last post: by
3 posts views Thread by Jesper Stocholm | last post: by
3 posts views Thread by Toadfather | last post: by
3 posts views Thread by rony | last post: by
26 posts views Thread by Bill | last post: by
reply views Thread by XIAOLAOHU | last post: by
reply views Thread by lumer26 | last post: by
reply views Thread by Vinnie | last post: by
1 post views Thread by lumer26 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.