Alan wrote:
It consists entirely of numbers (some integer, some real). The
primary "screening" comparison involves one integer field. For 90+% of
the data, this field will not match in the comparison. This will
reduce the data considerably.
The data does come in sequential, logical units (output of a time
cycle of a simulation). To clarify, I have to read in something like
2500 (variable) of the items at a time and compare them. Then I
calculate results and move on to the next group of 2500. If it might
be wise, I could create 1000 separate files instead of one big honker.
Any thoughts?
I will output a small file with summarized results (CSV, which
will be used in Excel for further analysis.
Thanks for the advice,
If these are numbers than I would not read them into strings,
for you will have to convert them to int or double anyway,
to compare. String comparison will not suffice, because
0, -0.0, 0.0000E+00 are all the same number.
I'd go for a vector of structs, if you know the struct in
advance. If you can reasonably guess the number of elements,
or if it is in a header of this file, then .reserve() the
vector in advance, this will reduce the penalty of reallocating
it during .push_back().
My experience shows that the bottleneck of such calculations
is usually the string to number conversion hidden behind the
scenes in
input_stream >some_number;
ie. the program becomes CPU-bound, not I/O-bound because of
the conversion. I suggest trying the stream-based approach
and only if the performance is unsatisfactory, trying to
re-code into sscanf()/atof() or the likes, being extremely
careful.
On the other hand, if you intend on comparing all pairs
of data, I suspect you'll have O(n^2) complexity of
O(n logn) if you sort it -- that way for large sets of
data the cost of reading it in, which scales as O(n)
will become less and less a hassle. Still, 2500 numbers
and a 54MB file does not sound like an awful lot.
I would not split the input into more smaller files,
IMHO it will only require more file opens and closes.
HTH,
- J.