473,414 Members | 1,622 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,414 software developers and data experts.

Need speed increase while reading large chunk of data.

I'm currently reading 1-n number of binary files, each with 3 different
arrays of floats containing about 10,000 values a piece for a total of
about 30,000 values per file.

I'm looking for a way to load them all into memory. I've tried using
vector pushback with reserving, but it was horribly slow. The current
method I am using is upon opening the file and reading the number of
values, resizing the vectors (I have 3, one for each data channel),
then inserting the [] operator.

vecLeftChannel[i] = DataBuffer[i].

This still takes a while to loop through the files, and I was wondering
if anyone knew a quicker way to load dynamic data like this without
using float arrays and reallocating manually.

Thanks!
Josh McFarlane

Jul 23 '05 #1
8 2879
Darsant wrote:
I'm currently reading 1-n number of binary files, each with 3 different
arrays of floats containing about 10,000 values a piece for a total of
about 30,000 values per file.

I'm looking for a way to load them all into memory. I've tried using
vector pushback with reserving, but it was horribly slow. The current
method I am using is upon opening the file and reading the number of
values, resizing the vectors (I have 3, one for each data channel),
then inserting the [] operator.

vecLeftChannel[i] = DataBuffer[i].

This still takes a while to loop through the files, and I was wondering
if anyone knew a quicker way to load dynamic data like this without
using float arrays and reallocating manually.

Thanks!
Josh McFarlane


std::vector is guaranteed to be O(n) amortized if you are only adding to
the end of the vector. You say you tried that "with reserving". Do you
mean that you called reserve once prior to adding to the array (good
usage), or periodically as you read more values (bad usage, can result
in non-linear running time)?

If you don't mind temporarily more memory, you could read everything
into a std::list, and then construct a vector from that with something
similar to:
std::vector<float> v(mylist.begin(), mylist.end()) ;
mylist.clear() ;

This might be faster than adding directly to the vector (though by at
most a constant factor).

Probably the fastest possible way to access your data is to use OS
specific mechanisms for creating a memory mapped file (see open and mmap
for Unix variants, or CreateFile, CreateFileMapping, and MapViewOfFile
for Win32).

-Alan
Jul 23 '05 #2
Darsant wrote:
I'm currently reading 1-n number of binary files, each with 3 different
arrays of floats containing about 10,000 values a piece for a total of
about 30,000 values per file.

I'm looking for a way to load them all into memory. I've tried using
vector pushback with reserving, but it was horribly slow. The current
method I am using is upon opening the file and reading the number of
values, resizing the vectors (I have 3, one for each data channel),
then inserting the [] operator.

vecLeftChannel[i] = DataBuffer[i].

This still takes a while to loop through the files, and I was wondering
if anyone knew a quicker way to load dynamic data like this without
using float arrays and reallocating manually.


I think there's a function like "seek" or "fseek" or something
to seek to the end of file. Either the seek function returns
a byte offset, or you call a "position" or "pos" function to
get the byte offset, which is effectively the size of the file
in bytes. You then resize the vector to
sum_of_file_sizes / sizeof(float). You should then get reasonable
performance reading in the floats one at a time using buffered
I/O.

Jul 23 '05 #3
wk****@yahoo.com wrote:
I think there's a function like "seek" or "fseek" or something
to seek to the end of file. Either the seek function returns
a byte offset, or you call a "position" or "pos" function to
get the byte offset, which is effectively the size of the file
in bytes. You then resize the vector to
sum_of_file_sizes / sizeof(float). You should then get reasonable
performance reading in the floats one at a time using buffered
I/O.


That's what I'm doing right now.

For each file, I read the header of information to a temporary header
structure. The header tells me the amount of data in each channel. I
then reserve size+newsamples for the vector, then push_back the new
readings from the temporary read buffer.

Jul 23 '05 #4
Darsant wrote:
I'm currently reading 1-n number of binary files, each with 3 different
arrays of floats containing about 10,000 values a piece for a total of
about 30,000 values per file.


First you should measure (e.g. with clock()) where the time is spent.
Then consider alternatives.

Jul 23 '05 #5
Darsant wrote:
wk****@yahoo.com wrote:
I think there's a function like "seek" or "fseek" or something
to seek to the end of file. Either the seek function returns
a byte offset, or you call a "position" or "pos" function to
get the byte offset, which is effectively the size of the file
in bytes. You then resize the vector to
sum_of_file_sizes / sizeof(float). You should then get reasonable
performance reading in the floats one at a time using buffered
I/O.


That's what I'm doing right now.

For each file, I read the header of information to a temporary header
structure. The header tells me the amount of data in each channel. I
then reserve size+newsamples for the vector, then push_back the new
readings from the temporary read buffer.


In a typical implementation of the "vector" template, I think
there is, in an instance of vector<float>, a private pointer to
float. This private pointer points to an array in dynamically
allocated storage whose number of (float) elements is greater
than or equal to the value returned by vector<float>::size().
What happens when you call push_back, and the number of floats in
the dynamic array is exactly equal to size()? push_back will:

1) dynamically allocate another array of floats whose dimension
is size() + Delta (Delta being a positive integral constant chosen
by the STL implementation).

2) copy the old array of floats into the newly allocated array of
floats.

3) copy the float being pushed_back to the entry at offset size()
in the newly allocated array, and set size() to size()+1.

4) free the old array of floats, and set the private pointer to the
new array of floats.

the resize() member function does something similar, except that
the caller controls the new size of the private dynamically allocated
array. So that is why it help alot to get the sizes of all the files,
add them up, resize the vector to the total size, then copy the floats
in one at a time. You may avoid alot of repeated unnecessary heap
allocation (which can be slow) and copying of the private array.

To put it in high level terms, I think that the Standard says that
the worst-case time complexity of both resize and push_back is
O(n). So it makes sense to do one resize rather than n push_backs.

Jul 23 '05 #6
wk****@yahoo.com wrote:
Darsant wrote:
wk****@yahoo.com wrote:
I think there's a function like "seek" or "fseek" or something
to seek to the end of file. Either the seek function returns
a byte offset, or you call a "position" or "pos" function to
get the byte offset, which is effectively the size of the file
in bytes. You then resize the vector to
sum_of_file_sizes / sizeof(float). You should then get reasonable
performance reading in the floats one at a time using buffered
I/O.


That's what I'm doing right now.

For each file, I read the header of information to a temporary header
structure. The header tells me the amount of data in each channel. I
then reserve size+newsamples for the vector, then push_back the new
readings from the temporary read buffer.

In a typical implementation of the "vector" template, I think
there is, in an instance of vector<float>, a private pointer to
float. This private pointer points to an array in dynamically
allocated storage whose number of (float) elements is greater
than or equal to the value returned by vector<float>::size().
What happens when you call push_back, and the number of floats in
the dynamic array is exactly equal to size()? push_back will:

1) dynamically allocate another array of floats whose dimension
is size() + Delta (Delta being a positive integral constant chosen
by the STL implementation).

2) copy the old array of floats into the newly allocated array of
floats.

3) copy the float being pushed_back to the entry at offset size()
in the newly allocated array, and set size() to size()+1.

4) free the old array of floats, and set the private pointer to the
new array of floats.

the resize() member function does something similar, except that
the caller controls the new size of the private dynamically allocated
array. So that is why it help alot to get the sizes of all the files,
add them up, resize the vector to the total size, then copy the floats
in one at a time. You may avoid alot of repeated unnecessary heap
allocation (which can be slow) and copying of the private array.

To put it in high level terms, I think that the Standard says that
the worst-case time complexity of both resize and push_back is
O(n). So it makes sense to do one resize rather than n push_backs.


23.2.4.1: A vector is a kind of sequence that supports random access
iterators. In addition, it supports (amortized) *constant time* insert
and erase operations at the end;

(emphasis mine)
If what you call "Delta" is a constant, then your performance for
inserting n items at the end of the vector will be bounded by O(n^2), or
O(n) amortized for each insert.

A better solution (that I imagine most STL implementations use) is to
double the capactiy of the vector every time you run out of space (and
half the capacity when the size hits 1/4). This is provably bounded by
O(n) for n insertions at the end, or O(1) per insertion.

-Alan
Jul 23 '05 #7
Thanks for everyone's help.

I've got the loading procedure in my class working fairly efficently
now (4375ms per 342000 floats).

Question: If I am reading to a buffer, can I read the buffer of array
of floats directly into my vector, or is the memory management
different in that I have to manually pushback an array buffer into the
vector? It's more of a curiosity question that might save me a few
hundred ms.

Unfortunately, due the the processing algorithm it is passed off to
(not under my control), I have to then create a temporary array and
pass that array to the algorithm from the vector. Allocating the array
takes roughly 1000ms and copying takes another 688ms. Due to a glitch
in Visual C++, I can't use a reference the constant vector returned by
my class (converts pointers to _w64 vector type instead of just a
pointer to a vector array, so also have to also copy the vector.)

Anyway, thanks again for everyones help!

Jul 23 '05 #8
Alan Johnson wrote:
wk****@yahoo.com wrote:

....
To put it in high level terms, I think that the Standard says that
the worst-case time complexity of both resize and push_back is
O(n). So it makes sense to do one resize rather than n push_backs.


23.2.4.1: A vector is a kind of sequence that supports random access
iterators. In addition, it supports (amortized) *constant time* insert
and erase operations at the end;

(emphasis mine)

If what you call "Delta" is a constant, then your performance for
inserting n items at the end of the vector will be bounded by O(n^2), or
O(n) amortized for each insert.
A better solution (that I imagine most STL implementations use) is to
double the capactiy of the vector every time you run out of space (and
half the capacity when the size hits 1/4). This is provably bounded by
O(n) for n insertions at the end, or O(1) per insertion.


Well, now, they don't really clarify as to *what* to amortized over,
do they? But I think you're right, a constant delta is probably
not a good idea. I think your analysis applies for increasing the
array size by multiplying by K for any K > 1. I'd guess a good
implementation would use a value of K that's closer to 1 than
2.

Maybe another way to implement a vector<T> is as an array of pointers
to arrays of T whose dimension is some power of 2. So
the internal accessing of a vector element would look like
a[i >> K][i & ((1 << K) - 1)] . No copying needed on size
changes.

Jul 23 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
by: ^ | last post by:
Hi group, I've become interested in Python a while ago and just converted a simple perl script to python. The script is very simple, it generates a list of found virusses from some maillog files...
23
by: Mark Dickinson | last post by:
I have a simple 192-line Python script that begins with the line: dummy0 = 47 The script runs in less than 2.5 seconds. The variable dummy0 is never referenced again, directly or indirectly,...
14
by: Ron Johnson | last post by:
Hi, While on the topic of "need for in-place upgrades", I got to think- ing how the pg_restore could be speeded up. Am I wrong in saying that in the current pg_restore, all of the indexes are...
4
by: Erpman | last post by:
I am trying to access the data with in a wav file. I am testing with very small files in order to keep the code simple to start with. Basically, im writing the entire wav file to a byte using a...
6
by: Lee Harr | last post by:
I have a database where I remove the schema public. When I try to use the createlang script, it fails like this ... >createdb foo CREATE DATABASE >psql foo -c "select version()" version...
3
by: Froggy / Froggy Corp. | last post by:
First thx for your quick answer :) iostat dont return bad load average. I never grow up to 1Mb/s and the IDE drive can go faster. I really can't change the server, its a locative server which...
2
by: TheMadHatter | last post by:
Yello, Quick Q: What is the fastest way of getting data from a file? Does the though put of the file increase if I collect a great big chunk of it? Eg say 6meg? or does that even help? Does...
2
by: anandaraj | last post by:
The discussion about the title already available in this forum. But i want to know how to copy the 2d array content or from a buffer into vector. or is there any other way to read data into vector.
18
by: HYRY | last post by:
I want to join two mono wave file to a stereo wave file by only using the default python module. Here is my program, but it is much slower than the C version, so how can I increase the speed? I...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.