I'm currently reading 1-n number of binary files, each with 3 different
arrays of floats containing about 10,000 values a piece for a total of
about 30,000 values per file.
I'm looking for a way to load them all into memory. I've tried using
vector pushback with reserving, but it was horribly slow. The current
method I am using is upon opening the file and reading the number of
values, resizing the vectors (I have 3, one for each data channel),
then inserting the [] operator.
vecLeftChannel[i] = DataBuffer[i].
This still takes a while to loop through the files, and I was wondering
if anyone knew a quicker way to load dynamic data like this without
using float arrays and reallocating manually.
Thanks!
Josh McFarlane 8 2879
Darsant wrote: I'm currently reading 1-n number of binary files, each with 3 different arrays of floats containing about 10,000 values a piece for a total of about 30,000 values per file.
I'm looking for a way to load them all into memory. I've tried using vector pushback with reserving, but it was horribly slow. The current method I am using is upon opening the file and reading the number of values, resizing the vectors (I have 3, one for each data channel), then inserting the [] operator.
vecLeftChannel[i] = DataBuffer[i].
This still takes a while to loop through the files, and I was wondering if anyone knew a quicker way to load dynamic data like this without using float arrays and reallocating manually.
Thanks! Josh McFarlane
std::vector is guaranteed to be O(n) amortized if you are only adding to
the end of the vector. You say you tried that "with reserving". Do you
mean that you called reserve once prior to adding to the array (good
usage), or periodically as you read more values (bad usage, can result
in non-linear running time)?
If you don't mind temporarily more memory, you could read everything
into a std::list, and then construct a vector from that with something
similar to:
std::vector<float> v(mylist.begin(), mylist.end()) ;
mylist.clear() ;
This might be faster than adding directly to the vector (though by at
most a constant factor).
Probably the fastest possible way to access your data is to use OS
specific mechanisms for creating a memory mapped file (see open and mmap
for Unix variants, or CreateFile, CreateFileMapping, and MapViewOfFile
for Win32).
-Alan
Darsant wrote: I'm currently reading 1-n number of binary files, each with 3 different arrays of floats containing about 10,000 values a piece for a total of about 30,000 values per file.
I'm looking for a way to load them all into memory. I've tried using vector pushback with reserving, but it was horribly slow. The current method I am using is upon opening the file and reading the number of values, resizing the vectors (I have 3, one for each data channel), then inserting the [] operator.
vecLeftChannel[i] = DataBuffer[i].
This still takes a while to loop through the files, and I was wondering if anyone knew a quicker way to load dynamic data like this without using float arrays and reallocating manually.
I think there's a function like "seek" or "fseek" or something
to seek to the end of file. Either the seek function returns
a byte offset, or you call a "position" or "pos" function to
get the byte offset, which is effectively the size of the file
in bytes. You then resize the vector to
sum_of_file_sizes / sizeof(float). You should then get reasonable
performance reading in the floats one at a time using buffered
I/O. wk****@yahoo.com wrote: I think there's a function like "seek" or "fseek" or something to seek to the end of file. Either the seek function returns a byte offset, or you call a "position" or "pos" function to get the byte offset, which is effectively the size of the file in bytes. You then resize the vector to sum_of_file_sizes / sizeof(float). You should then get reasonable performance reading in the floats one at a time using buffered I/O.
That's what I'm doing right now.
For each file, I read the header of information to a temporary header
structure. The header tells me the amount of data in each channel. I
then reserve size+newsamples for the vector, then push_back the new
readings from the temporary read buffer.
Darsant wrote: I'm currently reading 1-n number of binary files, each with 3 different arrays of floats containing about 10,000 values a piece for a total of about 30,000 values per file.
First you should measure (e.g. with clock()) where the time is spent.
Then consider alternatives.
Darsant wrote: wk****@yahoo.com wrote: I think there's a function like "seek" or "fseek" or something to seek to the end of file. Either the seek function returns a byte offset, or you call a "position" or "pos" function to get the byte offset, which is effectively the size of the file in bytes. You then resize the vector to sum_of_file_sizes / sizeof(float). You should then get reasonable performance reading in the floats one at a time using buffered I/O.
That's what I'm doing right now.
For each file, I read the header of information to a temporary header structure. The header tells me the amount of data in each channel. I then reserve size+newsamples for the vector, then push_back the new readings from the temporary read buffer.
In a typical implementation of the "vector" template, I think
there is, in an instance of vector<float>, a private pointer to
float. This private pointer points to an array in dynamically
allocated storage whose number of (float) elements is greater
than or equal to the value returned by vector<float>::size().
What happens when you call push_back, and the number of floats in
the dynamic array is exactly equal to size()? push_back will:
1) dynamically allocate another array of floats whose dimension
is size() + Delta (Delta being a positive integral constant chosen
by the STL implementation).
2) copy the old array of floats into the newly allocated array of
floats.
3) copy the float being pushed_back to the entry at offset size()
in the newly allocated array, and set size() to size()+1.
4) free the old array of floats, and set the private pointer to the
new array of floats.
the resize() member function does something similar, except that
the caller controls the new size of the private dynamically allocated
array. So that is why it help alot to get the sizes of all the files,
add them up, resize the vector to the total size, then copy the floats
in one at a time. You may avoid alot of repeated unnecessary heap
allocation (which can be slow) and copying of the private array.
To put it in high level terms, I think that the Standard says that
the worst-case time complexity of both resize and push_back is
O(n). So it makes sense to do one resize rather than n push_backs. wk****@yahoo.com wrote: Darsant wrote:
wk****@yahoo.com wrote:
I think there's a function like "seek" or "fseek" or something to seek to the end of file. Either the seek function returns a byte offset, or you call a "position" or "pos" function to get the byte offset, which is effectively the size of the file in bytes. You then resize the vector to sum_of_file_sizes / sizeof(float). You should then get reasonable performance reading in the floats one at a time using buffered I/O.
That's what I'm doing right now.
For each file, I read the header of information to a temporary header structure. The header tells me the amount of data in each channel. I then reserve size+newsamples for the vector, then push_back the new readings from the temporary read buffer.
In a typical implementation of the "vector" template, I think there is, in an instance of vector<float>, a private pointer to float. This private pointer points to an array in dynamically allocated storage whose number of (float) elements is greater than or equal to the value returned by vector<float>::size(). What happens when you call push_back, and the number of floats in the dynamic array is exactly equal to size()? push_back will:
1) dynamically allocate another array of floats whose dimension is size() + Delta (Delta being a positive integral constant chosen by the STL implementation).
2) copy the old array of floats into the newly allocated array of floats.
3) copy the float being pushed_back to the entry at offset size() in the newly allocated array, and set size() to size()+1.
4) free the old array of floats, and set the private pointer to the new array of floats.
the resize() member function does something similar, except that the caller controls the new size of the private dynamically allocated array. So that is why it help alot to get the sizes of all the files, add them up, resize the vector to the total size, then copy the floats in one at a time. You may avoid alot of repeated unnecessary heap allocation (which can be slow) and copying of the private array.
To put it in high level terms, I think that the Standard says that the worst-case time complexity of both resize and push_back is O(n). So it makes sense to do one resize rather than n push_backs.
23.2.4.1: A vector is a kind of sequence that supports random access
iterators. In addition, it supports (amortized) *constant time* insert
and erase operations at the end;
(emphasis mine)
If what you call "Delta" is a constant, then your performance for
inserting n items at the end of the vector will be bounded by O(n^2), or
O(n) amortized for each insert.
A better solution (that I imagine most STL implementations use) is to
double the capactiy of the vector every time you run out of space (and
half the capacity when the size hits 1/4). This is provably bounded by
O(n) for n insertions at the end, or O(1) per insertion.
-Alan
Thanks for everyone's help.
I've got the loading procedure in my class working fairly efficently
now (4375ms per 342000 floats).
Question: If I am reading to a buffer, can I read the buffer of array
of floats directly into my vector, or is the memory management
different in that I have to manually pushback an array buffer into the
vector? It's more of a curiosity question that might save me a few
hundred ms.
Unfortunately, due the the processing algorithm it is passed off to
(not under my control), I have to then create a temporary array and
pass that array to the algorithm from the vector. Allocating the array
takes roughly 1000ms and copying takes another 688ms. Due to a glitch
in Visual C++, I can't use a reference the constant vector returned by
my class (converts pointers to _w64 vector type instead of just a
pointer to a vector array, so also have to also copy the vector.)
Anyway, thanks again for everyones help!
Alan Johnson wrote: wk****@yahoo.com wrote:
.... To put it in high level terms, I think that the Standard says that the worst-case time complexity of both resize and push_back is O(n). So it makes sense to do one resize rather than n push_backs.
23.2.4.1: A vector is a kind of sequence that supports random access iterators. In addition, it supports (amortized) *constant time* insert and erase operations at the end;
(emphasis mine)
If what you call "Delta" is a constant, then your performance for inserting n items at the end of the vector will be bounded by O(n^2), or O(n) amortized for each insert. A better solution (that I imagine most STL implementations use) is to double the capactiy of the vector every time you run out of space (and half the capacity when the size hits 1/4). This is provably bounded by O(n) for n insertions at the end, or O(1) per insertion.
Well, now, they don't really clarify as to *what* to amortized over,
do they? But I think you're right, a constant delta is probably
not a good idea. I think your analysis applies for increasing the
array size by multiplying by K for any K > 1. I'd guess a good
implementation would use a value of K that's closer to 1 than
2.
Maybe another way to implement a vector<T> is as an array of pointers
to arrays of T whose dimension is some power of 2. So
the internal accessing of a vector element would look like
a[i >> K][i & ((1 << K) - 1)] . No copying needed on size
changes. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: ^ |
last post by:
Hi group,
I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files...
|
by: Mark Dickinson |
last post by:
I have a simple 192-line Python script that begins with the line:
dummy0 = 47
The script runs in less than 2.5 seconds. The variable dummy0 is never
referenced again, directly or indirectly,...
|
by: Ron Johnson |
last post by:
Hi,
While on the topic of "need for in-place upgrades", I got to think-
ing how the pg_restore could be speeded up.
Am I wrong in saying that in the current pg_restore, all of the
indexes are...
|
by: Erpman |
last post by:
I am trying to access the data with in a wav file. I am testing with very
small files in order to keep the code simple to start with.
Basically, im writing the entire wav file to a byte using a...
|
by: Lee Harr |
last post by:
I have a database where I remove the schema public. When I
try to use the createlang script, it fails like this ...
>createdb foo
CREATE DATABASE
>psql foo -c "select version()"
version...
|
by: Froggy / Froggy Corp. |
last post by:
First thx for your quick answer :)
iostat dont return bad load average. I never grow up to 1Mb/s and the
IDE drive can go faster.
I really can't change the server, its a locative server which...
|
by: TheMadHatter |
last post by:
Yello,
Quick Q:
What is the fastest way of getting data from a file?
Does the though put of the file increase if I collect
a great big chunk of it? Eg say 6meg? or does that
even help?
Does...
|
by: anandaraj |
last post by:
The discussion about the title already available in this forum. But i want to know how to copy the 2d array content or from a buffer into vector.
or is there any other way to read data into vector.
|
by: HYRY |
last post by:
I want to join two mono wave file to a stereo wave file by only using
the default python module.
Here is my program, but it is much slower than the C version, so how
can I increase the speed?
I...
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
| |