I'm currently reading 1-n number of binary files, each with 3 different
arrays of floats containing about 10,000 values a piece for a total of
about 30,000 values per file.
I'm looking for a way to load them all into memory. I've tried using
vector pushback with reserving, but it was horribly slow. The current
method I am using is upon opening the file and reading the number of
values, resizing the vectors (I have 3, one for each data channel),
then inserting the [] operator.
vecLeftChannel[i] = DataBuffer[i].
This still takes a while to loop through the files, and I was wondering
if anyone knew a quicker way to load dynamic data like this without
using float arrays and reallocating manually.
Thanks!
Josh McFarlane 8 2868
Darsant wrote: I'm currently reading 1-n number of binary files, each with 3 different arrays of floats containing about 10,000 values a piece for a total of about 30,000 values per file.
I'm looking for a way to load them all into memory. I've tried using vector pushback with reserving, but it was horribly slow. The current method I am using is upon opening the file and reading the number of values, resizing the vectors (I have 3, one for each data channel), then inserting the [] operator.
vecLeftChannel[i] = DataBuffer[i].
This still takes a while to loop through the files, and I was wondering if anyone knew a quicker way to load dynamic data like this without using float arrays and reallocating manually.
Thanks! Josh McFarlane
std::vector is guaranteed to be O(n) amortized if you are only adding to
the end of the vector. You say you tried that "with reserving". Do you
mean that you called reserve once prior to adding to the array (good
usage), or periodically as you read more values (bad usage, can result
in non-linear running time)?
If you don't mind temporarily more memory, you could read everything
into a std::list, and then construct a vector from that with something
similar to:
std::vector<float> v(mylist.begin(), mylist.end()) ;
mylist.clear() ;
This might be faster than adding directly to the vector (though by at
most a constant factor).
Probably the fastest possible way to access your data is to use OS
specific mechanisms for creating a memory mapped file (see open and mmap
for Unix variants, or CreateFile, CreateFileMapping, and MapViewOfFile
for Win32).
-Alan
Darsant wrote: I'm currently reading 1-n number of binary files, each with 3 different arrays of floats containing about 10,000 values a piece for a total of about 30,000 values per file.
I'm looking for a way to load them all into memory. I've tried using vector pushback with reserving, but it was horribly slow. The current method I am using is upon opening the file and reading the number of values, resizing the vectors (I have 3, one for each data channel), then inserting the [] operator.
vecLeftChannel[i] = DataBuffer[i].
This still takes a while to loop through the files, and I was wondering if anyone knew a quicker way to load dynamic data like this without using float arrays and reallocating manually.
I think there's a function like "seek" or "fseek" or something
to seek to the end of file. Either the seek function returns
a byte offset, or you call a "position" or "pos" function to
get the byte offset, which is effectively the size of the file
in bytes. You then resize the vector to
sum_of_file_sizes / sizeof(float). You should then get reasonable
performance reading in the floats one at a time using buffered
I/O. wk****@yahoo.com wrote: I think there's a function like "seek" or "fseek" or something to seek to the end of file. Either the seek function returns a byte offset, or you call a "position" or "pos" function to get the byte offset, which is effectively the size of the file in bytes. You then resize the vector to sum_of_file_sizes / sizeof(float). You should then get reasonable performance reading in the floats one at a time using buffered I/O.
That's what I'm doing right now.
For each file, I read the header of information to a temporary header
structure. The header tells me the amount of data in each channel. I
then reserve size+newsamples for the vector, then push_back the new
readings from the temporary read buffer.
Darsant wrote: I'm currently reading 1-n number of binary files, each with 3 different arrays of floats containing about 10,000 values a piece for a total of about 30,000 values per file.
First you should measure (e.g. with clock()) where the time is spent.
Then consider alternatives.
Darsant wrote: wk****@yahoo.com wrote: I think there's a function like "seek" or "fseek" or something to seek to the end of file. Either the seek function returns a byte offset, or you call a "position" or "pos" function to get the byte offset, which is effectively the size of the file in bytes. You then resize the vector to sum_of_file_sizes / sizeof(float). You should then get reasonable performance reading in the floats one at a time using buffered I/O.
That's what I'm doing right now.
For each file, I read the header of information to a temporary header structure. The header tells me the amount of data in each channel. I then reserve size+newsamples for the vector, then push_back the new readings from the temporary read buffer.
In a typical implementation of the "vector" template, I think
there is, in an instance of vector<float>, a private pointer to
float. This private pointer points to an array in dynamically
allocated storage whose number of (float) elements is greater
than or equal to the value returned by vector<float>::size().
What happens when you call push_back, and the number of floats in
the dynamic array is exactly equal to size()? push_back will:
1) dynamically allocate another array of floats whose dimension
is size() + Delta (Delta being a positive integral constant chosen
by the STL implementation).
2) copy the old array of floats into the newly allocated array of
floats.
3) copy the float being pushed_back to the entry at offset size()
in the newly allocated array, and set size() to size()+1.
4) free the old array of floats, and set the private pointer to the
new array of floats.
the resize() member function does something similar, except that
the caller controls the new size of the private dynamically allocated
array. So that is why it help alot to get the sizes of all the files,
add them up, resize the vector to the total size, then copy the floats
in one at a time. You may avoid alot of repeated unnecessary heap
allocation (which can be slow) and copying of the private array.
To put it in high level terms, I think that the Standard says that
the worst-case time complexity of both resize and push_back is
O(n). So it makes sense to do one resize rather than n push_backs. wk****@yahoo.com wrote: Darsant wrote:
wk****@yahoo.com wrote:
I think there's a function like "seek" or "fseek" or something to seek to the end of file. Either the seek function returns a byte offset, or you call a "position" or "pos" function to get the byte offset, which is effectively the size of the file in bytes. You then resize the vector to sum_of_file_sizes / sizeof(float). You should then get reasonable performance reading in the floats one at a time using buffered I/O.
That's what I'm doing right now.
For each file, I read the header of information to a temporary header structure. The header tells me the amount of data in each channel. I then reserve size+newsamples for the vector, then push_back the new readings from the temporary read buffer.
In a typical implementation of the "vector" template, I think there is, in an instance of vector<float>, a private pointer to float. This private pointer points to an array in dynamically allocated storage whose number of (float) elements is greater than or equal to the value returned by vector<float>::size(). What happens when you call push_back, and the number of floats in the dynamic array is exactly equal to size()? push_back will:
1) dynamically allocate another array of floats whose dimension is size() + Delta (Delta being a positive integral constant chosen by the STL implementation).
2) copy the old array of floats into the newly allocated array of floats.
3) copy the float being pushed_back to the entry at offset size() in the newly allocated array, and set size() to size()+1.
4) free the old array of floats, and set the private pointer to the new array of floats.
the resize() member function does something similar, except that the caller controls the new size of the private dynamically allocated array. So that is why it help alot to get the sizes of all the files, add them up, resize the vector to the total size, then copy the floats in one at a time. You may avoid alot of repeated unnecessary heap allocation (which can be slow) and copying of the private array.
To put it in high level terms, I think that the Standard says that the worst-case time complexity of both resize and push_back is O(n). So it makes sense to do one resize rather than n push_backs.
23.2.4.1: A vector is a kind of sequence that supports random access
iterators. In addition, it supports (amortized) *constant time* insert
and erase operations at the end;
(emphasis mine)
If what you call "Delta" is a constant, then your performance for
inserting n items at the end of the vector will be bounded by O(n^2), or
O(n) amortized for each insert.
A better solution (that I imagine most STL implementations use) is to
double the capactiy of the vector every time you run out of space (and
half the capacity when the size hits 1/4). This is provably bounded by
O(n) for n insertions at the end, or O(1) per insertion.
-Alan
Thanks for everyone's help.
I've got the loading procedure in my class working fairly efficently
now (4375ms per 342000 floats).
Question: If I am reading to a buffer, can I read the buffer of array
of floats directly into my vector, or is the memory management
different in that I have to manually pushback an array buffer into the
vector? It's more of a curiosity question that might save me a few
hundred ms.
Unfortunately, due the the processing algorithm it is passed off to
(not under my control), I have to then create a temporary array and
pass that array to the algorithm from the vector. Allocating the array
takes roughly 1000ms and copying takes another 688ms. Due to a glitch
in Visual C++, I can't use a reference the constant vector returned by
my class (converts pointers to _w64 vector type instead of just a
pointer to a vector array, so also have to also copy the vector.)
Anyway, thanks again for everyones help!
Alan Johnson wrote: wk****@yahoo.com wrote:
.... To put it in high level terms, I think that the Standard says that the worst-case time complexity of both resize and push_back is O(n). So it makes sense to do one resize rather than n push_backs.
23.2.4.1: A vector is a kind of sequence that supports random access iterators. In addition, it supports (amortized) *constant time* insert and erase operations at the end;
(emphasis mine)
If what you call "Delta" is a constant, then your performance for inserting n items at the end of the vector will be bounded by O(n^2), or O(n) amortized for each insert. A better solution (that I imagine most STL implementations use) is to double the capactiy of the vector every time you run out of space (and half the capacity when the size hits 1/4). This is provably bounded by O(n) for n insertions at the end, or O(1) per insertion.
Well, now, they don't really clarify as to *what* to amortized over,
do they? But I think you're right, a constant delta is probably
not a good idea. I think your analysis applies for increasing the
array size by multiplying by K for any K > 1. I'd guess a good
implementation would use a value of K that's closer to 1 than
2.
Maybe another way to implement a vector<T> is as an array of pointers
to arrays of T whose dimension is some power of 2. So
the internal accessing of a vector element would look like
a[i >> K][i & ((1 << K) - 1)] . No copying needed on size
changes. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: ^ |
last post by:
Hi group,
I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files...
|
by: Mark Dickinson |
last post by:
I have a simple 192-line Python script that begins with the line:
dummy0 = 47
The script runs in less than 2.5 seconds. The variable dummy0 is never
referenced again, directly or indirectly,...
|
by: Ron Johnson |
last post by:
Hi,
While on the topic of "need for in-place upgrades", I got to think-
ing how the pg_restore could be speeded up.
Am I wrong in saying that in the current pg_restore, all of the
indexes are...
|
by: Erpman |
last post by:
I am trying to access the data with in a wav file. I am testing with very
small files in order to keep the code simple to start with.
Basically, im writing the entire wav file to a byte using a...
|
by: Lee Harr |
last post by:
I have a database where I remove the schema public. When I
try to use the createlang script, it fails like this ...
>createdb foo
CREATE DATABASE
>psql foo -c "select version()"
version...
|
by: Froggy / Froggy Corp. |
last post by:
First thx for your quick answer :)
iostat dont return bad load average. I never grow up to 1Mb/s and the
IDE drive can go faster.
I really can't change the server, its a locative server which...
|
by: TheMadHatter |
last post by:
Yello,
Quick Q:
What is the fastest way of getting data from a file?
Does the though put of the file increase if I collect
a great big chunk of it? Eg say 6meg? or does that
even help?
Does...
|
by: anandaraj |
last post by:
The discussion about the title already available in this forum. But i want to know how to copy the 2d array content or from a buffer into vector.
or is there any other way to read data into vector.
|
by: HYRY |
last post by:
I want to join two mono wave file to a stereo wave file by only using
the default python module.
Here is my program, but it is much slower than the C version, so how
can I increase the speed?
I...
|
by: MeoLessi9 |
last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
|
by: DolphinDB |
last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation.
Take...
|
by: Aftab Ahmad |
last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below.
Dim IE As Object
Set IE =...
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: marcoviolo |
last post by:
Dear all,
I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: jfyes |
last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
| |