Container for large amount of data of various types

424 256MB

Hello,

I am relatively new to C++ and have been milling over a container related problem for a while now. I have a ~2Gb ASCII file of time-ordered data with 4 comma-separated columns. The first column is a string identifier, the second and third are date and time, and the last two some numerical data, e.g.:

frog, 10/06/2006, 23:56:03, 12.3456, 322.551
frog, 11/06/2006, 22:00:06, 6.12136, 41.1236
fish, 12/06/2006, 09:01:54, 1.3456, 0.3321

I want to perform some numerical analysis on the data columns. Suppose I only have frogs and fish, but a lot of them. I was thinking of defining a Vertebrate class and creating a frog and a fish object each of which will contain my numerical data and times. My Vertebrate constructor will read the ASCII file, and pass the data to the object's container (after reducing the date and time to a long int which is the number of seconds elapsed since 1/1/2000 or whatever).

Now I will iterate over the all object's data to compute, say, the mean of the numerical data or correlations between frog or fish data. The container must associate string, long int, and float types to each iterator value or key and should be optimized for sequential access. I considered a vector of structs:

class Vertebrate {
public:
// constructor, destructor, various methods
private:
struct animal {
char[4] species;
long int dayTime;
float data1;
float data2;
}
vector<animal> animals;
vector<animal>::iterator itr;
};

My question: is this an efficient way to store this varied data or are there STL containers better suited for this purpose? (Also, I shouldn't expect the container to fit in the machine's memory, so I considered STXXL's analog of vector which stores the vector in a kind of page file on the hard disk, but I haven't even tried to get that going yet.)

Less importantly, given the amount of time that reading and processing the data into the container would presumably take, would it be possible to somehow save the object as a binary file to disk to be reloaded directly when the program is restarted, thus avoiding reprocessing the entire ASCII data?

Sep 3 '07 #1

Subscribe Post Reply

2981

Ganon11

3,652

Expert 2GB

I think you've got a good start to solving the problem, and I think a struct is the way to go. There's no STL container that I know of that can store multiple types (unless they were subclasses of one parent class). However, looking at your program, I'm not sure if I would do it this way. Just think about the organization for a moment - does a 'Vertebrate' have-an 'animal'? No, a 'Vertebrate' is-an- 'animal'. I think you should tackle this problem using inheritance, not composition. Also, is it the Vertebrate class' job to interpret textual data, or could you write a function to process the text and then create an object using the retrieved values?

Just a few things to think about.

Sep 3 '07 #2

weaknessforcats

9,208

Expert Mod 8TB

There's a lot of stuff not being said here.

First: Your container does not need to store the objects. Instead it can store a handle to the object. See the article on Handle Classes in rthe C/C++ Articles forum.

Second: You do not need to associate a string. long int, and float. The problem arises that adding a new association will cause the conbtainer to nee3d to be relloaded.

Third: Use polytmorphism. That is have a base class. However, separate the interface (the publlic methods of the base class) from the implmentation (the virtual methods of the base class). The base class should have no public virtual methods. Instead, the base class virtual methods should be private and it is these methods that the derived class overrides. Check out the design pattern called Template Method.

Fourth: the Vertebrate constructor should not read the disc file. Instead, use a CreateContainer function that a) creates the container of handles to vertebrates, b) reads the disc file, c) creates the frog or fish object, d) loads the object with file data, e) adds the object to the container as a handle to a base object. At the end of the disc file, this function returns a handle to the container.

Fifth: Consider using a flat file where the objects are stored in the file rather than in the container. The container just has the offset from the beginning of the file to the object. That is, a frog knows it is 112345 bytes from the beginning of the file. You can seek to that location and read the frog. This would allow adding to the end of the file. If you do this, you never need to reload the the data into the coimputer. You can have a giant file and a small computer program.

Sixth: Why are you not using a SQL database?? I mean all of this database work has already been done. You just need your data model, create your tables and process your SQL queries and updates. If you are on Unix, use Oracle. If Windows use SQLServer.

Sep 3 '07 #3

arnaudk

424

256MB

Thank you very much for your comments. I guess I should elaborate a bit. I'm still 'tasting the water' when it comes to OOP so I might well be suggesting to do things in the most impractical way. Let me reply in reverse order:

Sixth. My script will be uploaded to a grid of non-identical fast machines running Unix and the Sun Grid Engine. My queued job scripts will compile my code on the machine-specific compiler for each node they spawn to and run it. The nodes do not have any sophisticated database program as far as I know. Whilst I could in principle run a database from the job directory, I suppose there is a more compact solution given the simplicity of what I'm trying to do. (And I like to understand what's going on behind the scenes, anyway).

Fifth. As for the 'flat file', I guess you mean hashing or indexing the file then using gseek() to iterate through the file. I thought of that, but isn't file access very slow compared to accessing a container's elements? Especially if this has to be done over the grid network, even though it's a gigabit ethernet. The routines I will be using involve some heavy repeated cycling through the data. A faster disk-related method might involve STXXL which apparently uses some sophisticated prefetching and caching of a local pagefile it creates, but it's limited to vector and set containers I think.

Particle physicists process terabytes of data on server farms using the TTree object from the ROOT data processing framework developed at CERN. You can store anything in a TTree. After you create one, you can save it to the disk as a compressed binary file (a "ROOT" file). I guess it's some kind of hashed table. I'd really like to use that class but ROOT is an arcane non-standard mix of C and C++ which inherits decades of legacy C and FORTRAN routines and uses the CINT interpreter, and I didn't manage to figure out how to extract the TTree class and its dependencies.

As for your first three points, coming from a brief C background, I'm still trying to get my head around OOP so I'll try to read up what you suggested to understand how to implement it. But in the meantime,

Fourth. Sounds good. I'll do that.

Third. Will read about the "Template Method".

Second. Why would the vector container be reloaded? The vector only sees a struct, do you refer to the struct as a container? I'm still getting used to the terminilogy...

First. Will read about Handle Classes. By a container storing a handle to the object, do you mean something like an array of pointers to different types in C? I was hoping to avoid using dynamic memory with objects but I'll have to learn it someday...

Sep 3 '07 #4

arnaudk

424

256MB

And in reply to Ganon11, thanks for your speedy reply. Is iterating over a vector of structs quicker than iterating over multiple vectors? For instance, I could have the following class definition instead:

class Vertebrate {
public:
// Constructor
Vertebrate(string fileName,string fish_frog);

// Datafunctions
data1(vector<animal>::iterator itr_)

char[4] name_;
vector<long int> dayTime;
vector<float>::iterator itr_dayTime

vector<float> data1;
vector<float> data2;
vector<float>::iterator itr_data;

};

then have my interation loop in main() which increments itr_data++ itr_dayTime++ and processes the vectors' values. (In my first post, there was no way to access the private vector's iterator in main() to loop over its contents so I'd have to fix that).

As I'm still getting to grips with C++ I'll think more about composition vs. inheritance as you mention. But perhaps the struct name "animal" was misleading, it should be called vertebrateProperties or something, then a Vertebrate has a name: 'frog', etc...

Sep 3 '07 #5

arnaudk

424

256MB

weaknessforcats, I read your article on Handle Classes carefully. So, if I understand correctly, you are in a sense suggesting that, instead of a vector of structs, I should have a vector of objects (a vector of handles to objects, more precisely). I can see the advantage of separating the interface from the implementation in principle, but is there much difference in terms of memory usage or time? As I excel in the creation of memory leaks in C, I would certainly spend weeks plugging the even more obscure memory leaks caused by the notoriously bad memory accounting of my objects.

If there's a great sense in doing this, I would invest the time. Now, can I write the objects pointed to in the vector of handles to a binary file and subsequently reload them easily into the vector? Does it even make sense to do this? Would it take as much time to reprocess the whole ASCII data file?

Thanks!

Sep 4 '07 #6

kreagan

153

100+

Thank you very much for your comments. I guess I should elaborate a bit. I'm still 'tasting the water' when it comes to OOP so I might well be suggesting to do things in the most impractical way. Let me reply in reverse order:

Sixth. My script will be uploaded to a grid of non-identical fast machines running Unix and the Sun Grid Engine. My queued job scripts will compile my code on the machine-specific compiler for each node they spawn to and run it. The nodes do not have any sophisticated database program as far as I know. Whilst I could in principle run a database from the job directory, I suppose there is a more compact solution given the simplicity of what I'm trying to do. (And I like to understand what's going on behind the scenes, anyway).

I guess in theory the computer which commands the nodes can store the database. The nodes don't need to be aware of its existance.

I did a project where a compact computer networked with a computer that processed everything. The main computer would just periodically send information to the hand held computer. I don't know if that idea would work for your project.

Good Luck
Kat

Sep 4 '07 #7

arnaudk

424

256MB

I just thought of another way of storing various data types in an STL container: what about using a multimap where I convert my various types, 'long int', 'double', etc to strings which I can then store in a multimap? The problem is that when I iterate through them, I'd have to convert from the string back to the long int or double. Do you think that performing this conversion at each iteration would cancel the speed benefits gained by using multimaps?

I'd still have the issue of handing a gigabyte-sized multimap, but it's progress... Perhaps I can split it into different parts somehow.

So which is fastest for storing data of various types:
1) a vector of structs;
2) a vector of handles to objects containing basic members like long int and double;
3) a multimap of strings which are converted to the various types at each access?

Sep 4 '07 #8

weaknessforcats

9,208

Expert Mod 8TB

I just did a max_size() on a vector<animal> implemented with Visual Studio.NET 2005 and got 268,435,455 as a result. This is the maximum number of animal elements you can put in a vector

Expand|Select|Wrap|Line Numbers

 
#include <vector>

#include <map>

struct animal {

char species[4];

long int dayTime;

float data1;

float data2;

};
 
class Handle

{

     int* a;   //dummy pointer for the object

     int* b;   //dummy pointer for the reference count

};
 
int main()

{

    vector<animal>

    cout << v.max_size() << endl;   //268,435,455

    vector<Handle> v1;

    cout << v1.max_size() << endl;   //536,870,911
 
}

If you will have more than 268,435,455 animals, a vector<animal> will not work.
If you have more than 536,870,911 animals, a vector<Handle> to those animals will not work.

You can see that by using a handle, the number of elements you can put into the container has doubled.

A list container produced the same answers.

Now I added two floats to the struct animnal an got 178, 956, 970 for vector<animal> and 536,870,911 for a vector<Handle>. Notice as the struct gets bigger the container capacity goes down. However, since the handle doesn't care how big the object is, the number of handles you can out in the vector<Handle> is constant.

This is why I suggested you not put your struct in the container. If you ever need to redesign the struct, your container may no longer work. By using handles, this potential problem is eliminated.

Of course, nothing prevents you from using more than one vector. In fact, your data may require several vectors. Perhaps these are loaded from separate files.

You could use a fixed array but arrays are limited to 0x7fffffff bytes. So you will need multiple arrays.

That TTree object you mention is sounding better by the minute.

I suggest you start looking into that.

Sep 5 '07 #9

arnaudk

424

256MB

OK, thank you very much for your help - since I'm new to C++ I have no way of knowing if there's a well known simple way of doing something. I'll look into STXXL (stxxl.sourceforge.net) which can store large vectors and stacks or, if I find a simple way of using ROOT's TTree (root.cern.ch) outside of its cumbersome framework, I'll post my experiences here.

Arnaud

Sep 6 '07 #10

cyp7

Use a multimap with the second object, the value, as another map
or a pair...

typedef pair < int, int > MYPAIR;
typedef multimap < string, MYPAIR > MYMULTIMAP;

- OR -

typedef map < int, float > MYMAP;
typedef multimap < string, MYMAP > MYMULTIMAP;

MYMAP m_myMap; //member variable

Use an Iterator to move through the map...

MYMAP :: iterator my_Iterator;

poke around and you'll find examples on how to load the map
and then search through it for the data your're looking for.
Super fast iteration.

Note: if you use multimap instead of map, then you have to
use .insert(...) rather than the subscript operator [] for inserting
b/c multimaps allow non-unique keys.

cyp7

Sep 17 '07 #11

Container for large amount of data of various types

Similar topics