safely reading large files

byte8bits

How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.txt", ios::binary);
if (myfile.is_open())
{
while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the amount of
available physical memory? What would happen? Seems getline() would
cause a crash. Is there a better way. Maybe... check amount of free
memory, then use 10% or so of that amount for the read. So if 1GB of
memory is free, then take 100MB for file IO. If only 10MB is free,
then just read 1MB at a time. Repeat this step until the file has been
read completely. Is something built into standard C++ to handle this?
Or is there a accepted way to do this?

Thanks,

Brad

Jun 27 '08 #1

Subscribe Post Reply

9851

Victor Bazarov

by*******@gmail.com wrote:

How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.txt", ios::binary);
if (myfile.is_open())
{
while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the amount of
available physical memory? What would happen? Seems getline() would
cause a crash. Is there a better way. Maybe... check amount of free
memory, then use 10% or so of that amount for the read. So if 1GB of
memory is free, then take 100MB for file IO. If only 10MB is free,
then just read 1MB at a time. Repeat this step until the file has been
read completely. Is something built into standard C++ to handle this?
Or is there a accepted way to do this?

Actually, performing operations that can lead to running out of memory
is not a simple thing at all. Yes, if you can estimate the amount of
memory you will need over what you right now want to allocate and you
know the size of available memory somehow, then you can allocate a chunk
and operate on that chunk until done and move over to the next chunk.
In the good ol' days that's how we solved large systems of linear
equations, one piece of the matrix at a time (or two if the algorithm
called for it).

Unfortunately there is no single straightforward solution. In most
cases you don't even know that you're going to run out of memory until
it's too late. You can write the program to handle those situations
using C++ exceptions. The pseudo-code might look like this:

std::size_t chunk_size = 1024*1024*1024;
MyAlgorithgm algo;

do {
try {
algo.prepare_the_operation(chunk_size);
// if I am here, the chunk_size is OK
algo.perform_the_operation();
algo.wrap_it_up();
}
catch (std::bad_alloc & e) {
chunk_size /= 2; // or any other adjustment
}
}
while (chunk_size 1024*1024); // or some other threshold

That way if your preparation fails, you just restart it using a smaller
chunk, until you either complete the operation or your chunk is too
small and you can't really do anything...

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Jun 27 '08 #2

Sam

by*******@gmail.com writes:

while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the amount of
available physical memory?

The C++ library will fail to allocate sufficient memory, throw an exception,
and terminate the process.

What would happen? Seems getline() would
cause a crash. Is there a better way. Maybe... check amount of free
memory, then use 10% or so of that amount for the read.

And where exactly would you propose to store the remaining 90% of
std::string?

read completely. Is something built into standard C++ to handle this?

No. std::getline() reads as much as necessary, until the end-of-line
character, and the resulting std::string has to be big enough to store the
entire line.

Or is there a accepted way to do this?

If you need a way to handle this situation, would not use std::getline(),
but some different approach, like std::istream::get or std::istream::read.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBIM4apx9p3GYHlUOIRAj5GAJ44mxrzR4cPoup+rW7WzL 94IMzAyQCeNHRL
nfDgBM3Esmf5hadFB9LuptE=
=nO85
-----END PGP SIGNATURE-----

Jun 27 '08 #3

red floyd

by*******@gmail.com wrote:

How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

Others have already answered your question, so I'm going to address
something else.

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.txt", ios::binary);
if (myfile.is_open())
{

This while loop does not do what you think it does. See FAQ 15.5
(http://parashift.com/c++-faq-lite/in....html#faq-15.5)

while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

Jun 27 '08 #4

James Kanze

On May 21, 4:11 am, Victor Bazarov <v.Abaza...@comAcast.netwrote:

byte8b...@gmail.com wrote:
How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.txt", ios::binary);
if (myfile.is_open())
{
while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the
amount of available physical memory? What would happen?
Seems getline() would cause a crash. Is there a better way.
Maybe... check amount of free memory, then use 10% or so of
that amount for the read. So if 1GB of memory is free, then
take 100MB for file IO. If only 10MB is free, then just read
1MB at a time. Repeat this step until the file has been read
completely. Is something built into standard C++ to handle
this? Or is there a accepted way to do this?

Actually, performing operations that can lead to running out
of memory is not a simple thing at all.

I'm sure you don't mean what that literally says. There's
certainly nothing difficult about running out of memory. Doing
something reasonable (other than just aborting) when it happens
is difficult, however.

Yes, if you can estimate the amount of memory you will need
over what you right now want to allocate and you know the size
of available memory somehow, then you can allocate a chunk and
operate on that chunk until done and move over to the next
chunk. In the good ol' days that's how we solved large
systems of linear equations, one piece of the matrix at a time
(or two if the algorithm called for it).

And you'd manually manage overlays, as well, so that only part
of the program was in memory at a time. (I once saw a PL/1
compiler which ran in 16 KB real memory, using such techniques.
Took something like three hours to compile a 500 line program,
but it did work.)

Unfortunately there is no single straightforward solution. In
most cases you don't even know that you're going to run out of
memory until it's too late. You can write the program to
handle those situations using C++ exceptions. The pseudo-code
might look like this:

std::size_t chunk_size = 1024*1024*1024;
MyAlgorithgm algo;

do {
try {
algo.prepare_the_operation(chunk_size);
// if I am here, the chunk_size is OK
algo.perform_the_operation();
algo.wrap_it_up();
}
catch (std::bad_alloc & e) {
chunk_size /= 2; // or any other adjustment
}
}
while (chunk_size 1024*1024); // or some other threshold

Shouldn't the condition here be "while ( operation not done )",
something like:

bool didIt = false ;
do {
try {
// your code from the try block
didIt = true ;
}
// ... your catch
} while ( ! didIt ) ;

That way if your preparation fails, you just restart it using
a smaller chunk, until you either complete the operation or
your chunk is too small and you can't really do anything...

Just a note, but that isn't allways reliable. Not all OS's will
tell you when there isn't enough memory: they'll return an
address, then crash or suspend your program when you try to
access it. (I've seen this happen on at least three different
systems: Windows, AIX and Linux. At least in the case of AIX
and Linux, and probably Windows as well, it depends on the
version, and some configuration parameters, but most Linux are
still configured so that you cannot catch allocation errors: if
the command "/sbin/sysctl vm.overcommit_memory" displays any
value other than 2, then a reliably conforming implementation of
C or C++ is impossible.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #5

brad

by*******@gmail.com wrote:

How does C++ safely open and read very large files?

How about this mate? It's a start.

// read a file into memory
// read in chunks if the file is larger than 16MB
#include <iostream>
#include <fstream>

using namespace std;

int main ()
{

int max = 16384000;

ifstream is;
is.open ("test.txt");

// get length of file:
is.seekg (0, ios::end);
int file_size = is.tellg();
is.seekg (0, ios::beg);

if (file_size max)
{
// allocate memory
char* buffer = new char[16384000];

cout << file_size << " bytes... break up to read" << endl;
while (!is.eof())
{
//read data as a block
is.read (buffer, max);

//write the read data to stdout
cout << buffer;
}
delete[] buffer;
}
else
{
// allocate memory:
char* buffer = new char[file_size];

cout << file_size << " bytes" << endl;
is.read (buffer, file_size);
cout << buffer;
delete[] buffer;
}

is.close();

return 0;
}

Jun 27 '08 #6

ltcmelo

On May 21, 5:33*am, James Kanze <james.ka...@gmail.comwrote:

On May 21, 4:11 am, Victor Bazarov <v.Abaza...@comAcast.netwrote:

byte8b...@gmail.com wrote:
How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
* string line;
* ifstream myfile ("example.txt", ios::binary);
* if (myfile.is_open())
* {
* * while (! myfile.eof() )
* * {
* * * getline (myfile,line);
* * * cout << line << endl;
* * }
* * myfile.close();
* }
* else cout << "Unable to open file";

* return 0;
}
In particular, what if a line in the file is more than the
amount of available physical memory? What would happen?
Seems getline() would cause a crash. Is there a better way.
Maybe... check amount of free memory, then use 10% or so of
that amount for the read. So if 1GB of memory is free, then
take 100MB for file IO. If only 10MB is free, then just read
1MB at a time. Repeat this step until the file has been read
completely. Is something built into standard C++ to handle
this? *Or is there a accepted way to do this?
Actually, performing operations that can lead to running out
of memory is not a simple thing at all.

I'm sure you don't mean what that literally says. *There's
certainly nothing difficult about running out of memory. *Doing
something reasonable (other than just aborting) when it happens
is difficult, however.

Yes, if you can estimate the amount of memory you will need
over what you right now want to allocate and you know the size
of available memory somehow, then you can allocate a chunk and
operate on that chunk until done and move over to the next
chunk. *In the good ol' days that's how we solved large
systems of linear equations, one piece of the matrix at a time
(or two if the algorithm called for it).

And you'd manually manage overlays, as well, so that only part
of the program was in memory at a time. *(I once saw a PL/1
compiler which ran in 16 KB real memory, using such techniques.
Took something like three hours to compile a 500 line program,
but it did work.)

Unfortunately there is no single straightforward solution. *In
most cases you don't even know that you're going to run out of
memory until it's too late. *You can write the program to
handle those situations using C++ exceptions. *The pseudo-code
might look like this:
* * *std::size_t chunk_size = 1024*1024*1024;
* * *MyAlgorithgm algo;
* * *do {
* * * * *try {
* * * * * * *algo.prepare_the_operation(chunk_size);
* * * * * * *// if I am here, the chunk_size is OK
* * * * * * *algo.perform_the_operation();
* * * * * * *algo.wrap_it_up();
* * * * *}
* * * * *catch (std::bad_alloc & e) {
* * * * * * *chunk_size /= 2; // or any other adjustment
* * * * *}
* * *}
* * *while (chunk_size 1024*1024); // or some other threshold

Shouldn't the condition here be "while ( operation not done )",
something like:

* * bool didIt = false ;
* * do {
* * * * try {
* * * * * * // *your code from the try block
* * * * * * didIt = true ;
* * * * }
* * * * // *... your catch
* * } while ( ! didIt ) ;

That way if your preparation fails, you just restart it using
a smaller chunk, until you either complete the operation or
your chunk is too small and you can't really do anything...

Just a note, but that isn't allways reliable. *Not all OS's will
tell you when there isn't enough memory: they'll return an
address, then crash or suspend your program when you try to
access it. *(I've seen this happen on at least three different
systems: Windows, AIX and Linux. *At least in the case of AIX
and Linux, and probably Windows as well, it depends on the
version, and some configuration parameters, but most Linux are
still configured so that you cannot catch allocation errors: if
the command "/sbin/sysctl vm.overcommit_memory" displays any
value other than 2, then a reliably conforming implementation of
C or C++ is impossible.)

Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)

--
Leandro T. C. Melo

Jun 27 '08 #7

Victor Bazarov

ltcmelo wrote:

[..]
Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)

Why do you say it's non-standard? The Standard defines 'streampos' for
stream positioning, which any decent library implementation on a system
that allows its files to have more than 32-bit size, should be the next
larger integral type (whatever that might be on that system). Also,
there is the 'streamoff' for offsetting in a stream buffer. Both are
implementation-defined.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Jun 27 '08 #8

ltcmelo

On May 21, 4:19*pm, Victor Bazarov <v.Abaza...@comAcast.netwrote:

ltcmelo wrote:
[..]
Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)

Why do you say it's non-standard? *The Standard defines 'streampos' for
stream positioning, which any decent library implementation on a system
that allows its files to have more than 32-bit size, should be the next
larger integral type (whatever that might be on that system). *Also,
there is the 'streamoff' for offsetting in a stream buffer. *Both are
implementation-defined.

Maybe "non-portable" (across systems) would have been a better
choice... I say that for particular cases like in the GNU C Library
where you can use macros to enable support for large files.

--
Leandro T. C. Melo

Jun 27 '08 #9

Victor Bazarov

ltcmelo wrote:

On May 21, 4:19 pm, Victor Bazarov <v.Abaza...@comAcast.netwrote:
>ltcmelo wrote:
>>[..]
Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)
Why do you say it's non-standard? The Standard defines 'streampos' for
stream positioning, which any decent library implementation on a system
that allows its files to have more than 32-bit size, should be the next
larger integral type (whatever that might be on that system). Also,
there is the 'streamoff' for offsetting in a stream buffer. Both are
implementation-defined.

Maybe "non-portable" (across systems) would have been a better
choice... I say that for particular cases like in the GNU C Library
where you can use macros to enable support for large files.

I am still confused. What's non-portable if you use standard types?
Perhaps you're using some special meaning of the word "non-portable" (or
the word "portable")... Care to elaborate?

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Jun 27 '08 #10

ltcmelo

On May 21, 4:53*pm, Victor Bazarov <v.Abaza...@comAcast.netwrote:

ltcmelo wrote:
On May 21, 4:19 pm, Victor Bazarov <v.Abaza...@comAcast.netwrote:
ltcmelo wrote:
[..]
Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)
Why do you say it's non-standard? *The Standard defines 'streampos' for
stream positioning, which any decent library implementation on a system
that allows its files to have more than 32-bit size, should be the next
larger integral type (whatever that might be on that system). *Also,
there is the 'streamoff' for offsetting in a stream buffer. *Both are
implementation-defined.

Maybe "non-portable" (across systems) would have been a better
choice... I say that for particular cases like in the GNU C Library
where you can use macros to enable support for large files.

I am still confused. *What's non-portable if you use standard types?
Perhaps you're using some special meaning of the word "non-portable" (or
the word "portable")... *Care to elaborate?

Hmm... I think I get your point.

Once I worked with an application in which we had to treat really big
files (sometimes they were as large as 6GB). We were having problems
reading such large files. I can't remember exactly what it was, but
basically the file pointer was getting lost. Since non-portability was
not a problem, we decided to use macro _LARGEFILE64_SOURCE (it was a
Solaris system with the GNU C Library) and changed several file
processing function calls and data types (non-standard).

Well, yes... that might actually have been caused by a bug in our
code. Maybe the wrong data types were being used.

--
Leandro T. C. Melo

Jun 27 '08 #11

kwikius

"Victor Bazarov" <v.********@comAcast.netwrote in message
news:g1**********@news.datemas.de...

<...>

I am still confused. What's non-portable if you use standard types?
Perhaps you're using some special meaning of the word "non-portable" (or
the word "portable")... Care to elaborate?

Portability of a type is a function of the programming language and hardware
and other parameters.

In a contemporary language with a virtual machine, an integer will provide a
guarantee on size and semantics of expressions.

In C and C++, both quite old programming languages, neither size or
detailed semantics is guaranteed, but pointer semantics is usually well
accomodated, ... hardware dominates.

Short term I'll bet on the virtual machine. Some hardware is going that
direction, but ..

Long term Von Neumann architecture is history too, a function of Moores Law.

Hence I coin the term fluid virtual machine. Define your types and the
architecture will morph itself to their characteristics.

In this semantic view , hardware is portable.

software heaven....

HTH

regards
Andy Little

Jun 27 '08 #12

Juha Nieminen

Sam wrote:

>In particular, what if a line in the file is more than the amount of
available physical memory?

The C++ library will fail to allocate sufficient memory, throw an
exception, and terminate the process.

You wish. What will more likely happen is that the OS will start
swapping like mad and your system will be next to unusable the next half
hour, while you try desperately to kill your program. (This will happen
in most current OSes, including Windows and Linux.) I am talking from
experience.

Too many C++ programs out there carelessly create things like
std::vectors or std::strings which size depends on some user-given
input, without any sanity checks about the size. This is a bad idea for
the abovementioned reason.

Jun 27 '08 #13

James Kanze

On May 22, 8:56 am, Juha Nieminen <nos...@thanks.invalidwrote:

Sam wrote:

In particular, what if a line in the file is more than the amount of
available physical memory?

The C++ library will fail to allocate sufficient memory, throw an
exception, and terminate the process.

You wish. What will more likely happen is that the OS will start
swapping like mad and your system will be next to unusable the next half
hour, while you try desperately to kill your program. (This will happen
in most current OSes, including Windows and Linux.) I am talking from
experience.

Hmmm. Sounds like Solaris 2.4. More recent Solaris doesn't
have this problem. (Things do slow down when you start
swapping, but the machine remains usable for simple
things---like opening an xterm to do ps and kill.)

In the case of Linux, of course, you're likely to get a program
crash before you get your exception.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #14

James Kanze

On May 21, 9:19 pm, Victor Bazarov <v.Abaza...@comAcast.netwrote:

ltcmelo wrote:
[..]
Even assuming you have memory, there's another concern. If
you're running on a 32-bit platform, you can have problems
working with such large files. Basically, the file pointer
might not be large enough to traverse, for example, a file
with more than 4GB. (There are "non- standard" solutions for
this.)

Why do you say it's non-standard? The Standard defines
'streampos' for stream positioning, which any decent library
implementation on a system that allows its files to have more
than 32-bit size, should be the next larger integral type
(whatever that might be on that system). Also, there is the
'streamoff' for offsetting in a stream buffer. Both are
implementation-defined.

The problem is that implementations don't want to break binary
compatibility, that support for files larger than 4GB is often a
more or less recent feature, and earlier implementations had a
32-bit streampos. If you've written streampos values to a file
in binary, you don't want their size to change, and there's
probably more existing code which does this (even though it's
really stupid) than new code which really needs files over 4GB.
So depending on the implementation, you may not be able to
access the end of the file. (streamoff doesn't help if all the
implementation does is add it to the current position,
maintained in a streampos.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #15

Juha Nieminen

James Kanze wrote:

Hmmm. Sounds like Solaris 2.4. More recent Solaris doesn't
have this problem. (Things do slow down when you start
swapping, but the machine remains usable for simple
things---like opening an xterm to do ps and kill.)

The few times I have accidentally allocated a 4-gigabyte vector
because of a bug (or because of not validating input) I have really
wished they did something to solve this problem in Linux. It really
doesn't make too much sense from a security point of view that a regular
user can hinder the entire system so badly with a simply self-made program.

It has also happened to me in Windows XP as well (similar
allocate-4GB-because-of-non-validated-input error), and the behavior was
approximately the same.

Jun 27 '08 #16

brad

Juha Nieminen wrote:

You wish. What will more likely happen is that the OS will start
swapping like mad and your system will be next to unusable the next half
hour, while you try desperately to kill your program.

I can confirm this behavior. Tried to read a 9GB file all at once into
4GB of RAM. I think my kernel ended up in swap space :)

Jun 27 '08 #17

Matthias Buelow

Juha Nieminen wrote:

It really
doesn't make too much sense from a security point of view that a regular
user can hinder the entire system so badly with a simply self-made program.

ulimit(1), setrlimit(2)

Of course the design is somewhat broken since it only applies
per-process limits, and you can't set a per-user limit. (Some other
systems, for example VMS, had better resource control.)

Jun 27 '08 #18

safely reading large files

Similar topics