safely reading large files

byte8bits

How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.tx t", ios::binary);
if (myfile.is_open ())
{
while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the amount of
available physical memory? What would happen? Seems getline() would
cause a crash. Is there a better way. Maybe... check amount of free
memory, then use 10% or so of that amount for the read. So if 1GB of
memory is free, then take 100MB for file IO. If only 10MB is free,
then just read 1MB at a time. Repeat this step until the file has been
read completely. Is something built into standard C++ to handle this?
Or is there a accepted way to do this?

Thanks,

Brad

Jun 27 '08 #1

Subscribe Reply

9932

Victor Bazarov

by*******@gmail .com wrote:

How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.tx t", ios::binary);
if (myfile.is_open ())
{
while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the amount of
available physical memory? What would happen? Seems getline() would
cause a crash. Is there a better way. Maybe... check amount of free
memory, then use 10% or so of that amount for the read. So if 1GB of
memory is free, then take 100MB for file IO. If only 10MB is free,
then just read 1MB at a time. Repeat this step until the file has been
read completely. Is something built into standard C++ to handle this?
Or is there a accepted way to do this?

Actually, performing operations that can lead to running out of memory
is not a simple thing at all. Yes, if you can estimate the amount of
memory you will need over what you right now want to allocate and you
know the size of available memory somehow, then you can allocate a chunk
and operate on that chunk until done and move over to the next chunk.
In the good ol' days that's how we solved large systems of linear
equations, one piece of the matrix at a time (or two if the algorithm
called for it).

Unfortunately there is no single straightforward solution. In most
cases you don't even know that you're going to run out of memory until
it's too late. You can write the program to handle those situations
using C++ exceptions. The pseudo-code might look like this:

std::size_t chunk_size = 1024*1024*1024;
MyAlgorithgm algo;

do {
try {
algo.prepare_th e_operation(chu nk_size);
// if I am here, the chunk_size is OK
algo.perform_th e_operation();
algo.wrap_it_up ();
}
catch (std::bad_alloc & e) {
chunk_size /= 2; // or any other adjustment
}
}
while (chunk_size 1024*1024); // or some other threshold

That way if your preparation fails, you just restart it using a smaller
chunk, until you either complete the operation or your chunk is too
small and you can't really do anything...

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Jun 27 '08 #2

Sam

by*******@gmail .com writes:

while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the amount of
available physical memory?

The C++ library will fail to allocate sufficient memory, throw an exception,
and terminate the process.

What would happen? Seems getline() would
cause a crash. Is there a better way. Maybe... check amount of free
memory, then use 10% or so of that amount for the read.

And where exactly would you propose to store the remaining 90% of
std::string?

read completely. Is something built into standard C++ to handle this?

No. std::getline() reads as much as necessary, until the end-of-line
character, and the resulting std::string has to be big enough to store the
entire line.

Or is there a accepted way to do this?

If you need a way to handle this situation, would not use std::getline(),
but some different approach, like std::istream::g et or std::istream::r ead.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBIM4apx9p 3GYHlUOIRAj5GAJ 44mxrzR4cPoup+r W7WzL94IMzAyQCe NHRL
nfDgBM3Esmf5had FB9LuptE=
=nO85
-----END PGP SIGNATURE-----

Jun 27 '08 #3

red floyd

by*******@gmail .com wrote:

How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

Others have already answered your question, so I'm going to address
something else.

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.tx t", ios::binary);
if (myfile.is_open ())
{

This while loop does not do what you think it does. See FAQ 15.5
(http://parashift.com/c++-faq-lite/in....html#faq-15.5)

while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

Jun 27 '08 #4

James Kanze

On May 21, 4:11 am, Victor Bazarov <v.Abaza...@com Acast.netwrote:

byte8b...@gmail .com wrote:
How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
string line;
ifstream myfile ("example.tx t", ios::binary);
if (myfile.is_open ())
{
while (! myfile.eof() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}

else cout << "Unable to open file";

return 0;
}

In particular, what if a line in the file is more than the
amount of available physical memory? What would happen?
Seems getline() would cause a crash. Is there a better way.
Maybe... check amount of free memory, then use 10% or so of
that amount for the read. So if 1GB of memory is free, then
take 100MB for file IO. If only 10MB is free, then just read
1MB at a time. Repeat this step until the file has been read
completely. Is something built into standard C++ to handle
this? Or is there a accepted way to do this?

Actually, performing operations that can lead to running out
of memory is not a simple thing at all.

I'm sure you don't mean what that literally says. There's
certainly nothing difficult about running out of memory. Doing
something reasonable (other than just aborting) when it happens
is difficult, however.

Yes, if you can estimate the amount of memory you will need
over what you right now want to allocate and you know the size
of available memory somehow, then you can allocate a chunk and
operate on that chunk until done and move over to the next
chunk. In the good ol' days that's how we solved large
systems of linear equations, one piece of the matrix at a time
(or two if the algorithm called for it).

And you'd manually manage overlays, as well, so that only part
of the program was in memory at a time. (I once saw a PL/1
compiler which ran in 16 KB real memory, using such techniques.
Took something like three hours to compile a 500 line program,
but it did work.)

Unfortunately there is no single straightforward solution. In
most cases you don't even know that you're going to run out of
memory until it's too late. You can write the program to
handle those situations using C++ exceptions. The pseudo-code
might look like this:

std::size_t chunk_size = 1024*1024*1024;
MyAlgorithgm algo;

do {
try {
algo.prepare_th e_operation(chu nk_size);
// if I am here, the chunk_size is OK
algo.perform_th e_operation();
algo.wrap_it_up ();
}
catch (std::bad_alloc & e) {
chunk_size /= 2; // or any other adjustment
}
}
while (chunk_size 1024*1024); // or some other threshold

Shouldn't the condition here be "while ( operation not done )",
something like:

bool didIt = false ;
do {
try {
// your code from the try block
didIt = true ;
}
// ... your catch
} while ( ! didIt ) ;

That way if your preparation fails, you just restart it using
a smaller chunk, until you either complete the operation or
your chunk is too small and you can't really do anything...

Just a note, but that isn't allways reliable. Not all OS's will
tell you when there isn't enough memory: they'll return an
address, then crash or suspend your program when you try to
access it. (I've seen this happen on at least three different
systems: Windows, AIX and Linux. At least in the case of AIX
and Linux, and probably Windows as well, it depends on the
version, and some configuration parameters, but most Linux are
still configured so that you cannot catch allocation errors: if
the command "/sbin/sysctl vm.overcommit_m emory" displays any
value other than 2, then a reliably conforming implementation of
C or C++ is impossible.)

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #5

brad

by*******@gmail .com wrote:

How does C++ safely open and read very large files?

How about this mate? It's a start.

// read a file into memory
// read in chunks if the file is larger than 16MB
#include <iostream>
#include <fstream>

using namespace std;

int main ()
{

int max = 16384000;

ifstream is;
is.open ("test.txt") ;

// get length of file:
is.seekg (0, ios::end);
int file_size = is.tellg();
is.seekg (0, ios::beg);

if (file_size max)
{
// allocate memory
char* buffer = new char[16384000];

cout << file_size << " bytes... break up to read" << endl;
while (!is.eof())
{
//read data as a block
is.read (buffer, max);

//write the read data to stdout
cout << buffer;
}
delete[] buffer;
}
else
{
// allocate memory:
char* buffer = new char[file_size];

cout << file_size << " bytes" << endl;
is.read (buffer, file_size);
cout << buffer;
delete[] buffer;
}

is.close();

return 0;
}

Jun 27 '08 #6

ltcmelo

On May 21, 5:33*am, James Kanze <james.ka...@gm ail.comwrote:

On May 21, 4:11 am, Victor Bazarov <v.Abaza...@com Acast.netwrote:

byte8b...@gmail .com wrote:
How does C++ safely open and read very large files? For example, say I
have 1GB of physical memory and I open a 4GB file and attempt to read
it like so:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
* string line;
* ifstream myfile ("example.tx t", ios::binary);
* if (myfile.is_open ())
* {
* * while (! myfile.eof() )
* * {
* * * getline (myfile,line);
* * * cout << line << endl;
* * }
* * myfile.close();
* }
* else cout << "Unable to open file";

* return 0;
}
In particular, what if a line in the file is more than the
amount of available physical memory? What would happen?
Seems getline() would cause a crash. Is there a better way.
Maybe... check amount of free memory, then use 10% or so of
that amount for the read. So if 1GB of memory is free, then
take 100MB for file IO. If only 10MB is free, then just read
1MB at a time. Repeat this step until the file has been read
completely. Is something built into standard C++ to handle
this? *Or is there a accepted way to do this?
Actually, performing operations that can lead to running out
of memory is not a simple thing at all.

I'm sure you don't mean what that literally says. *There's
certainly nothing difficult about running out of memory. *Doing
something reasonable (other than just aborting) when it happens
is difficult, however.

Yes, if you can estimate the amount of memory you will need
over what you right now want to allocate and you know the size
of available memory somehow, then you can allocate a chunk and
operate on that chunk until done and move over to the next
chunk. *In the good ol' days that's how we solved large
systems of linear equations, one piece of the matrix at a time
(or two if the algorithm called for it).

And you'd manually manage overlays, as well, so that only part
of the program was in memory at a time. *(I once saw a PL/1
compiler which ran in 16 KB real memory, using such techniques.
Took something like three hours to compile a 500 line program,
but it did work.)

Unfortunately there is no single straightforward solution. *In
most cases you don't even know that you're going to run out of
memory until it's too late. *You can write the program to
handle those situations using C++ exceptions. *The pseudo-code
might look like this:
* * *std::size_t chunk_size = 1024*1024*1024;
* * *MyAlgorithgm algo;
* * *do {
* * * * *try {
* * * * * * *algo.prepare_t he_operation(ch unk_size);
* * * * * * *// if I am here, the chunk_size is OK
* * * * * * *algo.perform_t he_operation();
* * * * * * *algo.wrap_it_u p();
* * * * *}
* * * * *catch (std::bad_alloc & e) {
* * * * * * *chunk_size /= 2; // or any other adjustment
* * * * *}
* * *}
* * *while (chunk_size 1024*1024); // or some other threshold

Shouldn't the condition here be "while ( operation not done )",
something like:

* * bool didIt = false ;
* * do {
* * * * try {
* * * * * * // *your code from the try block
* * * * * * didIt = true ;
* * * * }
* * * * // *... your catch
* * } while ( ! didIt ) ;

That way if your preparation fails, you just restart it using
a smaller chunk, until you either complete the operation or
your chunk is too small and you can't really do anything...

Just a note, but that isn't allways reliable. *Not all OS's will
tell you when there isn't enough memory: they'll return an
address, then crash or suspend your program when you try to
access it. *(I've seen this happen on at least three different
systems: Windows, AIX and Linux. *At least in the case of AIX
and Linux, and probably Windows as well, it depends on the
version, and some configuration parameters, but most Linux are
still configured so that you cannot catch allocation errors: if
the command "/sbin/sysctl vm.overcommit_m emory" displays any
value other than 2, then a reliably conforming implementation of
C or C++ is impossible.)

Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)

--
Leandro T. C. Melo

Jun 27 '08 #7

Victor Bazarov

ltcmelo wrote:

[..]
Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)

Why do you say it's non-standard? The Standard defines 'streampos' for
stream positioning, which any decent library implementation on a system
that allows its files to have more than 32-bit size, should be the next
larger integral type (whatever that might be on that system). Also,
there is the 'streamoff' for offsetting in a stream buffer. Both are
implementation-defined.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Jun 27 '08 #8

ltcmelo

On May 21, 4:19*pm, Victor Bazarov <v.Abaza...@com Acast.netwrote:

ltcmelo wrote:
[..]
Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)

Why do you say it's non-standard? *The Standard defines 'streampos' for
stream positioning, which any decent library implementation on a system
that allows its files to have more than 32-bit size, should be the next
larger integral type (whatever that might be on that system). *Also,
there is the 'streamoff' for offsetting in a stream buffer. *Both are
implementation-defined.

Maybe "non-portable" (across systems) would have been a better
choice... I say that for particular cases like in the GNU C Library
where you can use macros to enable support for large files.

--
Leandro T. C. Melo

Jun 27 '08 #9

Victor Bazarov

ltcmelo wrote:

On May 21, 4:19 pm, Victor Bazarov <v.Abaza...@com Acast.netwrote:
>ltcmelo wrote:
>>[..]
Even assuming you have memory, there's another concern. If you're
running on a 32-bit platform, you can have problems working with such
large files. Basically, the file pointer might not be large enough to
traverse, for example, a file with more than 4GB. (There are "non-
standard" solutions for this.)
Why do you say it's non-standard? The Standard defines 'streampos' for
stream positioning, which any decent library implementation on a system
that allows its files to have more than 32-bit size, should be the next
larger integral type (whatever that might be on that system). Also,
there is the 'streamoff' for offsetting in a stream buffer. Both are
implementati on-defined.

Maybe "non-portable" (across systems) would have been a better
choice... I say that for particular cases like in the GNU C Library
where you can use macros to enable support for large files.

I am still confused. What's non-portable if you use standard types?
Perhaps you're using some special meaning of the word "non-portable" (or
the word "portable") ... Care to elaborate?

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Jun 27 '08 #10

Similar topics

2284

FSO + XMLHTTP + reading large files + errr....

by: Steven Burn | last post by:

The application; Service on my webserver that allows a user to upload their HOSTS file for functions to verify the contents are still valid. Uses; 1. XMLHTTP (MSXML2) 2. FileSystemObject 3. CrazyBeavers Upload control (couldn't get the Dundas one to work)

ASP / Active Server Pages

4992

Reading whole text files

by: Michael Mair | last post by:

Cheerio, I would appreciate opinions on the following: Given the task to read a _complete_ text file into a string: What is the "best" way to do it? Handling the buffer is not the problem -- the character input is a different matter, at least if I want to remain within the bounds of the standard library.

C / C++

6366

reading large text files in reverse - optimization doubts

by: Rajorshi Biswas | last post by:

Hi folks, Suppose I have a large (1 GB) text file which I want to read in reverse. The number of characters I want to read at a time is insignificant. I'm confused as to how best to do it. Upon browsing through this group and other sources on the web, it seems that there are many ways to do it. Some suggest that simply fseek'ing to 8K bytes before the end of file, and going backwards is the way. In this case, am I guaranteed best results...

C / C++

6448

reading strings from binary files - performance issue

by: rnorthedge | last post by:

I am working on a code library which needs to read in the data from large binary files. The files hold int, double and string data. This is the code for reading in the strings: protected internal override string ReadString() { stringLength = fileStream.ReadByte(); moInput.Read(byteArrayBuffer, 0, stringLength); return asciiEncoding.GetString(byteArrayBuffer, 0, stringLength ); }

C# / C Sharp

2080

Reading LARGE image files for web output

by: Brad | last post by:

I'm working on a web app which will display LARGE tiff image files (e.g files 10-20+ mb). Files are hidden from users direct access. For other, smaller image files I have used FileStream to read in a file in a single Read and so my quesitons are: (1) What is a practical file size limit for reading using FileStream.Read (reading the file in a single read)...especially on a web server where I don't think I'd want to tax memory...

Visual Basic .NET

4014

Using gets() safely

by: ais523 | last post by:

I was just wondering whether there was a portable way to use gets() safely, and came up with this: #include <stdio.h> #include <stdlib.h> int main() { FILE* temp; char buf;

C / C++

18187

safely copy files

by: MarkusR | last post by:

Good day, I need to safely copy files from one directory to another. I need to make sure that I do not lock the file making it unreadable by others. I am not sure if this is the default behavior if I just used File.Copy... The files are relatively small.

C# / C Sharp

14988

Problem: "java.lang.OutOfMemoryError: Java heap space" while reading xml using SAX

by: blazedaces | last post by:

Ok, so you know my problem, java is running out of memory reading with SAX, the event-based xml parser intended more-so than DOM for extremely large files. I'll try to explain what I've been doing and why I have to do it. Hopefully someone has a suggestion... Alright, so I'm using a gps-simulation program that outputs gps data, like longitude, lattitude, altitude, etc. (hundreds of terms, these are just the well known ones). In the newer...

Java

1583

VB 2005 question regarding reading text files

by: akalmand | last post by:

Hi there, I am writing a code to read some data from the text files. The number of text files is not fixed and could be more that 15. the length of each file is large... close to 100,000 on an average. some of them are extra large. The data that I have to read will always be at the bottom and will be in the last 5 -20 line in the files depending upon their size. small files will have 5 line and large files will have 20 lines to read. Can...

Visual Basic 4 / 5 / 6

8801

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8707

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9074

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9015

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6634

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4464

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3158

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2520

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2110

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General