Working with binary files in C++

knapak

Hello

I'm a self instructed amateur attempting to read a huge file from disk... so
bear with me please... I just learned that reading a file in binary is
faster than text. So I wrote the following code that compiles OK. It runs and
shows the requested output. However, after execution, it pops one of those
windows to send error reports online to the porgram creator. I have managed
to find where the error is but can't see what's wrong. I'm posting the whole
code for context. I'm also marking where the problem is.

I appreciate your assistance. Thanks

#include <fstream>
#include <iostream>
#include <map>
using namespace std;

int main()
{
typedef map<int, double> IMAP;
IMAP Grid, NewGrid;

int IntValue1, rows = 3;
double DouValue2;

for(int i=0; i < rows; i++)
{
IntValue1 = i + 1;
DouValue2 = i * 2;
Grid.insert(IMA P::value_type(I ntValue1, DouValue2));
}

IMAP::const_ite rator IteratorG = Grid.begin();

cout << "Original Map" << endl;
while (IteratorG != Grid.end() )
{
cout << IteratorG->first << " " << IteratorG->second << endl;
IteratorG ++;
}

ofstream FileOut("C:/MyBinary.bin" , ios::binary);
FileOut.write(( char*) &Grid, sizeof Grid);
FileOut.close() ;

// ******** PROBLEM IN HERE *************** ***
ifstream FileIn("C:/MyBinary.bin", ios::binary);
FileIn.read((ch ar*) &NewGrid,siz eof NewGrid);
FileIn.close();
// *************** *************** ***********

IMAP::const_ite rator NewIteratorG = NewGrid.begin() ;

cout << " " << endl;
cout << "New Map" << endl;
while (NewIteratorG != NewGrid.end() )
{
cout << NewIteratorG->first << " " << NewIteratorG->second << endl;
NewIteratorG ++;
}

return 0;
}

Nov 17 '05 #1

Subscribe Reply

3682

Tom Widmer

knapak wrote:

Hello

I'm a self instructed amateur attempting to read a huge file from disk... so
bear with me please... I just learned that reading a file in binary is
faster than text.
However, writing in binary has a lot of potential problems, the main one
being that you can't write pointers, references or any non-POD types as
binary.

So I wrote the following code that compiles OK. It runs and shows the requested output. However, after execution, it pops one of those
windows to send error reports online to the porgram creator. I have managed
to find where the error is but can't see what's wrong. I'm posting the whole
code for context. I'm also marking where the problem is.

I appreciate your assistance. Thanks

#include <fstream>
#include <iostream>
#include <map>
using namespace std;

int main()
{
typedef map<int, double> IMAP;
IMAP Grid, NewGrid;

int IntValue1, rows = 3;
double DouValue2;

for(int i=0; i < rows; i++)
{
IntValue1 = i + 1;
DouValue2 = i * 2;
Grid.insert(IMA P::value_type(I ntValue1, DouValue2));
}

IMAP::const_ite rator IteratorG = Grid.begin();

cout << "Original Map" << endl;
while (IteratorG != Grid.end() )
{
cout << IteratorG->first << " " << IteratorG->second << endl;
IteratorG ++;
Prefer pre-increment where possible, since it can be faster:
++IteratorG;
}
The above would normally be a written as a for loop.

ofstream FileOut("C:/MyBinary.bin" , ios::binary);
FileOut.write(( char*) &Grid, sizeof Grid);
Ok, the above just wrote out the internal structure of a map object.
This structure probably consists of pointers out to various nodes of the
map, such as the root, begin and end nodes, and probably a variable
holding the size of the map. So, as a guess, the above code is writing
out the values of three pointers to structures internal to the map, and
not one entry stored in the map is actually written out.
FileOut.close() ;

// ******** PROBLEM IN HERE *************** ***
ifstream FileIn("C:/MyBinary.bin", ios::binary);
FileIn.read((ch ar*) &NewGrid,siz eof NewGrid);

The above is writing over the internal pointers and size stored in the
NewGrid object, which has undefined behaviour. You now have two
different map objects (Grid and NewGrid) that are sharing the same
internal data structures! This means that both Grid and NewGrid will
attempt to destroy the same structures when they go out of scope, which
will crash at best, and corrupt the heap in some more subtle way at worst.

In order to write out a map in either text or binary, you have to
iterate over the elements in the map and write them out one by one. You
are only allowed to binary read/write built in types, like int and
double, and C style structs that have no constructor/destructor or
private data. e.g. this is ok:

struct A
{
int a;
double b;
};

but std::pair (for example) is not. So to do the binary writing you need
to get down to the level of individual keys and values. e.g.

//write out the size, so we know how much to read in:
IMAP::size_type size = Grid.size();
//use reinterpret_cas t to show we're doing something strange
FileOut.write(r einterpret_cast <char*>(&size ), sizeof size);
for (IMAP::const_it erator i = Grid.begin();
i != end;
++i)
{
FileOut.write(
reinterpret_cas t<char const*>(&i->first),
sizeof i->first);
FileOut.write(
reinterpret_cas t<char const*>(&i->second),
sizeof i->second);
}

Finally, read them into a new map like this:
IMAP NewGrid;
IMAP::size_type size;
FileIn.read(rei nterpret_cast<c har*>(&size), sizeof size);
//now we know how many entries to read
for (IMAP::size_typ e i = 0; i < size; ++i)
{
int key;
double value;
FileIn.read(
reinterpret_cas t<char*>(&key) ,
sizeof key);
FileIn.read(
reinterpret_cas t<char*>(&value ),
sizeof value);
//finally add it to the map
NewGrid.insert( IMAP::value_typ e(key, value));
}

Hopefully, that should do it, but note that I haven't compiled or tested
the code. As a final point, you should check the return value of every
call to read and write to make sure IO hasn't failed. You should also
note that binary files written as above generally aren't portable - you
won't be able to load the file using a PowerPC based MAC, for example.

Tom

Nov 17 '05 #2

knapak

Tom

Thank you so much for your help, this problem was driving me nuts!!!

A couple of things. The whole purpose of this code is to reduce the time to
load a big data file and load it into a map or multimap to be able to quickly
find a record in the maze of data. When I did it with text files it took a
grueling 40 minutes to read the file... yup only to read the file. Using
binaries was suggested to me to reduce the reading time by loading the data
in "one big chunk". I don't know if this is correct or not, but it certainly
reduced the time. Now your suggestion goes reading one record at a time...
mind me, your suggested code does work and takes only a few seconds to read
the data. Still, I wonder if those few seconds could still be somehow reduced
say from 8 to 4... I know I'm being ambitious, but I'd like to optimize this
part of the program as much as possible. If not, I'll be happy with this
solution.

The second question is related to your comment about portability. A file
saved as binary with this code in Windows cannot be read in UNIX? I thought
binary files could be read anywhere... Can this problem be solved? For
example, should I leave the data file as text (ASCII) and load it as binary
in the same amount of time? Can then the same file be read both in Windows
and UNIX?

Again thanks a million for you kind assistance!

Carlos

"Tom Widmer" wrote:

knapak wrote:
Hello

I'm a self instructed amateur attempting to read a huge file from disk... so
bear with me please... I just learned that reading a file in binary is
faster than text.

However, writing in binary has a lot of potential problems, the main one
being that you can't write pointers, references or any non-POD types as
binary.

So I wrote the following code that compiles OK. It runs and
shows the requested output. However, after execution, it pops one of those
windows to send error reports online to the porgram creator. I have managed
to find where the error is but can't see what's wrong. I'm posting the whole
code for context. I'm also marking where the problem is.

I appreciate your assistance. Thanks

#include <fstream>
#include <iostream>
#include <map>
using namespace std;

int main()
{
typedef map<int, double> IMAP;
IMAP Grid, NewGrid;

int IntValue1, rows = 3;
double DouValue2;

for(int i=0; i < rows; i++)
{
IntValue1 = i + 1;
DouValue2 = i * 2;
Grid.insert(IMA P::value_type(I ntValue1, DouValue2));
}

IMAP::const_ite rator IteratorG = Grid.begin();

cout << "Original Map" << endl;
while (IteratorG != Grid.end() )
{
cout << IteratorG->first << " " << IteratorG->second << endl;
IteratorG ++;

Prefer pre-increment where possible, since it can be faster:
++IteratorG;
}

The above would normally be a written as a for loop.

ofstream FileOut("C:/MyBinary.bin" , ios::binary);
FileOut.write(( char*) &Grid, sizeof Grid);

Ok, the above just wrote out the internal structure of a map object.
This structure probably consists of pointers out to various nodes of the
map, such as the root, begin and end nodes, and probably a variable
holding the size of the map. So, as a guess, the above code is writing
out the values of three pointers to structures internal to the map, and
not one entry stored in the map is actually written out.
FileOut.close() ;

// ******** PROBLEM IN HERE *************** ***
ifstream FileIn("C:/MyBinary.bin", ios::binary);
FileIn.read((ch ar*) &NewGrid,siz eof NewGrid);

The above is writing over the internal pointers and size stored in the
NewGrid object, which has undefined behaviour. You now have two
different map objects (Grid and NewGrid) that are sharing the same
internal data structures! This means that both Grid and NewGrid will
attempt to destroy the same structures when they go out of scope, which
will crash at best, and corrupt the heap in some more subtle way at worst.

In order to write out a map in either text or binary, you have to
iterate over the elements in the map and write them out one by one. You
are only allowed to binary read/write built in types, like int and
double, and C style structs that have no constructor/destructor or
private data. e.g. this is ok:

struct A
{
int a;
double b;
};

but std::pair (for example) is not. So to do the binary writing you need
to get down to the level of individual keys and values. e.g.

//write out the size, so we know how much to read in:
IMAP::size_type size = Grid.size();
//use reinterpret_cas t to show we're doing something strange
FileOut.write(r einterpret_cast <char*>(&size ), sizeof size);
for (IMAP::const_it erator i = Grid.begin();
i != end;
++i)
{
FileOut.write(
reinterpret_cas t<char const*>(&i->first),
sizeof i->first);
FileOut.write(
reinterpret_cas t<char const*>(&i->second),
sizeof i->second);
}

Finally, read them into a new map like this:
IMAP NewGrid;
IMAP::size_type size;
FileIn.read(rei nterpret_cast<c har*>(&size), sizeof size);
//now we know how many entries to read
for (IMAP::size_typ e i = 0; i < size; ++i)
{
int key;
double value;
FileIn.read(
reinterpret_cas t<char*>(&key) ,
sizeof key);
FileIn.read(
reinterpret_cas t<char*>(&value ),
sizeof value);
//finally add it to the map
NewGrid.insert( IMAP::value_typ e(key, value));
}

Hopefully, that should do it, but note that I haven't compiled or tested
the code. As a final point, you should check the return value of every
call to read and write to make sure IO hasn't failed. You should also
note that binary files written as above generally aren't portable - you
won't be able to load the file using a PowerPC based MAC, for example.

Tom

Nov 17 '05 #3

Tom Widmer

knapak wrote:

Tom

Thank you so much for your help, this problem was driving me nuts!!!

A couple of things. The whole purpose of this code is to reduce the time to
load a big data file and load it into a map or multimap to be able to quickly
find a record in the maze of data. When I did it with text files it took a
grueling 40 minutes to read the file... yup only to read the file.Using
binaries was suggested to me to reduce the reading time by loading the data
in "one big chunk". I don't know if this is correct or not, but it certainly
reduced the time.
Unfortunately, std::map doesn't sit in memory in one large chunk - there
is one chunk for each entry in the map, so there is no way to write out
the map without iterating over the entries.

Now your suggestion goes reading one record at a time... mind me, your suggested code does work and takes only a few seconds to read
the data. Still, I wonder if those few seconds could still be somehow reduced
say from 8 to 4... I know I'm being ambitious, but I'd like to optimize this
part of the program as much as possible. If not, I'll be happy with this
solution.
I'm sure it is possible to reduce the time further. One approach is to
remove the calls to "read" and "write" and replace them with calls like
this:

FileOut.rdbuf()->sputn(same params as for write);

FileIn.rdbuf()->sgetn(same params as for read);

sputn/sgetn are quite a bit faster than write/read.

Another approach is to take the map and transfer its contents to a
vector, which can be written out in one chunk. I've posted two different
approaches, one legal but a bit slower, the other illegal, but likely to
work on most platforms:

typedef map<int, double> IMAP;

struct IMAP_POD
{
int key;
double value;
};

struct IMAPConverter
{
IMAP_POD operator()(IMAP ::const_referen ce val) const
{
IMAP_POD p = {val.first, val.second};
return p;
}

std::pair<int, double> operator()(IMAP _POD const& val) const
{
return std::pair<int, double>(val.key , val.value);
}
};

void writeIMAP(IMAP const& m, ostream& os)
{
vector<IMAP_POD > v(m.size());
transform(m.beg in(), m.end(), v.begin(), IMAPConverter() );
//write the size:
vector<IMAP_POD >::size_type size = v.size();
os.write(reinte rpret_cast<char *>(&size), sizeof size);
//write the map as a single vector:
os.write(reinte rpret_cast<char *>(&v[0]), v.size() * sizeof v[0]);
}

void readIMAP(IMAP& m, istream& is)
{
vector<IMAP_POD >::size_type size;
//read the size:
is.read(reinter pret_cast<char* >(&size), sizeof size);
vector<IMAP_POD > v(size);
//read the map as a single vector:
is.read(reinter pret_cast<char* >(&v[0]), v.size() * sizeof v[0]);
vector<std::pai r<int, double> > typedV;
typedV.reserve( size);
transform(v.beg in(), v.end(), back_inserter(t ypedV), IMAPConverter() );
//range insert for a sorted range
//is much faster than inserting one by one
m.insert(typedV .begin(), typedV.end());
}
Illegal approach:

typedef map<int, double> IMAP;

void writeIMAP(IMAP const& m, ostream& os)
{
typedef std::pair<int, double> non_const_value _type;
vector<non_cons t_value_type> v;
v.reserve(m.siz e());
v.insert(v.begi n(), m.begin(), m.end());
//write the size:
vector<non_cons t_value_type>:: size_type size = v.size();
os.write(reinte rpret_cast<char *>(&size), sizeof size);
//write the map as a single vector:
os.write(reinte rpret_cast<char *>(&v[0]), v.size() * sizeof v[0]);
}

void readIMAP(IMAP& m, istream& is)
{
typedef std::pair<int, double> non_const_value _type;
vector<non_cons t_value_type>:: size_type size;
//read the size:
is.read(reinter pret_cast<char* >(&size), sizeof size);
vector<non_cons t_value_type> v(size);
//read the map as a single vector:
is.read(reinter pret_cast<char* >(&v[0]), v.size() * sizeof v[0]);
m.insert(v.begi n(), v.end());
}

The reason that is illegal is that you can only copy the bytes into and
out of POD types, and std::pair<int, double> is not a POD type. However,
pair<int, double> is close to being a POD type (it doesn't have any base
classes or virtual functions, and the destructor is basically a no-op),
so the above is very likely to work on every platform.
The second question is related to your comment about portability. A file
saved as binary with this code in Windows cannot be read in UNIX?
The problem here is the format used by the CPU and compiler to hold ints
and doubles, and the sizes of those types. For example, some CPUs use a
big endian 64-bit 2s complement format for "int", while Windows (and
UNIX) compilers for x86 use a little endian 32-bit 2s-complement format.
Basically, the bits for a particular int value (such as 1234567) are
quite different on some platforms.

There's a bit more on it here:
http://www.eskimo.com/~scs/C-faq/q20.5.html

If you want portable binary, you need to decide on exactly the binary
format you want, and then make sure that your code writes out bytes in
the right format (byte-order swapping and padding as necessary).

I thought binary files could be read anywhere... Can this problem be solved? For
example, should I leave the data file as text (ASCII) and load it as binary
in the same amount of time? Can then the same file be read both in Windows
and UNIX?

It should be possible to optimize the code you use to read it as text so
that it operates much faster. If you need portability, this may be the
best option. If you want to do this, I'd suggest posting the code you
have (in a new thread) and asking for help in speeding it up.

Tom

Nov 17 '05 #4

knapak

Tom

Thanks again for your invaluable help. As for the alternative to write and
read, yes it improved the loading time... by about 0.4 of a sec (4.2 to
3.8)... which to me is quite good. I have to admit that your methods were
completely unknwon to me (remember I'm an amateur). I guess my only question
would be if there's is any room for problems by using the reinterpret_cas t.

As for the protability problem, you actually suggested to explore some
alternatives of standardized binary formats including netCDF. Actually I've
tried using netCDF but didn't quite follow the procedures and is very
difficult to find people with the expertise to provide assistance. For now
I'll try to work with your solution and eventually when my files get bigger
and do require switching between windows and unix I'll come back and ask
directly if anyone knows how to work with netCDF.

I very much appreciate the time you took to help me.

Carlos

Nov 17 '05 #5

Similar topics

2762

Working with Huge Text Files

by: Lorn Davies | last post by:

Hi there, I'm a Python newbie hoping for some direction in working with text files that range from 100MB to 1G in size. Basically certain rows, sorted by the first (primary) field maybe second (date), need to be copied and written to their own file, and some string manipulations need to happen as well. An example of the current format: XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N | | followed by...

Python

6505

binary diff

by: Ching-Lung | last post by:

Hi all, I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things added to the delta file.

C# / C Sharp

5105

Detecting binary files

by: dagecko | last post by:

Hi I would like to know how to detect if a file is binary or not. It's important for me but I don't know where to start. Ty

C / C++

3637

text and binary files confusion

by: joelagnel | last post by:

hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in binary mode. according to me, notepad opens files and each byte of the file read, it converts that byte from ascii to its correct character and displays

C / C++

5198

differance between binary file and ascii file

by: vim | last post by:

hello everybody Plz tell the differance between binary file and ascii file............... Thanks in advance vim

C / C++

18958

File read and writing in binary mode...

by: nicolasg | last post by:

Hi, I'm trying to open a file (any file) in binary mode and save it inside a new text file. After that I want to read the source from the text file and save it back to the disk with its original form. The problem is tha the binary source that I extract from the text file seems to be diferent from the source I saved. Here is my code: 1) handle=file('image.gif','rb')

Python

2979

Understanding binary files.

by: JoeC | last post by:

I am writing a program that I am trying to learn and save binary files. This is the page I found as a source: http://www.angelfire.com/country/aldev0/cpphowto/cpp_BinaryFileIO.html I have successfully created and used txt files. I am trying to save then load in an array of pointers to objects:

C / C++

2873

binary files

by: deepakvsoni | last post by:

are binary files portable?

C / C++

7587

Apache 2.2 VirtualHosts not working on Windows Vista

by: josequinonesii | last post by:

I've searched, I've read, I've tested and re-read numerous post but to no avail yet... Quite simply, the settings I've applied to my httpd.conf, httpd-vhost.conf and my hosts files simply does not work. Please review my work and let me know if you see that one little thing that is throwing me for a loop. HTTPD.CONF # # This is the main Apache HTTP server configuration file. It contains the # configuration directives that give the...

Apache Web Server

7946

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

7876

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8251

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8003

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8234

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6654

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

2385

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1478

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

1210

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General