473,503 Members | 1,772 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Can read all data off file accurately with fstream

I'm writing a program to read 2,000,000 floating point numbers off a text
file, to compute the sum, mean, and median. This is a direct example of
Stroustrup's paper.

But the program will not display the total number of elements beyone 23,272:

#include<vector>
#include<fstream>
#include<iostream>
#include<algorithm>
using namespace std;

int main(int argc, char* argv[])
{
char* file = argv[2] ;
vector<double> buf;

double median = 0;
double mean = 0;

fstream fin("num.txt", ios::in) ; // open file for input
double d;

while(fin >> d) {
buf.push_back(d) ;
mean = (buf.size()==1) ? d : mean+(d-mean)/buf.size(); // prone to
rounding errors
}

sort(buf.begin() ,buf.end()) ;

if (buf.size()) {
int mid = buf.size()/2;
median = (buf.size()%2) ? buf[mid] : (buf[mid-1]+buf[mid])/2;
}

cout << "Number of elements = " << buf.size()
<< ", median = " << median << ", mean = " << mean << "\n";

}

Anyone have any ideas?

-Don Kim
Jul 22 '05 #1
13 2296
Don Kim wrote:

Anyone have any ideas?


I ran your code unmodified on almost 20 million values.

Number of elements = 19315296, median = 3.4, mean = 3.03333

Are you sure you have 2 million elements in your file ?
Jul 22 '05 #2
I ran also with 2000000 numbers in file on windows xp, compiled program
with gcc - no problems...

Jul 22 '05 #3
"Gianni Mariani" <gi*******@mariani.ws> wrote
Are you sure you have 2 million elements in your file ?


Yes.

I wrote a program to generate random numbers to a text file. Originally, I
had it generate the numbers to file like this:

98.989
72.585
58.986

When the numbers are floating point and formatted like that, I get the odd
behavior. But when I make the program generate integer numbers like this
instead:

98
75
58

The program runs correctly.

I'm running this on Windows XP, and tested it on VC 7.1, 8.0, gcc 3.3.3 and
Digital Mars 8.4.1 and get the same odd behavior on all compilers.

Puzzled as to why this is the case.

-Don Kim
Jul 22 '05 #4

"Don Kim" <de*******@nospam.donkim.info> wrote in message
news:ky*****************@newssvr13.news.prodigy.co m...
"Gianni Mariani" <gi*******@mariani.ws> wrote
Are you sure you have 2 million elements in your file ?
Yes.

I wrote a program to generate random numbers to a text file. Originally,

I had it generate the numbers to file like this:

98.989
72.585
58.986

When the numbers are floating point and formatted like that, I get the odd
behavior. But when I make the program generate integer numbers like this
instead:

98
75
58

The program runs correctly.

I'm running this on Windows XP, and tested it on VC 7.1, 8.0, gcc 3.3.3 and Digital Mars 8.4.1 and get the same odd behavior on all compilers.

Puzzled as to why this is the case.


Show the code that created the file.

-Mike
Jul 22 '05 #5
"Mike Wahler" <mk******@mkwahler.net> wrote
Show the code that created the file.

-Mike


Ok, here it is:

#include <iostream>
#include <fstream>
#include <iomanip>
#include <cstdlib>
#include <ctime>
using namespace std;

int main()
{
srand(time(0));

cout << "Enter a number: ";
int n;
cin >> n;

ofstream numfile("num.txt");
for (int i = 0; i < n; i++)
{
//numfile << (rand() % 99)*((double)rand()/rand()) << "\n"; //to create
floats
numfile << rand() % 99 << "\n"; //to create ints
}
numfile.close();

}

Jul 22 '05 #6
On another issue, the code was adapted from Stroustrup's fine article
"Learning Standard C++ as a New Language", and running the C code against
the C++ code, my average running times on the program were as follows:

C version:

Unoptimized: 25 secs.
Optmized: 26 secs.

C++ version:

Unoptimized: 75 secs.
Optimized: 35 secs.

This was done on VC 7.1 on a P3 500 MHz, 800 MB Ram PC running WinXP with 5
million integer input values.

This seems to be the opposite of Stroupstrup's results. I haven't run this
with other compilers yet.

-Don Kim

P.S. - Here's the C version for those interested:

// C-style solution:
#include<stdlib.h>
#include<stdio.h>

#include "timecpp.hpp"
using namespace timecpp;

int compare(const void* p, const void* q) // comparison function for use by
qsort()
{
register double p0 = *(double*)p; // compare doubles
register double q0 = *(double*)q;
if (p0 > q0) return 1;
if (p0 < q0) return -1;
return 0;
}

void quit() // write error message and quit
{
fprintf(stderr,"memory exhausted\n") ;
exit(1) ;
}

int main(int argc, char* argv[])
{
timer t;
t.start();

int res = 1000; // initial allocation
char* file = argv[2];

double* buf = (double*)malloc(sizeof(double)*res) ;
if (buf==0) quit() ;

double median = 0;
double mean = 0;
int n = 0; // number of elements

FILE* fin = fopen("num.txt","r") ; // openfile for reading
double d;
while (fscanf(fin,"%lg",&d)==1) { // read number, update running mean
if (n==res) {
res += res;
buf = (double*)realloc(buf,sizeof(double)*res) ;
if (buf==0) quit() ;
}
buf[n++] = d;
mean = (n==1) ? d : mean+(d-mean); // prone to rounding errors
}

qsort(buf, n, sizeof(double) , compare) ;

if (n) {
int mid = n/2;
median = (n%2) ? buf[mid] : (buf[mid-1]+buf[mid])/2;
}

printf("number of elements = %d, median = %g, mean = %g\n", n, median,
mean);

t.stop("Time: ");
free(buf) ;
}
Jul 22 '05 #7
Don Kim wrote:
"Mike Wahler" <mk******@mkwahler.net> wrote
Show the code that created the file.

-Mike

Ok, here it is:

....

ofstream numfile("num.txt");
for (int i = 0; i < n; i++)
{
//numfile << (rand() % 99)*((double)rand()/rand()) << "\n"; //to create
floats


What happens on divide by zero ?
Jul 22 '05 #8
"Don Kim" <de*******@nospam.donkim.info> wrote in message
news:V_*****************@newssvr21.news.prodigy.co m...
"Mike Wahler" <mk******@mkwahler.net> wrote
Show the code that created the file.

-Mike


Ok, here it is:


The problem is a 'range' error in your data creation.
I used your program to write 2,000,000 values to a
file. I loaded it into a text editor and visually
verifed that 2 million lines were actually written.
The screenful of values I saw looked OK, but I made
no assumptions.

Then I tried to read the file in with the input
program you posted, but I added an error check to
it:

while(fin >> d) {
buf.push_back(d) ; ++count; /* I defined 'count' as a 'size_t' */
mean = (buf.size()==1) ? d : mean+(d-mean)/buf.size();
}

/* this tells us whether the above loop terminated because of
error or EOF */
if(!fin.eof())
{
cerr << "input error (count == " << count << ")\n";
cerr << "last value read == " << d << '\n';
}

For my test run I got output of:

input error (count == 85459)
last value read == 1
Number of elements = 85459, median = 41.573, mean = 299.587

So I loaded up the file in an editor, and looked at line
85459. It looked like this:

1.#INF
So of course it showed 'last value read' as 1, and the '#'
character put the stream in a 'fail' state (because that
is an invalid character for a floating point value), terminating
the 'while' loop.

Since you're generating values with 'rand()', of course
the exact point where this happens, and now many times it
happens, if any, can vary. Also, exactly what happens will
vary among implementations. I think this is really a case of
undefined behavior, which the compiler I used (MSVC++)
manifested as the "#INF output". Another compiler might
do something completely different, and not necessarily
consistently.
Morals:

1. *Always* check for *all* possible failures of the functions you call.

2. *Never* make assumptions about the integrity of your test data sets.
Ensure that you *know* their exact content.

3. Always be thinking about the possiblity of overflow/underflow
in your numeric objects, and of ways to protect against it.
The facilities provided by the <numeric_limits> header can
help with this.

3. When initially testing a program, it's best to use a known, static
data set, rather than a randomly created one. It's much easier
to determine if a result is correct if you know what it should
be in advance. Only after you've proven that should you move
on to things like random inputs.

HTH,
-Mike


Jul 22 '05 #9
"Gianni Mariani" <gi*******@mariani.ws> wrote in message
news:zd********************@speakeasy.net...
Don Kim wrote:
"Mike Wahler" <mk******@mkwahler.net> wrote
Show the code that created the file.

-Mike

Ok, here it is:

...

ofstream numfile("num.txt");
for (int i = 0; i < n; i++)
{
//numfile << (rand() % 99)*((double)rand()/rand()) << "\n"; //to create floats


What happens on divide by zero ?


One possibility is what I discovered. See my other post.

-Mike
Jul 22 '05 #10
"Mike Wahler" <mk******@mkwahler.net> wrote in message
news:La*****************@newsread3.news.pas.earthl ink.net...

3. Always be thinking about the possiblity of overflow/underflow
in your numeric objects, and of ways to protect against it.
The facilities provided by the <numeric_limits> header can
help with this.


And as Gianni mentions, and I overlooked, protect from
divide by zero.

-Mike
Jul 22 '05 #11
Don Kim wrote:
On another issue, the code was adapted from Stroustrup's fine article
"Learning Standard C++ as a New Language", and running the C code against
the C++ code, my average running times on the program were as follows:

C version:

Unoptimized: 25 secs.
Optmized: 26 secs.

C++ version:

Unoptimized: 75 secs.
Optimized: 35 secs.


Take these numbers with a grain-o-salt. A number of other optimizations
should be considered. e.g. When both C and C++ versions are linked
static on the amd64 version, the times are the same.

AMD Athlon(tm) MP 2400+
gcc version 4.0.0 20050102

C++
Unoptimized: 11.4
Optimized: 7

C
Unoptimized: 8.6
Optimized: 7.6
model name : AMD Opteron(tm) Processor 248
gcc-3.4.2 amd64

C++
Unoptimized: 6.9
Optimized: 3.4

C
Unoptimized: 3.8
Optimized: 2.9
model name : AMD Athlon(tm) MP 2400+
stepping : 1
cpu MHz : 2000.085
gcc version 4.0.0 20050102 (experimental)

$ text_rdr_mkr #integers
Enter a number: 5000000
$ g++ -O0 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
11.430u 0.240s 0:11.66 100.0% 0+0k 0+0io 262pf+0w

$ g++ -O2 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
7.040u 0.270s 0:07.30 100.1% 0+0k 0+0io 259pf+0w

$ g++ -O3 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
6.900u 0.320s 0:07.19 100.4% 0+0k 0+0io 259pf+0w
$ gcc -O0 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 88
8.560u 0.180s 0:08.74 100.0% 0+0k 0+0io 130pf+0w

$ gcc -O2 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 88
7.630u 0.240s 0:07.89 99.7% 0+0k 0+0io 130pf+0w

$ gcc -O3 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 88
7.580u 0.280s 0:07.85 100.1% 0+0k 0+0io 130pf+0w

model name : AMD Opteron(tm) Processor 248
stepping : 10
cpu MHz : 2191.059
gcc-3.4.2

$ g++ -O0 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 49.0195
6.866u 0.130s 0:07.01 99.7% 0+0k 0+0io 0pf+0w

$ g++ -O2 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 49.0195
3.365u 0.105s 0:03.47 99.7% 0+0k 0+0io 0pf+0w

$ g++ -O3 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 49.0195
3.342u 0.105s 0:03.44 100.0% 0+0k 0+0io 0pf+0w
$ gcc -O0 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 48
3.787u 0.084s 0:03.87 99.7% 0+0k 0+0io 0pf+0w

$ gcc -O2 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 48
2.908u 0.076s 0:02.98 99.6% 0+0k 0+0io 0pf+0w

$ gcc -O3 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 48
2.908u 0.080s 0:02.99 99.6% 0+0k 0+0io 0pf+0w

Other optimizations
32bit

$ g++ -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
5.240u 0.270s 0:05.50 100.1% 0+0k 0+0io 102pf+0w

$ gcc -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr_c.c
$ time ./text_rdr_c
number of elements = 5000000, median = 49, mean = 88
7.740u 0.250s 0:07.97 100.2% 0+0k 0+0io 130pf+0w

64bit
$ g++ -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr.cpp
Number of elements = 5000000, median = 49, mean = 49.0195
3.007u 0.106s 0:03.11 99.6% 0+0k 0+0io 0pf+0w

$ gcc -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr_c.c
$ time ./text_rdr_c
number of elements = 5000000, median = 49, mean = 48
3.107u 0.105s 0:03.21 99.6% 0+0k 0+0io 0pf+0w
Jul 22 '05 #12
"Mike Wahler" <mk******@mkwahler.net> wrote
The screenful of values I saw looked OK, but I made
no assumptions.


Excellent points. That's what my problem was... I made assumptions about
the data.

Thanks.

-Don Kim
Jul 22 '05 #13
In article <ze*****************@newssvr21.news.prodigy.com> , Don Kim
<de*******@nospam.donkim.info> writes
C version:

Unoptimized: 25 secs.
Optmized: 26 secs.

C++ version:

Unoptimized: 75 secs.
Optimized: 35 secs.

This was done on VC 7.1 on a P3 500 MHz, 800 MB Ram PC running WinXP with 5
million integer input values.

This seems to be the opposite of Stroupstrup's results. I haven't run this
with other compilers yet.

-Don Kim

And one of the points Bjarne Stroustrup has frequently made in the past
is that the variability in performance between different implementations
is far too high. We are sometimes getting an order of magnitude
variation in performance for different implementations of the Standard
Library.

In this case I suspect that the problem lies in whether a compiler (with
its current compilation switches) is inlining small functions or not.
For a compiler such as VC++7.x the term 'optimised' has no meaning
because there are so many different optimisation options available. Note
also that according to your figures the optimisation options you chose
did nothing to improve the C version.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects
Jul 23 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
1770
by: Christian Henke | last post by:
Hi! maybe someone can help me?! I want to use fstream to read an .au file . This .au file includes multiple information in its header. a MagicNumber, the No of Channels used, and so on. In...
8
7241
by: james | last post by:
I am trying to use Filestream to read a file ( .DAT) that contains values in HEX that I want to convert to text. I know the different offset addresses for each portion of the data I am trying to...
10
2310
by: kathy | last post by:
to read float type data, we can use: .... float f; fscanf(outfile, "%f", &f); .... but for double type, how to do that? .... double d;
1
5461
by: vinothg | last post by:
I have a binary file,which contains strings of 30 bytes each.I need to open the file,read the strings one by one and if the string is not found i need to write it.But unfortunately both read and...
0
1788
by: lion | last post by:
I have in my programm three files: SCANNER.CPP,tst_scanner.cpp and scaner.h i need read and write from text file which is glopal. how i can do it? the strange things if i put all programm in one...
3
3209
by: phwashington | last post by:
I am new to C++ and have a data file I want to read, which was stored in binary. I have looked at the data with a hex editor and it appears to be correct. Whenever I try to read it though as an...
1
1723
Andr3w
by: Andr3w | last post by:
Hi, I was working on something when I noticed that the following code produced a duplicate char before reaching at the end of the file if it had a blank line (with no chars at the file) before...
1
3917
by: Sachin Garg | last post by:
I have a program which opens a fstream in binary input+output mode, creating the file if it doesn't exists. But writing doesn't works after reading, it must be something obvious that I am not aware...
2
10371
iam_clint
by: iam_clint | last post by:
Hello everyone I'm still learning vb.net and I'm stuck on something. Private Sub Button3_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button3.Click Dim...
0
7199
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7076
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7274
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
6984
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
5576
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5005
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4670
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3151
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
377
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.