473,233 Members | 1,541 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,233 software developers and data experts.

Reading a large number of text files into an array

Hello,

Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?

Thanks.
-Matt

Here is my existing code. Sorry if it is ugly. It is the first C code
I've written in a long time:

#include <stdio.h>
#include <stdlib.h>

#define nrows 32768L
#define nfiles 1000L

int main(void)
{
double *data;
unsigned long filenum, pos;
char filename[20];
FILE *fp;
int i;

/* Create output matrix */
data = (double *) malloc((size_t)((nrows*nfiles)*sizeof(double)));

for(filenum=1; filenum<=nfiles; ++filenum) {
// Determine current file name
sprintf(filename, "data%lu.dat", filenum);

// Open the file
fp = fopen (filename,"r");

pos = nrows*(filenum-1L);

for (i=0; i<nrows; ++i)
fscanf(fp, "%lf", data+i+pos);

fclose(fp);
}

// De-allocate memory
free((char*) (data));

return 0;
}
Nov 14 '05 #1
4 5929
Matthew Crema wrote:
Hello,

Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?
sscanf() might just be what you are looking for.
Here is my existing code. Sorry if it is ugly. It is the first C code
I've written in a long time:
I'll comment on it, if you don't mind.
#include <stdio.h>
#include <stdlib.h>

#define nrows 32768L
#define nfiles 1000L
Symbolic constants use all-capital letters by convention. Also, if you add a
suffix, why not UL instead of L? Both values can never be negative.
int main(void)
{
double *data;
unsigned long filenum, pos;
char filename[20];
This is somewhat unsafe. You should think about a way to make the size of
the array dependent on the maximum length of all components from which the
actual filename is constructed. (This requires dynamic allocation or a VLA
if you have C99).
FILE *fp;
int i;

/* Create output matrix */
data = (double *) malloc((size_t)((nrows*nfiles)*sizeof(double)));
No need for either of the casts here.
for(filenum=1; filenum<=nfiles; ++filenum) {
// Determine current file name
sprintf(filename, "data%lu.dat", filenum);

// Open the file
fp = fopen (filename,"r");
You should always check the return value of fopen().
pos = nrows*(filenum-1L);

for (i=0; i<nrows; ++i)
fscanf(fp, "%lf", data+i+pos);

fclose(fp);
}

// De-allocate memory
free((char*) (data));
Absolutely no need to cast here.
return 0;
}

Christian
Nov 14 '05 #2
"Matthew Crema" <cr***@bu.edu> wrote in message
news:d4**********@news3.bu.edu...
Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?


It is likely that most of the difference (between calling fscanf()
repeatedly and reading the whole file with fread()) comes from the
conversion. In other words, if you read everything in and then convert it,
it will probably take about the same total time as converting as you read.

Note that fread() does not terminate what it reads to make a string (ie it
does not append '\0' as you showed above).

You say the files contain integers, but the code (that I've snipped)
converted (in a loose sense) the files to an array of double. Is that really
what you meant? (I can see why that might be just what you want.) If so, it
may be faster to do something like the following:

fscanf(fp, "%d", &temp);
array[pos] = temp;

The bottom line is that the C language itself provides no guarantees about
the speed (or relative speed) of code sequences. If speed is an issue, try
several approaches that seem reasonable and measure to see which is fastest.
But bear in mind that your results are valid only on the system you tested.
Changes to (for example) the compiler, compiler options, the standard
library, the operating system, or the hardware could give different results
and possibly a different conclusion.

Alex
Nov 14 '05 #3
On Tue, 26 Apr 2005 19:06:39 -0400, Matthew Crema wrote:
Hello,

Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0
Note that fread() doesn't read strings, i.e. it doesn't write a
terminating null character. It it also likely to split a line betwwen the
end of one read and the beginning of the next.
and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?


You'll have to sort out the end of the buffer issues yourself but the
"simple" function to convert a string representation to a double is
strtod(). Well there is also atof() but that isn't very good at error
checking. These are likely to be the fastest ways of converting character
data to a double in the standard library.

Try reading your file in a line at a time using fgets(). You may find that
this isn't much slower than using fread() and it makes the rest of your
task easier.

It was suggested that if the numbers in your file data are always integers
then you might convert to an integer and then to a double. That's worth
trying too. There is a strtol() function to do that. You could even try an
inline conversion loop in that case, assuming that performance is really
that much of an issue.

Lawrence
Nov 14 '05 #4
Lawrence Kirby wrote:
On Tue, 26 Apr 2005 19:06:39 -0400, Matthew Crema wrote:

Hello,

Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

Note that fread() doesn't read strings, i.e. it doesn't write a
terminating null character. It it also likely to split a line betwwen the
end of one read and the beginning of the next.

and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?

You'll have to sort out the end of the buffer issues yourself but the
"simple" function to convert a string representation to a double is
strtod(). Well there is also atof() but that isn't very good at error
checking. These are likely to be the fastest ways of converting character
data to a double in the standard library.

Try reading your file in a line at a time using fgets(). You may find that
this isn't much slower than using fread() and it makes the rest of your
task easier.

It was suggested that if the numbers in your file data are always integers
then you might convert to an integer and then to a double. That's worth
trying too. There is a strtol() function to do that. You could even try an
inline conversion loop in that case, assuming that performance is really
that much of an issue.

Lawrence


Thanks to all for your responses.

I think I agree with Alex's post that the time consuming part of this
whole thing is the conversion. So 'fread'ing the data in and then
converting the whole thing would likely take a similar amount of time as
'fscanf'ing the data into a double array. Eventually I'll play with
strtod and others, but I'm going to leave my code as it is for now.

Several of you pointed out (and I have verified) that fread does not
append the '\0' as I assumed.

Also, sorry for the confusion about the int's vs. floats. My data is
generally double precision floats.

Aside, using fgets (to read each line) instead of fread (to read the
entire file), seems to take much longer given my large data sets. For
smaller data sets there is not much difference.

Thanks for the other tips on bugfixes as well. I will implement them
immediately.

-Matt
Nov 14 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: David | last post by:
Hi all: I am processing a 3D bitmaps(essentially ~1024 2D bitmaps with a size of 1MB each). If I want read large amount of radom data from this series, how could I buffer the file to get...
3
by: Lionel | last post by:
Hi, I have to create a website with keywords to find pictures. I will have more than 500.000 jpeg files. I work with II6 on a windows server 2000. Is it better to put the whole 500.000 files...
2
by: TreatmentPlant | last post by:
My father is a careers advisor who sends to his clients a monthly snail mail or email as a sort of newsletter. He has been doing this for years, so has a large number of files that he would like...
3
by: bbepristis | last post by:
Hey all I have this code that reads from one text file writes to another unless im on a certian line then it writes the new data however it only seems to do about 40 lines then quits and I cant...
0
by: Alexandre Guimond | last post by:
Hi. I've noticed that when i select a large number of files (> 400) using tkFileDialog.Open i get an empty list. Does anyone knows the limits of that interface regarding the maximum number of files...
0
by: kaminekutte | last post by:
Hi everybody, I have been trying to parse a 100MB log file(tab separated). Basic aim is to read the file randomly, do some procesing and then display the contents of the file line by line. Working...
0
by: DaBizNOS | last post by:
Basically i have been programming for a year, and have just started using visual basic. I have programmed a small game and am trying to create a high score board. i might have got this completely...
1
by: MrTea | last post by:
Hi Folks Hopefully a simple question... Using Visual Studio 2005, what is the easiest way to create a setup for my Windows Forms App that can copy a large number of required PDF files to the...
4
by: paduffy | last post by:
Folks, I've a Python 2.5 app running on 32 bit Win 2k SP4 (NTFS volume). Reading a file of 13 GBytes, one line at a time. It appears that, once the read line passes the 4 GByte boundary, I am...
3
by: Man Nguyen | last post by:
Hi, I am using C# to read large FoxPro data file (DBF). Everything works fine except it is too slow to load the file in buffer (I think). Anybody know how to solve this problem please instruct...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.