Reading whole text files

Michael Mair

Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.
If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)
Regards,
Michael
--
E-Mail: Mine is an /at/ gmx /dot/ de address.

Nov 14 '05 #1

Subscribe Post Reply

4862

infobahn

Michael Mair wrote:

Cheerio,

I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
And you have to maintain /two/ buffers (quite apart from the buffer
maintained by your text stream handler) - your expanding buffer,
and the buffer you give to fgets (unless you use the expanding
buffer for that too, which is certainly doable but probably gives
you more headaches).
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one,

Fine, so use that. But it wouldn't be my choice.

Vive la difference!

Nov 14 '05 #2

Michael Mair

infobahn wrote:

Michael Mair wrote:
Cheerio,

I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.

"Probably" inefficient in that I cannot rely on getc() being
implemented as a macro and that I do not want to make assumptions
about the underlying library. So, essentially, the question is
for me whether having a loop in my code is "better" than just
telling fscanf() to get, say 8K characters in one go.
The main beauty of this approach lies for me in the clarity of the
code. Thanks for reminding me of getc() vs. fgetc().

- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.

And you have to maintain /two/ buffers (quite apart from the buffer
maintained by your text stream handler) - your expanding buffer,
and the buffer you give to fgets (unless you use the expanding
buffer for that too, which is certainly doable but probably gives
you more headaches).

Actually, I have implemented it first with fgets() and one extending
buffer but found, looking at the final code, that approach too unwieldy
and error prone, as you need more code and variables.
Usually, I would have gone for the "Low" approach due to the clarity
of the resulting code but -- as I was at it -- I just asked myself
which options do I have.

- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one,

Fine, so use that. But it wouldn't be my choice.

I _was_ asking for opinions.

Vive la difference!

:-)
Thank you for your input!
Cheers
Michael
--
E-Mail: Mine is a gmx dot de address.

Nov 14 '05 #3

jacob navia

Michael Mair wrote:

Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.
If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)
Regards,
Michael

What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

int main(int argc,char *argv[])
{
if (argc < 2) {
printf("usage: readfile <filename>\n");
exit(1);
}
int len=0;
char *contents=ReadFileIntoRam(argv[1],&len);
// work with the contents of the file
}

Nov 14 '05 #4

Michael Mair

jacob navia wrote:

Michael Mair wrote:
Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.
If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)
Regards,
Michael

What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");

Here is the crux: I want/have to work with a _text_ file.
Everything else may give me wrong results.
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);
This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

int main(int argc,char *argv[])
{
if (argc < 2) {
printf("usage: readfile <filename>\n");
exit(1);
}
int len=0;
char *contents=ReadFileIntoRam(argv[1],&len);
// work with the contents of the file
}

Thank you for trying :-)
Cheers
Michael
--
E-Mail: Mine is a gmx dot de address.

Nov 14 '05 #5

S.Tobias

infobahn <in******@btinternet.com> wrote:

Michael Mair wrote:

Cheerio,

I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.

In thread-safe libraries getc() family functions can actually
be quite inefficient, because they must lock the stream object,
which takes time. This is the reason why some systems provide
getc_unlocked() (thread-unsafe) family (I remember a noticeable
difference between them in my tests some time ago).

+++

Excuse my ignorance, I have no experience with text files in
the C Std context. Why wouldn't fread() be suitable for
reading text files? In 7.19.8p2 it says the fread() call is
performed as if by use of fgetc() function in the bottom.
I haven't spotted any mention where these functions would be
constrained to binary streams only.

--
Stan Tobias
mailx `echo si***@FamOuS.BedBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

Nov 14 '05 #6

Michael Mair

S.Tobias wrote:

infobahn <in******@btinternet.com> wrote:
Michael Mair wrote:
Cheerio,

I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.

Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.

In thread-safe libraries getc() family functions can actually
be quite inefficient, because they must lock the stream object,
which takes time. This is the reason why some systems provide
getc_unlocked() (thread-unsafe) family (I remember a noticeable
difference between them in my tests some time ago).

Interesting.
+++

Excuse my ignorance, I have no experience with text files in
the C Std context. Why wouldn't fread() be suitable for
reading text files? In 7.19.8p2 it says the fread() call is
performed as if by use of fgetc() function in the bottom.
I haven't spotted any mention where these functions would be
constrained to binary streams only.

It seems I am plain stupid... Somewhere in my brain, there was
"fread()/fwrite() <-> binary I/O" hardwired :-/
So, if I open the stream as text stream, everything should be
fine. (If this is wrong, please correct me.)
Moreover, if I read the data into dynamically allocated
storage pointed to by an unsigned char *, I circumvent potential
problems with the is** functions from <ctype.h> (as I asked in
another thread).

Thank you :-)
Cheers
Michael
--
E-Mail: Mine is a gmx dot de address.

Nov 14 '05 #7

Michael Mair

Michael Mair wrote:

jacob navia wrote:
Michael Mair wrote:
Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.
If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)
Regards,
Michael
What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");

Here is the crux: I want/have to work with a _text_ file.
Everything else may give me wrong results.

Sorry, the "b" brought me back onto the wrong track I already
was on. See the other subthread.
Cheers
Michael
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);

This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

int main(int argc,char *argv[])
{
if (argc < 2) {
printf("usage: readfile <filename>\n");
exit(1);
}
int len=0;
char *contents=ReadFileIntoRam(argv[1],&len);
// work with the contents of the file
}

Thank you for trying :-)
Cheers
Michael

--
E-Mail: Mine is a gmx dot de address.

Nov 14 '05 #8

SM Ryan

Michael Mair <Mi**********@invalid.invalid> wrote:
# Cheerio,
#
#
# I would appreciate opinions on the following:
#
# Given the task to read a _complete_ text file into a string:
# What is the "best" way to do it?
# Handling the buffer is not the problem -- the character
# input is a different matter, at least if I want to remain within
# the bounds of the standard library.
#
# Essentially, I can think of three variants:
# - Low: Use fgetc(). Simple, straightforward, probably inefficient.

char *contents=0; int m=0,n=0,ch;
while ((ch=fgetc(file))!=EOF) {
if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
contents[n++] = ch; contents[n] = 0;
}
contents = realloc(contents,n+1);

You might also include #ifdef/#endif code to use memory mapping on systems
that support it.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
This is one wacky game show.

Nov 14 '05 #9

Al Bowers

Michael Mair wrote:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.

Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.

"Probably" inefficient in that I cannot rely on getc() being
implemented as a macro and that I do not want to make assumptions
about the underlying library. So, essentially, the question is
for me whether having a loop in my code is "better" than just
telling fscanf() to get, say 8K characters in one go.
The main beauty of this approach lies for me in the clarity of the
code. Thanks for reminding me of getc() vs. fgetc().
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.

My intuition is the the definition of a "_complete_" text file
would require the "ugly". Hence, I would use function fgets in
a loop.

And you have to maintain /two/ buffers (quite apart from the buffer
maintained by your text stream handler) - your expanding buffer,
and the buffer you give to fgets (unless you use the expanding
buffer for that too, which is certainly doable but probably gives
you more headaches).

Actually, I have implemented it first with fgets() and one extending
buffer but found, looking at the final code, that approach too unwieldy
and error prone, as you need more code and variables.

Use fgets to copy into a buffer. And, then append to a
expanding dynamically allocated char array. This is not unwieldy.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(void)
{
char buffer[128],*fstr, *tmp;
size_t slen, blen;
FILE *fp;

if((fp = fopen("test.c","r")) == NULL) exit(EXIT_FAILURE);
for(slen = 0, fstr = NULL;
(fgets(buffer,sizeof buffer, fp)) ; slen+=blen)
{
blen = strlen(buffer);
if((tmp = realloc(fstr,slen+blen+1)) == NULL)
{
free(fstr);
exit(EXIT_FAILURE);
}
if(slen == 0) *tmp = '\0';
fstr = tmp;
strcat(fstr,buffer);
}
fclose(fp);
puts(fstr);
free(fstr);
return 0;
}
--
Al Bowers
Tampa, Fl USA
mailto: xa******@myrapidsys.com (remove the x to send email)
http://www.geocities.com/abowers822/

Nov 14 '05 #10

infobahn

Al Bowers wrote:

<snip>
Use fgets to copy into a buffer. And, then append to a
expanding dynamically allocated char array. This is not unwieldy.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(void)
{
char buffer[128],*fstr, *tmp;
size_t slen, blen;
FILE *fp;

if((fp = fopen("test.c","r")) == NULL) exit(EXIT_FAILURE);
for(slen = 0, fstr = NULL;
(fgets(buffer,sizeof buffer, fp)) ; slen+=blen)
{
blen = strlen(buffer);
Consider a file 12,800,000 or so bytes in length. This means you'll
call strlen 10,000 times, and just about every call will have to
trawl through 128 (or so) bytes. That is, modulo the last read,
you'll have to touch every character /three/ times - once while
reading, once while strlenning, and once while copying. For large
files, this is a serious overhead.
if((tmp = realloc(fstr,slen+blen+1)) == NULL)
You don't have to go to the well quite this often. You can keep
a max, and only realloc when the max is about to be exceeded.
Whenever you do this, multiply the not-enough-storage value by
some constant (some people double, others use 1.1 or 1.5 or
whatever) to decide how much to allocate next time.

Consider adding a way to stop the reading of a file larger than
the largest the user is prepared to allocate RAM for.
{
free(fstr);
exit(EXIT_FAILURE);
}
if(slen == 0) *tmp = '\0';
fstr = tmp;
strcat(fstr,buffer);

Its getting worse. strcat has to find the end of the string, which
is O(n). Put it into a loop, and you get O(n*n). This will seriously
impact on performance for large files. It's not hard to keep a
pointer to the next place to write.

Nov 14 '05 #11

Eric Sosman

Michael Mair wrote:

infobahn wrote:
Michael Mair wrote:
Cheerio,

I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.

Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.

"Probably" inefficient in that I cannot rely on getc() being
implemented as a macro and that I do not want to make assumptions
about the underlying library. So, essentially, the question is
for me whether having a loop in my code is "better" than just
telling fscanf() to get, say 8K characters in one go.
The main beauty of this approach lies for me in the clarity of the
code. Thanks for reminding me of getc() vs. fgetc().

Considerations of the relative efficiency of library
functions already involve matters you cannot "rely" on; the
Standard has nothing to say about it, and you're forced to
empirical methods.

I can, perhaps, offer a data point. My fgets() replacement
(everybody writes one eventually, it seems) originally used
fgets() itself, on the grounds that it might be implemented
more efficiently "under the covers" than repeated getc(). After
each fgets() I'd check whether the line was too long (no '\n'
in the buffer), and if so I'd expand the buffer and do another
fgets(). All well and good.

Just for curiosity's sake, though, I wrote a second version
that made repeated getc() calls -- and guess what? It was a
little bit faster. Whatever speed advantage fgets() might have
had was lost in the need to search for the end of the line
afterwards. strlen(buff) was a hair faster than strchr(buff,'\n'),
but either way the combined fgets()/strxxx() was slower than a
loop calling getc() and testing each character on the fly.

The "getc() is faster" result was reproducible on four
configurations: SPARC with Sun Studio compiler and Solaris' C
library, SPARC with gcc and Solaris' C library, and on two
different Pentium models with gcc and the DJgpp library.

YMMV, and the problem you're trying to solve is slightly
different from the one I attacked. Still, it's suggestive.

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 14 '05 #12

Randy Howard

In article <11*************@corp.supernews.com>, wyrmwif@tango-sierra-oscar-
foxtrot-tango.fake.org says...

Michael Mair <Mi**********@invalid.invalid> wrote:
# Cheerio,
#
#
# I would appreciate opinions on the following:
#
# Given the task to read a _complete_ text file into a string:
# What is the "best" way to do it?
# Handling the buffer is not the problem -- the character
# input is a different matter, at least if I want to remain within
# the bounds of the standard library.
#
# Essentially, I can think of three variants:
# - Low: Use fgetc(). Simple, straightforward, probably inefficient.

char *contents=0; int m=0,n=0,ch;
while ((ch=fgetc(file))!=EOF) {
if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
contents[n++] = ch; contents[n] = 0;
}
contents = realloc(contents,n+1);

What happens to contents if this realloc() fails?

--
Randy Howard (2reply remove FOOBAR)
"Making it hard to do stupid things often makes it hard
to do smart ones too." -- Andrew Koenig

Nov 14 '05 #13

CBFalconer

jacob navia wrote:

Michael Mair wrote:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.

If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)

What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

No good. Note that ftell is meaningless for text files. It also
returns a long, not an int. You haven't even tested for failure
(which it will on input from a keyboard). Even if everything works
use of calloc is silly, why zero what you are about to fill.
Instead just add a single '\0' after filling. From N869:

7.19.9.4 The ftell function

Synopsis

[#1]

#include <stdio.h>
long int ftell(FILE *stream);

Description

[#2] The ftell function obtains the current value of the
file position indicator for the stream pointed to by stream.
For a binary stream, the value is the number of characters
from the beginning of the file. For a text stream, its file
position indicator contains unspecified information, usable
by the fseek function for returning the file position
indicator for the stream to its position at the time of the
ftell call; the difference between two such return values is
not necessarily a meaningful measure of the number of
characters written or read.

Returns

[#3] If successful, the ftell function returns the current
value of the file position indicator for the stream. On
failure, the ftell function returns -1L and stores an
implementation-defined positive value in errno.

One way to get a whole file into memory in a useful form is to
buffer it in lines and make a linked list of those lines. An
example in my ggets package just just that. See:

<http://cbfalconer.home.att.net/download/ggets.zip>

Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #14

jacob navia

For text files is the same as
above, but add:
char *p1 =contents,char *p2 = contents;
int i = 0;
while (i < actualBytesRead) {
if (*p1 != '\r') {
*p2++ = *p1;
}
p1++;
i++;
}
*p2++ = 0;

This is thousand times more efficient that all those
calls to realloc, or all those calls to fread.

True, you will waste some bytes because you will read
many \r that you later erase, allocating a slightly
bigger buffer than needed but this is not very
important in most applications...
Note: You could do this more stable if you want to keep
isolated \r (i.e. \r not followed by \n) in which case
you can add the corresponding tests...

Nov 14 '05 #15

cpg

Michael Mair wrote:

This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().

I'm just curious, why would you do anything different for text/binary
data files? The approach is the same, what you can do afterwards on
the resultant buffer is the only thing that differs. Since a "text"
file is a special case of a "raw binary" file,you only have to code the
common functionality once (buffering in this case).

Would you not simply perform raw reads into a temp buffer accumulating
your overall file buffer until the entire file is read? Apply a filter
afterwards for some sort of sanity checking that this file meets your
requirements (ctype.h), then continue on.

Obviously, if the file checks out as "text", then things like lines
make sense. I would create "text" functions to operate on these buffers
to fit your needs. Later on you may find a need to write some binary
equivalents to do other tasks (a raw strstr() equivalent becomes
particularly useful for searching binary data), and the buffering part
is already done.

Also, it's probably more useful to define a structure that abstracts
these "buffers". That way, you can add functionality without breaking
the interface.

Have fun, cpg

Nov 14 '05 #16

jacob navia

CBFalconer wrote:

jacob wrote:
What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

No good.

Please Chuck, it was a program written in a few minutes!

Note that ftell is meaningless for text files.

That's why I opened in binary mode
It also returns a long, not an int.
OK

You haven't even tested for failure (which it will on input from a keyboard). The function receives a file name Chuck. There is NO
keyboard input...

Even if everything works use of calloc is silly, why zero what you are about to fill.
No. This dispenses with the zeroing of the last byte,
maybe inefficient but it is an habit...

Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.

If you open it in binary mode yes, you can...

Nov 14 '05 #17

CBFalconer

infobahn wrote:

Al Bowers wrote:
<snip>

Use fgets to copy into a buffer. And, then append to a
expanding dynamically allocated char array. This is not unwieldy.

.... snip ...
You don't have to go to the well quite this often. You can keep
a max, and only realloc when the max is about to be exceeded.
Whenever you do this, multiply the not-enough-storage value by
some constant (some people double, others use 1.1 or 1.5 or
whatever) to decide how much to allocate next time.

A certain Richard Heathfield has made available a routine for this
approach, found in fgetline at:

<http://users.powernet.co.uk/eton/c/fgetdata.html>

while I prefer using my own ggets/fggets, which doesn't keep a
history (thus having a much simpler calling sequence), and which
can be found at:

<http://cbfalconer.home.att.net/download/ggets.zip>

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #18

Michael Mair

cpg wrote:

Michael Mair wrote:

This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().

I'm just curious, why would you do anything different for text/binary
data files? The approach is the same, what you can do afterwards on
the resultant buffer is the only thing that differs. Since a "text"
file is a special case of a "raw binary" file,you only have to code the
common functionality once (buffering in this case).

Would you not simply perform raw reads into a temp buffer accumulating
your overall file buffer until the entire file is read? Apply a filter
afterwards for some sort of sanity checking that this file meets your
requirements (ctype.h), then continue on.

The thing is that I do not want to make _any_ assumptions like that
there is a one-to-one correspondence for certain byte ranges -- the
standard does not guarantee that and even mentions that "Characters
may have to be added, altered, or deleted ..."
Moreover, if I want to move on to wide characters/multibyte characters,
then I certainly will stick to the narrow path and not try to find
convenient shortcuts.
So, I will treat reading in a text file in a different manner than
reading in a binary file if necessary. It is quite possible that
fread() on a text stream will do what I want; then I will use it.
I have no interest in sanity checks which work with the C locale
but not every other locale as well.
If there was a standard way to read in a binary file and then convert
the resulting buffer into the "text" equivalent, then I would use this
approach.

Obviously, if the file checks out as "text", then things like lines
make sense. I would create "text" functions to operate on these buffers
to fit your needs. Later on you may find a need to write some binary
equivalents to do other tasks (a raw strstr() equivalent becomes
particularly useful for searching binary data), and the buffering part
is already done.
That is true in general but here I have a special requirement where
I am certain that I will deal only with text files and the only possible
extension is going for multibyte/wide characters. However, this will not
be any problem as I essentially will only have to create wide char
versions of my functions and get a "w" or "wc" into the called library
functions.
The only thing left is a "good" way to get a complete text file into
a buffer. The organisation in lines does not play any role at all, so
the question is using a getc loop vs. using something to obtain large
chunks of characters from text files.

Also, it's probably more useful to define a structure that abstracts
these "buffers". That way, you can add functionality without breaking
the interface.
True but in this case only overhead.

Have fun, cpg

Thanks :-)
-Michael
--
E-Mail: Mine is a gmx dot de address.

Nov 14 '05 #19

Eric Sosman

cpg wrote:

Michael Mair wrote:

This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().

I'm just curious, why would you do anything different for text/binary
data files? The approach is the same, what you can do afterwards on
the resultant buffer is the only thing that differs. Since a "text"
file is a special case of a "raw binary" file,you only have to code the
common functionality once (buffering in this case).

Would you not simply perform raw reads into a temp buffer accumulating
your overall file buffer until the entire file is read? Apply a filter
afterwards for some sort of sanity checking that this file meets your
requirements (ctype.h), then continue on.

There's the rub: What should the "filter" do? On
one system I've used, for example, if you were to write
the line "Hello, world!\n" to a text file and then read
it back in binary, here are the bytes you would get:

\015 \000 H e l l o , w o r l d ! \000

Notice that the '\n' you wrote has vanished and that three
new bytes have appeared out of thin air. The system in
question knows how to translate this sequence of bytes back
to "Hello, world!\n" -- but do *you* know how?

By the way, the above illustrates the system's "usual"
way of storing text in a file. The system actually provides
six additional text formats, some of which permit variations.
How many "filters" are you prepared to write, simply to avoid
using what the C library already provides?

(A hint for the curious: The company that bought the
company that bought the company that made this system recently
fired its CEO.)

--
Er*********@sun.com

Nov 14 '05 #20

Chris Torek

In article <42***********************@news.wanadoo.fr>
jacob navia <ja***@jacob.remcomp.fr> wrote:

The [file-reading] function receives a file name Chuck. There is NO
keyboard input...

What if the file name is "CON:" or "CON" or "/dev/tty" or "/tyCo/0"
or whatever external file name is used to represent "keyboard input"
on that system?

The way to read the whole text file is to read the whole text file. :-)

int read_whole_text_file(const char *fname, char **memp, size_t *sizep) {
FILE *fp; /* the open file */
char *mem, *new; /* memory regions (current and new) */
size_t memsize, newsize; /* sizes of regions (current & new) */
size_t tot; /* total bytes read so far */
size_t rdattempt, rdresult; /* argument & result for fread */

*memp = NULL; /* optional */
*sizep = NULL; /* optional */

fp = fopen(fname, "r");
if (fp == NULL)
return UNABLE_TO_OPEN;
memsize = INITIAL_BLOCK_SIZE;
mem = malloc(memsize);
if (mem == NULL) {
fclose(fp);
return UNABLE_TO_GET_MEM;
}
tot = 0;

/* loop, reading what we can, until we get less than we ask for */
for (;;) {
rdattempt = memsize - tot;
rdresult = fread(mem + tot, 1, memsize - tot, fp);
if (rdresult < rdattempt)
break;
tot += rdresult;
newsize = memsize * 2; /* use whatever strategy you like */
new = realloc(mem, newsize);
if (new == NULL) {
/*
* Here, I choose to discard the data read so far.
* You have other options, including returning the
* partial result, or allocating a smaller incremental
* amount of memory.
*/
free(mem);
fclose(fp);
return UNABLE_TO_GET_MEM;
}
mem = new;
memsize = newsize;
}

/* we reach this line only when fread() stopped due to EOF or error */
/* if (ferror(fp)) ... -- optional, handle read-error */

/* optional (but required if adding '\0') */
new = realloc(mem, tot); /* or tot+1 if you want to add a '\0' */
if (new == NULL) {
/* since I'm not adding the '\0', can just use existing mem */
} else {
mem = new;
/* mem[tot] = '\0'; -- to add '\0' */
}

/* set return-value parameters */
*memp = mem;
*sizep = tot;

return SUCCEEDED;
}

(The code above is completely untested. Note that if you want to
add a '\0', you can subtract 1 from "rdattempt", and still skip
the final realloc() or allow it to fail, as long as INITIAL_BLOCK_SIZE
and the newsize computation allow forward progress with this
subtraction. Of course, you also have to define the initial block
size and the three return values -- one success, two error codes.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Nov 14 '05 #21

Flash Gordon

jacob navia wrote:

CBFalconer wrote:

jacob wrote:
What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

No good.

Please Chuck, it was a program written in a few minutes!
Note that ftell is meaningless for text files.

That's why I opened in binary mode

See below

It also returns a long, not an int.

OK
You haven't even tested for failure
(which it will on input from a keyboard).

The function receives a file name Chuck. There is NO
keyboard input...

And if on a windows system that file name is COM1: ? Or on any system
that provides a file name for a user input device?

Even if everything works
use of calloc is silly, why zero what you are about to fill.

No. This dispenses with the zeroing of the last byte,
maybe inefficient but it is an habit...

It's a terrible habit which in this case leads to incredibly inefficient
code.

Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.

If you open it in binary mode yes, you can...

Then you have to do system specific things to convert it to text since
the new line might be represented by *anything*.
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

Nov 14 '05 #22

Randy Howard

In article <42***************@yahoo.com>, cb********@yahoo.com says...

infobahn wrote:
You don't have to go to the well quite this often. You can keep
a max, and only realloc when the max is about to be exceeded.
Whenever you do this, multiply the not-enough-storage value by
some constant (some people double, others use 1.1 or 1.5 or
whatever) to decide how much to allocate next time.

A certain Richard Heathfield has made available a routine for this
approach, found in fgetline at:

<http://users.powernet.co.uk/eton/c/fgetdata.html>

What a strange coincidence.

--
Randy Howard (2reply remove FOOBAR)
"Making it hard to do stupid things often makes it hard
to do smart ones too." -- Andrew Koenig

Nov 14 '05 #23

Keith Thompson

jacob navia <ja***@jacob.remcomp.fr> writes:

For text files is the same as
above, but add:
char *p1 =contents,char *p2 = contents;
int i = 0;
while (i < actualBytesRead) {
if (*p1 != '\r') {
*p2++ = *p1;
}
p1++;
i++;
}
*p2++ = 0;

He said he's reading a text file, not necessarily a DOS-format text
file. There's no reason to assume that there's anything special about
the '\r' character.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #24

SM Ryan

# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.

If you want to contract me with pay to adapt the code to your system, I'll happily
include whatever warnings and work arounds you desire that are possible.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Don't say anything. Especially you.

Nov 14 '05 #25

SM Ryan

# blen = strlen(buffer);
# if((tmp = realloc(fstr,slen+blen+1)) == NULL)

If you increase the buffer size by a constant factor>1, the time and space complexity
are linear in the size of input. If you increase by a constant increment the time
complexity is quadratic.

If you're concerned above overshooting the file size and exhausting memory, you can
instead calculate the file size (from fseek or system specific calls like stat() or
by reading through the file once without storing and then rewinding and reading again),
and then allocating one block once.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
So basically, you just trace.

Nov 14 '05 #26

Randy Howard

In article <11*************@corp.supernews.com>, wyrmwif@tango-sierra-oscar-
foxtrot-tango.fake.org says...

# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.

If you want to contract me with pay to adapt the code to your system,

With that attitude about error handling, don't hold your breath.

--
Randy Howard (2reply remove FOOBAR)
"Making it hard to do stupid things often makes it hard
to do smart ones too." -- Andrew Koenig

Nov 14 '05 #27

Mac

On Fri, 11 Feb 2005 01:36:49 +0000, SM Ryan wrote:
[Randy Howard wrote]

# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.

If you want to contract me with pay to adapt the code to your system, I'll happily
include whatever warnings and work arounds you desire that are possible.

This response totally misses the point.

If you want to say that you didn't bother because you were just trying to
sketch out a quick idea, that is fine. But most people in this newsgroup
try to post decent code, and if you review past posts, you will see that
it is considered very bad form to call realloc() without checking to see
if it succeeds. Newbies are consistently admonished not to do that.

If nothing else, you are setting a bad example for the newbies.

I humbly submit that you should readjust your attitude.

--Mac

Nov 14 '05 #28

Barry Schwarz

On Thu, 10 Feb 2005 08:56:10 +0100, Michael Mair
<Mi**********@invalid.invalid> wrote:

Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

Why not consider fread?
<<Remove the del for email>>

Nov 14 '05 #29

infobahn

SM Ryan wrote:

# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.
Then assert that it doesn't happen, using this macro:

#define ASSERT(cond, msg) if(cond) { fprintf(stderr, "%s\n", msg);
abort(); }

like this:

ASSERT(contents != NULL,
"I don't understand it! This CAN'T happen!"
" My home phone number is...");

(insert your home phone number in the appropriate place)

If you want to contract me with pay to adapt the code to your system, I'll happily
include whatever warnings and work arounds you desire that are possible.

I wouldn't pay anyone who used realloc like THAT, except perhaps to
clean the toilets.

Nov 14 '05 #30

Jack Klein

On Thu, 10 Feb 2005 08:56:10 +0100, Michael Mair
<Mi**********@invalid.invalid> wrote in comp.lang.c:

Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.
If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)
Regards,
Michael

If you want to read a whole text file into a single string, I'd use
fread(), which is not restricted to just binary files, you know. If
you use it on files opened in text mode, the same translations, if
any, are performed the same as if you used fscanf() or fgets().

Start with an fread() of the initial size of your allocated
destination buffer. If the return value is equal to the buffer size,
you have to grow your buffer by a fixed size or some percentage. Do
another fread() and check the result.

Continue until the return value is less than the requested number of
characters. Then you can check to see whether feof() or ferror() was
the cause.

If you really want a string, add a '\0' after the last character read.

In the calls to fread(), use 1 as the second parameter (size of each
element) and the number of bytes to read as the third (number of
elements). That way, the return value is exactly the number of bytes
read.

The only potential problem is if the text file contains '\0'
characters. The C standard does not guarantee much about such files
no matter how you try to read them (see 7.19.2 P2), so if your input
files look like that, you'll have to deal with that yourself no matter
how you read them.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html

Nov 14 '05 #31

CBFalconer

jacob navia wrote:

CBFalconer wrote:
.... snip ...
Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.

If you open it in binary mode yes, you can...

Even if you forbid keyboard input, what if the file is on tape, or
coming from a serial line, etc. There is no requirement for ftell
to work. That's why it returns an error signal and may store
something in errno.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #32

Michael Mair

Jack Klein wrote:

On Thu, 10 Feb 2005 08:56:10 +0100, Michael Mair
<Mi**********@invalid.invalid> wrote in comp.lang.c:

Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.
If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)
Regards,
Michael

If you want to read a whole text file into a single string, I'd use
fread(), which is not restricted to just binary files, you know. If
you use it on files opened in text mode, the same translations, if
any, are performed the same as if you used fscanf() or fgets().

Thank you very much, Jack!
As mentioned in my reply to S.Tobias, I somehow was under the (wrong)
impression that fread() only works for binary files. As I wanted to get
this right, I now was just waiting for someone who knows for sure to
tell me that it really works and I am not just jumping to a wrong
conclusion as the description of fread() does not indicate otherwise.
Start with an fread() of the initial size of your allocated
destination buffer. If the return value is equal to the buffer size,
you have to grow your buffer by a fixed size or some percentage. Do
another fread() and check the result.

Continue until the return value is less than the requested number of
characters. Then you can check to see whether feof() or ferror() was
the cause.

If you really want a string, add a '\0' after the last character read.

In the calls to fread(), use 1 as the second parameter (size of each
element) and the number of bytes to read as the third (number of
elements). That way, the return value is exactly the number of bytes
read.

The only potential problem is if the text file contains '\0'
characters. The C standard does not guarantee much about such files
no matter how you try to read them (see 7.19.2 P2), so if your input
files look like that, you'll have to deal with that yourself no matter
how you read them.

Once again: Thank you very much for your detailed reply!
I was aware of most of it (but for the last) but with this reply I could
have started working safely even if I had not been. I really appreciate
that :-)
Cheers
Michael
--
E-Mail: Mine is an /at/ gmx /dot/ de address.

Nov 14 '05 #33

Michael Mair

Michael Mair wrote:

Cheerio,
I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.
If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :-)

Essentially I was looking for a "text file replacement" of fread()
because I had the wrong impression that fread() were only for binary
input. Jack Klein's reply (<75********************************@4ax.com>)
but also S.Tobias's question in this direction made clear that this
was a misconception.
So the solution clearly is using fread(). If you are interested in
details, then read Jack's message -- it describes the complete usage
in a safe way.

Thank you very much to everyone for their input!
Cheers
Michael
--
E-Mail: Mine is an /at/ gmx /dot/ de address.

Nov 14 '05 #34

SM Ryan

Randy Howard <ra*********@FOOverizonBAR.net> wrote:
# In article <11*************@corp.supernews.com>, wyrmwif@tango-sierra-oscar-
# foxtrot-tango.fake.org says...
# > # What happens to contents if this realloc() fails?
# >
# > Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
# > the largest files I use, it doesn't happen.
# >
# > If you want to contract me with pay to adapt the code to your system,
#
# With that attitude about error handling, don't hold your breath.

The program forks; if it gets the signal, it does a traceback dump saved to a
database telling what line it failed at and local variables. The parent sees
the child exit, reports it, forks and continues. I've done daemons that run
months at a time, restarting and recoverring when needed.

If 2GB of VM isn't enough, the process probably has enough other problems
that pretending you can recover within the process is a fool's errand. Better
to let the process die noisily, save enough information to figure out what went
wrong, and then restart.

Memory exhausation is rarely a problem on machines with virtual memory; when
it does happen the real problem is almost always a stuck loop or recursion. Dinging
random memory is a more frequent problem for all programmers. And when it
happens for most programmers, they're stuck trying to figure out how to recreate
it under a debugger and usually can't, leave a random error that persists for
months or years and no way to diagnose it.

Don't lecture people about error handling until you can guarentee you capture
the error state of every one of your program failures, even 'production' versions
with all discretionary error checking turned off.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Why are we here?
whrp

Nov 14 '05 #35

SM Ryan

# I humbly submit that you should readjust your attitude.

Probably because most people don't want to have to plow through all
_ and other macros I use to thread the stack for traceback dumps.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Who's leading this mob?

Nov 14 '05 #36

Eric Sosman

jacob navia wrote:

CBFalconer wrote:
Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.

If you open it in binary mode yes, you can...

No, you cannot. There is no necessary connection
between the number of characters you can read from a
file via a binary stream and the number you can read
from it via a text stream. The "binary count" can be
greater than, equal to, or less than the "text count."

Specific example: OpenVMS. One of its file formats
"decorates" each line stored in th file by attaching
counts of the number of empty lines to skip before or
after the line itself (I've always assumed this was
for the benefit of the COBOL implementation). Each
such count byte can thus become as many as 255 newline
characters when read by a C text stream, making the "text
count" larger than the "binary count."

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 14 '05 #37

websnarf

SM Ryan wrote:

# blen = strlen(buffer);
# if((tmp = realloc(fstr,slen+blen+1)) == NULL)

If you increase the buffer size by a constant factor>1, the time
and space complexity are linear in the size of input. If you
increase by a constant increment the time complexity is
quadratic.
Correct! (Glad someone else here has figured this out.) But actually
the issue is not limited to just performance -- one can easily *shred*
your heap by doing this. You can actually lose access to some of your
heap memory by sufficiently leaning on is as such as scheme is likely
to do (I've seen this happen with a deployed System V-like heap).
If you're concerned above overshooting the file size and exhausting
memory, you can instead calculate the file size (from fseek or
system specific calls like stat() or by reading through the file
once without storing and then rewinding and reading again), and
then allocating one block once.

Yeah, except that these functions are useless for systems with 32bit
ints that allow for file lengths to be > 4GB (i.e., the vast majority
of desktop and workstation systems in existence today.) I mean they
didn't even make them use size_t's -- I mean that's incompetence to the
extreme.

The simplest strategy which works well and is portable is to use some
exponentially growing sequence of reallocs -- then you are guaranteed
to make at most O(ln(sizeof(int))) calls to the heap.

---
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Nov 14 '05 #38

websnarf

SM Ryan wrote:

Michael Mair <Mi**********@invalid.invalid> wrote:
# Cheerio,
#
# I would appreciate opinions on the following:
#
# Given the task to read a _complete_ text file into a string:
# What is the "best" way to do it?
# Handling the buffer is not the problem -- the character
# input is a different matter, at least if I want to remain within
# the bounds of the standard library.
#
# Essentially, I can think of three variants:
# - Low: Use fgetc(). Simple, straightforward, probably inefficient.

char *contents=0; int m=0,n=0,ch;
while ((ch=fgetc(file))!=EOF) {
if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
contents[n++] = ch; contents[n] = 0;
}
contents = realloc(contents,n+1);

That's a little too condensed, and its not surprising that people
jumped all over you about error handling. The idea, of course, is
perfectly correct, however. Let's make things a little clearer:

struct tagbstring {
int mlen, slen;
char * data;
} c = {0, 0, NULL};
int ch;

while ((ch == fgetc (file)) != EOF) {
if (c.slen < c.mlen) {
char * data;
c.mlen = (c.slen <= 0) ? 1 : 2*c.slen;
data = (c.data) ? realloc (c.data, c.mlen) : malloc
(c.mlen);
if (!data) {
free (c.data);
c.data = NULL;
break;
}
c.data = data;
}
c.data[c.slen] = ch;
c.slen ++;
}
if (c.data && c.data[c.slen]) {
c.data[c.slen] = '\0';
c.slen++;
}

Now the value ch.data has a pointer to a '\0' terminated string with
the desired contents, or else its NULL (because we ran out of memory.)
We could do more with ferror(), but I'll leave that as an exercise to
the reader.

---
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Nov 14 '05 #39

CBFalconer

we******@gmail.com wrote:

SM Ryan wrote:

If you increase the buffer size by a constant factor>1, the time
and space complexity are linear in the size of input. If you
increase by a constant increment the time complexity is
quadratic.
Correct! (Glad someone else here has figured this out.) But
actually the issue is not limited to just performance -- one can
easily *shred* your heap by doing this. You can actually lose
access to some of your heap memory by sufficiently leaning on is
as such as scheme is likely to do (I've seen this happen with a
deployed System V-like heap).

The reason being that the sum of the (possibly) freed chunks is not
enough to allocate a new large chunk. If other calls to malloc
have been interspersed the situation is likely to be even worse.
It is rooted in the fact that (1 + 2 + 4 + 8 .... + N) < 2N. .... snip ...
The simplest strategy which works well and is portable is to use
some exponentially growing sequence of reallocs -- then you are
guaranteed to make at most O(ln(sizeof(int))) calls to the heap.

Bearing in mind the above gotcha. Wonder if we can beat it by
allocating, in alternation, 1.5x and 2x the previous allocation?
Too lazy to work it out for now.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #40

Richard Bos

Eric Sosman <es*****@acm-dot-org.invalid> wrote:

jacob navia wrote:
CBFalconer wrote:
Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.

If you open it in binary mode yes, you can...

No, you cannot. There is no necessary connection
between the number of characters you can read from a
file via a binary stream and the number you can read
from it via a text stream. The "binary count" can be
greater than, equal to, or less than the "text count."

There's that; and then there's the fact that you'll have to read the
entire file before you know how many bytes _or_ characters it contains
in any case, because:

# A binary stream need not meaningfully support fseek calls with a
# whence value of SEEK_END.

Richard

Nov 14 '05 #41

Richard Bos

we******@gmail.com wrote:

data = (c.data) ? realloc (c.data, c.mlen) : malloc (c.mlen);

Erm... you _do_ know that realloc(0, length) is defined by the Standard
to be identical to malloc(length), do you?

Richard

Nov 14 '05 #42

Flash Gordon

SM Ryan wrote:

Randy Howard <ra*********@FOOverizonBAR.net> wrote:
# In article <11*************@corp.supernews.com>, wyrmwif@tango-sierra-oscar-
# foxtrot-tango.fake.org says...
# > # What happens to contents if this realloc() fails?
# >
# > Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
# > the largest files I use, it doesn't happen.
# >
# > If you want to contract me with pay to adapt the code to your system,
#
# With that attitude about error handling, don't hold your breath.

The program forks; if it gets the signal, it does a traceback dump saved to a
database telling what line it failed at and local variables. The parent sees
the child exit, reports it, forks and continues. I've done daemons that run
months at a time, restarting and recoverring when needed.

<snip>

Of course, if you handled the realloc failure and aborted when it
occurred your program would abort at the actual point of failure,
instead of later when you tried to use the memory. It would also be more
portable since it would not rely on the system catching a problem that
you could easily catch.

Mind you, if you had any sense you would not be using # as a quote
character either, since no one reading a C group will ever have a news
reader configured to recognise it as C uses # at the start of lines.
You've been told this before yet continue to be deliberately annoying.
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

Nov 14 '05 #43

SM Ryan

Flash Gordon <sp**@flash-gordon.me.uk> wrote:
# SM Ryan wrote:
# > Randy Howard <ra*********@FOOverizonBAR.net> wrote:
# > # In article <11*************@corp.supernews.com>, wyrmwif@tango-sierra-oscar-
# > # foxtrot-tango.fake.org says...
# > # > # What happens to contents if this realloc() fails?
# > # >
# > # > Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
# > # > the largest files I use, it doesn't happen.
# > # >
# > # > If you want to contract me with pay to adapt the code to your system,
# > #
# > # With that attitude about error handling, don't hold your breath.
# >
# > The program forks; if it gets the signal, it does a traceback dump saved to a
# > database telling what line it failed at and local variables. The parent sees
# > the child exit, reports it, forks and continues. I've done daemons that run
# > months at a time, restarting and recoverring when needed.
#
# <snip>
#
# Of course, if you handled the realloc failure and aborted when it

Look at the code, bright stuff. Immediately after the realloc, it always
stores into the array.

# Mind you, if you had any sense you would not be using # as a quote

Why don't you deal with the trauma of your toilet training in a more
appropriate forum.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
If your job was as meaningless as theirs, wouldn't you go crazy too?

Nov 14 '05 #44

CBFalconer

Flash Gordon wrote:

.... snip ...
Mind you, if you had any sense you would not be using # as a quote
character either, since no one reading a C group will ever have a news
reader configured to recognise it as C uses # at the start of lines.
You've been told this before yet continue to be deliberately annoying.

If everybody just plonked him for it, as I have, the annoyance
would soon disappear.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #45

Walter Roberson

In article <81************@brenda.flash-gordon.me.uk>,
Flash Gordon <sp**@flash-gordon.me.uk> wrote:
:Mind you, if you had any sense you would not be using # as a quote
:character either, since no one reading a C group will ever have a news
:reader configured to recognise it as C uses # at the start of lines.
:You've been told this before yet continue to be deliberately annoying.

Hmmm, my newsreader recognizes it without difficulty, just as it
recognizes pretty much any non-whitespace character. How does
your newsreader cope with '>' as a quote character, considering
that '>' is a C operator that could appear on the beginning of a line?

I often use ':' as the quote chracter in cisco newsgroups, and I
haven't had a single complaint there yet, even though the ':' character
is the comment marker for Cisco PIX configurations [which I post a fair
number of.] I haven't had any complaints about quoting sh/ksh scripts
either, even though ':' is a sh/ksh command (similar to no-op but
allows output redirections to be set up.)

As best I can recall, the only time I've ever had a complaint about my
quoting style in -any- newsgroup was from someone who claimed that the
Usenet News RFC's -defined- the quoting character to be '>' and only
'>'. Which is, of course, not correct [it is just a recommendation in
news.announce.newusers, which does not even come close to being a
"standard".]
--
How does Usenet function without a fixed point?

Nov 14 '05 #46

CBFalconer

Walter Roberson wrote:

.... snip ...
As best I can recall, the only time I've ever had a complaint about
my quoting style in -any- newsgroup was from someone who claimed
that the Usenet News RFC's -defined- the quoting character to be
'>' and only '>'. Which is, of course, not correct [it is just a
recommendation in news.announce.newusers, which does not even come
close to being a "standard".]

The ':' is just as annoying. There is software (and readers) out
there that reformats quotes, that color codes them to tie them to
attribution lines, etc. This gets all screwed up when people
insist on being inventive. Even worse are the systems that stuff
the initials or identifier from the attribution to the left of the
'>'.

Also there is no need for a blank after the '>' _unless_ this is
the initial quoting of the line.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #47

Randy Howard

In article <42***************@yahoo.com>, cb********@yahoo.com
says...

Also there is no need for a blank after the '>' _unless_ this is
the initial quoting of the line.

Are we running out of spaces again? Someone please memcpy some
more and share them.

--
Randy Howard (2reply remove FOOBAR)
"Making it hard to do stupid things often makes it hard
to do smart ones too." -- Andrew Koenig

Nov 14 '05 #48

CBFalconer

Randy Howard wrote:

cb********@yahoo.com
Also there is no need for a blank after the '>' _unless_ this is
the initial quoting of the line.

Are we running out of spaces again? Someone please memcpy some
more and share them.

That is simply a move to prevent quotes falling off the right and
getting wrapped by poorer software.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #49

Richard Bos

CBFalconer <cb********@yahoo.com> wrote:

Randy Howard wrote:
cb********@yahoo.com
Also there is no need for a blank after the '>' _unless_ this is
the initial quoting of the line.

Are we running out of spaces again? Someone please memcpy some
more and share them.

That is simply a move to prevent quotes falling off the right and
getting wrapped by poorer software.

*Shrug* So get a real newsreader.

Richard

Nov 14 '05 #50

Reading whole text files

Similar topics