Programming in standard c

jacob navia

In my "Happy Christmas" message, I proposed a function to read
a file into a RAM buffer and return that buffer or NULL if
the file doesn't exist or some other error is found.

It is interesting to see that the answers to that message prove that
programming exclusively in standard C is completely impossible even
for a small and ridiculously simple program like the one I proposed.

1 I read the file contents in binary mode, what should allow me
to use ftell/fseek to determine the file size.

No objections to this were raised, except of course the obvious
one, if the "file" was some file associated with stdin, for
instance under some unix machine /dev/tty01 or similar...

I did not test for this since it is impossible in standard C:
isatty() is not in the standard.

2) There is NO portable way to determine which characters should be
ignored when transforming a binary file into a text file. One
reader (CB Falconer) proposed to open the file in binary mode
and then in text mode and compare the two buffers to see which
characters were missing... Well, that would be too expensive.

3) I used different values for errno defined by POSIX, but not by
the C standard, that defines only a few. Again, error handling
is not something important to be standardized, according to
the committee. errno is there but its usage is absolutely
not portable at all and goes immediately beyond what standard C
offers.

We hear again and again that this group is about standard C *"ONLY"*.
Could someone here then, tell me how this simple program could be
written in standard C?

This confirms my arguments about the need to improve the quality
of the standard library!

You can't do *anything* in just standard C.
--
jacob navia
jacob at jacob point remcomp point fr
logiciels/informatique
http://www.cs.virginia.edu/~lcc-win32

Dec 26 '07

Subscribe Post Reply

270

9193

« First
<
4
5
6

Nick Keighley

On Dec 27 2007, 9:32*pm, "Bart C" <b...@freeuk.comwrote:

"Eric Sosman" <esos...@ieee-dot-org.invalidwrote in message
news:4f******************************@comcast.com. ..
Bart C wrote:

unsigned int getfilesize(FILE* handle)
{
unsigned int p,size;
p=ftell(handle); * * * * * * * */*p=current position*/
fseek(handle,0,2); * * * * * * */*get eof position*/
size=ftell(handle); * * * * * * /*size in bytes*/
fseek(handle,p,0); * * * * * * */*restore file position*/
return size;
}

What is wrong with this, is it non-standard? (Apart from the likely 4Gb
limit)

Several things are wrong with it, even apart from the
possible 64KB limit.

I'll let that go.

* * Zeroth, you should have #include'd <stdio.h>. *I'll let
you get away with this one, though, on the grounds that since

This is a code fragment. Assume headers are included for all C library
calls.

* * First, there's no error checking. *None, nada, zero, zip.

I wasn't aware there was much to go wrong. But I will have a look. At worst
it will return the wrong file size; I'll make it return all 0's or or 1's or
something.

ug. in band error signalling. How do you tell the difference between
a zero length) or very large file and an error?

Second, ftell() returns a long. *When you store the long

I'll change the types to long too, since it can't go past 2GB-1 anyway.

maybe

* * Third, what are the magic numbers 2 and 0 that you use
as the third arguments in the fseek() calls? *My guess is
that they are the expansions of the macros SEEK_END and
SEEK_CUR on some system you once used, and that you've
decided for some bizarre reason to avoid using the macros.

Close. The code existed in a non-C language, but using the C runtime,
without access to the headers and it was easiest to plug in these constants.
Converted to C (and tested) for the post.

* * Fourth, for a text stream the value returned by ftell() is
not necessarily a byte count; it is a value with an unspecified
encoding. *Calling it a "file size" makes unwarranted assumptions.

My assumption is the file is in binary mode; and my wrapper of the fopen()
function ensures that.

this might surprise someone who used your function without
your wrapper.

* * Fifth, there's 7.19.9.2p3: "A binary stream need not...
* * Sixth, for a binary stream there may be an unspecified...

I don't get these. There are known issues when used with files currently
open for writing. And I know the host OS in *my* case. But yes, anyone
attempting to use this code on their OS should be aware of limitations.

yes, buts this means it isn't portable

* * But other than that, it looks pretty good.

Thanks (?)

:-)
--
Nick Keighley

Jan 3 '08 #251

Kelsey Bjarnason

[snips]

On Thu, 03 Jan 2008 02:14:32 +0000, Bart C wrote:

Taking the size of a rapidly changing file like that is asking for problems.
But they need not be serious. Ask the OS to copy that file to a unique
filename.

This assumes you can. If the file is larger than available free space,
how do you plan to manage this?

It's a discrepancy: if the file should have been static, then an error
can be raised.

And how does he know it's supposed to be static? Simple example: a text
file viewer/editor. It's the user's call what file to use it on, how does
the code know whether the file is supposed to be static?

Actually I thought files (on Windows for example) were already sparse;
if I create an empty file, write the first byte, then write the ten
billionth, will the OS really fill in all those intermediate blocks?

Actually, most systems simply won't let you write the billionth byte
unless you've already written all the bytes before it - meaning you have a
billion bytes on disk. Sparse files generally require special handling,
which is why the problem comes up: if you use the special sparse-file
functions or modes, you can, in fact, seek to byte 1 billion and write,
without actually storing a billion bytes on disk, but a "naive" file copy
routine, one which reads a block then writes it, will read every
intervening byte - it will copy a billion bytes.

And that's kinda the point: the file has maybe 100K of actual data in it,
the rest is "virtual zero" or some equivalent. Depending which "file
size" value you get, you see either 100K - in which case you're almost
certain to lose data, if it's strewn about the file - or you see 1 billion
bytes, in which case you're copying a billion bytes to get 100,000.
Neither is particularly good, but any concept of "file size" which isn't
smart enough to deal with such cases is going to face at least one of
those two problems - though, that said, any remotely portable file read is
also probably not going to be able to use the sparse file as anything
other than a "normal" file anyhow, and see it as a billion bytes, so it
had better hope it never sees the "data size" being reported.

If the OS is responsible for sparse/compressed files, then I would
expect them to be transparent.

They are and they aren't. A naive file copy can, indeed, copy such a
file, but it will "see" a file a billion bytes long, where one that uses
the sparse file functions may be able to tell there's only certain regions
which contain data, and copy those instead.

Certainly an application written specifically for the file - eg if this
file is some sort of data storage specific to the app - will know or be
able to tell, via indexing or the like, which portions of the file are
used and thus seek to offset 750,493 to read record 19.541 or whatever -
it can use the sparse file as a sparse file, where the naive routines use
the sparse file as a "flat" file, and the OS fills in the gaps, usually
with zeros.

It should report the full size. After all
it shouldn't take long to read in non-existent blocks!

Presumably less time than reading them off disk, but the fact is, the
blocks _do_ exist. They're just not stored on disk. Think of it as a
copy-on-write deal. Until written, the "sectors" don't exist. Once
written, they're mapped into the file and stored. On reading, the ones
which don't exist yet return zeroes - full "sectors", just with all bytes
zero - rather than simply skipping over them.

And it wouldn't
do me much good to have a sparse/compressed file of unknown format in my
memory space.

However, the issue here was one of knowing the file size so you can read
the file in. If you're writing a file duplicator, or a file editor, for
example, you may want to read in some portion of the file, even all of it
if it's small enough, yet here you're faced with a file with at least two
distinctly different "size" values, each legitimate. One describes the
"theoretical" size - say 4GB. The other describes the "actual" size -
size actually occupied on disk, size of actual data stored to the file,
say 200K. Which is the "correct" value? Both are correct.

(Someone said the OS may not know the full size of compressed files. I
would call that a broken OS)

Why? If you're storing a file on a compressed file system, there are
again two perfectly legitimate "size" values - size of original file, and
size as recorded to disk. If you're asking for "the size of the file",
which do you want? Depends; for some purposes, you'd want the size of the
compressed file, for others, the size of the file before compression.
Neither is "the" correct file size; each is correct - yet each is
different.

>I suggested one such, but it involves reporting not a single answer,
but several: size on disk, size reserved (eg how big a sparse file
"actually is"), size of contained compressed data, size of contained
data after decompression. That's four; perhaps we need to add another
four,

The most useful is the number of data bytes seen by the application.

And which value is that? IIRC, some folks have pointed out that not all
OSen even record a file size, per se. Thus you could, presumably, get
whatever value is reported for the current size of the file - 6 "blocks" -
write a chunk of data to the end, get the new size - again, 6 "blocks" -
and by naive reasoning conclude that no data had been written. The fact
that the system is reporting size by "block" count and your additional
data didn't spill into the next block means your size is of more than
questionable utility for many purposes.

Compressed files, total bytes allocated, that's all OS stuff

All file size values are OS stuff, which is kinda the point here. If
someone is going to say "I want the size of the file", he's going to have
to explain what he means, while taking all this sort of thing into
consideration. What *is* the size of the file? There's too many possible
answers to that to give a useful response even in a comparatively simple
case, never mind as a general case.

The world could do with simplifying. Why not have a concept of
'filesize', then define what it might mean under all your extreme
examples?

Exactly what I'm saying: if "you" want a function that determines the size
of a file, how about "you" define what "the size of a file" means. Oddly,
the ones most insistent upon having such a function refuse to solve these
issues.

I don't know what forks are, but if people had been happy dealing with
files their way before they came along, why can't they continue to do
so?

Who is "they"?

The introduction of forks will not break existing code surely?

The code won't break. What happens to the data, though?

Whatever benefits 'forks' bestow surely can be reaped without affecting
naive applications that know nothing about them.

You'd think so. I've been bitten by them before, though; a file which you
didn't realise was forked, you read/copy and subsequently delete it, oops,
sorry, you only actually copied the "default" fork, not all the data in
the file.

>So explain to us, you - or the others who think this issue is so
trivial - which of the 8 values I mentioned, the ones which don't even
deal with forked files, is "the size of a file".

As mentioned, the one reported by a typical OS on file listings.

The one which is arguably of the least possible value of all the possible
results. Yeah, I'm gonna rush right out and use that one.

I did a little test in Windows: slowly writing file A while, in a
command window, asking the OS to copy A to B. The result: I got a
partial copy of A in B, which represented where it had got to in writing
A.

Another test where the copying was done by an appl calling C functions.
The same result. Was this an error? If it was then the OS is in error
too.

Just for giggles, did you also try it using the OS-specific file APIs?

Jan 3 '08 #252

Kelsey Bjarnason

[snips]

On Thu, 03 Jan 2008 00:00:27 +0100, Syren Baran wrote:

You should skip the "-like" here. Either its a file with an approriate
handle or i can obtain the handle via an open with a char*.

Really? So a hard drive is a file. And a sound card is a file. And a
PS/2 connector is a file.

Every one of those can be accessed in a file-like manner. Does this make
them files?

Jan 3 '08 #253

Walter Roberson

In article <fl**********@aioe.org>, jacob navia <ja***@nospam.orgwrote:

>1) You can't open a file with
fopen("name","a+")
since somebody else could grow the file after the file is positioned at
EOF, so you would overwrite his data.

C89 4.9.5.3 The fopen Function

Opening a file with append mode ('a' as the first character in
the mode argument) causes all subsequent writes to the file
to be forced to the then current end-of-file, regardless of
intervening calls to the fseek function.
Therefore, if the implementation allows an I/O interruption after
the file is automatically repositioned, but before the writing
happens, the implementation is not conformant to the C standards,
as the writing would not be to the "then current end-of-file".
--
"All is vanity." -- Ecclesiastes

Jan 3 '08 #254

Syren Baran

Kelsey Bjarnason schrieb:

[snips]

On Thu, 03 Jan 2008 00:00:27 +0100, Syren Baran wrote:

>You should skip the "-like" here. Either its a file with an approriate
handle or i can obtain the handle via an open with a char*.

Really? So a hard drive is a file. And a sound card is a file. And a
PS/2 connector is a file.

Sure, thats one of the nice and simple things about unices. Once you
have a file handle, how could you tell the difference?

>
Every one of those can be accessed in a file-like manner. Does this make
them files?

You say file-like again. What is the the difference between "file
manner" and "file-like manner"?
Problem is, the term "file" is not well defined.
Is a 1:1 copy of there entire contents of a harddrive a file, e.g. "dd
if=/dev/hdd of=hardrive.backup"?
Is an archive (e.g. zip-file, tar-file) a file or a filesystem? Does it
automagicly change its status if an implementation of open accepts
something like "/home/me/archive.zip/folder/somefile"?

Jan 3 '08 #255

Bart C

"Nick Keighley" <ni******************@hotmail.comwrote in message
news:38**********************************@e4g2000h sg.googlegroups.com...
On Dec 27 2007, 9:32 pm, "Bart C" <b...@freeuk.comwrote:

"Eric Sosman" <esos...@ieee-dot-org.invalidwrote in message
news:4f******************************@comcast.com. ..
Bart C wrote:

unsigned int getfilesize(FILE* handle)
{
unsigned int p,size;
p=ftell(handle); /*p=current position*/
fseek(handle,0,2); /*get eof position*/
size=ftell(handle); /*size in bytes*/
fseek(handle,p,0); /*restore file position*/
return size;
}

First, there's no error checking. None, nada, zero, zip.

I wasn't aware there was much to go wrong. But I will have a look. At
worst
it will return the wrong file size; I'll make it return all 0's or or 1's
or
something.

)ug. in band error signalling. How do you tell the difference between
)a zero length) or very large file and an error?

Having an error condition equate to a zero-length file is workable when the
error is likely rare and unimportant.

But yes all 1's is better, and in this case won't clash with the largest
size returnable.

Bart

Jan 3 '08 #256

Al Balmer

On Thu, 03 Jan 2008 20:20:02 GMT, "Bart C" <bc@freeuk.comwrote:

>
"Nick Keighley" <ni******************@hotmail.comwrote in message
news:38**********************************@e4g2000 hsg.googlegroups.com...
On Dec 27 2007, 9:32 pm, "Bart C" <b...@freeuk.comwrote:
>"Eric Sosman" <esos...@ieee-dot-org.invalidwrote in message
news:4f******************************@comcast.com ...
Bart C wrote:

>unsigned int getfilesize(FILE* handle)
{
unsigned int p,size;
p=ftell(handle); /*p=current position*/
fseek(handle,0,2); /*get eof position*/
size=ftell(handle); /*size in bytes*/
fseek(handle,p,0); /*restore file position*/
return size;
}

First, there's no error checking. None, nada, zero, zip.

I wasn't aware there was much to go wrong. But I will have a look. At
worst
it will return the wrong file size; I'll make it return all 0's or or 1's
or
something.

)ug. in band error signalling. How do you tell the difference between
)a zero length) or very large file and an error?

Having an error condition equate to a zero-length file is workable when the
error is likely rare and unimportant.

But yes all 1's is better, and in this case won't clash with the largest
size returnable.

I missed the beginning of this thread, but why is "size" unsigned?
ftell returns long int, with -1L indicating failure. Your function
could return long int as well.

--
Al Balmer
Sun City, AZ

Jan 3 '08 #257

Gordon Burditt

>So which is the correct size of a partially compressed sparse file? The

The file size being discussed was *the size of the file when it is
read into memory* (and you have to specify text or binary mode).
This is not the only "file size" definition, but it's the one
relevant to reading the whole file into memory. The size on disk
is irrelevant for this problem (note also that C provides no way
to get "free disk space", another term that is difficult to define
exactly). A sparse file might take more memory than has ever been
manufactured to read the whole thing in.

>uncompressed size it would be if it was actually "full"?

The number of bytes when you read it in. This can change with time.
To accurately measure it, you open, read, and close it without any
intervening accesses by another program. Some systems let you
PREVENT such accesses with mandatory file locking.

>The compressed
size, based on what's actually in it?

No. That doesn't affect the size of the file when you read it,
assuming you are talking about transparent compression.

>The size it currently occupies on
the disk?

No. That doesn't affect the size of the file when you read it.

>If the latter, keep in mind that it bears absolutely no
relationship to the actual number of data bytes in the file.

So do tell, which is the "correct" size.

For this problem, the size of the file when you read it into memory
is the size of the file when you read it into memory in a particular
mode (text or binary) and at a particular time. To get a consistent
value, you need to do all your accesses in one sequence with no
intervening accesses from other programs.

There are plenty of other definitions of "file size" for other problems.

Jan 4 '08 #258

Bart C

Al Balmer wrote:

On Thu, 03 Jan 2008 20:20:02 GMT, "Bart C" <bc@freeuk.comwrote:

>>
"Nick Keighley" <ni******************@hotmail.comwrote in message
news:38**********************************@e4g2000 hsg.googlegroups.com...
On Dec 27 2007, 9:32 pm, "Bart C" <b...@freeuk.comwrote:
>>"Eric Sosman" <esos...@ieee-dot-org.invalidwrote in message
news:4f******************************@comcast.co m...
Bart C wrote:

>>>>unsigned int getfilesize(FILE* handle)
{
unsigned int p,size;
p=ftell(handle); /*p=current position*/
fseek(handle,0,2); /*get eof position*/
size=ftell(handle); /*size in bytes*/
fseek(handle,p,0); /*restore file position*/
return size;
}

>>>First, there's no error checking. None, nada, zero, zip.

I wasn't aware there was much to go wrong. But I will have a look.
At worst
it will return the wrong file size; I'll make it return all 0's or
or 1's or
something.

)ug. in band error signalling. How do you tell the difference between
)a zero length) or very large file and an error?

Having an error condition equate to a zero-length file is workable
when the error is likely rare and unimportant.

But yes all 1's is better, and in this case won't clash with the
largest size returnable.

I missed the beginning of this thread, but why is "size" unsigned?
ftell returns long int, with -1L indicating failure. Your function
could return long int as well.

Probably my fault, using 12-year-old docs and source originally in non-C.

(Oh, and thanks for the OE-Quotefix link posted elsewhere. Usenet in
technicolor now..)

Bart

Jan 4 '08 #259

Bart C

Kelsey Bjarnason wrote:

[snips]

On Thu, 03 Jan 2008 02:14:32 +0000, Bart C wrote:

>Taking the size of a rapidly changing file like that is asking for
problems. But they need not be serious. Ask the OS to copy that file
to a unique filename.

This assumes you can. If the file is larger than available free
space, how do you plan to manage this?

You write software that requires certain resources and if they can't be met
then it fails. In this case you need enough space to duplicate this file.
Obviously that isn't ideal then you have to look at other ways of dealing
with these files in a low-disk situation..

>
>It's a discrepancy: if the file should have been static, then an
error can be raised.

And how does he know it's supposed to be static? Simple example: a
text file viewer/editor. It's the user's call what file to use it
on, how does the code know whether the file is supposed to be static?

Good point. On my little test of a slowly expanding file, one Editor denied
me access to the file (perhaps a good idea except this denial could be
intermittent), one of my own editors allowed me to view the file so far.
Probably wrong but which approach is more useful to someone who urgently
needs to look at the file?

Actually editing the file would be problematical but multiple write access
to such data files needs certain approaches and that's probably outside the
scope of the C file functions we're talking about.

<snip lots of stuff about OS/compressed/sparse files>

You obviously know a lot about this but I'd still like my right to a simple
interface to the file system where these details are hidden, unless I call
the appropriate functions.

>...Why not have a concept of
'filesize', then define what it might mean under all your extreme
examples?

Exactly what I'm saying: if "you" want a function that determines the
size of a file, how about "you" define what "the size of a file"
means. Oddly, the ones most insistent upon having such a function
refuse to solve these issues.

I have my own ideas which are adequate most of the time, but you will likely
come up with some unusual scenario where these ideas will break down.

I notice ms-windows has a native function GetFileSize, and I would happy to
go along with whatever that means (although I use only C functions). This
returns the expanded sizes of compressed files, and they have a separate
function for the compressed sizes of those.

Bart

Jan 4 '08 #260

Ben Bacarisse

Kelsey Bjarnason <kb********@gmail.comwrites:

On Thu, 03 Jan 2008 02:14:32 +0000, Bart C wrote:

<snip>

>Actually I thought files (on Windows for example) were already sparse;
if I create an empty file, write the first byte, then write the ten
billionth, will the OS really fill in all those intermediate blocks?

Actually, most systems simply won't let you write the billionth byte
unless you've already written all the bytes before it - meaning you have a
billion bytes on disk.

I don't know about most systems, but on my Linux box (and on most
Unix boxes that I remember using) this program:

#include <stdlib.h>
#include <stdio.h>

int main(int argc,char *argv[])
{
int rc = 0;
if (argc == 2) {
FILE *fp = fopen(argv[1], "wb");
if (fp != NULL) {
rc = fseek(fp, 100L * 1024 * 1024 - 1, SEEK_SET) == 0 &&
fputc(' ', fp) != EOF;
if (fclose(fp) != 0)
rc = 0;
}
}
return rc ? EXIT_SUCCESS : EXIT_FAILURE;
}

makes a file with size 100M but which uses only 32 blocks. Obviously this
is very file-system specific, but I think your comment is overly general.

--
Ben.

Jan 4 '08 #261

Kelsey Bjarnason

[snips]

On Fri, 04 Jan 2008 03:50:32 +0000, Bart C wrote:

Good point. On my little test of a slowly expanding file, one Editor denied
me access to the file (perhaps a good idea except this denial could be
intermittent), one of my own editors allowed me to view the file so far.
Probably wrong but which approach is more useful to someone who urgently
needs to look at the file?

Another editor monitors the file, notes it has changed and asks if you
want to reload. Lots of ways of dealing with this sort of thing, none of
them universally correct.

I notice ms-windows has a native function GetFileSize, and I would happy
to go along with whatever that means (although I use only C functions).
This returns the expanded sizes of compressed files, and they have a
separate function for the compressed sizes of those.

And there's an example of what I'm talking about. The discussion has been
about a C routine to determine file sizes, yet here there are two distinct
sizes, each perfectly correct - and depending on what you want to do with
the file, arguments can be made in favour of both sizes. Yet such a
function can, presumably, only return one - so which one?

The simple fact is there isn't a general solution to the problem, because
the problem simply cannot be solved in the general case. At best you can
make an arbitrary decision: "file size means bytes available to be read
when the file is opened in binary mode, using naive file functions,
treating the entire process as atomic and as it it were being performed at
time of determining size."

Yeah, that'll give you a size, and yeah, that size will work on a whole
lotta files, but it'll crap out on a whole lot, too.

Jan 7 '08 #262

Kelsey Bjarnason

On Fri, 04 Jan 2008 01:22:49 +0000, Gordon Burditt wrote:

>>So which is the correct size of a partially compressed sparse file? The

The file size being discussed was *the size of the file when it is
read into memory* (and you have to specify text or binary mode).

If you can actually specify the mode, in many cases the size will be
different. Is the OS supposed to keep track of both and report back the
one you want? If we're defining a standard C routine to determine file
size, do we pass a parameter specifying which mode to use?

This is not the only "file size" definition, but it's the one relevant
to reading the whole file into memory.

It's at least two distinct sizes already.

The number of bytes when you read it in. This can change with time. To
accurately measure it, you open, read, and close it without any
intervening accesses by another program. Some systems let you PREVENT
such accesses with mandatory file locking.

Sure, and some don't, and in the context of C, you can't rely on such
measures existing, and even if they do, we're still talking a minimum of
two different sizes.

>>So do tell, which is the "correct" size.

For this problem, the size of the file when you read it into memory is
the size of the file when you read it into memory in a particular mode
(text or binary) and at a particular time.

So two distinct sizes, then. Again - which is the correct one? Which
will our file size function return? Should it have a parameter which lets
you specify? Is the size you determine _now_ the size of the file at the
time you read it? Several examples have been given where this won't be
the case.

To get a consistent value,
you need to do all your accesses in one sequence with no intervening
accesses from other programs.

Sounds good - can you explain how to ensure this in standard C code? If
you can't, then whether you can determine the file size or not sorta
becomes irrelevant, as it may well change at the drop of a hat.

Jan 7 '08 #263

Kelsey Bjarnason

[snips]

On Thu, 03 Jan 2008 20:23:46 +0100, Syren Baran wrote:

Sure, thats one of the nice and simple things about unices. Once you
have a file handle, how could you tell the difference?

Mostly you can't.

>Every one of those can be accessed in a file-like manner. Does this make
them files?

You say file-like again. What is the the difference between "file
manner" and "file-like manner"?

That's kinda my question. To me, a serial port is not a file, regardless
of how you access it. Nor is a video card or sound card. These are
devices, not files.

Problem is, the term "file" is not well defined.
Is a 1:1 copy of there entire contents of a harddrive a file, e.g. "dd
if=/dev/hdd of=hardrive.backup"?

Er, you're creating a file, why wouldn't it be a file? :)

Is an archive (e.g. zip-file, tar-file) a file or a filesystem? Does it
automagicly change its status if an implementation of open accepts
something like "/home/me/archive.zip/folder/somefile"?

Make it simpler: a loopback. Say, for example, mounting an ISO image as
if it were a device. mount -o loop -t iso9660 media.iso /path/to/mount/at.

Obviously, you're dealing with a file. Then again, equally obviously,
you're not dealing with a device; you're dealing with a file mimicking a
device.

Perhaps the simplest way to differentiate is asking "can I duplicate
this?" No matter how many times you try to dupe your sound card, you're
only going to drive two speakers. No matter how many times you try to
dupe your hard drive, you're not going to turn a 10GB drive into a 1TB
drive. A file, by contrast, can be duplicated endlessly, with the result
that you do, in fact, have multiple distinct - and distinctly usable -
copies.

Granted, from a C perspective, you might not be able to tell the
difference, but C doesn't _quite_ encompass all reality. Yet. :)

Jan 7 '08 #264

Gordon Burditt

>>>So which is the correct size of a partially compressed sparse file? The

>>
The file size being discussed was *the size of the file when it is
read into memory* (and you have to specify text or binary mode).

If you can actually specify the mode, in many cases the size will be
different. Is the OS supposed to keep track of both and report back the
one you want? If we're defining a standard C routine to determine file
size, do we pass a parameter specifying which mode to use?

Yes, you'd have to pass a parameter specifying which mode to use,
or open the file and let the system use the same mode as what you
opened it with. Or have two different functions, like filetextsize()
and filebinarysize(). On POSIX and Windows, stat() could be used
to provide filebinarysize(). On POSIX, filetextsize() is the same
as filebinarysize(). And in any case, you can obtain the required
value by opening the file in the appropriate mode, calling fgetc()
repeatedly, and counting the number of calls. I didn't say it would
be fast.

>This is not the only "file size" definition, but it's the one relevant
to reading the whole file into memory.

It's at least two distinct sizes already.

Which file mode is relevant to the file you intend reading into
memory? If you pass it an open FILE * handle, it should already
know what mode you are interested in (if it makes a difference).
If you pass it a file name, you'll need a mode also.

>The number of bytes when you read it in. This can change with time. To
accurately measure it, you open, read, and close it without any
intervening accesses by another program. Some systems let you PREVENT
such accesses with mandatory file locking.

Sure, and some don't, and in the context of C, you can't rely on such
measures existing, and even if they do, we're still talking a minimum of
two different sizes.

>>>So do tell, which is the "correct" size.

For this problem, the size of the file when you read it into memory is
the size of the file when you read it into memory in a particular mode
(text or binary) and at a particular time.

So two distinct sizes, then. Again - which is the correct one? Which

If you are intending to read the file into memory, which mode to you
intend to use when reading it? That is the correct mode to use for
computing the size.

>will our file size function return? Should it have a parameter which lets
you specify? Is the size you determine _now_ the size of the file at the
time you read it? Several examples have been given where this won't be
the case.

The function returns a value as of a particular time. What gets
nasty is when the function's accesses to the file are interleaved
with accesses by other program or programs unknown.

> To get a consistent value,
you need to do all your accesses in one sequence with no intervening
accesses from other programs.

Sounds good - can you explain how to ensure this in standard C code? If

An *implementation* of a proposed function to add to standard C can
use non-standard hooks which standard C code can't (like file locking).

>you can't, then whether you can determine the file size or not sorta
becomes irrelevant, as it may well change at the drop of a hat.

You're right. Files can change size, and trying to get the file
size ahead of time, no matter how you define it, is a problem. You
also see the same problem when people ask "how can I find out if/how
many other programs have the file open?". That won't work either:
unless you can get the function to return an accurate value *AND
PREVENT IT FROM CHANGING* until the caller of the function lets go.
That tends to be a significant opening for a denial-of-service
attack by a buggy or malicious caller of the file size function.

There are other uses of the file size, such as comparing the output
size with the expected output size in a regression test. (Next step,
if the file sizes match, is to compare them).

Jan 7 '08 #265

David Thompson

On Thu, 27 Dec 2007 01:46:04 -0000, go***********@burditt.org (Gordon
Burditt) wrote:
<snip, attribution to jacob navia missing>

Really? You have the file open and seek to a position and the OS lets
someone else delete it? Marvelous.

Yes, and furthermore you can still read from the file, or write to it,
*AFTER* someone else deletes it (until you fclose() it).

Show me this system that I may
stand in astonishment.

UNIX or POSIX.

Real Unix (at least) doesn't truly delete the file (=inode), only the
(last) direntry for it. I'm not sure how closely a POSIX simulation or
wrapper on something else must, or does, track this.

[Windows example of deleting failing on open file deleted.]

At least some versions of NFS in at least some configurations had(?)
the problem that a file can truly be deleted out from under an open,
causing real breakage.

Multics had files (or formally segments) mapped rather than opened; it
allowed you to truly delete, or just retract permission to, a mapped
file, invalidating the existing mappings so that subsequent attempts
to access it from the already-running process(es) would fail.

- formerly david.thompson1 || achar(64) || worldnet.att.net

Jan 7 '08 #266

David Thompson

On Fri, 28 Dec 2007 12:27:18 +0100, jacob navia <ja***@nospam.com>
wrote:
<snip>

Nobody needs to modify the OS. But if those systems support C, they
MUST support

FILE *f = fopen("foo","a+");

And they HAVE to know where the end of the file is somehow. I am
amazed how you and the others just ignore the basic facts.

They have to know where the end of file is, but that doesn't
necessarily mean knowing the size of the file. I once worked on a
(custom) system where files were stored as double-linked lists of
sectors, with only head and tail pointers in the directory. You could
add to _or take (read&delete) from_ the end, _or the beginning_,
_only_; full editing used a variant of the TECO/Emacs 'buffer gap'
technique where you:
- open oldfile at beginning and create empty newfile open at end;
- to move forward, read&delete from oldfile and append to newfile;
- to insert just append to newfile;
- to delete just read&delete without writing;
- to replace should be obvious at this point;
- to move backward read/delete from (end of) newfile and write
'before' beginning of oldfile;
- repeat until positioned at end, with oldfile empty; then delete
oldfile and keep (i.e. catalog) newfile.

In C, any file is conceptually a sequence of bytes. Some file systems
do not support this well. But if they support it, THEN they must
ALREDY support this abstraction so that filesize wouldn't mean any effort.

A sequence, but not a vector. C files needn't be randomly or directly
addressable: fseek() can fail (and in text mode can use other than
byte positions); fsetpos() too. On some files, like magtape or serial
(or pseudo) or pipe, they can't work and thus must fail.

Disk file systems generally do support direct positioning, because
that was originally one of the main benefits of having disks and disk
files. But they don't inherently require it, and neither does C.

- formerly david.thompson1 || achar(64) || worldnet.att.net

Jan 7 '08 #267

Kelsey Bjarnason

[snips]

On Mon, 07 Jan 2008 02:48:53 +0000, Gordon Burditt wrote:

Yes, you'd have to pass a parameter specifying which mode to use,
or open the file and let the system use the same mode as what you
opened it with. Or have two different functions, like filetextsize()
and filebinarysize().

Which means the OS, when writing a block of data, no longer has to merely
write it, but parse it - look for any embedded characters which would be
translated into greater or lesser sequences, and record that value as
well. I suspect this is going to have an impact on performance - assuming
you can get 'em to do it at all.

Alternatively, the function itself could do the job, by opening the file
and reading the file, in the appropriate mode, beginning to end. Can you
say performance hit?

If you are intending to read the file into memory, which mode to you
intend to use when reading it? That is the correct mode to use for
computing the size.

This assumes I will only ever read the file in one mode, or determine size
by reading the file, btyewise, at time of determining the size. The
former isn't reliable, the latter is hellishly inefficient.

>>will our file size function return? Should it have a parameter which
lets you specify? Is the size you determine _now_ the size of the file
at the time you read it? Several examples have been given where this
won't be the case.

The function returns a value as of a particular time.

Yes, but again - which value?

There are other uses of the file size, such as comparing the output size
with the expected output size in a regression test.

Sure. Now, again, *which* file size? Determined *how*?

Jan 9 '08 #268

Gordon Burditt

>Yes, you'd have to pass a parameter specifying which mode to use,

>or open the file and let the system use the same mode as what you
opened it with. Or have two different functions, like filetextsize()
and filebinarysize().

Which means the OS, when writing a block of data, no longer has to merely
write it, but parse it - look for any embedded characters which would be
translated into greater or lesser sequences, and record that value as
well. I suspect this is going to have an impact on performance - assuming
you can get 'em to do it at all.

I did not say the OS has to have either or both of the sizes
precalculated. If the result impacts performance more than reading
the whole file and counting bytes, taking into account things like
how often file sizes of either type are needed and how often writes
are done, then someone made a poor decision of pessimization.

I consider having the text file size used for reading the file into
memory to be used insufficiently often to make it worth caching it.
Your opinion may differ.

POSIX happens to keep *both* precalculated, since there's no
difference between binary and text mode. Windows keeps the binary-mode
size precalculated. Thus, performance for getting the text-mode
size may suck significantly more than getting the binary-mode size
on Windows.

>Alternatively, the function itself could do the job, by opening the file
and reading the file, in the appropriate mode, beginning to end. Can you
say performance hit?

If you don't need the correct answer, you can do it in zero bytes
and zero time. But if the performance hit is so bad, maybe you
shouldn't use a method that needs a precalculated file size. That
approach of reading the file in chunks (this is to read it into
memory, NOT precalculate the size) and realloc()ing when needed
(say, doubling each time, with fallback if you run out of memory)
is starting to look more and more efficient all the time, even with
the copying (if any).

While we're at it, how about revisiting the strategy of reading the
entire file into memory? Is it really a good idea? If the file is
large, you may force parts of this program or other programs to page
out. Slow. Now, depending on what you are doing with the file, reading
it in chunks might be worse. Or better. If you're just dumping the
file in hex, reading chunks at a time lets your program run in much
less memory, and makes it work on files MUCH larger than what you can
fit in memory.

>If you are intending to read the file into memory, which mode to you
intend to use when reading it? That is the correct mode to use for
computing the size.

This assumes I will only ever read the file in one mode, or determine size
by reading the file, btyewise, at time of determining the size. The
former isn't reliable, the latter is hellishly inefficient.

Each time you read the file into memory, you read it in *one* mode,
I hope (no switching in the middle of the file). When you want the
file size for that buffer, you read it in that one mode. How you
read it last time or will read it next time is irrelevant.

You made a bad decision, performance-wise, to use a precalculated
file length, especially in text mode if the OS doesn't keep the
value handy and text mode != binary mode. Stick with that decision,
and performance is going to suck.

If you *must* have a precalculated value, have the OS save the one
involving the same mode that the file was written in (and which
kind it is). My guess is that this will cover at least 80% of the
times that file size is needed for the purpose of reading the file
into memory.

>>>will our file size function return? Should it have a parameter which
lets you specify? Is the size you determine _now_ the size of the file
at the time you read it? Several examples have been given where this
won't be the case.

>The function returns a value as of a particular time.

Yes, but again - which value?

It returns the one associated with the mode you intend to use to read
the file into memory. You have to make up your mind which mode to use
before you start reading. Use the same decision when you determine
the file size.

>There are other uses of the file size, such as comparing the output size
with the expected output size in a regression test.

Sure. Now, again, *which* file size? Determined *how*?

The size associated with the mode the file was written in, if your
application knows what that is (and no, I don't expect the OS to
keep track of it). It's up to your application to know what mode
to open its own files in. Either it knows from what prompt was
answered (e.g. text editors always do text files; graphics editors
always do binary files), or the file extension, or it asks the user,
or it just handles generic files and can do everything in binary
mode.

(This assumes that the reference "correct" output was generated on
THIS system or was converted to the local file format. If it wasn't,
well, size comparisons may be totally worthless). Since here,
you're using file size as a shortcut for comparing the files for
equality to quickly find a mismatch, you can dispense with the step
entirely and proceed to reading the files byte-by-byte and comparing
them if finding the size is a performance bottleneck.

Jan 10 '08 #269

Kelsey Bjarnason

[snips]

On Thu, 10 Jan 2008 01:48:46 +0000, Gordon Burditt wrote:

>>Sure. Now, again, *which* file size? Determined *how*?

The size associated with the mode the file was written in, if your
application knows what that is

I see; you're one of those folks under the impression only one application
is ever allowed to be used on a file. I think we can dispense with any
further discussion, as your beliefs and reality have no bearing on each
other.

In the real world, we're still left with the questions: what size,
determined how, and with what performance penalty?

Apparently, a file size function is perfectly acceptable if it returns
multiple distinct values for the same file (even unmodified) with runtimes
ranging from, oh, a millisecond to, say, 10 minutes or more.

Sorry, not gonna work out here in reality.

Jan 10 '08 #270

Eric Sosman

Kelsey Bjarnason wrote:

>
I see; you're one of those folks under the impression only one application
is ever allowed to be used on a file. I think we can dispense with any
further discussion, as your beliefs and reality have no bearing on each
other.

The issue is that the behavior of a file being used by
more than one program is well outside the scope of a language
standard. Different operating environments do in fact define
different semantics (and even define "used" differently), and
it is not the proper purview of the C Standard to try to
dictate the environments in which it is used.

Some other languages have different and more limiting
goals. Java, for example, is quite the martinet and dictates
everything about the operating environment that it thinks it
can get away with: This is both a strength and a weakness.
S/370 assembly language is rather definite about the sizes and
representations of its data types: it relieves its users of
worrying about trap representations, but it doesn't interoperate
very well with MacOS. C's "I'll run anywhere" permissive approach
comes at a cost in specificity, but has proven to be of wide
applicability. Learn to appreciate the trade-off -- or learn
to hate it, and turn to languages more suited to your temperament.

--
Eric Sosman
es*****@ieee-dot-org.invalid

Jan 11 '08 #271

Similar topics