Read only last line-

Ry*****@gmail.com writes:

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

No, there isn't. A text file is a sequence of characters, not a
sequence of lines; there's no way to jump directly to the beginning of
the last line without knowing in advance how long it is.

As far as the standard is concerned, fseek(file, -2, SEEK_END)
invokes undefined behavior. The standard says:

For a text stream, either offset shall be zero, or offset shall be
a value returned by an earlier successful call to the ftell
function on a stream associated with the same file and whence
shall be SEEK_SET.

The result of ftell() on a text stream contains unspecified
information, usable only in a call to fseek() with whence==SEEK_SET.

Realistically, if you can assume that the last line is shorter than,
say, 80 characters, you can *probably* get away with doing an
fseek(file, -80, SEEK_END), then reading everything up to the end of
the file and grabbing just the last line from that. But that's still
a bit risky; for example, if the system encodes end-of-line as a CR-LF
pair, the fseek() could land you between a CR and an LF. And there's
always the risk that you've guessed wrong, and the last line is really
90 characters long.

The only portable approach is to read the entire file. You can likely
do better by trading off portability for performance.

<OT>
The Unix "tail -1" command does this. Source code for the GNU version
of the tail command is included in the coreutils package. The
implementation is undoubtedly non-portable and more complex than you
need, but you might get some ideas from it.
</OT>

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Feb 18 '06 #2

Ry*****@gmail.com wrote:

Hello-

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

It can't be done in perfectly portable C, in part
because the way you're using fseek() isn't portable:

For a text stream, either offset shall be zero,
or offset shall be a value returned by an earlier
successful call to the ftell function [...] and
whence shall be SEEK_SET. (7.19.9.2/4)

Another problem is that each line (except the first)
begins immediately after the '\n' that ends the preceding
line; it is the preceding '\n' that marks the succeeding
character as a line-starter. You cannot tell whether an
arbitrary position in a text file is or isn't the start
of a line without reading the preceding character to see
whether it's a newline. And yet another problem is that
the file on disk may mark the line endings in some other
way, with an encoding that can only be translated to '\n'
by reading it in the forward direction.

Reading the whole file from start to finish is the
only perfectly portable approach. Remember the line most
recently read until you've successfully read another, and
when you reach end-of-file the remembered line is the last
one. (You might also wish to use ftell() to remember the
position of each line start, so you can fseek() back to the
last one again.) Unless the file is very large, this is a
perfectly reasonable approach.

If the file is very large and you're willing to try
something that isn't portable (i.e., isn't guaranteed to
work), you could try seeking to a spot about ten or a dozen
lines' worth before the end and using the read-and-remember
method on the tail of the file. This may -- *may* -- work;
only you can judge whether the risk is worth the reward.

--
Eric Sosman
es*****@acm-dot-org.invalid

Feb 18 '06 #3

Flash Gordon

Ry*****@gmail.com wrote:

Hello-

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

Read it backwards a character at a time using fseek building your number
as you go. Don't forget to check for file operations failing and for
integer overflow.
--
Flash Gordon
Living in interesting times.
Web site - http://home.flash-gordon.me.uk/
comp.lang.c posting guidlines and intro -
http://clc-wiki.net/wiki/Intro_to_clc

Feb 18 '06 #4

websnarf

Ry*****@gmail.com wrote:

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

The core C language is not really useful for performing this kind of
operation as others have posted. fseek and ftell are really inadequate
functions as they assume file sizes are always less than LONG_MAX which
doesn't make any sense on modern file systems. Many systems have 64
bit versions of these functions which makes a little more sense, but in
about 30 years we're going to wish people were just using intmax_t.

Anyhow, if we ignore that problem, and the other bizarre ANSI-ism that
you can't seek to some offset you haven't previously visited, what I
would suggest is the following: In a loop seek to offsets of -1, -2,
-4, -8, -16, -32, ... etc. Then read until the end (actually to avoid
redundancy you only need to read half as much except the first where
you should read a whole byte). Then scan for the last '\n' found. If
you find a '\n' then you know you've found the offset of the last line,
otherwise just go to the next offset. The case of a file itself being
only one line, or other sort of read length error you can detect this
and just read the whole file as the line. On average you should expect
to never read equal or more than twice the length of the last line of
the file.

You could also just go backwards in fixed sized offsets (this would
tend to reduce the amount of over-reading), however, I would be
suspicious of the performance of fseek(). My own experience (on
Windows 98) is that the performance of fseek can be roughly as bad as
O(n), where n is the position you are seeking to. So the exponential
offset increasing will reduce this cost.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Feb 18 '06 #5

Michael Mair

Ry*****@gmail.com schrieb:

Hello-

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

I assume that you actually can seek to the end of the file.
This is not a given.

Determine the number of digits of INT_MAX (or, if you admit
negative numbers, the maximum number of digits plus sign from
INT_MAX and INT_MIN), numlen. INT_MAX and INT_MIN is found in
<limits.h>. For portability, use a general mechanism.
Then seek back from the file end by numlen+2.
Read numlen+2 characters. The number you are looking for is
stored between the last and the second-to-last '\n' [1].
Find the second-to-last '\n', apply sscanf() or strtol() to
extract the number from behind this position.

[1] This assumes a portable file structure i.e. the file ending
with a '\n'.
Cheers
Michael
--
E-Mail: Mine is an /at/ gmx /dot/ de address.

Feb 19 '06 #6

Michael Mair wrote:

Ry*****@gmail.com schrieb:
Hello-

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

I assume that you actually can seek to the end of the file.
This is not a given.

fseek(stream, 0, SEEK_END) is fine. Of course, not all
streams are seekable, and there's always the possibility of
I/O error.
Determine the number of digits of INT_MAX (or, if you admit
negative numbers, the maximum number of digits plus sign from
INT_MAX and INT_MIN), numlen. INT_MAX and INT_MIN is found in
<limits.h>. For portability, use a general mechanism.
Then seek back from the file end by numlen+2.

That's the bogus bit: fseek(stream, nonzero, SEEK_END)
and fseek(stream, nonzero, SEEK_CUR) have undefined behavior
on text streams (they violate a "shall"). They'll work on
many systems (just as INT_MAX+1 "works" on many systems), but
not on all. Worth a try, perhaps, but not portable.

--
Eric Sosman
es*****@acm-dot-org.invalid

Feb 19 '06 #7

Michael Mair

Eric Sosman schrieb:

Michael Mair wrote:
Ry*****@gmail.com schrieb:
Hello-

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

I assume that you actually can seek to the end of the file.
This is not a given.

fseek(stream, 0, SEEK_END) is fine. Of course, not all
streams are seekable, and there's always the possibility of
I/O error.

This is what I meant; thank you for clarifying.

Determine the number of digits of INT_MAX (or, if you admit
negative numbers, the maximum number of digits plus sign from
INT_MAX and INT_MIN), numlen. INT_MAX and INT_MIN is found in
<limits.h>. For portability, use a general mechanism.
Then seek back from the file end by numlen+2.

That's the bogus bit: fseek(stream, nonzero, SEEK_END)
and fseek(stream, nonzero, SEEK_CUR) have undefined behavior
on text streams (they violate a "shall"). They'll work on
many systems (just as INT_MAX+1 "works" on many systems), but
not on all. Worth a try, perhaps, but not portable.

Gah. I love C file semantics. Footnote 225) tells you that
fseek(stream, 0, SEEK_END) has UB for binary streams and
7.19.9.2#4 tells you, that fseek(stream, nonzero, whence)
works only if nonzero has been returned by ftell() and whence
is SEEK_SET.

Thank you for the correction.

-Michael
--
E-Mail: Mine is an /at/ gmx /dot/ de address.

Feb 19 '06 #8

Joe Wright

Eric Sosman wrote:

Michael Mair wrote:
Ry*****@gmail.com schrieb:
Hello-

I am trying to write a snippet which will open a text file with an
integer on each line. I would like to read the last integer in the
file. I am currently using:
file = fopen("f.txt", "r+");
fseek(file, -2, SEEK_END);
fscanf(file, "%d", &c);
this works fine if the integer is only a single character. When I get
into larger numbers though (e.g. 502) it only reads in the 2. Is there
anything I can do to read the last line as an entity instead of looping
through then entire file? Thanks in advance-

I assume that you actually can seek to the end of the file.
This is not a given.

fseek(stream, 0, SEEK_END) is fine. Of course, not all
streams are seekable, and there's always the possibility of
I/O error.
Determine the number of digits of INT_MAX (or, if you admit
negative numbers, the maximum number of digits plus sign from
INT_MAX and INT_MIN), numlen. INT_MAX and INT_MIN is found in
<limits.h>. For portability, use a general mechanism.
Then seek back from the file end by numlen+2.

That's the bogus bit: fseek(stream, nonzero, SEEK_END)
and fseek(stream, nonzero, SEEK_CUR) have undefined behavior
on text streams (they violate a "shall"). They'll work on
many systems (just as INT_MAX+1 "works" on many systems), but
not on all. Worth a try, perhaps, but not portable.

If it's a text file stream and all lines are terminated with '\n'
including the last one, we first trip through the file looking for '\n'
characters and recording the position (offset) of the next character.

FILE *fp = fopen("file.txt", "r");
int ch;
long prev = 0, here;
while ((ch = fgetc(fp)) != EOF)
if (ch == '\n') {
prev = here;
here = ftell(fp);
}

At EOF, here is really the end of file and prev is the offset to the
previous (last) line.

fseek(fp, prev, SEEK_SET);

points you to it. Good luck.

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Feb 19 '06 #9

Joe Wright <jo********@comcast.net> writes:
[...]

If it's a text file stream and all lines are terminated with '\n'
including the last one, we first trip through the file looking for
'\n' characters and recording the position (offset) of the next
character.

FILE *fp = fopen("file.txt", "r");
int ch;
long prev = 0, here;
while ((ch = fgetc(fp)) != EOF)
if (ch == '\n') {
prev = here;
here = ftell(fp);
}

At EOF, here is really the end of file and prev is the offset to the
previous (last) line.

fseek(fp, prev, SEEK_SET);

points you to it. Good luck.

That will work (assuming the file is seekable at all), but it requires
reading the entire file, which the OP was trying to avoid.

It records the position of the beginning of the last line, which will
let you re-read from that position, but it assumes that the file isn't
going to change; if you're assuming that, you might as well just read
the entire file and remember the last line.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Feb 19 '06 #10

Joe Wright

Keith Thompson wrote:

Joe Wright <jo********@comcast.net> writes:
[...]
If it's a text file stream and all lines are terminated with '\n'
including the last one, we first trip through the file looking for
'\n' characters and recording the position (offset) of the next
character.

FILE *fp = fopen("file.txt", "r");
int ch;
long prev = 0, here;
while ((ch = fgetc(fp)) != EOF)
if (ch == '\n') {
prev = here;
here = ftell(fp);
}

At EOF, here is really the end of file and prev is the offset to the
previous (last) line.

fseek(fp, prev, SEEK_SET);

points you to it. Good luck.

That will work (assuming the file is seekable at all), but it requires
reading the entire file, which the OP was trying to avoid.

It records the position of the beginning of the last line, which will
let you re-read from that position, but it assumes that the file isn't
going to change; if you're assuming that, you might as well just read
the entire file and remember the last line.

Good morning Keith. Thank you for reading. Regardless what the OP was
trying to avoid, we must read the entire file. Agreed? Who or what might
change the file while I am reading it? Reading the file character at a
time simplifies things so I don't need to know how long a line is. We
can know how long the last line is by subtracting prev from here,
allocating space for it, backing up (fseek()) to prev and reading the line.

It's a beautiful bright Sunday afternoon in Arlington. Who'sit on the
Left Coast?

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Feb 19 '06 #11

Joe Wright wrote:

[...] Reading the file character at a
time simplifies things so I don't need to know how long a line is. We
can know how long the last line is by subtracting prev from here,
allocating space for it, backing up (fseek()) to prev and reading the line.

7.19.9.4/2: "[...] For a text stream, its file position
indicator contains unspecified information, [...] the difference
between two such return values is not necessarily a meaningful
measure of the number of characters written or read."

So, calculating a line length by subtracting two file
positions is no use. You'd need to count each character of the
line as you read it, starting a new count after each '\n' that
isn't the last character in the file.

--
Eric Sosman
es*****@acm-dot-org.invalid

Feb 19 '06 #12

On 18 Feb 2006 14:33:32 -0800, in comp.lang.c , we******@gmail.com
wrote:

fseek and ftell are really inadequate
functions as they assume file sizes are always less than LONG_MAX which
doesn't make any sense on modern file systems. Many systems have 64
bit versions of these functions which makes a little more sense, but in
about 30 years we're going to wish people were just using intmax_t.

Anyhow, if we ignore that problem, and the other bizarre ANSI-ism that
you can't seek to some offset you haven't previously visited,

Y'know, your advice would be considerably more useful if you skipped
the pointless diatribe at the start and just got right into answering
the question. If you don't like C, why do you hang around here?

Mark McIntyre
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Feb 19 '06 #13

we******@gmail.com wrote:

[...]
Anyhow, if we ignore that problem, and the other bizarre ANSI-ism that
you can't seek to some offset you haven't previously visited
[... in a text stream ...]

Lots of people (including some who ought to know better)
are seduced by the notion that one can do arithmetic on the
"file position indicators" of text streams. In Standard C,
this is not reliable and leads to undefined behavior.

Some people who have learned the above profess that
Standard C is to blame for this sad situation. Such people
have had but limited experience of different systems' notions
of "file," and would do well to refrain from insulting those
whose knowledge is greater. It is better to remain silent and
risk suspicion of folly than to open one's mouth and confirm it.

Then again, there's always the faint possibility that the
Young Turks have a better idea than the Old Fogeys. If so,
let's hear about the better idea -- but let's hear about it
without gratuitous insults. We stand on the shoulders of
giants; let us refrain from making horsefly-buzzings in their
ears lest they become annoyed and swat us.

--
Eric Sosman
es*****@acm-dot-org.invalid

Feb 20 '06 #14

websnarf

Eric Sosman wrote:

we******@gmail.com wrote:
[...]
Anyhow, if we ignore that problem, and the other bizarre ANSI-ism that
you can't seek to some offset you haven't previously visited
> [... in a text stream ...]
Lots of people (including some who ought to know better)
are seduced by the notion that one can do arithmetic on the
"file position indicators" of text streams. In Standard C,
this is not reliable and leads to undefined behavior.

Some people who have learned the above profess that
Standard C is to blame for this sad situation. Such people
have had but limited experience of different systems' notions
of "file," and would do well to refrain from insulting those
whose knowledge is greater. It is better to remain silent and
risk suspicion of folly than to open one's mouth and confirm it.

ANSI C contains two functions which satisfy this: fgetpos(), and
fsetpos(). Notice how they used a data type (fpos_t) that does not
imply any arithmetic capabilities?

So you're saying people who see fseek/ftell which clearly uses "long
int" as the file position are not natually supposed to assume they can
do file pointer arithmetic, especially in light of the knowledge that
functions with more restrictive semantics exist as seperate functions?

I am well aware of systems that have a hard time with byte by byte file
offsets. One such obscure system is Windows 98. The difference is,
rather than simply punting and implementing some obscure kind of fseek,
they bit the bullet and did the obvious, but very slow, implementation
that at least behaved consistently.
Then again, there's always the faint possibility that the
Young Turks have a better idea than the Old Fogeys. If so,
let's hear about the better idea
Oh it may surprise you. The first thing I would start with is to stop
idolizing the original creators and/or the C standard, and recognize
when mistakes have been made.

You want an improvement? Not a problem:

#include <stdio.h>
#include <stdint.h>
int fseekB (FILE * fp, intmax_t offs);

The value offs is always taken as an offset (in bytes) from the
beginning of the file, if the offset does not refer to a valid file
position, -1 is returned and errno is set. No value of offs can lead
to UB so long as fp is a valid and active file.

#include <stdio.h>
#include <stdint.h>
intmax_t ftellB (FILE * fp);

The value returned is the exact offset in the file where the
corresponding to the next byte to be read/written. For a valid
non-empty file pointer fp, fseekB (fp, ftellB (fp)) is basically a
NO-OP. At EOF, the position returned is the length of the file. The
function properly observes ungetc(), and all other or ordinary file
operations.

These functions can retain the property of UB, that the ANSI C
committee cherishes so much, if fp is not a valid open file.
Personally, I would prefer that errors be returned if fp is either NULL
or a *closed* fp.

What could be simpler?

If a system has problems implementing it, then that's too bad. The x86
and 68K world implemented entire floating point emulators just to be
compliant with the C standard, a little work on the part of others
won't kill them. And as Microsoft and others have demonstrated -- its
not actually an impractical thing to implement.

If that doesn't work for you, then point me out one system where 1) it
would be impractical to implement this, and 2) which has a working C99
compiler on it. The marginal platforms are not going to be updated
beyond C89 anyways, so there is little sense in making provisions for
them.
-- but let's hear about it
without gratuitous insults. We stand on the shoulders of
giants; [...]

Its hard to *stand* if you are grovelling all the time.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Feb 20 '06 #15

CBFalconer

we******@gmail.com wrote:

Eric Sosman wrote:
.... snip ...
-- but let's hear about it without gratuitous insults. We stand
on the shoulders of giants; [...]

Its hard to *stand* if you are grovelling all the time.

When I was 15 I was amazed at the ignorance of my parents. When I
was 25 I was amazed how much they had learned in the past 10
years. You are, apparently, about 15.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

Feb 20 '06 #16

On 19 Feb 2006 21:20:00 -0800, in comp.lang.c , we******@gmail.com
wrote:

No value of offs can lead
to UB so long as fp is a valid and active file.
pipes too?
The value returned is the exact offset in the file where the
corresponding to the next byte to be read/written.
and you've tested this for sparse files, databases, etc? Files with
multiple read/write operations permitted? Files with lockable
sections?
These functions can retain the property of UB, that the ANSI C
committee cherishes so much,
Remarks like this merely make you look like a dickhead.
If a system has problems implementing it, then that's too bad.

Good plan, reduce the portability of C
<flame bait>?
to suit your own apparent inability to programme safely.
</bait>

-- but let's hear about it
without gratuitous insults. We stand on the shoulders of
giants; [...]

Its hard to *stand* if you are grovelling all the time.

And how about when you're so far up yourself that you haven't seen
daylight in weeks?
Mark McIntyre
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Feb 22 '06 #17

On 19 Feb 2006 21:20:00 -0800, in comp.lang.c , we******@gmail.com
wrote:

The first thing I would start with is to stop
idolizing the original creators and/or the C standard, and recognize
when mistakes have been made.

You again equate "mistake" with compromise, portability and consensus.
Of course, if C owned by one designer that person could do whatever
they liked, eliminating any inconvenient platform or feature along the
way. But its not. Its owned by a committee representing a wide range
of interests, and therefore has to reflect a wider consensus than your
narrow view.

Unfortunately yet again any contribution you might have had to make to
this group is marred by your bias.
Mark McIntyre
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Feb 22 '06 #18

Jordan Abel

On 2006-02-22, Mark McIntyre <ma**********@spamcop.net> wrote:

On 19 Feb 2006 21:20:00 -0800, in comp.lang.c , we******@gmail.com
wrote:
No value of offs can lead
to UB so long as fp is a valid and active file.

pipes too?

A "request that cannot be satisfied" results in a nonzero return, not
UB. It is arguable that this also applies to a call of fseek on a text
stream with a value that does not correspond to a position in the file
which ftell might have returned.

The value returned is the exact offset in the file where the
corresponding to the next byte to be read/written.

and you've tested this for sparse files, databases, etc? Files with
multiple read/write operations permitted? Files with lockable
sections?

Again, failure is not the same as UB. What is a specific case that you
think invokes UB?

Feb 22 '06 #19

Jordan Abel wrote On 02/22/06 14:37,:

On 2006-02-22, Mark McIntyre <ma**********@spamcop.net> wrote:
On 19 Feb 2006 21:20:00 -0800, in comp.lang.c , we******@gmail.com
wrote:

No value of offs can lead
to UB so long as fp is a valid and active file.

pipes too?

A "request that cannot be satisfied" results in a nonzero return, not
UB. It is arguable that this also applies to a call of fseek on a text
stream with a value that does not correspond to a position in the file
which ftell might have returned.

The value returned is the exact offset in the file where the
corresponding to the next byte to be read/written.

and you've tested this for sparse files, databases, etc? Files with
multiple read/write operations permitted? Files with lockable
sections?

Again, failure is not the same as UB. What is a specific case that you
think invokes UB?

Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes. There are also systems where writing a
newline produces no bytes in the file, systems where a file
contains both data bytes and metadata bytes, and systems that
use state-dependent encodings for extended character sets.

It's not so much a problem of U.B., but of failure that
doesn't produce a reliable indication. Seek to a position
that happens to be in the middle of a multi-byte character
or in the middle of a stretch of metadata, and the problem
may be difficult to detect: a byte in a file does not always
stand alone, but may require prior context (at an arbitrary
separation) for proper interpretation. Here's the stuff of
a nightmare or two: Imagine opening a stream for update,
seeking to the middle of a stretch of metadata, successfully
writing "Hello, world!" there, and only later discovering
that the successful write has corrupted the file structure
and made the entire tail end unreadable ...

It would be nice if one could do meaningful arithmetic on
file position indicators in text streams, but given the rich
variety of file encodings that exist in the world it is not
always possible to do so. The C Standard recognizes this
difficulty, and so does not attempt to guarantee that seeking
to arbitrary positions in text files will work as desired.
The Standard is cognizant of imperfections in reality, and
does not insist that reality rearrange itself for the Standard's
convenience.

--
Er*********@sun.com

Feb 22 '06 #20

Jordan Abel

On 2006-02-22, Eric Sosman <Er*********@sun.com> wrote:

Jordan Abel wrote On 02/22/06 14:37,:
On 2006-02-22, Mark McIntyre <ma**********@spamcop.net> wrote:
On 19 Feb 2006 21:20:00 -0800, in comp.lang.c , we******@gmail.com
wrote:
No value of offs can lead
to UB so long as fp is a valid and active file.

pipes too?

A "request that cannot be satisfied" results in a nonzero return, not
UB. It is arguable that this also applies to a call of fseek on a text
stream with a value that does not correspond to a position in the file
which ftell might have returned.

The value returned is the exact offset in the file where the
corresponding to the next byte to be read/written.

and you've tested this for sparse files, databases, etc? Files with
multiple read/write operations permitted? Files with lockable
sections?

Again, failure is not the same as UB. What is a specific case that you
think invokes UB?

Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes. There are also systems where writing a
newline produces no bytes in the file, systems where a file
contains both data bytes and metadata bytes, and systems that
use state-dependent encodings for extended character sets.

If you're dealing with something that might be a state-dependent
encoding, you should probably be using fgetpos and fsetpos
exclusively.
It's not so much a problem of U.B., but of failure that
doesn't produce a reliable indication. Seek to a position that
happens to be in the middle of a multi-byte character or in the
middle of a stretch of metadata, and the problem may be difficult
to detect: a byte in a file does not always stand alone, but may
require prior context (at an arbitrary separation) for proper
interpretation. Here's the stuff of a nightmare or two: Imagine
opening a stream for update, seeking to the middle of a stretch of
metadata, successfully writing "Hello, world!" there, and only
later discovering that the successful write has corrupted the file
structure and made the entire tail end unreadable ...
An implementation may silently force a file opened in update mode to
be a binary stream. An implementation that has such issues probably
should do so. (It would be nice if some way were provided for the
program to detect this, but unfortunately there does not seem to be)
It would be nice if one could do meaningful arithmetic on file
position indicators in text streams, but given the rich variety of
file encodings that exist in the world it is not always possible
to do so.

There is a difference between "not meaningful" and "undefined" - I
am entirely opposed to the dilution of the term "undefined behavior"
in this newsgroup.

I think that the implementation should detect all those issues and
treat them as "a request that cannot be satisfied", and return a
value indicating failure. I think there is a reading of the standard
which supports this view.

Feb 22 '06 #21

Jordan Abel wrote On 02/22/06 15:54,:

On 2006-02-22, Eric Sosman <Er*********@sun.com> wrote:
Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes. There are also systems where writing a
newline produces no bytes in the file, systems where a file
contains both data bytes and metadata bytes, and systems that
use state-dependent encodings for extended character sets.

If you're dealing with something that might be a state-dependent
encoding, you should probably be using fgetpos and fsetpos
exclusively.

Right. And this means you can't do arithmetic on the
file positions of a text stream, because fpos_t need not
be an arithmetic type.

It's not so much a problem of U.B., [...]

There is a difference between "not meaningful" and "undefined" - I
am entirely opposed to the dilution of the term "undefined behavior"
in this newsgroup.

We seem to be in violent agreement.
I think that the implementation should detect all those issues and
treat them as "a request that cannot be satisfied", and return a
value indicating failure. I think there is a reading of the standard
which supports this view.

I don't see how the issues can be detected, not with
any pretense of efficiency. One could get reliable detection
by implementing fseek() as read-and-count, perhaps preceded
by rewind(), but the result would be horrible. True, the
Standard doesn't promise efficiency, and an fseek() that
behaved this way would satisfy the letter of the Standard's
law. Equally, an fseek() that returned -1 unconditionally
would meet the letter of the law; so would a malloc() that
always returned NULL, a time() that always returned -1, and
a rand() that always returned 42. "It's just a quality of
implementation concern," but it's folly to ignore QoI.

--
Er*********@sun.com

Feb 22 '06 #22

On 22 Feb 2006 19:37:42 GMT, in comp.lang.c , Jordan Abel
<ra*******@gmail.com> wrote:

Again, failure is not the same as UB. What is a specific case that you
think invokes UB?

You get a result saying you can write to byte 23456, but by the time
you try, the file no longer contains any bytes at that location. Or
some other thread has written to them already and locked them. In
such circumstances, Paul's variants on the standard functions are
better in that they probably avoid UB, but still not reliable.

Mark McIntyre
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Feb 22 '06 #23

websnarf

Eric Sosman wrote:

Jordan Abel wrote On 02/22/06 14:37,:
On 2006-02-22, Mark McIntyre <ma**********@spamcop.net> wrote:
On 19 Feb 2006 21:20:00 -0800, in comp.lang.c , we******@gmail.com
wrote:
No value of offs can lead
to UB so long as fp is a valid and active file.

pipes too?
A "request that cannot be satisfied" results in a nonzero return, not
UB.
Well in my proposal the error return is specifically -1. I hadn't
considered file streams like stdin and stdout where clearly you can't
fseek, but obviously they would just return with -1 -- certainly not
UB.

[...] It is arguable that this also applies to a call of fseek on a text
stream with a value that does not correspond to a position in the file
which ftell might have returned.

My proposal is for two new functions fseekB and ftellB which are not
ftell or fseek compatible.

The value returned is the exact offset in the file where the
corresponding to the next byte to be read/written.

and you've tested this for sparse files, databases, etc? Files with
multiple read/write operations permitted? Files with lockable
sections?

Again, failure is not the same as UB. What is a specific case that you
think invokes UB?

Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes.

So what? If you read that back on Windows, you also get just one
character. What does this mean? It means that it has to count as 1
character (so long as you read the file in text mode.) It doesn't
count *underlying byte representation*, it counts offset in the units
of "characters" or whatever it is that is being written to the file.
[...] There are also systems where writing a
newline produces no bytes in the file, systems where a file
contains both data bytes and metadata bytes, and systems that
use state-dependent encodings for extended character sets.
Underlying file system details do not affect what I have specified. If
you put the contents of a file into an array, then that specifies an
offset to data mapping. That's the mapping you have to support. Its
not impossible, and its not even very hard. Not if your system
supports faithful read-write turn around, and fgetpos/fsetpos.
It's not so much a problem of U.B., but of failure that
doesn't produce a reliable indication. Seek to a position
that happens to be in the middle of a multi-byte character
or in the middle of a stretch of metadata,
How does that happen for a file opened in text mode?
[...] and the problem
may be difficult to detect: a byte in a file does not always
stand alone, but may require prior context (at an arbitrary
separation) for proper interpretation. Here's the stuff of
a nightmare or two: Imagine opening a stream for update,
seeking to the middle of a stretch of metadata, successfully
writing "Hello, world!" there, and only later discovering
that the successful write has corrupted the file structure
and made the entire tail end unreadable ...
Well explain to me how that happens -- remember I am mapping from
offsets of the original data, as if it were all coming from an array to
positions in the underlying file (that we know *exists* because of the
existence of fgetpos, fsetpos functions). So what bad thing is
supposed to happen?
It would be nice if one could do meaningful arithmetic on
file position indicators in text streams,
You mean its nice to know that it is well defined and possible. (You
need a good definition of intmax_t, of course.)
[...] but given the rich
variety of file encodings that exist in the world it is not
always possible to do so.
It might be slow, but its always possible.
[...] The C Standard recognizes this
difficulty, and so does not attempt to guarantee that seeking
to arbitrary positions in text files will work as desired.
Even though it presents an API that clearly implies that it does.
The Standard is cognizant of imperfections in reality, and
does not insist that reality rearrange itself for the Standard's
convenience.

If that were a true and complete description of the standard that would
at least be a defensible and credible stance. But its not. If they
took this stance, ftell() and fseek() would be gone, since
fgetpos/fsetpos already gives you the weaker semantics.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Feb 27 '06 #24

we******@gmail.com writes:

Eric Sosman wrote:

[...]

> [...] It is arguable that this also applies to a call of fseek on a text
> stream with a value that does not correspond to a position in the file
> which ftell might have returned.
My proposal is for two new functions fseekB and ftellB which are not
ftell or fseek compatible.

[...] Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes.

So what? If you read that back on Windows, you also get just one
character. What does this mean? It means that it has to count as 1
character (so long as you read the file in text mode.) It doesn't
count *underlying byte representation*, it counts offset in the units
of "characters" or whatever it is that is being written to the file.

[...]

So something like
fseekB(some_file, 100000, SEEK_SET);
would, on some systems, actually have to read 1 million characters
from the file to find the proper position. On Windows, where an
end-of-line is represented in a text file as a CR-LF pair, there would
be no other way to find the 1 millionth character of the file
(counting each CR-LF pair as one character). On Unix, on the other hand,
it would simply be equivalent to
fseek(some_file, 1000000, SEEK_SET);
and would be much faster.

This might be conceptually cleaner than the existing fseek/ftell
interface, but I'm not convinced that it would be useful.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Feb 27 '06 #25

Keith Thompson wrote On 02/27/06 14:44,:

we******@gmail.com writes:
[replacements for fseek/ftell that count "delivered
characters" instead of "recorded bytes"]
[...]

So something like
fseekB(some_file, 100000, SEEK_SET);

Missing a zero, I think.
would, on some systems, actually have to read 1 million characters
from the file to find the proper position. On Windows, where an
end-of-line is represented in a text file as a CR-LF pair, there would
be no other way to find the 1 millionth character of the file
(counting each CR-LF pair as one character). On Unix, on the other hand,
it would simply be equivalent to
fseek(some_file, 1000000, SEEK_SET);
and would be much faster.

Not even Unix can do this efficiently in the presence
of variable-length or state-dependent character encodings.

--
Er*********@sun.com

Feb 27 '06 #26

Eric Sosman <Er*********@sun.com> writes:

Keith Thompson wrote On 02/27/06 14:44,:
we******@gmail.com writes:
[replacements for fseek/ftell that count "delivered
characters" instead of "recorded bytes"]

[...]

So something like
fseekB(some_file, 100000, SEEK_SET);

Missing a zero, I think.

Yes.

would, on some systems, actually have to read 1 million characters
from the file to find the proper position. On Windows, where an
end-of-line is represented in a text file as a CR-LF pair, there would
be no other way to find the 1 millionth character of the file
(counting each CR-LF pair as one character). On Unix, on the other hand,
it would simply be equivalent to
fseek(some_file, 1000000, SEEK_SET);
and would be much faster.

Not even Unix can do this efficiently in the presence
of variable-length or state-dependent character encodings.

Ok, but it can do so in their absence. (I suppose it's
locale-dependent?)

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Feb 27 '06 #27

Walter Roberson

In article <dt**********@news1brm.Central.Sun.COM>,
Eric Sosman <Er*********@sun.com> wrote:

True, the
Standard doesn't promise efficiency, and an fseek() that
behaved this way would satisfy the letter of the Standard's
law. Equally, an fseek() that returned -1 unconditionally
would meet the letter of the law; so would a malloc() that
always returned NULL, a time() that always returned -1, and
a rand() that always returned 42.

In an implementation that rand() always returned 42, then
RAND_MAX would be 42, but C89 requires RAND_MAX to be at
least 32767.

Now, if rand() always returned 32767, then -that- might be within
the letter of the standard ;-)

Let's see, how perverse could one get...? How about:
rand() returns 0 continually upon srand(0),
rand() returns RAND_MAX continually upon srand(RAND_MAX),
rand() returns 42 continually otherwise (including the
default case srand(1))
--
There are some ideas so wrong that only a very intelligent person
could believe in them. -- George Orwell

Feb 27 '06 #28

Jordan Abel

On 2006-02-27, Walter Roberson <ro******@ibd.nrc-cnrc.gc.ca> wrote:

In article <dt**********@news1brm.Central.Sun.COM>,
Eric Sosman <Er*********@sun.com> wrote:
True, the
Standard doesn't promise efficiency, and an fseek() that
behaved this way would satisfy the letter of the Standard's
law. Equally, an fseek() that returned -1 unconditionally
would meet the letter of the law; so would a malloc() that
always returned NULL, a time() that always returned -1, and
a rand() that always returned 42.

In an implementation that rand() always returned 42, then
RAND_MAX would be 42, but C89 requires RAND_MAX to be at
least 32767.

I don't think it's required that rand() ever return RAND_MAX.

Feb 27 '06 #29

Jordan Abel <ra*******@gmail.com> writes:

On 2006-02-27, Walter Roberson <ro******@ibd.nrc-cnrc.gc.ca> wrote:
In article <dt**********@news1brm.Central.Sun.COM>,
Eric Sosman <Er*********@sun.com> wrote:
True, the
Standard doesn't promise efficiency, and an fseek() that
behaved this way would satisfy the letter of the Standard's
law. Equally, an fseek() that returned -1 unconditionally
would meet the letter of the law; so would a malloc() that
always returned NULL, a time() that always returned -1, and
a rand() that always returned 42.

In an implementation that rand() always returned 42, then
RAND_MAX would be 42, but C89 requires RAND_MAX to be at
least 32767.

I don't think it's required that rand() ever return RAND_MAX.

The statement in the standard is:

The rand function computes a sequence of pseudo-random integers in
the range 0 to RAND_MAX.

Whether a rand() implementation that never returns RAND_MAX would be
conforming is a question I'm not going to try to answer.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Feb 27 '06 #30

Keith Thompson wrote On 02/27/06 17:01,:

Jordan Abel <ra*******@gmail.com> writes:
On 2006-02-27, Walter Roberson <ro******@ibd.nrc-cnrc.gc.ca> wrote:

I don't think it's required that rand() ever return RAND_MAX.

The statement in the standard is:

The rand function computes a sequence of pseudo-random integers in
the range 0 to RAND_MAX.

Whether a rand() implementation that never returns RAND_MAX would be
conforming is a question I'm not going to try to answer.

Could a conforming program prove that rand() is unable
to return RAND_MAX? The number of samples required is a
function of the number of bits in rand()'s internal state,
and the Standard does not document that number.

--
Er*********@sun.com

Feb 27 '06 #31

Eric Sosman <Er*********@sun.com> writes:

Keith Thompson wrote On 02/27/06 17:01,:
Jordan Abel <ra*******@gmail.com> writes:
On 2006-02-27, Walter Roberson <ro******@ibd.nrc-cnrc.gc.ca> wrote:

I don't think it's required that rand() ever return RAND_MAX.

The statement in the standard is:

The rand function computes a sequence of pseudo-random integers in
the range 0 to RAND_MAX.

Whether a rand() implementation that never returns RAND_MAX would be
conforming is a question I'm not going to try to answer.

Could a conforming program prove that rand() is unable
to return RAND_MAX? The number of samples required is a
function of the number of bits in rand()'s internal state,
and the Standard does not document that number.

I'm guessing you meant "strictly conforming"; a "conforming program"
can do just about anything, since it's free to use extensions.

A rand() implementation that always repeatedly returns the same number
(perhaps a different number depending on the seed) could conceivably
be truly pseudo-random, but very unlucky. Even a truly random
sequence could contain the same number repeated an arbitrary number of
times, and there's no set number of repetitions that can prove that
it's non-random. It is possible to discuss the probability that a
given non-random-appearing sequence could have been generated, and
compare that to, say, the probability that the programmer who wrote
the rand() function forgot to update the seed.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Feb 27 '06 #32