ftell() arithmetic vs. text files read as binary

Hallvard B Furuseth

I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position, and one can
do arithmetic on ftell() positions within one line. I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

--
Hallvard

Nov 20 '06 #1

Subscribe Reply

3068

Eric Sosman

Hallvard B Furuseth wrote:

I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.

Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position, and one can
do arithmetic on ftell() positions within one line. I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),
it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")

- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

The VAR file format was <length, contentsor <length, contents,
padding byteto make an even total. I think the padding byte was
always a zero, but I don't remember whether that was guaranteed or
just "usual practice."

The VFC format was weirder: <length, prefix, contentsor
<length, prefix, contents, padding byte>. The "prefix" portion was
of fixed length (usually two bytes), and indicated "carriage control"
to be applied before and after "printing" the line: single-advance,
double-advance, skip to new page, and so on. On text-mode input,
the C library translated these by synthesizing LF's and FF's and
such before and after the "payload" of the line.

If you read any of these things in binary mode, you'd get the
raw, uninterpreted data: length, prefix, payload, and padding, as
one undifferentiated stream of bytes.

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

You might want to make that "an unsigned byte < 32."

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 20 '06 #2

Random832

2006-11-20 <hb**************@bombur.uio.no>,
Hallvard B Furuseth wrote:

I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position,

and one can do arithmetic on ftell() positions within one line.

one can _do_ arithmetic, perhaps... one isn't guaranteed to get
meaningful results, particularly with multibyte streams.

I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

Multibytes again

>
- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

If the file is record-oriented, it could plausibly instead bump it to
the next multiple of an arbitrarily large power of two [say, record
number and offset are separate fields]

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

Don't forget the extra zero-padding permitted at the end of binary files
(for systems where native file size is stored in units 1 byte)

Nov 20 '06 #3

Hallvard B Furuseth

Eric Sosman wrote:

>Hallvard B Furuseth wrote:
>I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.

Sounds interesting, I'll see if I can dig out some more info about that.

Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...

Low, but it's not unlikely that the program will meet _some_ esoteric
system. And what one system can do, others can do as well.

I think I'll downgrade my expectations a bit and instead ask:

Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*? I'm not about to support things like <length,
contents, paddinganyway. Sounds like the binary formats that will
break my "text mode assumptions" will break just as badly in binary
mode, which is a relief in a way:-)

In any case, I guess a user option which makes the program read the
file as a text file and save it to a tmpfile() would be a good idea.
Then it'll be the user's worry instead of mine...

ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),

Correct. fgetc() "advances the associated file position indicator" in
both C89 and C99.

it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")

(...)
> - files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

You might want to make that "an unsigned byte < 32."

Good point. But I think I'm currently hoping to drop binary mode and
stay with ftell() in text mode.

--
Hallvard

Nov 20 '06 #4

Hallvard B Furuseth

Random832 wrote:

>Hallvard B Furuseth wrote:
>In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position,

and one can do arithmetic on ftell() positions within one line.

one can _do_ arithmetic, perhaps... one isn't guaranteed to get
meaningful results, particularly with multibyte streams.

As far as I know, streams are not multibyte unless I make them so.
C99 7.19.2p4 says: "Once a wide character input/output function has
been applied to a stream without orientation, the stream becomes a
wide-oriented stream."

Though it's a point, such a program can't be extended to handle
wide-oriented streams.

> - at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

If the file is record-oriented, it could plausibly instead bump it to
the next multiple of an arbitrarily large power of two [say, record
number and offset are separate fields]

True. I don't know of an example though?

>Or for binary mode FILE*s,
(...)
Don't forget the extra zero-padding permitted at the end of binary
files (for systems where native file size is stored in units 1 byte)

Good point.

--
Hallvard

Nov 20 '06 #5

Eric Sosman

Hallvard B Furuseth wrote On 11/20/06 13:18,:

>
Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*?

My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?

--
Er*********@sun.com

Nov 20 '06 #6

Hallvard B Furuseth

Eric Sosman writes:

Hallvard B Furuseth wrote On 11/20/06 13:18,:
>Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*?

My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?

Frankly I'm not entirely sure yet, but I think it can be reduced to
something like:
Walk through the file and save info about each character, with
index (ftell() position of line + character's index in line).
Next,
for (i = 0; i < {max ftell() position}; i++)
if (there is a character #i)
handle(getc());
I suppose that for loop can be changed to read line by line, but
that change looks a bit messy.

There are some ugly cases like fseek(arbitrary position) as well,
but I think they can be eliminated without too much fuss.

--
Hallvard

Nov 21 '06 #7

Eric Sosman

Hallvard B Furuseth wrote On 11/21/06 05:52,:

Eric Sosman writes:

>>Hallvard B Furuseth wrote On 11/20/06 13:18,:

>>>Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*?

My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?

Frankly I'm not entirely sure yet, but I think it can be reduced to
something like:
Walk through the file and save info about each character, with
index (ftell() position of line + character's index in line).
Next,
for (i = 0; i < {max ftell() position}; i++)
if (there is a character #i)
handle(getc());
I suppose that for loop can be changed to read line by line, but
that change looks a bit messy.

The "walk through," I guess, is probably line by line?
(If it were character by character you could forget about
saving the intra-line index and just save each character's
ftell() position, then fseek() back to it. That would make
everything legitimate except the "max ftell() position"
calculation, which isn't guaranteed to make sense but very
likely will.)

But it looks like the arithmetic on ftell() values is
strictly within a line, right? That is, the loop looks
more like

for (i = 0; i < max; i++) {
if (something_about_position(i)) {
fseek(stream, ftellpos[i] + offset[i],
SEEK_SET);
ch = getc(stream);
...
}
}

If that's it, you may be out of the woods. Most crudely:

for (i = 0; i < max; i++) {
if (something_about_position(i)) {
fseek(stream, ftellpos[i], SEEK_SET);
for (j = 0; j < offset[i]; j++)
(void)getc(stream);
ch = getc(stream);
...
}
}

A slightly fancier version would remember what line it
was in and what the previous offset was, to avoid seeking
over and over again to the start of the same line and
getc()'ing past longer and longer prefixes.

There are some ugly cases like fseek(arbitrary position) as well,
but I think they can be eliminated without too much fuss.

Good luck!

--
Er*********@sun.com

Nov 21 '06 #8

Similar topics

4883

[Q] Text vs Binary Files

by: Eric | last post by:

Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there...

.NET Framework

4907

How to speed up ftell()/fseek()

by: Leslaw Bieniasz | last post by:

Hello, I am trying to fastly read large binary files (order of 100-200 MB) using ftell() and fseek(). My class gets a pointer to the data stored in the file, and then uses fseek() to access and...

C / C++

5100

integral promotion, arithmetic conversion, value preserving, unsigned preserving???

by: TTroy | last post by:

Hello, I'm relatively new to C and have gone through more than 4 books on it. None mentioned anything about integral promotion, arithmetic conversion, value preserving and unsigned preserving. ...

C / C++

2161

CR-NL, NL and ftell

by: Martin Johansen | last post by:

Hello When opening a CR-NL file, ftell returns the length of the file with the CR-NL as two bytes, is it supposed to do so? I am comparing two file-sizes, one CR-NL and one NL using ftell to...

C / C++

3532

How do I (fseek, ftell, stat)?

by: cedarson | last post by:

I am writing a program and have been instructeed to use the 'fseek', 'ftell', and 'stat' functions, however, after looking in the online manual for each of these, I am still unsure on how to use...

C / C++

3625

text and binary files confusion

by: joelagnel | last post by:

hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in...

C / C++

5929

Text mode fseek/ftell

by: Kenneth Brody | last post by:

I recently ran into an "issue" related to text files and ftell/fseek, and I'd like to know if it's a bug, or simply an annoying, but still conforming, implementation. The platform is Windows,...

C / C++

2924

Question about change of "fp" in function "fseek"and "ftell"

by: Chen ShuSheng | last post by:

HI, I am now study a segment of codes: ------------------------ printf("%p\t",fp); /*add by me*/ fseek(fp, 0L, SEEK_END); /* go to end of file */ printf("%p\t",fp); ...

C / C++

3306

questions on ftell and fopen

by: subramanian100in | last post by:

Consider the following program: #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv) { if (argc != 2) { printf("Usage: <program-name<text-file>\n");

C / C++

7342

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7410

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7067

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

5650

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4729

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3215

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3201

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1570

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

774

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP