473,854 Members | 1,443 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

ftell() arithmetic vs. text files read as binary

I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position, and one can
do arithmetic on ftell() positions within one line. I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

--
Hallvard
Nov 20 '06 #1
7 3137
Hallvard B Furuseth wrote:
I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.
I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.

Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...
In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position, and one can
do arithmetic on ftell() positions within one line. I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.
ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),
it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")
- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)
The VAR file format was <length, contentsor <length, contents,
padding byteto make an even total. I think the padding byte was
always a zero, but I don't remember whether that was guaranteed or
just "usual practice."

The VFC format was weirder: <length, prefix, contentsor
<length, prefix, contents, padding byte>. The "prefix" portion was
of fixed length (usually two bytes), and indicated "carriage control"
to be applied before and after "printing" the line: single-advance,
double-advance, skip to new page, and so on. On text-mode input,
the C library translated these by synthesizing LF's and FF's and
such before and after the "payload" of the line.

If you read any of these things in binary mode, you'd get the
raw, uninterpreted data: length, prefix, payload, and padding, as
one undifferentiate d stream of bytes.
- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.
You might want to make that "an unsigned byte < 32."

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 20 '06 #2
2006-11-20 <hb************ **@bombur.uio.n o>,
Hallvard B Furuseth wrote:
I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position,

and one can do arithmetic on ftell() positions within one line.
one can _do_ arithmetic, perhaps... one isn't guaranteed to get
meaningful results, particularly with multibyte streams.
I.e.:
- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.
Multibytes again
>
- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)
If the file is record-oriented, it could plausibly instead bump it to
the next multiple of an arbitrarily large power of two [say, record
number and offset are separate fields]
Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.
Don't forget the extra zero-padding permitted at the end of binary files
(for systems where native file size is stored in units 1 byte)
Nov 20 '06 #3
Eric Sosman wrote:
>Hallvard B Furuseth wrote:
>I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.
Sounds interesting, I'll see if I can dig out some more info about that.
Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...
Low, but it's not unlikely that the program will meet _some_ esoteric
system. And what one system can do, others can do as well.

I think I'll downgrade my expectations a bit and instead ask:

Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*? I'm not about to support things like <length,
contents, paddinganyway. Sounds like the binary formats that will
break my "text mode assumptions" will break just as badly in binary
mode, which is a relief in a way:-)

In any case, I guess a user option which makes the program read the
file as a text file and save it to a tmpfile() would be a good idea.
Then it'll be the user's worry instead of mine...
ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),
Correct. fgetc() "advances the associated file position indicator" in
both C89 and C99.
it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")
(...)
> - files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

You might want to make that "an unsigned byte < 32."
Good point. But I think I'm currently hoping to drop binary mode and
stay with ftell() in text mode.

--
Hallvard
Nov 20 '06 #4
Random832 wrote:
>Hallvard B Furuseth wrote:
>In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position,

and one can do arithmetic on ftell() positions within one line.

one can _do_ arithmetic, perhaps... one isn't guaranteed to get
meaningful results, particularly with multibyte streams.
As far as I know, streams are not multibyte unless I make them so.
C99 7.19.2p4 says: "Once a wide character input/output function has
been applied to a stream without orientation, the stream becomes a
wide-oriented stream."

Though it's a point, such a program can't be extended to handle
wide-oriented streams.
> - at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

If the file is record-oriented, it could plausibly instead bump it to
the next multiple of an arbitrarily large power of two [say, record
number and offset are separate fields]
True. I don't know of an example though?
>Or for binary mode FILE*s,
(...)
Don't forget the extra zero-padding permitted at the end of binary
files (for systems where native file size is stored in units 1 byte)
Good point.

--
Hallvard
Nov 20 '06 #5


Hallvard B Furuseth wrote On 11/20/06 13:18,:
>
Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*?
My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?

--
Er*********@sun .com

Nov 20 '06 #6
Eric Sosman writes:
Hallvard B Furuseth wrote On 11/20/06 13:18,:
>Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*?

My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?
Frankly I'm not entirely sure yet, but I think it can be reduced to
something like:
Walk through the file and save info about each character, with
index (ftell() position of line + character's index in line).
Next,
for (i = 0; i < {max ftell() position}; i++)
if (there is a character #i)
handle(getc());
I suppose that for loop can be changed to read line by line, but
that change looks a bit messy.

There are some ugly cases like fseek(arbitrary position) as well,
but I think they can be eliminated without too much fuss.

--
Hallvard
Nov 21 '06 #7


Hallvard B Furuseth wrote On 11/21/06 05:52,:
Eric Sosman writes:
>>Hallvard B Furuseth wrote On 11/20/06 13:18,:
>>>Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*?

My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?


Frankly I'm not entirely sure yet, but I think it can be reduced to
something like:
Walk through the file and save info about each character, with
index (ftell() position of line + character's index in line).
Next,
for (i = 0; i < {max ftell() position}; i++)
if (there is a character #i)
handle(getc());
I suppose that for loop can be changed to read line by line, but
that change looks a bit messy.
The "walk through," I guess, is probably line by line?
(If it were character by character you could forget about
saving the intra-line index and just save each character's
ftell() position, then fseek() back to it. That would make
everything legitimate except the "max ftell() position"
calculation, which isn't guaranteed to make sense but very
likely will.)

But it looks like the arithmetic on ftell() values is
strictly within a line, right? That is, the loop looks
more like

for (i = 0; i < max; i++) {
if (something_abou t_position(i)) {
fseek(stream, ftellpos[i] + offset[i],
SEEK_SET);
ch = getc(stream);
...
}
}

If that's it, you may be out of the woods. Most crudely:

for (i = 0; i < max; i++) {
if (something_abou t_position(i)) {
fseek(stream, ftellpos[i], SEEK_SET);
for (j = 0; j < offset[i]; j++)
(void)getc(stre am);
ch = getc(stream);
...
}
}

A slightly fancier version would remember what line it
was in and what the previous offset was, to avoid seeking
over and over again to the start of the same line and
getc()'ing past longer and longer prefixes.
There are some ugly cases like fseek(arbitrary position) as well,
but I think they can be eliminated without too much fuss.
Good luck!

--
Er*********@sun .com

Nov 21 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

27
5052
by: Eric | last post by:
Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there any solid reasons to prefer text files over binary files files?
7
5005
by: Leslaw Bieniasz | last post by:
Hello, I am trying to fastly read large binary files (order of 100-200 MB) using ftell() and fseek(). My class gets a pointer to the data stored in the file, and then uses fseek() to access and read the data. The problem is that when the file grows in size, the access time also increases. I initially used fseek() with option SEEK_SET, but later switched to SEEK_CUR in the hope that this will speed up the access, but there is no...
16
5158
by: TTroy | last post by:
Hello, I'm relatively new to C and have gone through more than 4 books on it. None mentioned anything about integral promotion, arithmetic conversion, value preserving and unsigned preserving. And K&R2 mentions "signed extension" everywhere. Reading some old clc posts, I've beginning to realize that these books are over-generalizing the topic. I am just wondering what the difference between the following pairs of terms are: 1)...
18
2216
by: Martin Johansen | last post by:
Hello When opening a CR-NL file, ftell returns the length of the file with the CR-NL as two bytes, is it supposed to do so? I am comparing two file-sizes, one CR-NL and one NL using ftell to get the filesize. Any alternative suggestion is welcomed. Thanks - Martin Johansen
2
3566
by: cedarson | last post by:
I am writing a program and have been instructeed to use the 'fseek', 'ftell', and 'stat' functions, however, after looking in the online manual for each of these, I am still unsure on how to use them. In my program, I am to write a code that opens a file, uses 'stat' to determine the file size, use 'fseek' to move the offset of the pointer, and finally use 'ftell' to obtain the file pointer index. Will someone please help? Again, thanks...
10
3678
by: joelagnel | last post by:
hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in binary mode. according to me, notepad opens files and each byte of the file read, it converts that byte from ascii to its correct character and displays
10
5988
by: Kenneth Brody | last post by:
I recently ran into an "issue" related to text files and ftell/fseek, and I'd like to know if it's a bug, or simply an annoying, but still conforming, implementation. The platform is Windows, where text files use CF+LF (0x0d, 0x0a) to mark end-of-line. The file in question, however, was in Unix format, with only LF (0x0a) at the end of each line. First, does the above situation already invoke "implementation defined" or "undefined"...
3
2962
by: Chen ShuSheng | last post by:
HI, I am now study a segment of codes: ------------------------ printf("%p\t",fp); /*add by me*/ fseek(fp, 0L, SEEK_END); /* go to end of file */ printf("%p\t",fp); /*add by me*/ last = ftell(fp); cout<<"last="<<last<<"\t"; /*add by me*/ -------------------------
25
3380
by: subramanian100in | last post by:
Consider the following program: #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv) { if (argc != 2) { printf("Usage: <program-name<text-file>\n");
0
9751
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10679
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10371
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9513
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7914
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5741
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5941
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4560
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3186
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.