473,324 Members | 2,248 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

Binary or Ascii Text?

Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

Please help me, thanks.

Mar 31 '06 #1
31 3097
"Claude Yih" <wi******@gmail.com> writes:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?


There is no general solution to this. Many systems don't actually
distinguish between text and binary files; a text file is just a file
that happens to consist of printable characters -- and what's
considered a printable character can vary. You can also look at line
terminators (ASCII LF on Unix-like systems, an ASCII CR-LF sequence on
Windows-like systems, possibly something completely different
elsewhere).

<OT>Unix-like systems have a command called "file" that attempts to
classify a file based on its contents.</OT>

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 31 '06 #2
"Claude Yih" writes:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?


The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.

But note that this is not what is meant when C programmers discuss text vs.
binary, for example in some of the file functions. What is referred to
there is the distinction between two ways of handling end of lines. Is an
end of line demarked by a single character (LF) or two characters <CR><LF>?
Unix uses only the LF to mark end of line, so the distinction is
meaningless. Systems that use <CR><LF> or <LF><CR> have to examine the
stream and convert the two characters into one, called '\n' So '\n' is
really <LF>..

When you open a file in binary mode, you are telling the world: Hey, you
there, keep your cotton-picking hands off this file.
Mar 31 '06 #3
"Claude Yih" <wi******@gmail.com> wrote:
# Hi, everyone. I got a question. How can I identify whether a file is a
# binary file or an ascii text file? For instance, I wrote a piece of
# code and saved as "Test.c". I knew it was an ascii text file. Then
# after compilation, I got a "Test" file and it was a binary executable
# file. The problem is, I know the type of those two files in my mind
# because I executed the process of compilation, but how can I make the
# computer know the type of a given file by writing code in C? Files are

As far as stdio is concerned, a binary file is what you get if you
include a "b" in the open mode, otherwise it's text mode. Binary and
text files may handle end-of-line indicators differently, and how
fseek offsets are interpretted. (In unix, stdio treats binary and
text files identically.)

--
SM Ryan http://www.rawbw.com/~wyrmwif/
GERBILS
GERBILS
GERBILS
Mar 31 '06 #4
osmium writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, itis
probably not a text file - but it's only an educated guess, no real proof.


Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.

Mar 31 '06 #5
"Claude Yih" writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are
present
in text files. So if you have quite a few of the other 25 or so codes, it
is
probably not a text file - but it's only an educated guess, no real proof.


Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.

It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.
Mar 31 '06 #6
Claude Yih wrote:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

Please help me, thanks.

As others have said, this is an essentially arbitrary decision on your
part. Here we've standardized on a definition of "binary" that means:

Lines greater than X bytes (where "X" is some arbitrarily high number,
like 16 or 23k), or any character within the file is \0 or null.

I line is defined as data between newlines (normalized to '\n').

Everything else fits into a reasonable notion of OEM or ANSI charset,
with some caveats.

Again, this is specific to application requirements. Your requirements
may vary.
Mar 31 '06 #7
"osmium" <r1********@comcast.net> writes:
"Claude Yih" writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are
present
in text files. So if you have quite a few of the other 25 or so codes, it
is
probably not a text file - but it's only an educated guess, no real proof.


Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.

It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.


The quoting is completely messed up. The paragraph starting with "The
best you can do" was written by osmium, the next three paragraphs
where written by Claude Yih, and the last paragraph, starting with "It
doesn't work", was written by osmium (who usually gets this stuff
right).

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 31 '06 #8
"Claude Yih" <wi******@gmail.com> writes:
osmium writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.
Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.


I think that's fairly close to what the Unix "file" command does.
(Versions of the command are available as open source; see
<ftp://ftp.astron.com/pub/file/>.)

As mentioned above, you should also check for control characters.
However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.


Multi-byte characters aren't the only problem. ISO-8859-1 is an
extension of ASCII that uses codes from 161 to 255 for printable
characters (there are several ISO-8859-N standards).

And none of this is portable to all possible C implementations. Some
systems distinguish between text and binary files at the filesystem
level.

Whatever it is you're trying to do, your first line of defense should
be to arrange to know what type a file is before you open it. If that
fails, as it inevitably will in some cases, you can check the contents
as a fallback, but there's no 100% reliable way to do so.

If you're writing a program that's intended to work only on text
files, it might be best to decide what's acceptable *for that
program*. If you're displaying the contents of the file, for example,
you can establish a convention for displaying non-printable characters
in some readable form. If an input line is very long, you can wrap it
or truncate it. And so on.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 31 '06 #9
Me
Claude Yih wrote:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?


Modern computers deal with much more than just ASCII so trying to
determine the encoding is doomed to failure and mysterious acting
heuristics. All you should be doing is getting a filename from the user
and opening it in either text mode or binary mode. If you want to
distinguish between the two then just have the user also input which
mode they want to use.

Apr 1 '06 #10
Me wrote:
Claude Yih wrote:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?


Modern computers deal with much more than just ASCII so trying to
determine the encoding is doomed to failure and mysterious acting
heuristics. All you should be doing is getting a filename from the user
and opening it in either text mode or binary mode. If you want to
distinguish between the two then just have the user also input which
mode they want to use.

Too true. Reading a file to determine its format is like walking outside
and predicting the weather. You might get it right but maybe not.

Text mode implemented in C is a concession to Microsoft. It removes the
CR from the CRLF pair and ignores any trailing ^Z character. Conversely
on writing a file, when told to write LF the pair CRLF is written.

If you expect anything else you'll be disappointed. If you must
investigate the contents of a file, "rb" is your friend.

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Apr 1 '06 #11
Joe Wright wrote:

.... snip ...

Too true. Reading a file to determine its format is like walking outside
and predicting the weather. You might get it right but maybe not.

Text mode implemented in C is a concession to Microsoft. It removes the
CR from the CRLF pair and ignores any trailing ^Z character. Conversely
on writing a file, when told to write LF the pair CRLF is written.


In this case you are being unfair to Microsoft (yes, I know it's
hard to do). C is the offbeat animal here. Text lines were
terminated with cr/lf for many moons before C decided to ignore the
cr, and the protocols were largely inherited from teletype
machines. The C technique makes it awkward to overprint lines, or
to advance a line without returning to the left margin, while the
much older protocal makes those things easy.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

Apr 1 '06 #12
"Joe Wright" <jo********@comcast.net> wrote in message
news:Ma******************************@comcast.com. ..
Text mode implemented in C is a concession to Microsoft.
Nope. It was a concession to every OS *except* UNIX. By the time
the C standardization effort began in 1983, my company Whitesmiths,
Ltd. had ported C to dozens of different platforms. We added the
text/binary dichotomy to deal uniformly with the numerous
conventions for terminating lines in text files. One of those
platforms happened to be 86-DOS, which was the precursor to MS-DOS.
It was by no means the most important at that time.
It removes the CR
from the CRLF pair and ignores any trailing ^Z character.
And the bytes thereafter.
Conversely on
writing a file, when told to write LF the pair CRLF is written.

If you expect anything else you'll be disappointed. If you must
investigate the contents of a file, "rb" is your friend.


Right. Except for the possibility of trailing NUL padding, that is.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Apr 1 '06 #13
"CBFalconer" <cb********@yahoo.com> wrote in message
news:44***************@yahoo.com...
Joe Wright wrote:
... snip ...

Too true. Reading a file to determine its format is like walking outside
and predicting the weather. You might get it right but maybe not.

Text mode implemented in C is a concession to Microsoft. It removes the
CR from the CRLF pair and ignores any trailing ^Z character. Conversely
on writing a file, when told to write LF the pair CRLF is written.


In this case you are being unfair to Microsoft (yes, I know it's
hard to do). C is the offbeat animal here. Text lines were
terminated with cr/lf for many moons before C decided to ignore the
cr,


You mean, before Unix developed a uniform notation for text streams,
both inside and outside the program, and C built it into its runtime
library.
and the protocols were largely inherited from teletype
machines.
They were also terminated by lf/cr, cr, by blank padding to fixed-length
records, by line count, etc. etc.
The C technique makes it awkward to overprint lines,
No, most systems will put out a ^M so you can do that.
or
to advance a line without returning to the left margin,
Yes, that is hard to do in text mode.
while the
much older protocal makes those things easy.


True. Thus the choice of binary mode as well.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Apr 1 '06 #14
"Joe Wright" writes:
Text mode implemented in C is a concession to Microsoft.


I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this
work. And also, to make it work, a line feed had to have no side effects,
such as advancing the medium. I believe the ASCII code has been jiggered
with to redefine CR and LF since the original specification, but I have no
actual proof.

So it was a concession to ASCII.
Apr 1 '06 #15
"osmium" <r1********@comcast.net> wrote in message
news:49************@individual.net...
"Joe Wright" writes:
Text mode implemented in C is a concession to Microsoft.
I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this
work. And also, to make it work, a line feed had to have no side
effects, such as advancing the medium. I believe the ASCII code has been
jiggered with to redefine CR and LF since the original specification, but
I have no actual proof.


No, you just need backspace to work so you can overstrike a letter.
That still works with the portable C model of a text stream (on a
display that shows both characters of an overstrike, at least).
So it was a concession to ASCII.


Not really.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Apr 1 '06 #16

osmium wrote:
"Joe Wright" writes:
Text mode implemented in C is a concession to Microsoft.
I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this
work. And also, to make it work, a line feed had to have no side effects,
such as advancing the medium.


What, then, is its effect? Or are you thinking about carraige return
not advancing the media?
I believe the ASCII code has been jiggered
with to redefine CR and LF since the original specification, but I have no
actual proof.

So it was a concession to ASCII.


Apr 1 '06 #17
"P.J. Plauger" writes:
So it was a concession to ASCII.


Not really.


I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of actual
drum printers, chain printers and so on.

So are you saying that the initial release of the ASCII standard said that
LF was to do line feed AND carriage return? What was the point then, of
having them as separate codes? Unfortunately I don't have the text that
goes with my pre-historic ASCII chart, only a single page showing the
glyphs.
Apr 1 '06 #18
<me********@aol.com> wrote:

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this
work. And also, to make it work, a line feed had to have no side
effects,
such as advancing the medium.


What, then, is its effect? Or are you thinking about carraige return
not advancing the media?


I think I said it backwards. CR meant return the carriage and LF meant to
advance to next line. Neither had any side effects. The sequence <CR><LF>
was like a typewriter.
Apr 1 '06 #19
"osmium" <r1********@comcast.net> wrote in message
news:49************@individual.net...
"P.J. Plauger" writes:
So it was a concession to ASCII.
Not really.


I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of
actual drum printers, chain printers and so on.


Dunno how CR would be any better off than BS if that was the case.
Either way, as long as the device driver can do overstrikes you
can express them either by CR and spacing down or by BS and an
immediate overstrike.
So are you saying that the initial release of the ASCII standard said that
LF was to do line feed AND carriage return?
I don't recall saying anything about the ASCII standard. It describes
the effect of presenting a stream of ASCII characters to a conforming
display device. That's what goes on *outside* a C program. What I
discussed was the use of a single NL (as opposed to an assortment of
earlier conventions) for signaling the end of a line of text *within*
a C program. Unix also chose this representation for text files, so
there was no need to distinguish binary and text files. Moreover, Unix
device drivers generated whatever sequence of codes was necessary to
get the device to replicate the intent of the internal text stream.
That isolated the device peculiarities where they belong, not spread
throughout each program. (If you ever saw code written during the
1960s, you'd appreciate what a breakthrough this uniformity caused.)

I agree that you lose a bit of expressiveness over maintaining the
code internally as ASCII, but the payoff is a significantly better
unified model, IMO, for representing all text streams. Witness the
success of Unix-style software tools, and the C I/O model well
beyond Unix.
What was the point then, of
having them as separate codes?
ASCII serves one purpose, the C Standard another.
Unfortunately I don't have the text that
goes with my pre-historic ASCII chart, only a single page showing the
glyphs.


P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Apr 1 '06 #20
"P.J. Plauger" wrote:
"CBFalconer" <cb********@yahoo.com> wrote in message

.... snip ...

In this case you are being unfair to Microsoft (yes, I know it's
hard to do). C is the offbeat animal here. Text lines were
terminated with cr/lf for many moons before C decided to ignore the
cr,


You mean, before Unix developed a uniform notation for text streams,
both inside and outside the program, and C built it into its runtime
library.


Pascal is pretty well contemporaneous with C and Unix, and had/has
a well defined concept of files and streams. It doesn't make any
assumptions about line termination characters etc. The world is
not a Unix machine.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
Apr 1 '06 #21
"P.J. Plauger" <pj*@dinkumware.com> writes:
"osmium" <r1********@comcast.net> wrote in message
news:49************@individual.net...
"P.J. Plauger" writes:
So it was a concession to ASCII.

Not really.


I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of
actual drum printers, chain printers and so on.


Dunno how CR would be any better off than BS if that was the case.


I had a dot-matrix printer once (Okidata ML520) that would
overheat and stop (until it cooled down) if you sent it too much
text that contained lots of backspaces to do
character-by-character bold or underline. That kind of thing
made the printhead go back and forth incredibly rapidly, and it
just wasn't designed for that.

On the other hand, using CR didn't cause a problem because it
didn't make the printhead reverse direction any more often than
normal.
--
Ben Pfaff
email: bl*@cs.stanford.edu
web: http://benpfaff.org
Apr 1 '06 #22
"CBFalconer" <cb********@yahoo.com> wrote in message
news:44***************@yahoo.com...
"P.J. Plauger" wrote:
"CBFalconer" <cb********@yahoo.com> wrote in message
... snip ...

In this case you are being unfair to Microsoft (yes, I know it's
hard to do). C is the offbeat animal here. Text lines were
terminated with cr/lf for many moons before C decided to ignore the
cr,


You mean, before Unix developed a uniform notation for text streams,
both inside and outside the program, and C built it into its runtime
library.


Pascal is pretty well contemporaneous with C and Unix, and had/has
a well defined concept of files and streams. It doesn't make any
assumptions about line termination characters etc.


Right, and it's a damn poor model, with terrible lookahead properties.
Kernighan and I had to really work at imposing decent primitives atop
it. It is no accident that the model hasn't survived.
The world is
not a Unix machine.


Actually, it is. Compare the operating systems of today with those
of 35 years ago and you'll see how ubiquitous the basic design
decisions of Unix have become. Line terminators are at least now
always embedded characters in a stream -- gone are padding blanks
and structured files -- if not always the same terminators. And
C is certainly ubiquitous, with its simple rules for mapping
C-style text streams to and from text files.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Apr 2 '06 #23
"Ben Pfaff" <bl*@cs.stanford.edu> wrote in message
news:87************@benpfaff.org...
"P.J. Plauger" <pj*@dinkumware.com> writes:
"osmium" <r1********@comcast.net> wrote in message
news:49************@individual.net...
"P.J. Plauger" writes:

> So it was a concession to ASCII.

Not really.

I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of
actual drum printers, chain printers and so on.


Dunno how CR would be any better off than BS if that was the case.


I had a dot-matrix printer once (Okidata ML520) that would
overheat and stop (until it cooled down) if you sent it too much
text that contained lots of backspaces to do
character-by-character bold or underline. That kind of thing
made the printhead go back and forth incredibly rapidly, and it
just wasn't designed for that.

On the other hand, using CR didn't cause a problem because it
didn't make the printhead reverse direction any more often than
normal.


Okay, you've made a case for why a good printer *driver* might
rewrite the stream you send it (as practically every smart device
did in Unix and does in today's systems). The issue we've been
discussing is the *linguistics* of text streams. And the point
was that either CR or BS is sufficient to describe overstrikes.
ASCII doesn't have any thermal attributes.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Apr 2 '06 #24
"P.J. Plauger" wrote:
"CBFalconer" <cb********@yahoo.com> wrote in message
"P.J. Plauger" wrote:
"CBFalconer" <cb********@yahoo.com> wrote in message
... snip ...

In this case you are being unfair to Microsoft (yes, I know it's
hard to do). C is the offbeat animal here. Text lines were
terminated with cr/lf for many moons before C decided to ignore the
cr,

You mean, before Unix developed a uniform notation for text streams,
both inside and outside the program, and C built it into its runtime
library.


Pascal is pretty well contemporaneous with C and Unix, and had/has
a well defined concept of files and streams. It doesn't make any
assumptions about line termination characters etc.


Right, and it's a damn poor model, with terrible lookahead properties.
Kernighan and I had to really work at imposing decent primitives atop
it. It is no accident that the model hasn't survived.


I probably should't get into this :-) but people have been
misunderstanding Pascal i/o for generations now. With the use of
lazy i/o there is no problem with interactive operation, and
prompting can be handled with a prompt function (equivalent to
writeln, but without the line advance) or by detection of
interactive pairs to force buffer flushing.

Meanwhile there are none of the problems associated with
interactive scanf and other routines, because the C stream is never
sure whether the field terminating char has been used or is still
in the stream. With Pascal, it is in the stream. With Pascal, we
always have one char. lookahead.

Granted, we can build the equivalent set in C, but that requires
the discipline to not use many existing functions, or to follow
them with an almost universal ungetc. What we can't get is the
convenience of the shorthand usage of read(ln) and write(ln),
although the C++ mechanisms make an ugly attempt at it.
The world is
not a Unix machine.


Actually, it is. Compare the operating systems of today with those
of 35 years ago and you'll see how ubiquitous the basic design
decisions of Unix have become. Line terminators are at least now
always embedded characters in a stream -- gone are padding blanks
and structured files -- if not always the same terminators. And
C is certainly ubiquitous, with its simple rules for mapping
C-style text streams to and from text files.


Granted the Unix philosophy has simplified file systems. This is
not necessarily good, since the old systems all had reasons for
existing. Many of those reasons have been subsumed into much
higher performance levels at the storage level, but that is
something like approving of gui bloat because cpus are faster.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
Apr 2 '06 #25
"CBFalconer" <cb********@yahoo.com> wrote in message
news:44***************@yahoo.com...
"P.J. Plauger" wrote:
"CBFalconer" <cb********@yahoo.com> wrote in message
"P.J. Plauger" wrote:
"CBFalconer" <cb********@yahoo.com> wrote in message

... snip ...
>
> In this case you are being unfair to Microsoft (yes, I know it's
> hard to do). C is the offbeat animal here. Text lines were
> terminated with cr/lf for many moons before C decided to ignore the
> cr,

You mean, before Unix developed a uniform notation for text streams,
both inside and outside the program, and C built it into its runtime
library.

Pascal is pretty well contemporaneous with C and Unix, and had/has
a well defined concept of files and streams. It doesn't make any
assumptions about line termination characters etc.
Right, and it's a damn poor model, with terrible lookahead properties.
Kernighan and I had to really work at imposing decent primitives atop
it. It is no accident that the model hasn't survived.


I probably should't get into this :-) but


You're probably right.
people have been
misunderstanding Pascal i/o for generations now.
That may be, but I don't. I've written tens of thousands of lines
of Pascal and hundreds of thousands of lines of C over the past
few decades. I've written essays on the various design principles
of parsing with various degrees of lookahead. I've written or
coauthored textbooks on the subject. In short, I've *thought*
about this topic for longer than the average reader of this
newsgroup has been alive. I think I understand it.
With the use of
lazy i/o there is no problem with interactive operation, and
prompting can be handled with a prompt function (equivalent to
writeln, but without the line advance) or by detection of
interactive pairs to force buffer flushing.
Yes, you can get around the problems. The only problem is that you
*have* to get around the problems.
Meanwhile there are none of the problems associated with
interactive scanf and other routines, because the C stream is never
sure whether the field terminating char has been used or is still
in the stream.
Not true. It's precisely, and usefully, defined.
With Pascal, it is in the stream.
Not always true.
With Pascal, we
always have one char. lookahead.
And with C. You never need more than one char lookahead, by design.
Granted, we can build the equivalent set in C, but that requires
the discipline to not use many existing functions, or to follow
them with an almost universal ungetc. What we can't get is the
convenience of the shorthand usage of read(ln) and write(ln),
although the C++ mechanisms make an ugly attempt at it.
I agree that, beyond a point, this becomes a matter of aesthetics.
I won't argue that. What I will observe is natural selection at
work. The C I/O model has survived and thrives. The Pascal model
is marginalized if not dead.
The world is
not a Unix machine.


Actually, it is. Compare the operating systems of today with those
of 35 years ago and you'll see how ubiquitous the basic design
decisions of Unix have become. Line terminators are at least now
always embedded characters in a stream -- gone are padding blanks
and structured files -- if not always the same terminators. And
C is certainly ubiquitous, with its simple rules for mapping
C-style text streams to and from text files.


Granted the Unix philosophy has simplified file systems. This is
not necessarily good, since the old systems all had reasons for
existing.


Yes, they did. Lots of them. In all sorts of directions. And they
haven't survived. Coincidence? I don't think so.
Many of those reasons have been subsumed into much
higher performance levels at the storage level, but that is
something like approving of gui bloat because cpus are faster.


No, it's something like adapting the total software package to the
needs of current hardware. I see no overall bloat in how buffering
is distributed today vs. 30 years ago. But I do see a significant
simplification of I/O as seen by the user over that same period.

Item: One of the seven looseleaf binders that came with RSX-11M was
titled "Preparing for I/O." There is no Unix equivalent. (Or DOS,
or Linux, or ...) You don't set up file control blocks and I/O
control blocks; you just call open, close, read, and write.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Apr 2 '06 #26
In article <49************@individual.net> "osmium" <r1********@comcast.net> writes:
....
Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.


No, it cannot handle other encodings, but that was not what you asked for.
Note that also files that consist of pure ASCII codes can be binary.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Apr 3 '06 #27
"P.J. Plauger" wrote:
"CBFalconer" <cb********@yahoo.com> wrote in message
.... snip ...
Meanwhile there are none of the problems associated with
interactive scanf and other routines, because the C stream is never
sure whether the field terminating char has been used or is still
in the stream.


Not true. It's precisely, and usefully, defined.
With Pascal, it is in the stream.


Not always true.
With Pascal, we
always have one char. lookahead.


And with C. You never need more than one char lookahead, by design.


Now we can get this off a language war and onto pure C. My problem
with C, and the usual library, is the absence of sane and clear
methods for interactive i/o. To illustrate, the user wants to
output a prompt and receive an integer. How to do it?

The new users first choice is probably scanf. He forgets to check
the error return. And, even worse, what gets entered is:

<programmed prompt>: 1234 x<cr>

and this is being handled by:

printf("<programmed prompt>:"); fflush(stdout);
scanf("%d", &i);

which gets the first entry, but falls all over itself when the
sequence is called again. The usual advice is to read full lines,
i.e."

printf("<programmed prompt>:"); fflush(stdout);
fgets(buffer, BUFSZ, stdin);
i = strtol(buffer, &errptr, 10);

which brings in the extraneous buffer, a magical BUFSZ derived by
gazing at the ceiling, prayer, and incense sticks, not to mention
errptr. So I consider that solution unclean by my standards. (Of
course they can use my ggets for consistent whole line treatement).

So instead we write a baby routine that inputs from a stream with
getc, skips leading blanks (and possibly blank lines), and ungets
the field termination char. We combine that with my favorite
flushln:

while ((EOF != (ch = getc(f)) && ('\n' != ch)) continue;

and the birds twitter, the sun shines, etc. UNTIL somebody calls
some other input routine and doesn't have the discipline to define
a quiescent i/o state and ensure that that state is achieved at
each input. That in turn leads to calling the flushln twice, and
discarding perfectly usable (and possibly needed) input.
Alternatively it leads to newbies calling fflush(stdin) and similar
curses.

This is what I mean by saying the C doesn't provide the one char
lookahead in the right place, i.e. the i/o, where it can't be lost.

It would help if C provided the ability to detect "the last
character used was a '\n'", which would enable the above flushln to
avoid discarding extra lines. However that won't happen. It would
probably also suffice to define fflush usage on input streams.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
Apr 3 '06 #28
On Sat, 1 Apr 2006 16:00:58 UTC, "osmium" <r1********@comcast.net>
wrote:
"P.J. Plauger" writes:
So it was a concession to ASCII.


Not really.


I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of actual
drum printers, chain printers and so on.

So are you saying that the initial release of the ASCII standard said that
LF was to do line feed AND carriage return? What was the point then, of
having them as separate codes? Unfortunately I don't have the text that
goes with my pre-historic ASCII chart, only a single page showing the
glyphs.


In late 60th and early 70the there was no device available today known
as screen. Tere were line printers, punch card and paper reader and
writers, and TTY devices combining keyboard, puch paper reader and
writer and a character printer. That printer was able to use singe
control chars like
- cr - caridge return - point print unit back to column 1
- lf - linefeed - feed paper to next line
- ff - formfeed - feed paper to next page stop on
the control ribbon
- backspace - one fixed character position back on same
line
- backline - page one line back

Some of these devices werde dumb enogh to get the next character
printed even before the device was able to reach character position 1.
So to get a clean printout you had to do cr before lf to hold the
device until lf was done.

Anyway to get a new line you must give out lf or the prit head would
put the char on the position it was at the time it got the order to
print it.

On mainframes the TTY used was mainly configured to make a cr even
when it got an lf to optimise the programs and save one character in
text (memory was bare and expensive even as the was able to
multitask). The upcoming microprocessors (mostenly homebrowed by
highly different manufacturers were limited in multitasking on the
different hardware levels (mostenly 16) the CPU was able to control
and designed more primitive. They required even more dumb TTY or more
intelligent customer builded I/O devices.

At the time C was created there was a typica computer either a
mainframe with
- a lot of punch card readers as program input
- a lot of magnetic tape devises as data store
- 1 or more punch card writer(s)
- some paper tape readers and writers
- one or more line printers (the first music devices :-)
for developers)
- later then a high number of removeable hard disk
- 1 TTY as operator console

No wouder that the C runtime is not created to handle user input well
but ideal for handling computer designed input like punch cards.

The upcoming microprocessors were designed to control mashines, having
only
- special devices to control mashines
- paper tape punchers and readers
- magnetic tape writers
- seldom line printers
- TTY as operator console.

Ages later they got moved into bureaus and other kinds of special
devices and TTY like devices as user input/output devices.

Modern GUIs are properitary anyway and does not use the C runtime for
user oriented I/O anyway.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
Apr 9 '06 #29
On Sat, 1 Apr 2006 07:33:13 -0800, "osmium" <r1********@comcast.net>
wrote:
"Joe Wright" writes:
Text mode implemented in C is a concession to Microsoft.
I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this


Are you sure? The far-dominant early ASCII (64-graphic = uppercase
only) devices, Teletype 33 and 35, had uparrow and backarrow. The
earliest revision of the standard document I looked at, IIRC 1968 or
so, added tilde along with lowercase and described circumflex and
underscore as changed precisely so they could be used as modifiers. It
also gave NL as an acceptable alternate meaning of x0A but not the
primary one. (There was the same ambiguity over whether VT and FF
included CR or not, but those were already less important then, and
now have nearly vanished.) And of course ASCII was originally intended
and used only as an "American" (meaning US) standard.
work. And also, to make it work, a line feed had to have no side effects,
such as advancing the medium. I believe the ASCII code has been jiggered
with to redefine CR and LF since the original specification, but I have no
actual proof.

So it was a concession to ASCII.

I would say to ASCII as commonly used, _and_ to other non-Unix and
record-oriented filesystems still pretty important in the 1980s.

- David.Thompson1 at worldnet.att.net
Apr 16 '06 #30

"Dave Thompson" <da*************@worldnet.att.net> wrote in message
news:dd********************************@4ax.com...
On Sat, 1 Apr 2006 07:33:13 -0800, "osmium" <r1********@comcast.net>
wrote:
"Joe Wright" writes:
> Text mode implemented in C is a concession to Microsoft.


I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this


Are you sure? The far-dominant early ASCII (64-graphic = uppercase
only) devices, Teletype 33 and 35, had uparrow and backarrow. The
earliest revision of the standard document I looked at, IIRC 1968 or
so, added tilde along with lowercase and described circumflex and
underscore as changed precisely so they could be used as modifiers. It
also gave NL as an acceptable alternate meaning of x0A but not the
primary one. (There was the same ambiguity over whether VT and FF
included CR or not, but those were already less important then, and
now have nearly vanished.) And of course ASCII was originally intended
and used only as an "American" (meaning US) standard.
work. And also, to make it work, a line feed had to have no side
effects,
such as advancing the medium. I believe the ASCII code has been jiggered
with to redefine CR and LF since the original specification, but I have
no
actual proof.

So it was a concession to ASCII.

I would say to ASCII as commonly used, _and_ to other non-Unix and
record-oriented filesystems still pretty important in the 1980s.


I work for a main-frame manufacturer and what I said was partly based on
discussions with the guy who had represented us when the standard was
written. I detested ASCII (still do) and he was a big proponent and
defender. I hated the fact that about 25 codes were wasted to suit the
Teletype people where one could almost fit the Greek alphabet in, maybe a
blend of lower and upper case depending on common usage in the sciences. I
also thought (and think) a seven bit code was ridiculous. I thought the
sheet of glyphs that I had was from the first release of ASCII, but I can't
attest to that. I know there is no hint of a NL in it, or a soft meaning
for CR or LF. Yes, it is an American code and I don't know if there is
*any* language that can be correctly transcribed in ASCII, but the point is
they *tried*. The grave, circumflex, tilde and virgule (old? Norwegian) are
a pretty modest start. BTW the sheet I have shows the vertical bar as a
broken vertical bar - looks kind of IBMey. And I vaguely recall seeing the
"hooked bar" of EBCDIC in lieu of the tilde in some version or other of
ASCII.

I think there was a post in this thread that spoke of using BS instead of
the LF CR business that I didn't repond to. That would only work for
*character* printers. It doesn't work on line printers. BTW, I suspect that
tons of computer fluent people would be amazed to see a line printer at
work, cranking out beaucoup *pages* per minute.

There is a long thread of 282 messages on the NL situation but I don't have
the time or interest to read it all. I did note message #135 from Dennis
Ritchie supports my claims, as far as I can see. I need an intern to read
and digest this kind of thing for me. :-)

http://groups.google.com/group/alt.f...3045858df1b784
Apr 17 '06 #31
In article <dd********************************@4ax.com> Dave Thompson <da*************@worldnet.att.net> writes:
On Sat, 1 Apr 2006 07:33:13 -0800, "osmium" <r1********@comcast.net>
wrote:
"Joe Wright" writes:
Text mode implemented in C is a concession to Microsoft.
I hate Microsoft too. But that is not the case.
Indeed. The distinction predates Microsoft.
The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this


The first copy of ASCII shows neither circumflex, nor tilde. The first
version of ASCII that shows both dates from 1965, and indeed has tilde and
circumflex in upper case positions. However, although that version was
ratified, it was never published, nor used.
Are you sure? The far-dominant early ASCII (64-graphic = uppercase
only) devices, Teletype 33 and 35, had uparrow and backarrow.
Yup, ASCII 1963. The arrows and the backslash.
The
earliest revision of the standard document I looked at, IIRC 1968 or
so,
The earlies revision was 1965. Adding lower case letters, tilde,
circumflex, underscore, braces, not-symbol and vertical bar. The
arrows and the backslash were removed. But (as I said) although
ratified, never published, nor used.
added tilde along with lowercase and described circumflex and
underscore as changed precisely so they could be used as modifiers.
ASCII 1967 removed the not-symbol and re-added the backslash when
comparing to 1965, but also a few positions were moved: the tilde
moved from uppercase to lowercase. But the use as modifiers was
complex. All of apostrophe, grave accent, umlaut, tilde and
circumflex could be modifiers, depending on context.
It
also gave NL as an acceptable alternate meaning of x0A but not the
primary one.
Indeed. The meaning 'NL' was only to be used if both sender and receiver
had agreed on the meaning.
And of course ASCII was originally intended
and used only as an "American" (meaning US) standard.


The reason ASCII 1965 was never published was because the committee had
become aware that there was an international effort for standardisation.
This ultimately lead to ASCII 1967 which is equal to the ISO version of
that time.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Apr 17 '06 #32

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
by: J. Campbell | last post by:
OK...I'm in the process of learning C++. In my old (non-portable) programming days, I made use of binary files a lot...not worrying about endian issues. I'm starting to understand why C++ makes...
12
by: Sunner Sun | last post by:
Hi, all Since the OS look both ASCII and binary file as a sequence of bytes, is there any way to determine the file type except to judge the extension? Thank you!
13
by: greg | last post by:
Hello, I'm searching to know if a local file is ascii or binary. I couldn't find it in the manual, is there a way to know that ? thanks, -- greg
10
by: joelagnel | last post by:
hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in...
7
by: smith4894 | last post by:
Hello all, I'm working on writing my own streambuf classes (to use in my custom ostream/isteam classes that will handle reading/writing data to a mmap'd file). When reading from the mmap...
4
by: Florence | last post by:
How can a binary file be distinguished from a text file on Windows? Obviously I want a way that is more sophisicated that just looking at the dot extention in the filename. I want to write...
11
by: raghu | last post by:
how do i convert a text entered through keyboard into a binary format? Should I first convert each letter of the text to ASCII and then binary??? Is this method correct? Please advise. Thanks...
5
by: bwv539 | last post by:
I have to output data into a binary file, that will contain data coming from a four channel measurement instrument. Since those data have to be read from another C program somewhere else, the...
3
by: logaelo | last post by:
Hello all, Could anyone explain how to optimization this code? In the prosess of optimization what is the factor needed and important to know about it? Thank you very much for all. ...
5
by: dm3281 | last post by:
Hello, I have a text report from a mainframe that I need to parse. The report has about a 2580 byte header that contains binary information (garbage for the most part); although there are a...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.