By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,686 Members | 1,603 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,686 IT Pros & Developers. It's quick & easy.

how to determine a file is ASCII or binary?

P: n/a
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!
Nov 14 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a
<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.

Since there's no definition of what that magic might be, there's likewise no
way to distinguish a "text" file from a "binary" file. All text files are
binary files. The only way to recognize a text file would be to check if
the file matches the local environment's criteria for a "text" file (and
most environments don't have the concept of a "text" file at all).

The cannonical example is CP/M (and Microsoft's products, which harken back
to it). There, if you open a file for writing as a "text" file, every "\n"
that is written becomes "\r\n" on disk, and when you close the file, "\032"
is appended to the end of the file. When you read from the text file, the
reverse operations occur. Windows still does this. The only way you would
could differentiate between a text file and binary file would be to be
armed with this information, then open the target file in binary mode and
check that every byte in the file returns true for isprint() or isspace()
except the last byte in the file, which must equal '\032'. If so, you know
the file is a text file. You don't need to test if the file is a binary
file, since all files are.

It gets more complicated in modern days where multiple character sets and
various encodings are used for text... In that case, the encoding needs to
be indicated within the file somehow and that frequently presumes multibyte
character sets, etc., which already preclude them from being treated as
simple text files in the first place.
Sunner Sun wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!


--
remove .spam from address to reply by e-mail.
Nov 14 '05 #2

P: n/a
On Fri, 9 Apr 2004 21:46:18 +0800, "Sunner Sun" <su********@163.com> wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Portably, in C? Nah, because a "binary" file can simply mimic an ASCII file
and no one, or no /thing/, could possibly tell you whether the data was
written in binary mode or text mode.

The best you can do is take the approach of the Unix "file" command. Here's
a sample output I just got from running it under Cygwin on a text file:

[/home/leor] $ file s2
s2: ASCII English text, with CRLF line terminators

It looks at the first few bytes (along with perhaps platform-specific inode
info in this case) and "takes its best shot".
-leor
Thank you!


--
Leor Zolman --- BD Software --- www.bdsoft.com
On-Site Training in C/C++, Java, Perl and Unix
C++ users: Download BD Software's free STL Error Message Decryptor at:
www.bdsoft.com/tools/stlfilt.html
Nov 14 '05 #3

P: n/a
Sunner Sun writes:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?


There is no way to be certain. But note that most of the control characters
would not appear in an ASCII file.

You can make a pretty good guess by making a subset of most of the ASCII
control characters. Then count the number of characters in the file that are
in the subset, the count should be zero for an ASCII file.

But in the final analysis you must prove a negative, which is a troublesome
thing to do.
Nov 14 '05 #4

P: n/a
Sunner Sun wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!


In addition to what the other's have replied, there is no
guarantee in some operating systems that an extension is the
type of file.

For example, in MS-DOS land, one could create a file
containing "The big Ogre" and give it an extension of
".exe". On the other hand, one could rename a executable,
such as "command.com" to "command.txt".

Whether a file is binary or ASCII is an attribute of the
file. Maintaining file attributes is the responsibility
of the operating system (and perhaps the application
creating the file).

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Nov 14 '05 #5

P: n/a
In <%5ydc.107110$K91.305670@attbi_s02> James McIninch <ja************@comcast.net.spam> writes:
<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.


The difference is as natural as you can get on those systems where binary
and text files are completely different beasts. Unix and Windows do not
define the whole world of hosted computing...

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #6

P: n/a
In <c5**********@news.yaako.com> "Sunner Sun" <su********@163.com> writes:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?


If *your* implementation allows opening a file in the wrong (text vs
binary) mode (that's technically undefined behaviour), you can try
opening it in text mode and using some heuristics to decide whether
it contains text or binary data. I wouldn't recommend opening it in
binary mode, as it could expose some of the internals of the text files
representation and allow drawing the wrong conclusion from that.

First, if you find any null character inside, it is reasonable to decide
that you have a binary file (text files seldom contain null characters,
as they upset any input function that returns a string, while binary files
seldom don't contain at least one null byte).

In the absence of a null character, try finding characters for which
iscntrl() is true but isspace() isn't. Any such beast is also a good
hint that the file is a binary file.

If the file is too large, you may want to restrict your search to the
first N bytes. There are also files containing essentially text, but
having embedded terminal/printer control sequences. It is hard to
say whether they qualify as text files or as binary files and even
harder to identify them.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #7

P: n/a
In article <c5*************@ID-179017.news.uni-berlin.de> "osmium" <r1********@comcast.net> writes:
But in the final analysis you must prove a negative, which is a troublesome
thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Nov 14 '05 #8

P: n/a
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
In article <c5*************@ID-179017.news.uni-berlin.de> "osmium" <r1********@comcast.net> writes:
But in the final analysis you must prove a negative, which is a troublesome
thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #9

P: n/a

In article <Hw********@cwi.nl>, "Dik T. Winter" <Di********@cwi.nl> writes:
In article <c5*************@ID-179017.news.uni-berlin.de> "osmium" <r1********@comcast.net> writes:
> But in the final analysis you must prove a negative, which is a troublesome
> thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


Executable machine code for Pentiums and the like that's entirely
printable ASCII or UTF-16 is quite the rage these days, since it's
useful for exploiting some kinds of buffer overflows and other
security vulnerabilities. It shows up all the time on the security
mailing lists (Bugtraq, Vuln-Dev, etc).

Back to the original problem: If a file consists of nothing but
printable ASCII characters, then it is by definition an ASCII file.
It may not be human-readable text, but it's ASCII. Problem solved.

If the OP wants to determine the *intent* of a file, of course, that's
a bit harder, inasmuch as it's not even well-defined.

--
Michael Wojcik mi************@microfocus.com

How can I sing with love in my bosom?
Unclean, immature and unseasonable salmon. -- Basil Bunting
Nov 14 '05 #10

P: n/a
"Dan Pop" <Da*****@cern.ch> wrote in message
news:c5**********@sunnews.cern.ch...
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)


It is arguable whether they qualify as "executables". Dik was probably
talking about EICAR or something similar. Look up EICAR in Google.

Peter
Nov 14 '05 #11

P: n/a
In article <40**********@mk-nntp-2.news.uk.tiscali.com> "Peter Pichler" <pi*****@pobox.sk> writes:
"Dan Pop" <Da*****@cern.ch> wrote in message
news:c5**********@sunnews.cern.ch...
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)


It is arguable whether they qualify as "executables". Dik was probably
talking about EICAR or something similar. Look up EICAR in Google.


Nope, much older. It is an old story about a workstation where a rm -rf /
in the works was aborted, but not in time to lose almost all object files
(/bin /usr/bin). Luckily there was a window open with a shell and a
window with a texteditor (vi). On a similar machine an executable was
produced that consisted of ASCII characters only. This was entered in
the text editor, a file was written, in the shell a chmod performed
(built-in), and so a start of recovery was found.

--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Nov 14 '05 #12

P: n/a
In <40**********@mk-nntp-2.news.uk.tiscali.com> "Peter Pichler" <pi*****@pobox.sk> writes:
"Dan Pop" <Da*****@cern.ch> wrote in message
news:c5**********@sunnews.cern.ch...
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
>There has been an example of an executable which consisted only
>of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)


It is arguable whether they qualify as "executables".


Nothing to argue about it on a Unix system. If execve() can execute it,
it is an executable.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.