473,322 Members | 1,734 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

how to determine a file is ASCII or binary?

Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!
Nov 14 '05 #1
12 7767
<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.

Since there's no definition of what that magic might be, there's likewise no
way to distinguish a "text" file from a "binary" file. All text files are
binary files. The only way to recognize a text file would be to check if
the file matches the local environment's criteria for a "text" file (and
most environments don't have the concept of a "text" file at all).

The cannonical example is CP/M (and Microsoft's products, which harken back
to it). There, if you open a file for writing as a "text" file, every "\n"
that is written becomes "\r\n" on disk, and when you close the file, "\032"
is appended to the end of the file. When you read from the text file, the
reverse operations occur. Windows still does this. The only way you would
could differentiate between a text file and binary file would be to be
armed with this information, then open the target file in binary mode and
check that every byte in the file returns true for isprint() or isspace()
except the last byte in the file, which must equal '\032'. If so, you know
the file is a text file. You don't need to test if the file is a binary
file, since all files are.

It gets more complicated in modern days where multiple character sets and
various encodings are used for text... In that case, the encoding needs to
be indicated within the file somehow and that frequently presumes multibyte
character sets, etc., which already preclude them from being treated as
simple text files in the first place.
Sunner Sun wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!


--
remove .spam from address to reply by e-mail.
Nov 14 '05 #2
On Fri, 9 Apr 2004 21:46:18 +0800, "Sunner Sun" <su********@163.com> wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Portably, in C? Nah, because a "binary" file can simply mimic an ASCII file
and no one, or no /thing/, could possibly tell you whether the data was
written in binary mode or text mode.

The best you can do is take the approach of the Unix "file" command. Here's
a sample output I just got from running it under Cygwin on a text file:

[/home/leor] $ file s2
s2: ASCII English text, with CRLF line terminators

It looks at the first few bytes (along with perhaps platform-specific inode
info in this case) and "takes its best shot".
-leor
Thank you!


--
Leor Zolman --- BD Software --- www.bdsoft.com
On-Site Training in C/C++, Java, Perl and Unix
C++ users: Download BD Software's free STL Error Message Decryptor at:
www.bdsoft.com/tools/stlfilt.html
Nov 14 '05 #3
Sunner Sun writes:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?


There is no way to be certain. But note that most of the control characters
would not appear in an ASCII file.

You can make a pretty good guess by making a subset of most of the ASCII
control characters. Then count the number of characters in the file that are
in the subset, the count should be zero for an ASCII file.

But in the final analysis you must prove a negative, which is a troublesome
thing to do.
Nov 14 '05 #4
Sunner Sun wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!


In addition to what the other's have replied, there is no
guarantee in some operating systems that an extension is the
type of file.

For example, in MS-DOS land, one could create a file
containing "The big Ogre" and give it an extension of
".exe". On the other hand, one could rename a executable,
such as "command.com" to "command.txt".

Whether a file is binary or ASCII is an attribute of the
file. Maintaining file attributes is the responsibility
of the operating system (and perhaps the application
creating the file).

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Nov 14 '05 #5
In <%5ydc.107110$K91.305670@attbi_s02> James McIninch <ja************@comcast.net.spam> writes:
<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.


The difference is as natural as you can get on those systems where binary
and text files are completely different beasts. Unix and Windows do not
define the whole world of hosted computing...

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #6
In <c5**********@news.yaako.com> "Sunner Sun" <su********@163.com> writes:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?


If *your* implementation allows opening a file in the wrong (text vs
binary) mode (that's technically undefined behaviour), you can try
opening it in text mode and using some heuristics to decide whether
it contains text or binary data. I wouldn't recommend opening it in
binary mode, as it could expose some of the internals of the text files
representation and allow drawing the wrong conclusion from that.

First, if you find any null character inside, it is reasonable to decide
that you have a binary file (text files seldom contain null characters,
as they upset any input function that returns a string, while binary files
seldom don't contain at least one null byte).

In the absence of a null character, try finding characters for which
iscntrl() is true but isspace() isn't. Any such beast is also a good
hint that the file is a binary file.

If the file is too large, you may want to restrict your search to the
first N bytes. There are also files containing essentially text, but
having embedded terminal/printer control sequences. It is hard to
say whether they qualify as text files or as binary files and even
harder to identify them.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #7
In article <c5*************@ID-179017.news.uni-berlin.de> "osmium" <r1********@comcast.net> writes:
But in the final analysis you must prove a negative, which is a troublesome
thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Nov 14 '05 #8
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
In article <c5*************@ID-179017.news.uni-berlin.de> "osmium" <r1********@comcast.net> writes:
But in the final analysis you must prove a negative, which is a troublesome
thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #9

In article <Hw********@cwi.nl>, "Dik T. Winter" <Di********@cwi.nl> writes:
In article <c5*************@ID-179017.news.uni-berlin.de> "osmium" <r1********@comcast.net> writes:
> But in the final analysis you must prove a negative, which is a troublesome
> thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


Executable machine code for Pentiums and the like that's entirely
printable ASCII or UTF-16 is quite the rage these days, since it's
useful for exploiting some kinds of buffer overflows and other
security vulnerabilities. It shows up all the time on the security
mailing lists (Bugtraq, Vuln-Dev, etc).

Back to the original problem: If a file consists of nothing but
printable ASCII characters, then it is by definition an ASCII file.
It may not be human-readable text, but it's ASCII. Problem solved.

If the OP wants to determine the *intent* of a file, of course, that's
a bit harder, inasmuch as it's not even well-defined.

--
Michael Wojcik mi************@microfocus.com

How can I sing with love in my bosom?
Unclean, immature and unseasonable salmon. -- Basil Bunting
Nov 14 '05 #10
"Dan Pop" <Da*****@cern.ch> wrote in message
news:c5**********@sunnews.cern.ch...
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)


It is arguable whether they qualify as "executables". Dik was probably
talking about EICAR or something similar. Look up EICAR in Google.

Peter
Nov 14 '05 #11
In article <40**********@mk-nntp-2.news.uk.tiscali.com> "Peter Pichler" <pi*****@pobox.sk> writes:
"Dan Pop" <Da*****@cern.ch> wrote in message
news:c5**********@sunnews.cern.ch...
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)


It is arguable whether they qualify as "executables". Dik was probably
talking about EICAR or something similar. Look up EICAR in Google.


Nope, much older. It is an old story about a workstation where a rm -rf /
in the works was aborted, but not in time to lose almost all object files
(/bin /usr/bin). Luckily there was a window open with a shell and a
window with a texteditor (vi). On a similar machine an executable was
produced that consisted of ASCII characters only. This was entered in
the text editor, a file was written, in the shell a chmod performed
(built-in), and so a start of recovery was found.

--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Nov 14 '05 #12
In <40**********@mk-nntp-2.news.uk.tiscali.com> "Peter Pichler" <pi*****@pobox.sk> writes:
"Dan Pop" <Da*****@cern.ch> wrote in message
news:c5**********@sunnews.cern.ch...
In <Hw********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:
>There has been an example of an executable which consisted only
>of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)


It is arguable whether they qualify as "executables".


Nothing to argue about it on a Unix system. If execve() can execute it,
it is an executable.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
by: J. Campbell | last post by:
OK...I'm in the process of learning C++. In my old (non-portable) programming days, I made use of binary files a lot...not worrying about endian issues. I'm starting to understand why C++ makes...
8
by: Bernhard Hidding | last post by:
Hi, my program writes chars to an ascii file via ofstream. You can use "\n" for newline and "\t" for tab there, but is there any command that deletes the last char in the current ofstream? Thanks...
22
by: Sunner Sun | last post by:
Hi, all Since the OS look both ASCII and binary file as a sequence of bytes, is there any way to determine the file type except to judge the extension? Thank you!
6
by: Kaki | last post by:
Given a file, how do I know if it's ascii or unicode or binary? And how do I know if it's rtf or html or etc? In other words, how do I find the stream type or mime type? (No, file extension cannot...
3
by: Mark Gibson | last post by:
Is there an equivalent to the unix 'file' command? $ file min.txt min.txt: ASCII text $ file trunk trunk: directory $ file compliance.tgz compliance.tgz: gzip compressed data, from Unix ...
6
by: SandyMan | last post by:
Hi, I am able to open a binary file for reading but can someone tell me as how to go about converting a Binary file to ASCII file using C. Thanks In Advance SandyMan
68
by: vim | last post by:
hello everybody Plz tell the differance between binary file and ascii file............... Thanks in advance vim
8
by: Vijay | last post by:
Hi , I am doing a small project in c. I have a Hexadecimal file and want to convert into ascii value. (i.e., Hexadecimal to Ascii conversion from a file). Could anyone help me? Thanks in...
5
by: bwv539 | last post by:
I have to output data into a binary file, that will contain data coming from a four channel measurement instrument. Since those data have to be read from another C program somewhere else, the...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.