473,797 Members | 3,204 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

how to determine a file is ASCII or binary?

Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!
Nov 14 '05 #1
12 7838
<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.

Since there's no definition of what that magic might be, there's likewise no
way to distinguish a "text" file from a "binary" file. All text files are
binary files. The only way to recognize a text file would be to check if
the file matches the local environment's criteria for a "text" file (and
most environments don't have the concept of a "text" file at all).

The cannonical example is CP/M (and Microsoft's products, which harken back
to it). There, if you open a file for writing as a "text" file, every "\n"
that is written becomes "\r\n" on disk, and when you close the file, "\032"
is appended to the end of the file. When you read from the text file, the
reverse operations occur. Windows still does this. The only way you would
could differentiate between a text file and binary file would be to be
armed with this information, then open the target file in binary mode and
check that every byte in the file returns true for isprint() or isspace()
except the last byte in the file, which must equal '\032'. If so, you know
the file is a text file. You don't need to test if the file is a binary
file, since all files are.

It gets more complicated in modern days where multiple character sets and
various encodings are used for text... In that case, the encoding needs to
be indicated within the file somehow and that frequently presumes multibyte
character sets, etc., which already preclude them from being treated as
simple text files in the first place.
Sunner Sun wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!


--
remove .spam from address to reply by e-mail.
Nov 14 '05 #2
On Fri, 9 Apr 2004 21:46:18 +0800, "Sunner Sun" <su********@163 .com> wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Portably, in C? Nah, because a "binary" file can simply mimic an ASCII file
and no one, or no /thing/, could possibly tell you whether the data was
written in binary mode or text mode.

The best you can do is take the approach of the Unix "file" command. Here's
a sample output I just got from running it under Cygwin on a text file:

[/home/leor] $ file s2
s2: ASCII English text, with CRLF line terminators

It looks at the first few bytes (along with perhaps platform-specific inode
info in this case) and "takes its best shot".
-leor
Thank you!


--
Leor Zolman --- BD Software --- www.bdsoft.com
On-Site Training in C/C++, Java, Perl and Unix
C++ users: Download BD Software's free STL Error Message Decryptor at:
www.bdsoft.com/tools/stlfilt.html
Nov 14 '05 #3
Sunner Sun writes:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?


There is no way to be certain. But note that most of the control characters
would not appear in an ASCII file.

You can make a pretty good guess by making a subset of most of the ASCII
control characters. Then count the number of characters in the file that are
in the subset, the count should be zero for an ASCII file.

But in the final analysis you must prove a negative, which is a troublesome
thing to do.
Nov 14 '05 #4
Sunner Sun wrote:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!


In addition to what the other's have replied, there is no
guarantee in some operating systems that an extension is the
type of file.

For example, in MS-DOS land, one could create a file
containing "The big Ogre" and give it an extension of
".exe". On the other hand, one could rename a executable,
such as "command.co m" to "command.tx t".

Whether a file is binary or ASCII is an attribute of the
file. Maintaining file attributes is the responsibility
of the operating system (and perhaps the application
creating the file).

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.l earn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Nov 14 '05 #5
In <%5ydc.107110$K 91.305670@attbi _s02> James McIninch <ja************ @comcast.net.sp am> writes:
<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.


The difference is as natural as you can get on those systems where binary
and text files are completely different beasts. Unix and Windows do not
define the whole world of hosted computing...

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #6
In <c5**********@n ews.yaako.com> "Sunner Sun" <su********@163 .com> writes:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?


If *your* implementation allows opening a file in the wrong (text vs
binary) mode (that's technically undefined behaviour), you can try
opening it in text mode and using some heuristics to decide whether
it contains text or binary data. I wouldn't recommend opening it in
binary mode, as it could expose some of the internals of the text files
representation and allow drawing the wrong conclusion from that.

First, if you find any null character inside, it is reasonable to decide
that you have a binary file (text files seldom contain null characters,
as they upset any input function that returns a string, while binary files
seldom don't contain at least one null byte).

In the absence of a null character, try finding characters for which
iscntrl() is true but isspace() isn't. Any such beast is also a good
hint that the file is a binary file.

If the file is too large, you may want to restrict your search to the
first N bytes. There are also files containing essentially text, but
having embedded terminal/printer control sequences. It is hard to
say whether they qualify as text files or as binary files and even
harder to identify them.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #7
In article <c5************ *@ID-179017.news.uni-berlin.de> "osmium" <r1********@com cast.net> writes:
But in the final analysis you must prove a negative, which is a troublesome
thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Nov 14 '05 #8
In <Hw********@cwi .nl> "Dik T. Winter" <Di********@cwi .nl> writes:
In article <c5************ *@ID-179017.news.uni-berlin.de> "osmium" <r1********@com cast.net> writes:
But in the final analysis you must prove a negative, which is a troublesome
thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #9

In article <Hw********@cwi .nl>, "Dik T. Winter" <Di********@cwi .nl> writes:
In article <c5************ *@ID-179017.news.uni-berlin.de> "osmium" <r1********@com cast.net> writes:
> But in the final analysis you must prove a negative, which is a troublesome
> thing to do.


Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.


Executable machine code for Pentiums and the like that's entirely
printable ASCII or UTF-16 is quite the rage these days, since it's
useful for exploiting some kinds of buffer overflows and other
security vulnerabilities . It shows up all the time on the security
mailing lists (Bugtraq, Vuln-Dev, etc).

Back to the original problem: If a file consists of nothing but
printable ASCII characters, then it is by definition an ASCII file.
It may not be human-readable text, but it's ASCII. Problem solved.

If the OP wants to determine the *intent* of a file, of course, that's
a bit harder, inasmuch as it's not even well-defined.

--
Michael Wojcik mi************@ microfocus.com

How can I sing with love in my bosom?
Unclean, immature and unseasonable salmon. -- Basil Bunting
Nov 14 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
9128
by: J. Campbell | last post by:
OK...I'm in the process of learning C++. In my old (non-portable) programming days, I made use of binary files a lot...not worrying about endian issues. I'm starting to understand why C++ makes it difficult to read/write an integer directly as a bit-stream to a file. However, I'm at a bit of a loss for how to do the following. So as not to obfuscate the issue, I won't show what I've been attempting ;-) What I want to do is the...
8
3455
by: Bernhard Hidding | last post by:
Hi, my program writes chars to an ascii file via ofstream. You can use "\n" for newline and "\t" for tab there, but is there any command that deletes the last char in the current ofstream? Thanks in advance, Bernhard
22
888
by: Sunner Sun | last post by:
Hi, all Since the OS look both ASCII and binary file as a sequence of bytes, is there any way to determine the file type except to judge the extension? Thank you!
6
14341
by: Kaki | last post by:
Given a file, how do I know if it's ascii or unicode or binary? And how do I know if it's rtf or html or etc? In other words, how do I find the stream type or mime type? (No, file extension cannot be the answer) Thanks *** Sent via Developersdex http://www.developersdex.com *** Don't just participate in USENET...get rewarded for it!
3
2436
by: Mark Gibson | last post by:
Is there an equivalent to the unix 'file' command? $ file min.txt min.txt: ASCII text $ file trunk trunk: directory $ file compliance.tgz compliance.tgz: gzip compressed data, from Unix What I really want to do is determine if a file is 1) a directory, 2) a
6
10040
by: SandyMan | last post by:
Hi, I am able to open a binary file for reading but can someone tell me as how to go about converting a Binary file to ASCII file using C. Thanks In Advance SandyMan
68
5262
by: vim | last post by:
hello everybody Plz tell the differance between binary file and ascii file............... Thanks in advance vim
8
18656
by: Vijay | last post by:
Hi , I am doing a small project in c. I have a Hexadecimal file and want to convert into ascii value. (i.e., Hexadecimal to Ascii conversion from a file). Could anyone help me? Thanks in adv.
5
2916
by: bwv539 | last post by:
I have to output data into a binary file, that will contain data coming from a four channel measurement instrument. Since those data have to be read from another C program somewhere else, the reading program must know how many channels have been acquired, date, time, and so on. I mean that the position of each datum is not fixed in the file but depends on the conditions when acquired. That is, I need something like a header in the file to...
0
9685
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9537
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10469
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10246
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10023
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9066
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6803
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5459
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
3750
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.