473,729 Members | 2,235 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Binary or Ascii Text?

Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

Please help me, thanks.

Mar 31 '06 #1
31 3212
"Claude Yih" <wi******@gmail .com> writes:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?


There is no general solution to this. Many systems don't actually
distinguish between text and binary files; a text file is just a file
that happens to consist of printable characters -- and what's
considered a printable character can vary. You can also look at line
terminators (ASCII LF on Unix-like systems, an ASCII CR-LF sequence on
Windows-like systems, possibly something completely different
elsewhere).

<OT>Unix-like systems have a command called "file" that attempts to
classify a file based on its contents.</OT>

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 31 '06 #2
"Claude Yih" writes:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?


The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.

But note that this is not what is meant when C programmers discuss text vs.
binary, for example in some of the file functions. What is referred to
there is the distinction between two ways of handling end of lines. Is an
end of line demarked by a single character (LF) or two characters <CR><LF>?
Unix uses only the LF to mark end of line, so the distinction is
meaningless. Systems that use <CR><LF> or <LF><CR> have to examine the
stream and convert the two characters into one, called '\n' So '\n' is
really <LF>..

When you open a file in binary mode, you are telling the world: Hey, you
there, keep your cotton-picking hands off this file.
Mar 31 '06 #3
"Claude Yih" <wi******@gmail .com> wrote:
# Hi, everyone. I got a question. How can I identify whether a file is a
# binary file or an ascii text file? For instance, I wrote a piece of
# code and saved as "Test.c". I knew it was an ascii text file. Then
# after compilation, I got a "Test" file and it was a binary executable
# file. The problem is, I know the type of those two files in my mind
# because I executed the process of compilation, but how can I make the
# computer know the type of a given file by writing code in C? Files are

As far as stdio is concerned, a binary file is what you get if you
include a "b" in the open mode, otherwise it's text mode. Binary and
text files may handle end-of-line indicators differently, and how
fseek offsets are interpretted. (In unix, stdio treats binary and
text files identically.)

--
SM Ryan http://www.rawbw.com/~wyrmwif/
GERBILS
GERBILS
GERBILS
Mar 31 '06 #4
osmium writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, itis
probably not a text file - but it's only an educated guess, no real proof.


Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.

Mar 31 '06 #5
"Claude Yih" writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are
present
in text files. So if you have quite a few of the other 25 or so codes, it
is
probably not a text file - but it's only an educated guess, no real proof.


Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.

It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.
Mar 31 '06 #6
Claude Yih wrote:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

Please help me, thanks.

As others have said, this is an essentially arbitrary decision on your
part. Here we've standardized on a definition of "binary" that means:

Lines greater than X bytes (where "X" is some arbitrarily high number,
like 16 or 23k), or any character within the file is \0 or null.

I line is defined as data between newlines (normalized to '\n').

Everything else fits into a reasonable notion of OEM or ANSI charset,
with some caveats.

Again, this is specific to application requirements. Your requirements
may vary.
Mar 31 '06 #7
"osmium" <r1********@com cast.net> writes:
"Claude Yih" writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are
present
in text files. So if you have quite a few of the other 25 or so codes, it
is
probably not a text file - but it's only an educated guess, no real proof.


Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.

It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.


The quoting is completely messed up. The paragraph starting with "The
best you can do" was written by osmium, the next three paragraphs
where written by Claude Yih, and the last paragraph, starting with "It
doesn't work", was written by osmium (who usually gets this stuff
right).

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 31 '06 #8
"Claude Yih" <wi******@gmail .com> writes:
osmium writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.
Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.


I think that's fairly close to what the Unix "file" command does.
(Versions of the command are available as open source; see
<ftp://ftp.astron.com/pub/file/>.)

As mentioned above, you should also check for control characters.
However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.


Multi-byte characters aren't the only problem. ISO-8859-1 is an
extension of ASCII that uses codes from 161 to 255 for printable
characters (there are several ISO-8859-N standards).

And none of this is portable to all possible C implementations . Some
systems distinguish between text and binary files at the filesystem
level.

Whatever it is you're trying to do, your first line of defense should
be to arrange to know what type a file is before you open it. If that
fails, as it inevitably will in some cases, you can check the contents
as a fallback, but there's no 100% reliable way to do so.

If you're writing a program that's intended to work only on text
files, it might be best to decide what's acceptable *for that
program*. If you're displaying the contents of the file, for example,
you can establish a convention for displaying non-printable characters
in some readable form. If an input line is very long, you can wrap it
or truncate it. And so on.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 31 '06 #9
Me
Claude Yih wrote:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?


Modern computers deal with much more than just ASCII so trying to
determine the encoding is doomed to failure and mysterious acting
heuristics. All you should be doing is getting a filename from the user
and opening it in either text mode or binary mode. If you want to
distinguish between the two then just have the user also input which
mode they want to use.

Apr 1 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
9119
by: J. Campbell | last post by:
OK...I'm in the process of learning C++. In my old (non-portable) programming days, I made use of binary files a lot...not worrying about endian issues. I'm starting to understand why C++ makes it difficult to read/write an integer directly as a bit-stream to a file. However, I'm at a bit of a loss for how to do the following. So as not to obfuscate the issue, I won't show what I've been attempting ;-) What I want to do is the...
12
7835
by: Sunner Sun | last post by:
Hi, all Since the OS look both ASCII and binary file as a sequence of bytes, is there any way to determine the file type except to judge the extension? Thank you!
13
3552
by: greg | last post by:
Hello, I'm searching to know if a local file is ascii or binary. I couldn't find it in the manual, is there a way to know that ? thanks, -- greg
10
3660
by: joelagnel | last post by:
hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in binary mode. according to me, notepad opens files and each byte of the file read, it converts that byte from ascii to its correct character and displays
7
7025
by: smith4894 | last post by:
Hello all, I'm working on writing my own streambuf classes (to use in my custom ostream/isteam classes that will handle reading/writing data to a mmap'd file). When reading from the mmap file, I essentially have a char buffer in my streambuf class, that I'm registering with setp(). on an overflow() call, I simply copy the contents of the buffer into the mmap'd file via memcpy().
4
9558
by: Florence | last post by:
How can a binary file be distinguished from a text file on Windows? Obviously I want a way that is more sophisicated that just looking at the dot extention in the filename. I want to write code that processes all text files in a directory but leaves binary files alone. -- http://www.florencesoft.com
11
4499
by: raghu | last post by:
how do i convert a text entered through keyboard into a binary format? Should I first convert each letter of the text to ASCII and then binary??? Is this method correct? Please advise. Thanks a lot. Regards, Raghu
5
2914
by: bwv539 | last post by:
I have to output data into a binary file, that will contain data coming from a four channel measurement instrument. Since those data have to be read from another C program somewhere else, the reading program must know how many channels have been acquired, date, time, and so on. I mean that the position of each datum is not fixed in the file but depends on the conditions when acquired. That is, I need something like a header in the file to...
3
3499
by: logaelo | last post by:
Hello all, Could anyone explain how to optimization this code? In the prosess of optimization what is the factor needed and important to know about it? Thank you very much for all. /********************************************************/ /* Binary converter */ /* By Matt Fowler */ /* email address removed */ /* converts text into...
5
11276
by: dm3281 | last post by:
Hello, I have a text report from a mainframe that I need to parse. The report has about a 2580 byte header that contains binary information (garbage for the most part); although there are a couple areas that have ASCII text that I need to extract. At the end of the 2580 bytes, I can read the report like a standard text file. It should have CR/LF at the end of each line. What is the best way for me to read this report using C#. It is...
0
8913
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9426
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9280
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9200
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9142
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6722
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4525
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
2677
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2162
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.