Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?
Please help me, thanks. 31 3212
"Claude Yih" <wi******@gmail .com> writes: Hi, everyone. I got a question. How can I identify whether a file is a binary file or an ascii text file? For instance, I wrote a piece of code and saved as "Test.c". I knew it was an ascii text file. Then after compilation, I got a "Test" file and it was a binary executable file. The problem is, I know the type of those two files in my mind because I executed the process of compilation, but how can I make the computer know the type of a given file by writing code in C? Files are all save as 0's and 1's. What's the difference?
There is no general solution to this. Many systems don't actually
distinguish between text and binary files; a text file is just a file
that happens to consist of printable characters -- and what's
considered a printable character can vary. You can also look at line
terminators (ASCII LF on Unix-like systems, an ASCII CR-LF sequence on
Windows-like systems, possibly something completely different
elsewhere).
<OT>Unix-like systems have a command called "file" that attempts to
classify a file based on its contents.</OT>
--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
"Claude Yih" writes: Hi, everyone. I got a question. How can I identify whether a file is a binary file or an ascii text file? For instance, I wrote a piece of code and saved as "Test.c". I knew it was an ascii text file. Then after compilation, I got a "Test" file and it was a binary executable file. The problem is, I know the type of those two files in my mind because I executed the process of compilation, but how can I make the computer know the type of a given file by writing code in C? Files are all save as 0's and 1's. What's the difference?
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.
But note that this is not what is meant when C programmers discuss text vs.
binary, for example in some of the file functions. What is referred to
there is the distinction between two ways of handling end of lines. Is an
end of line demarked by a single character (LF) or two characters <CR><LF>?
Unix uses only the LF to mark end of line, so the distinction is
meaningless. Systems that use <CR><LF> or <LF><CR> have to examine the
stream and convert the two characters into one, called '\n' So '\n' is
really <LF>..
When you open a file in binary mode, you are telling the world: Hey, you
there, keep your cotton-picking hands off this file.
"Claude Yih" <wi******@gmail .com> wrote:
# Hi, everyone. I got a question. How can I identify whether a file is a
# binary file or an ascii text file? For instance, I wrote a piece of
# code and saved as "Test.c". I knew it was an ascii text file. Then
# after compilation, I got a "Test" file and it was a binary executable
# file. The problem is, I know the type of those two files in my mind
# because I executed the process of compilation, but how can I make the
# computer know the type of a given file by writing code in C? Files are
As far as stdio is concerned, a binary file is what you get if you
include a "b" in the open mode, otherwise it's text mode. Binary and
text files may handle end-of-line indicators differently, and how
fseek offsets are interpretted. (In unix, stdio treats binary and
text files identically.)
--
SM Ryan http://www.rawbw.com/~wyrmwif/
GERBILS
GERBILS
GERBILS
osmium writes: The best you can do is make a guess. The first 32 characters of ASCII are control codes and only a few of them (CR, LF, FF, HT (tab), .... are present in text files. So if you have quite a few of the other 25 or so codes, itis probably not a text file - but it's only an educated guess, no real proof.
Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.
Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.
However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.
"Claude Yih" writes: The best you can do is make a guess. The first 32 characters of ASCII are control codes and only a few of them (CR, LF, FF, HT (tab), .... are present in text files. So if you have quite a few of the other 25 or so codes, it is probably not a text file - but it's only an educated guess, no real proof.
Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.
Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.
However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.
It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.
Claude Yih wrote: Hi, everyone. I got a question. How can I identify whether a file is a binary file or an ascii text file? For instance, I wrote a piece of code and saved as "Test.c". I knew it was an ascii text file. Then after compilation, I got a "Test" file and it was a binary executable file. The problem is, I know the type of those two files in my mind because I executed the process of compilation, but how can I make the computer know the type of a given file by writing code in C? Files are all save as 0's and 1's. What's the difference?
Please help me, thanks.
As others have said, this is an essentially arbitrary decision on your
part. Here we've standardized on a definition of "binary" that means:
Lines greater than X bytes (where "X" is some arbitrarily high number,
like 16 or 23k), or any character within the file is \0 or null.
I line is defined as data between newlines (normalized to '\n').
Everything else fits into a reasonable notion of OEM or ANSI charset,
with some caveats.
Again, this is specific to application requirements. Your requirements
may vary.
"osmium" <r1********@com cast.net> writes: "Claude Yih" writes:
The best you can do is make a guess. The first 32 characters of ASCII are control codes and only a few of them (CR, LF, FF, HT (tab), .... are present in text files. So if you have quite a few of the other 25 or so codes, it is probably not a text file - but it's only an educated guess, no real proof.
Well, as matter of fact, I just got an idea to handle that problem. But I don't know if it is feasible.
Now that we know ascii text only use 7 bits of a byte and the first bit is always set as 0. So I wonder if I could write a program to get a fixed length of a given file(for example, the first 1024 bytes) , to store them in a unsigned char array and to check if there is any elements greater than 0x7F. If any, the file can be judged as a binary file.
However, the disadvantage of the above method is that it cannot handle the multi-byte character. Take the UTF-8's japanese character for example, a japanese character may be encoded as three bytes and some of them may be greater than 0x7F? In that case, my method will make no sense.
It doesn't work, but it has nothing to do with UTF-8. It is the problem of proving a negative. How many white crows are there? AFAIK no one has ever *seen* a white crow. What does that prove? Your guess is not as good as the guess I implicitly proposed.
The quoting is completely messed up. The paragraph starting with "The
best you can do" was written by osmium, the next three paragraphs
where written by Claude Yih, and the last paragraph, starting with "It
doesn't work", was written by osmium (who usually gets this stuff
right).
--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
"Claude Yih" <wi******@gmail .com> writes: osmium writes: The best you can do is make a guess. The first 32 characters of ASCII are control codes and only a few of them (CR, LF, FF, HT (tab), .... are present in text files. So if you have quite a few of the other 25 or so codes, it is probably not a text file - but it's only an educated guess, no real proof. Well, as matter of fact, I just got an idea to handle that problem. But I don't know if it is feasible.
Now that we know ascii text only use 7 bits of a byte and the first bit is always set as 0. So I wonder if I could write a program to get a fixed length of a given file(for example, the first 1024 bytes) , to store them in a unsigned char array and to check if there is any elements greater than 0x7F. If any, the file can be judged as a binary file.
I think that's fairly close to what the Unix "file" command does.
(Versions of the command are available as open source; see
<ftp://ftp.astron.com/pub/file/>.)
As mentioned above, you should also check for control characters.
However, the disadvantage of the above method is that it cannot handle the multi-byte character. Take the UTF-8's japanese character for example, a japanese character may be encoded as three bytes and some of them may be greater than 0x7F。 In that case, my method will make no sense.
Multi-byte characters aren't the only problem. ISO-8859-1 is an
extension of ASCII that uses codes from 161 to 255 for printable
characters (there are several ISO-8859-N standards).
And none of this is portable to all possible C implementations . Some
systems distinguish between text and binary files at the filesystem
level.
Whatever it is you're trying to do, your first line of defense should
be to arrange to know what type a file is before you open it. If that
fails, as it inevitably will in some cases, you can check the contents
as a fallback, but there's no 100% reliable way to do so.
If you're writing a program that's intended to work only on text
files, it might be best to decide what's acceptable *for that
program*. If you're displaying the contents of the file, for example,
you can establish a convention for displaying non-printable characters
in some readable form. If an input line is very long, you can wrap it
or truncate it. And so on.
--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Claude Yih wrote: Hi, everyone. I got a question. How can I identify whether a file is a binary file or an ascii text file? For instance, I wrote a piece of code and saved as "Test.c". I knew it was an ascii text file. Then after compilation, I got a "Test" file and it was a binary executable file. The problem is, I know the type of those two files in my mind because I executed the process of compilation, but how can I make the computer know the type of a given file by writing code in C? Files are all save as 0's and 1's. What's the difference?
Modern computers deal with much more than just ASCII so trying to
determine the encoding is doomed to failure and mysterious acting
heuristics. All you should be doing is getting a filename from the user
and opening it in either text mode or binary mode. If you want to
distinguish between the two then just have the user also input which
mode they want to use. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: J. Campbell |
last post by:
OK...I'm in the process of learning C++. In my old (non-portable)
programming days, I made use of binary files a lot...not worrying
about endian issues. I'm starting to understand why C++ makes it
difficult to read/write an integer directly as a bit-stream to a file.
However, I'm at a bit of a loss for how to do the following. So as
not to obfuscate the issue, I won't show what I've been attempting ;-)
What I want to do is the...
|
by: Sunner Sun |
last post by:
Hi, all
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?
Thank you!
|
by: greg |
last post by:
Hello,
I'm searching to know if a local file is ascii or binary.
I couldn't find it in the manual, is there a way to know that ?
thanks,
--
greg
|
by: joelagnel |
last post by:
hi friends,
i've been having this confusion for about a year, i want to know the
exact difference between text and binary files.
using the fwrite function in c, i wrote 2 bytes of integers in binary
mode.
according to me, notepad opens files and each byte of the file
read, it converts that byte from ascii to its correct character and
displays
|
by: smith4894 |
last post by:
Hello all,
I'm working on writing my own streambuf classes (to use in my custom
ostream/isteam classes that will handle reading/writing data to a
mmap'd file).
When reading from the mmap file, I essentially have a char buffer in my
streambuf class, that I'm registering with setp(). on an overflow()
call, I simply copy the contents of the buffer into the mmap'd file via
memcpy().
| |
by: Florence |
last post by:
How can a binary file be distinguished from a text file on Windows?
Obviously I want a way that is more sophisicated that just looking at the
dot extention in the filename.
I want to write code that processes all text files in a directory but leaves
binary files alone.
--
http://www.florencesoft.com
|
by: raghu |
last post by:
how do i convert a text entered through keyboard into a binary format?
Should I first convert each letter of the text to ASCII and then
binary???
Is this method correct? Please advise.
Thanks a lot.
Regards,
Raghu
|
by: bwv539 |
last post by:
I have to output data into a binary file, that will contain data
coming from a four channel measurement instrument.
Since those data have to be read from another C program somewhere
else, the reading program must know how many channels have been
acquired, date, time, and so on. I mean that the position of each
datum is not fixed in the file but depends on the conditions when
acquired.
That is, I need something like a header in the file to...
|
by: logaelo |
last post by:
Hello all,
Could anyone explain how to optimization this code? In the prosess of optimization what is the factor needed and important to know about it?
Thank you very much for all.
/********************************************************/
/* Binary converter */
/* By Matt Fowler */
/* email address removed */
/* converts text into...
|
by: dm3281 |
last post by:
Hello, I have a text report from a mainframe that I need to parse.
The report has about a 2580 byte header that contains binary information
(garbage for the most part); although there are a couple areas that have
ASCII text that I need to extract. At the end of the 2580 bytes, I can read
the report like a standard text file. It should have CR/LF at the end of
each line.
What is the best way for me to read this report using C#. It is...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |