473,508 Members | 2,289 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Determine file type (binary or text)

Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..
Thanks in adv.
Jul 18 '05 #1
21 39251
> How can I check if a file is binary or text?
import os
f = os.popen('file -bi test.py', 'r')
f.read().startswith('text')

1

(btw, f.read() returns 'text/x-java; charset=us-ascii\n')

--
bromden[at]gazeta.pl

Jul 18 '05 #2
> >>> f = os.popen('file -bi test.py', 'r')
>>> f.read().startswith('text')
sorry, it's not general, since "file -i" returns
"application/x-shellscript" for shell scripts,
it's better to go like that: import os
f = os.popen('file test.py', 'r')
f.read().find('text') != -1


--
bromden[at]gazeta.pl

Jul 18 '05 #3
Works well in Unix but I'm making a script that works on both
Unix and Windows.

Win doesn't have that 'file -bi' command.

"bromden" <br*****@gazeta.pl.no.spam> wrote in message
news:bh**********@absinth.dialog.net.pl...
How can I check if a file is binary or text?

>>> import os
>>> f = os.popen('file -bi test.py', 'r')
>>> f.read().startswith('text')

1

(btw, f.read() returns 'text/x-java; charset=us-ascii\n')

--
bromden[at]gazeta.pl

Jul 18 '05 #4
Hi,
yes there is more than just Unix in the world ;-)
Windows directories have no means to specify their contents type in any way.
The approved method is using three-letter extensions, though this rule is
not strictly followed (lot of files without extension nowadays!)

When I had a similar problem I read 1000 characters, counted the amount of
<32 and >255 characters and classified it "binary when this qota exceeded
20%. I have no idea whether it will work good with chinese unicode files or
some funny depositories or project files that store uncompressed texts....

KIndly
Michael P

"Sami Viitanen" <no**@none.net> schrieb im Newsbeitrag
news:v7*****************@news2.nokia.com...
Works well in Unix but I'm making a script that works on both
Unix and Windows.

Win doesn't have that 'file -bi' command.

"bromden" <br*****@gazeta.pl.no.spam> wrote in message
news:bh**********@absinth.dialog.net.pl...
How can I check if a file is binary or text?

>>> import os
>>> f = os.popen('file -bi test.py', 'r')
>>> f.read().startswith('text')

1

(btw, f.read() returns 'text/x-java; charset=us-ascii\n')

--
bromden[at]gazeta.pl


Jul 18 '05 #5
Michael Peuser schrieb:
Hi,
yes there is more than just Unix in the world ;-)
Windows directories have no means to specify their contents type in any way.
That's even more true with linux/unix, as there is no need to do
any stuff like line-terminator conversion.
The approved method is using three-letter extensions, though this rule is
not strictly followed (lot of files without extension nowadays!)

When I had a similar problem I read 1000 characters, counted the amount of
<32 and >255 characters and classified it "binary when this qota exceeded
20%. I have no idea whether it will work good with chinese unicode files or
some funny depositories or project files that store uncompressed texts....
based on the idea from Mr. "bromden", why not use mimetypes.MimeTypes()
and guess_type('file://...') and analye the returned string.
This should work on windows / linux / unix / whatever.
Karl


KIndly
Michael P

"Sami Viitanen" <no**@none.net> schrieb im Newsbeitrag
news:v7*****************@news2.nokia.com...
Works well in Unix but I'm making a script that works on both
Unix and Windows.

Win doesn't have that 'file -bi' command.

"bromden" <br*****@gazeta.pl.no.spam> wrote in message
news:bh**********@absinth.dialog.net.pl...
How can I check if a file is binary or text?

>>> import os
>>> f = os.popen('file -bi test.py', 'r')
>>> f.read().startswith('text')
1

(btw, f.read() returns 'text/x-java; charset=us-ascii\n')

--
bromden[at]gazeta.pl




Jul 18 '05 #6
Sami Viitanen wrote:

How can I check if a file is binary or text?

There was some easy way but I forgot it..


First you need to define what you mean by binary and text.
Is a file "text" simply because it contains only the
printable (in ASCII) bytes between 31 and 127, plus
CR and/or LF, or do you have a more complex definition
in mind.

Better yet, what do you need the information for? Maybe
the answer to that will show us the proper path to take.
Jul 18 '05 #7
[Sami Viitanen wrote]
Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..


Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary). Assuming that, then:

def is_binary(filename):
"""Return true iff the given filename is binary.

Raises an EnvironmentError if the file does not exist or cannot be
accessed.
"""
fin = open(filename, 'rb')
try:
CHUNKSIZE = 1024
while 1:
chunk = fin.read(CHUNKSIZE)
if '\0' in chunk: # found null byte
return 1
if len(chunk) < CHUNKSIZE:
break # done
finally:
fin.close()

return 0

Cheers,
Trent
--
Trent Mick
Tr****@ActiveState.com

Jul 18 '05 #8
In article <AF******************@news1.nokia.com>, Sami Viitanen wrote:
How can I check if a file is binary or text?
In order to provide an answer, you'll have to define "binary"
and "text".
There was some easy way but I forgot it..


To _me_ a file isn't "binary" or "text". Those are two modes
you can use to read a file. The file itself is neutral on the
matter. At least under Windows and Unix. VMS and FILES-11
contained a _lot_ more meta-data and actually did have several
different fundamental file types (fixed length records,
variable length records, byte-stream, etc.).

--
Grant Edwards grante Yow! Will it improve my
at CASH FLOW?
visi.com
Jul 18 '05 #9
Trent Mick wrote:

[Sami Viitanen wrote]
Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..


Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary).


"Contains only printable characters" is probably a more useful definition
of text in many cases. I can't say off the top of my head exactly when
either definition might be a problem.... wait, how about this one: in
CVS, if you don't have a file that is effectively line-oriented, human
readable information, you probably don't want to let it be treated as
"text" and stored as diffs. In that situation, "contains primarily
printable characters organized in lines" is probably a more thorough,
though less deterministic, definition.

-Peter
Jul 18 '05 #10
Trent Mick wrote:
[Sami Viitanen wrote]

Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..


Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary).


Dangerous assumption. Even if many or most binary files contain NULs, it
doesn't mean that they all do.

It is trivial to create a non-text file that has no NULs.

f = open('no_zeroes.bin', 'rb')
for x in range(1, 256):
f.write(chr(x))
f.close()

Sami, I would suggest that you need to stop thinking in terms of tools,
and instead think in terms of the problem you're trying to solve. Why do
you need to (or think you need to) determine whether a file is "binary"
or "text"? Why would your application fail if it received a
(binary/text) file when it expected a (text/binary) one?

My guess is that the trait you are trying to identify will prove not to
be "binary or text", but something more application-specific.

-- Graham

P.S. Sami, it's very bad form to "make up" an e-mail address, such as
<no**@none.net>. I'm sure the owners of the none.net domain would agree.
Can't you provide a real address?

Jul 18 '05 #11
Grant Edwards wrote:

In article <3F***************@engcorp.com>, Peter Hansen wrote:
"Contains only printable characters" is probably a more useful definition
of text in many cases.


The definition of "printable" is dependent on the character
set, that will have to be specified.


That's why I said "printable (in ASCII)" in another message, so I
definitely agree. The problem was rather under-specified. :-)
Jul 18 '05 #12
"Michael Peuser" <mp*****@web.de> wrote in message news:<bh*************@news.t-online.com>...

When I had a similar problem I read 1000 characters, counted the amount of
<32 and >255 characters and classified it "binary when this qota exceeded


How many characters > 255 did you get? Did you mean 127? If so, what
about accented characters ... like umlauts?

On a slightly more serious note, CR, LF, HT and FF would have to be
considered "text" but their ordinal values are < 32.

What was the problem that you thought you were solving?
Jul 18 '05 #13
Trent Mick <tr****@ActiveState.com> wrote in message news:<ma**********************************@python. org>...
Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary).


Data file written by C program which has an off-by-one error and is
including a trailing '\0' byte ...
Jul 18 '05 #14
Graham Fawcett <fa*****@teksavvy.com> wrote in message news:<ma**********************************@python. org>...

It is trivial to create a non-text file that has no NULs.

f = open('no_zeroes.bin', 'rb')
for x in range(1, 256):
f.write(chr(x))
f.close()


I tried this but it didn't work. It said:

IOError: [Errno 2] No such file or directory: 'no_zeroes.bin'.

So I thought I had to be persistent but after doing it a few more times it said:

SerialIdiotError: What I tell you three times is true.
NotLispingError: You need 'wb' as in 'wascally wabbit'

This is very strange behaviour -- does my computer have worms?
Jul 18 '05 #15
John Machin wrote:
Graham Fawcett <fa*****@teksavvy.com> wrote in message news:<ma**********************************@python. org>...

It is trivial to create a non-text file that has no NULs.

f = open('no_zeroes.bin', 'rb')
for x in range(1, 256):
f.write(chr(x))
f.close()


I tried this but it didn't work. It said:

IOError: [Errno 2] No such file or directory: 'no_zeroes.bin'.

So I thought I had to be persistent but after doing it a few more times it said:

SerialIdiotError: What I tell you three times is true.
NotLispingError: You need 'wb' as in 'wascally wabbit'

This is very strange behaviour -- does my computer have worms?


No, but my brain does. Glad you caught my typo.

However, it looks like your computer definitely has an AttitudeError!

-- Graham

Jul 18 '05 #16
John Machin wrote:

Trent Mick <tr****@ActiveState.com> wrote in message news:<ma**********************************@python. org>...
Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary).


Data file written by C program which has an off-by-one error and is
including a trailing '\0' byte ...


To be fair, I'd call that a "binary" file in any case, or at least
a defective text file...
Jul 18 '05 #17
Peter Hansen <pe***@engcorp.com> wrote in message news:<3F***************@engcorp.com>...
"Contains only printable characters" is probably a more useful definition
of text in many cases. I can't say off the top of my head exactly when
either definition might be a problem.... wait, how about this one: in
CVS, if you don't have a file that is effectively line-oriented, human
readable information, you probably don't want to let it be treated as
"text" and stored as diffs. In that situation, "contains primarily
printable characters organized in lines" is probably a more thorough,
though less deterministic, definition.


We check for binary files in our CVS commitprep script like this:

look for -kb arg
open the file in binary mode, read 4k fom the file and...

for i in range(len(buff)):
a = ord(buff[i])
if (a < 8) or (a > 13 and a < 32) or (a > 126):
non_text = non_text + 1

If 10 percent of the characters are found to be non-text, we reject
the file if it was not commited with the -kb flag, or print a warning
if the file appears to be text but is being checked in as a binary.

We don't bother checking for charsets other than ascii, because
localized files have to be checked in as binaries or bad things
(tm) happen.
Jul 18 '05 #18
Thanks for the answers.

To be more specific I'm making a script that should
identify binary files as binary and text files as text.

The script is for automating CVS commands and
with CVS you have to add the -kb flag to
add (or import) binary files. (because it can't itself
determine what type the file is). If binary file is not
added with -kb the results are awful.

Script example usage:
-import.py <directory_name>

Script makes list of all files under that directory
and then determines each files filetype. After that
all files are added with Add command and binary
files get that additional -kb automatically.
"Sami Viitanen" <no**@none.net> wrote in message
news:AF******************@news1.nokia.com...
Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..
Thanks in adv.

Jul 18 '05 #19
In article <gw*****************@news2.nokia.com>, Sami Viitanen wrote:
To be more specific I'm making a script that should
identify binary files as binary and text files as text.


That's "more specific"? ;)

--
Grant Edwards grante Yow! I hope I
at bought the right
visi.com relish... zzzzzzzzz...
Jul 18 '05 #20
"David C. Fox" wrote:

Sami Viitanen wrote:
Thanks for the answers.

To be more specific I'm making a script that should
identify binary files as binary and text files as text.

The script is for automating CVS commands and
with CVS you have to add the -kb flag to
add (or import) binary files. (because it can't itself
determine what type the file is). If binary file is not
added with -kb the results are awful.


You should note that the question of when to use -kb is not simply based
on the contents of the file, but on whether you want CVS/RCS to try to
merge conflicting versions.

For example, I recently added some files containing pickled objects
(used as test data sets for a regression test) to the CVS repository for
my project. Although the pickle files are in fact all printable text, a
CVS/RCS merge of two valid pickle files won't yield a valid pickle file.
Therefore, I used -kb to ensure that the developer would always be
forced to choose a version in the event of a version conflict.


Exactly. We had the same issue with the project files for the Codewright
text editor. They are sort of like Windows .INI files, but merging such
files leads to complete disaster, including inability to run Codewright
until the files are manually fixed or removed!

-Peter
Jul 18 '05 #21
Graham Fawcett <fa*****@teksavvy.com> schreef:
P.S. Sami, it's very bad form to "make up" an e-mail address, such as
<no**@none.net>. I'm sure the owners of the none.net domain would agree.
Very true.
Can't you provide a real address?


Some non-real addresses are allowed/harmless too:
- everything ending with the .invalid TLD
e.g.: no**@none.invalid
- me@privacy.net (the owner of the domain gave his permission)

--
JanC

"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9
Jul 18 '05 #22

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
14300
by: Kaki | last post by:
Given a file, how do I know if it's ascii or unicode or binary? And how do I know if it's rtf or html or etc? In other words, how do I find the stream type or mime type? (No, file extension cannot...
2
8860
by: Brian Henry | last post by:
I want to list out a directory listing along with showing the file type name (like explorer does when it says something like "MyDoc.DOC - Microsoft Word Document" How do I get that file type name...
3
3150
by: Shapper | last post by:
Hello, I created a script to upload a file. To determine the file type I am using userPostedFile.ContentType. For example, for a png image I get "image/png". My questions are: 1. Where can...
4
8666
by: comp.lang.php | last post by:
I borrowed the following function from the PHP manual user notes: if (!function_exists('is_binary')) { /** * Determine if a file is binary. Useful for doing file content editing * *...
3
2419
by: Mark Gibson | last post by:
Is there an equivalent to the unix 'file' command? $ file min.txt min.txt: ASCII text $ file trunk trunk: directory $ file compliance.tgz compliance.tgz: gzip compressed data, from Unix ...
22
4181
by: David Warner | last post by:
Greetings! I am working on a C app that needs to read print files generated by lp that contain HP LaserJet PCL codes. If the PCL contains binary data to define a bit map to be printed on the...
5
10695
by: veg_all | last post by:
I have a script where a user can upload their csv file into a mysql database. The problem is sometimes a user will upload the raw excel or access file. How can I perform some simple checks to...
13
31355
by: =?Utf-8?B?S2VzdGZpZWxk?= | last post by:
Hi Our company has a .Net web service that, when called via asp.net web pages across our network works 100%! The problem is that when we try and call the web service from a remote machine, one...
1
3122
by: =?Utf-8?B?SWJyYWhpbS4=?= | last post by:
Hello, I want to download a file (zip) from the webservice to the client ? how can I achieve this, since SOAP only transfers XML data, how? Regards,
0
7224
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7120
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7323
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7380
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
7039
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7494
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
3192
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
1
763
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
415
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.