473,385 Members | 1,555 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Identifying File type by reading files

This is not really a Python-centric question, however, I am using
Python to solve this problem (as of now) so I thought it appropiate to
pose the question here.

I have some functions that search for files that contain certian
strings and if the files found to have these string do not already
have a filename extension (such as '.doc' or '.xls') the function will
append that to the files and rename them. So, if a file named 'report'
was found to have the string 'Microsoft' and the string
'Word.Document.' (notice the '.' at the end of both words) and it does
not already have an extension, then a rename would take place that
would name the file 'report.doc'

These functions work very well on most files (98% guessed correctly).
However, I would like the functions to be more precise (100%). So,
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.

Are there certain sets of binary data that are unique to files that
would be a better way of identifying them? For example, on the N line
of a MS doc file begining at position X a binary string that is L
digits in lentgh that begins with B and ends with E will *ALWAYS* be
present... some one tell me that I'm not dreaming and that something
like the above example exists???

A few of my string searches today:

doc = string.find(file(os.path.join(root,fname), 'rb').read(),
'Word.Document.')
xls = string.find(file(os.path.join(root,fname), 'rb').read(),
'Excel.Sheet.')
pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
'PDF-1.')
jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')

Any suggestions or information that better describes how to positively
ID files w/o the possibiliy of mistake would be very helpful to me. As
of now, some of my files, though not many (~ 2%) will be given the
wrong extension, but the logic of the functions is such that they
append any extension that probably applies to the file so at that
point it is a simple process of elimination to determine which
extension is actually the correct one. Normally, I never have more
than 2 unique extensions attached to the same file.

Thank you!!!
Jul 18 '05 #1
1 5132
hokiegal99:
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.


That likely means you have an incomplete 'magic' file. This is the
file used by the 'file' command to figure out the file type. Take a
look at http://www.unixhideout.com/freebsd/share/misc/magic for
a more complete (I think) version.

That's dated 1995 and is close the one on my Mac. It doesn't support
the newer MS Word and Excel formats. I'm having trouble
finding the most recent, definitive version. One link pointed me
to ftp://ftp.astron.com/pub/file/ but I haven't investigated it further.

There's also a pymagic, http://thomas.mangin.me.uk/software/python.html
which may help for a pure Python implementation of 'file'.

Andrew
da***@dalkescientific.com
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: hokiegal99 | last post by:
This is not really a Python-centric question, however, I am using Python to solve this problem (as of now) so I thought it appropiate to pose the question here. I have some functions that search...
9
by: Hans-Joachim Widmaier | last post by:
Hi all. Handling files is an extremely frequent task in programming, so most programming languages have an abstraction of the basic files offered by the underlying operating system. This is...
4
by: Michael J. Fromberger | last post by:
Greetings, The following question pertains to a problem I am solving in Python 2.4b2 on a MacOS 10 (Panther) system with an HFS+ filesystem. Given the pathname of a directory in my filesystem,...
0
by: Peter | last post by:
I am having a problem reading an Excel file that is XML based. The directory I am reading contains Excel files that can be of two types. Either generic Microsoft based or XML based. I am reading...
1
by: Galen Somerville | last post by:
And yet another VB6 to VB2005 problem. All helpful suggestions appreciated. As you can see in the code below, my structures use fixed length strings and known array sizes. Consequently I can save...
6
by: Frank Rizzo | last post by:
I have an interesting problem. I have a directory of image files. However, none of the files have an extension. I need to figure out what type if image it is and attach an extension to the file. ...
1
AdrianH
by: AdrianH | last post by:
Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this...
1
by: dwaterpolo | last post by:
Hi Everyone, I am trying to read two text files swY40p10t3ctw45.col.txt and solution.txt and compare them, the first text file has a bunch of values listed like: y y y y y y y
1
by: shyaminf | last post by:
hi everybody! iam facing a problem with the transfer of file using servlet programming. i have a code for uploading a file. but i'm unable to execute it using tomcat5.5 server. kindly help me how to...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.