473,403 Members | 2,354 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,403 software developers and data experts.

Identifying File type by reading files

This is not really a Python-centric question, however, I am using
Python to solve this problem (as of now) so I thought it appropiate to
pose the question here.

I have some functions that search for files that contain certian
strings and if the files found to have these string do not already
have a filename extension (such as '.doc' or '.xls') the function will
append that to the files and rename them. So, if a file named 'report'
was found to have the string 'Microsoft' and the string
'Word.Document.' (notice the '.' at the end of both words) and it does
not already have an extension, then a rename would take place that
would name the file 'report.doc'

These functions work very well on most files (98% guessed correctly).
However, I would like the functions to be more precise (100%). So,
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.

Are there certain sets of binary data that are unique to files that
would be a better way of identifying them? For example, on the N line
of a MS doc file begining at position X a binary string that is L
digits in lentgh that begins with B and ends with E will *ALWAYS* be
present... some one tell me that I'm not dreaming and that something
like the above example exists???

A few of my string searches today:

doc = string.find(file(os.path.join(root,fname), 'rb').read(),
'Word.Document.')
xls = string.find(file(os.path.join(root,fname), 'rb').read(),
'Excel.Sheet.')
pdf = string.find(file(os.path.join(root,fname), 'rb').read(),
'PDF-1.')
jpg = string.find(file(os.path.join(root,fname), 'rb').read(), 'JFIF')

Any suggestions or information that better describes how to positively
ID files w/o the possibiliy of mistake would be very helpful to me. As
of now, some of my files, though not many (~ 2%) will be given the
wrong extension, but the logic of the functions is such that they
append any extension that probably applies to the file so at that
point it is a simple process of elimination to determine which
extension is actually the correct one. Normally, I never have more
than 2 unique extensions attached to the same file.

Thank you!!!
Jul 18 '05 #1
1 5137
hokiegal99:
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.


That likely means you have an incomplete 'magic' file. This is the
file used by the 'file' command to figure out the file type. Take a
look at http://www.unixhideout.com/freebsd/share/misc/magic for
a more complete (I think) version.

That's dated 1995 and is close the one on my Mac. It doesn't support
the newer MS Word and Excel formats. I'm having trouble
finding the most recent, definitive version. One link pointed me
to ftp://ftp.astron.com/pub/file/ but I haven't investigated it further.

There's also a pymagic, http://thomas.mangin.me.uk/software/python.html
which may help for a pure Python implementation of 'file'.

Andrew
da***@dalkescientific.com
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: hokiegal99 | last post by:
This is not really a Python-centric question, however, I am using Python to solve this problem (as of now) so I thought it appropiate to pose the question here. I have some functions that search...
9
by: Hans-Joachim Widmaier | last post by:
Hi all. Handling files is an extremely frequent task in programming, so most programming languages have an abstraction of the basic files offered by the underlying operating system. This is...
4
by: Michael J. Fromberger | last post by:
Greetings, The following question pertains to a problem I am solving in Python 2.4b2 on a MacOS 10 (Panther) system with an HFS+ filesystem. Given the pathname of a directory in my filesystem,...
0
by: Peter | last post by:
I am having a problem reading an Excel file that is XML based. The directory I am reading contains Excel files that can be of two types. Either generic Microsoft based or XML based. I am reading...
1
by: Galen Somerville | last post by:
And yet another VB6 to VB2005 problem. All helpful suggestions appreciated. As you can see in the code below, my structures use fixed length strings and known array sizes. Consequently I can save...
6
by: Frank Rizzo | last post by:
I have an interesting problem. I have a directory of image files. However, none of the files have an extension. I need to figure out what type if image it is and attach an extension to the file. ...
1
AdrianH
by: AdrianH | last post by:
Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this...
1
by: dwaterpolo | last post by:
Hi Everyone, I am trying to read two text files swY40p10t3ctw45.col.txt and solution.txt and compare them, the first text file has a bunch of values listed like: y y y y y y y
1
by: shyaminf | last post by:
hi everybody! iam facing a problem with the transfer of file using servlet programming. i have a code for uploading a file. but i'm unable to execute it using tomcat5.5 server. kindly help me how to...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.